Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search

V R wrote: > > Sorry for the off-topic and cross-post but I was curious (since we have > the attention of so many now) if more "intelligent" floating point scheme > exists (i.e. non-IEEE 754/854)? I know computers performed math before > Intel's spec so of course there will be dozens of proprietary formats... > > It feels like manipulation of floating point data in the 754/854 formats > is more cumbersome than it needs to be. Any there any other schemes that > are "simpler" (besides fixed point, etc) and/or easier to implement? Any > implementations that nicely lend themselves to FPGAs? Obviously one will > have to make a trade offs such as bit-size vs. precision, etc. but I'm > inquiring about a general schemes... I've implemented a few sw math libs, so the following is based on that experience, and not from hw implementations: For a maximum speed/ease of implementation sw package, I start as others here have suggested, i.e. disregard underflow, Inf, Nan handling, and possibly also Zero. All this can be replaced with very simple saturating arithmetic on the exponent part. Keeping Zero as a special case is probably OK, but it means that you have 4 paths through each two-operand function instead of just one. When working on the Pentium FDIV sw workaround 7 years ago, I wrote a quick&dirty library of 128-bit math operations, and used this to implement Arctan(), so I could verify the results given by our workaround code. The storage format I used was a direct extension of the IEEE formats, with a leading sign bit, a number of exponent bits and the remainder left over for the mantissa. I believe I used explicit storage of the leading (1) mantissa bit, but I don't remember exactly. Anyway, the only significant IEEE spec I disobeyed totally was the requirement for multiple rounding formats, I just used a single round to nearest rule instead. Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"

rk wrote:How about when the counter counts? > > 01111111 > 10000000 > > Hint for the h-work kid: look up static hazard in your logic book. Yes. Why go to the trouble of using exotic schemes, likeLFSR, when a binary solution is actually simpler, and also equally fast. Also, unless you know the tricks, an LFSR divides by 255, not 256. Peter Alfke > >

V R wrote: > . Any there any other schemes that > are "simpler" (besides fixed point, etc) and/or easier to implement? Any > implementations that nicely lend themselves to FPGAs? Virtex-II now has fast combinatorial multipliers ( 18 x 18 2's complement ), and such a multiplier can obviously also be used to shift the exponent. Saves a lot of multiplexers and routing. And is fast Peter Alfke

rk wrote:How about when the counter counts? > > 01111111 > 10000000 > > Hint for the h-work kid: look up static hazard in your logic book. Yes. Why go to the trouble of using exotic schemes, like LFSR, when a binary solution is actually simpler, and also equally fast. Also, unless you know the tricks, an LFSR divides by 255, not 256. Peter Alfke >

On Fri, 09 Feb 2001 23:24:12 GMT, s_clubb@NOSPAMnetcomuk.co.uk (Stuart Clubb) wrote: >That's basically taking Renoir out of the flow. You would be missing >the design management capabilities of Renoir and compromising the >ability of your colleagues to cooperatively work with your design as >part of a project. The documentation and re-use of your design would >likely be a lot harder for those within your organisation using >Renoir. Sure it adds a little overhead, but once you understand "how" >it all works, you'll find Renoir quite useful I suspect. If you don't mind me saying so, this sounds like a case of the tail wagging the dog. Surely the point is that Renoir has to integrate into existing proven design flows, and not the other way round. I worked in a group last year where some people used Renoir, and some didn't. This caused a huge amount of trouble. Maybe the situation is better with 2000.x; I don't know. Evan PS - can't understand why you'd prefer Portland to Stevenage - very odd... :)

Peter Alfke wrote: > > How about when the counter counts? > > > > > 01111111 > > 10000000 > > > > Hint for the h-work kid: look up static hazard in your logic book. > > Yes. It would be interesting to see if they take off on H-Work assignments for a pulse that has potential glitches in it. > Why go to the trouble of using exotic schemes, > like LFSR, when a binary > solution is actually simpler, and also equally fast. <rk shrugs> Lots of solutions to this one. Very simple H-Work problem. > Also, unless you know the tricks, an LFSR divides > by 255, not 256. Hints for the H-Work kid: 1. _The Art of Electronics_, Horowitz and Hill 2. _HDL Chip Design_, Smith. ----------------------------------------------------------------------- rk A designer has arrived at perfection stellar engineering, ltd. not when there is no longer anything stellare@erols.com.NOSPAM to add, but when there is no longer Hi-Rel Digital Systems Design anything to take away - Bentley, 1983

Peter Alfke wrote: > > This is a good basic article ( and I am notoriously critical of anybody else's > tutorials :-) > Minor flaws: > The author describes antifuse circuits as if they used fuses ( they really make a > connection, not break it ) and he fails to mention the enormous difference in > flip-flop count between CPLDs and FPGAs: > CPLDs have from 32 or 36 to <300 flip-flops, while FPGAs nowadays start at a > couple of hundred and end above 50,000 flip-flops. So that is a dramatic > difference. > The speed difference has disapeared, FPGAs are now as fast as CPLDs, even for > simple functions. > But, fundamentally, a good introductory story. Agreed. A few minor nits. It’s important to take the FPGA configuration time (at startup) into account when designing your system. If you need instant power-on performance, you probably want to use a flash memory-based device or an OTP device. OTP devices, by themselves, do not give one instant power-on performance. One must take FPGA start time into account when designing a system. Real devices attached to your FPGA may be effected during the startup transient. This includes devices such as relays and pyrotechnic devices. I would also have included Quicklogic and Atmel in the list of suppliers of FPGAs. ---------------------------------------------------------------------- rk How the hell do I know? I'm just a stellar engineering, ltd. common, ordinary, simple savior of stellare@erols.com.NOSPAM America's destiny. Hi-Rel Digital Systems Design -- Pat Paulsen

hello, thanks to you all for the reply. I have to mention that it's not a homework... Now, how can i generate those 7 waveforms EFFICIENTLY waveform 1 ---------- 0 from 0-->4 cycles and 1 from 5-->255 waveform 2 ---------- 0 from 0-->3 cycles and 1 from 4-->255 waveform 3 ---------- 0 from 0-->2 cycles and 1 from 3-->255 waveform 4 ---------- 0 from 0-->1 cycles and 1 from 2-->255 waveform 5 ---------- "1" from 0-->251 cycles and "0" from 252-->255 waveform 6 ---------- 1 from 0-->252 cycles and 0 from 253-->255 waveform 7 ---------- 1 from 0-->253 cycles and 0 from 254-->255 i can use what you suggested, many comparators in //, but the problem of fan out occurs. As i can duplicate the counter. but is there any more appropriate way to do it ? thanks In article <3A86B54B.D8BAEC41@nospamplease.erols.com>, rk <stellare@nospamplease.erols.com> wrote: > Peter Alfke wrote: > > > > How about when the counter counts? > > > > > > > > 01111111 > > > 10000000 > > > > > > Hint for the h-work kid: look up static hazard in your logic book. > > > > Yes. > > It would be interesting to see if they take off on H-Work assignments > for a pulse that has potential glitches in it. > > > Why go to the trouble of using exotic schemes, > > like LFSR, when a binary > > solution is actually simpler, and also equally fast. > > <rk shrugs> Lots of solutions to this one. Very simple H-Work problem. > > > Also, unless you know the tricks, an LFSR divides > > by 255, not 256. > > Hints for the H-Work kid: > > 1. _The Art of Electronics_, Horowitz and Hill > > 2. _HDL Chip Design_, Smith. > > ---------------------------------------------------------------------- - > rk A designer has arrived at perfection > stellar engineering, ltd. not when there is no longer anything > stellare@erols.com.NOSPAM to add, but when there is no longer > Hi-Rel Digital Systems Design anything to take away - Bentley, 1983 > Sent via Deja.com http://www.deja.com/

karenwlead@my-deja.com schrieb: > > hello, > > thanks to you all for the reply. I have to mention that it's not a > homework... > > Now, how can i generate those 7 waveforms EFFICIENTLY > > waveform 1 > ---------- > 0 from 0-->4 cycles and 1 from 5-->255 > > > > i can use what you suggested, many comparators in //, but the problem > of fan out occurs. As i can duplicate the counter. but is there any > more appropriate way to do it ? There are many solutions. One quick and dirty is using blockram. Generate a file with 256 bytes, waveform 1 is the 1. bit in each byte (LSB), waveform 2 is the 2. Bit and so on. load this table into your blockram (use coregen to generate a 256x8Bit ROM) attatch a 8 bit counter to the address input, ready. You can create as complex as possible waveforms, it wont degrade performance. -- MFG Falk

"Peter Alfke" <palfke@earthlink.net> wrote in message news:3A86AA7C.D34633E4@earthlink.net... > Virtex-II now has fast combinatorial multipliers ( 18 x 18 2's complement ), > and such a multiplier can obviously also be used to shift the exponent. > Saves a lot of multiplexers and routing. And is fast > > Peter Alfke Ignoring rounding modes, etc., the two most area- and interconnect-intensive jobs in an FPGA-implemented FPU are the multiplier, and the denormalize (mantissa binary point alignment) and normalize barrel shifters in the adder. Focusing on the latter, you can indeed use the new 18x18 multipliers either iteratively or in parallel to do these shifts. But in this posting [www.fpgacpu.org/usenet/fp.html] (which references several other FPGA FPU implementations), I proposed an alternative FP adder implementation. Maybe it's wacky and unusable, I don't know. Here's the idea: If you do FP perform addition in a bit- or nybble-serial fashion, you can implement the binary point alignment denormalization, and the subsequent sum normalization, with a variable-tap shift register, which is implemented extremely efficiently in Virtex-derivative FPGAs using the powerful SRL16 primitive (which packs a variable-tap 16-element shift register into a single logic cell). (For more on applying SRL16, see [http://www.xilinx.com/support/techxclusives/SRL16-techxclusive2.htm].) Of course, a bit- or nybble-serial FP adder will be slower than a combinational one. But you can now implement one in much less area. If you can express your FP computation as a parallelizable data flow, you might instantiate many of these area-optimized FP adders in the space of one combinational FP adder. Will the throughput be higher? I don't know. Jan Gray, Gray Research LLC FPGA CPU News: www.fpgacpu.org

karenwlead@my-deja.com wrote: > > hello, > > thanks to you all for the reply. I have to mention that it's not a > homework... OK, it looks like homework. I would still suggest looking up the references. Good luck! ---------------------------------------------------------------------- rk We had dodged bullets before, but stellar engineering, ltd. this time we caught one in midair and stellare@erols.com.NOSPAM spit it out. Hi-Rel Digital Systems Design -- Gene Kranz after Apollo 5

EET Feb 5,2001 Ron Wilson argues that on-the-fly reprogrammable FPGAs are much better an architecture than SoC. The leading FPGA companies are going full speed into SoC / platform FPGAs. How valid is Ron's point of view ? Dan

Well, thanks to everyone for the good information... I'm trying to accelerate a ray tracing application (POVRay) using digital hardware. My guess is that I do not have to be strictly IEEE compliant in my implementation, but the numbers I hand back to the have to be in double precision format. I'm counting on everything being heavily pipelined and I am looking into CORDIC for implementing the trancendentals. The board I'm using hangs off the memory bus, so I'd like everything to be synchronous to the 66 or 100 MHz SDRAM clock. I'm doing this for a research project for school, so no worry of me bidding it too low :) Matt Matt Billenstein mbillens (at) one (dot) net http://w3.one.net/~mbillens/ "Matt Billenstein" <mbillens@mbillens.yi.org> wrote in message news:6L2h6.1820$xh3.173569@typhoon.kc.rr.com... | All, | | I've taken on a project where I'll be implementing a number of math | functions on IEEE double precision floating point types (64 bit). | Multiplication, division, addition, and subtraction are fairly straight | forward. I'll need to do cosine, exponential (e^x), and square roots. Any | advice/pointers/book titles would be appreciated. I'll be implementing in | VHDL targeting a large Xilinx VirtexE device (XCV1000E). Hopefully at 66 or | 100 MHz. | | Thanks, | | Matt | | | -- | | Matt Billenstein | REMOVEmbillens@one.net | REMOVEhttp://w3.one.net/~mbillens/ | | | |

karenwlead@my-deja.com wrote: > Now, how can i generate those 7 waveforms EFFICIENTLY <snip> > i can use what you suggested, many comparators in //, but the problem > of fan out occurs. As i can duplicate the counter. but is there any > more appropriate way to do it ? To avoid the wastage of comparators, and the decode glitches, the most efficent is to use JK Register design. ( T -> JK is the usual synth path ). Then, you need decode only the SET and CLEAR points, and if you have edge-adjacent waveforms ( as you do ), you can save more. a) common the CLR on OP 1..4 all K = 1 at CV == 255 ( 1 clk delay ) b) Either compare, or count, to create J=1 at CV==1 for OP4. Then J3=OP4, J2=OP3, J1 = OP2 will be 1 clock wider each. =jg -- ======= 80x51 Tools & IP Specialists ========= = http://www.DesignTools.co.nz

Matt Billenstein wrote: > > Well, thanks to everyone for the good information... I'm trying to > accelerate a ray tracing application (POVRay) using digital hardware. My > guess is that I do not have to be strictly IEEE compliant in my > implementation, but the numbers I hand back to the have to be in double > precision format. > > I'm counting on everything being heavily pipelined and I am looking into > CORDIC for implementing the trancendentals. The board I'm using hangs off > the memory bus, so I'd like everything to be synchronous to the 66 or 100 > MHz SDRAM clock. > > I'm doing this for a research project for school, so no worry of me bidding > it too low :) Using POVRay as the single target for your job makes things much simpler, in some ways! I would take a long, close look at the full POVRay pipeline, and see what's actually needed to get from input to output, as opposed to how the algorithm is currently broken down into basic mathlib operations. One idea is to explicitely split each trancendental operation into a corresponding polynomial function, with sufficient precision to achieve the final sub-pixel accuracy required. When you do this you'll notice that many of those underlying operations really doesn't need anything like full IEEE double semantics. You could even look into a limited-precision fixed-point number format, which makes both addition and multiplication simpler. Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"

On Sat, 10 Feb 2001 03:09:18 GMT, Phil Hays <spampostmaster@home.com> wrote: >Carry chain can be infered by synthesis tools, however the code may not be >highly readable. For example, to create an OR gate: > >OR_temp <= '0' & A & B & C & D & E; >Result_temp = OR_temp + "011111" >Result = Result_temp(5); -- result is zero unless (A or B or C or D or E) = 1 Neat. However, the results from SynplifyPro don't show much of a gain. I must be missing something. If I just did a simple 12 bit add (for a 12 bit AND) Synplify inferred twelve LUTS feeding twelve MUXCYs. The speed was nothing to write home about (3.5ns in a VirtexE -7). A two stage 12 input AND Synplify infers from the normal a normal IF/THEN/ELSE comes in at 2.5ns. Both tests the AND feeds the DFF. I then instantiated three four-input ANDs and then did a three-bit add. Synplicity inferred a two stage LUT feeding the flop. Grrr. Only when I extended the AND to three stages (32 bits) did the carry method become a tad faster. Synplify says the ADD method is 3.5ns, vs. 3.6 for the three stage LUT (seems strange as I write, I'll have to look at this again). >I'd suggest using a proceedure to improve readability. Understood. Good comments help a lot too. ;-) >Biggest gain in speed is from using the carry chain for priority encoders, large >AND and OR gates gain some. Any tricks for a fast address comparitor? It seems there ought to be some way to do this with carry chains. Looking at the logic drawings of the Virtex CLB, I don't immediately see it though. I have need of a fast comparitor to compare an address bus against a register (variable address). One input has to be fast, the other doesn't. I'm now pipelining the operation over three cycles (XOR, AND, Encode), which is ugly. ---- Keith

There's a nice example of a Virtex-carry-chain-optimized n-bit comparator in VHDL from Arrigo Benedetti in the fpga-cpu list archive, at http://groups.yahoo.com/group/fpga-cpu/message/73. Jan Gray, Gray Research LLC

The display is upside down. Rotate 180 degrees. On Sun, 11 Feb 2001 04:10:15 +0100, "Daniel Nilsson" <danielnilsson@REMOVE_THIShem3.passagen.se> wrote: >This LCD almost lacks own logic, many signals need to be generated etc. >The LCD works fine now, I can put things into memory and it will come out >fine on the display... The problem is that when I try to write to the >display on position (x,y) = (0,0) then it ends up at (255,62), (10,0) @ >(245,62), (10,10) @ (245,52) ... and the datasheet says that it should begin >counting from upper left corner... mine counts from (lower-1) right corner, >and counts upwards towards lower Y-value (if (0,0) is upper left corner) >I have verified that all timing in datasheet is met, still it doesn't behave >properly. > >Has anyone of you had any similar experience will LCD diaplays? > >/Daniel Nilsson > > > > > > > Philip Freidin Fliptronics

Jabari Zakiya <jzakiya@mail.com> wrote > But it should be > unquestionable that clocked CMOS devices draw more power than > unclocked CMOS devices of the same technology. While I don't know much about low power design, I do know that this does not necessarily follow, because while a purely combinational design may save on clocking power, it may also waste power due to internal signal glitching. Consider one gate G in the middle of an unbalanced, deep, purely combinational logic expression graph. If G's inputs arrive at, and settle at, different times, it may GLITCH high then low then high again, over and over, charging and discharging its output net, and perhaps causing the downstream gates on that net to themselves glitch high and low, and so forth. Result: arbitrary amounts of wasted power due to glitching. Now consider the same gate G, but this time in the midst of a pipelined design, with pipeline registers at each two levels of logic. Here, even if G's inputs are unbalanced (perhaps one input sources a gate that sources some registers, another input sources another register directly) then there will be at most one glitch originating at G, and since its output is registered, this glitch is not seen by downstream gates. Result: less or no power wasted due to glitching. To quote slide# "Architecture:18" from the wonderful ISCA00 Tutorial "Low Power Design: From Soup to Nuts", presented by Mary Jane Irwin and Vijaykrishnan Narayanan of Penn State: "Glitch Reduction by Pipelining * Glitches are dependent on the logic depth of the circuit * Nodes logically deeper are more prone to glitching > Arrival times of the gate inputs are more spread due to delay imbalances > Usually affected by more primary input switching * Reduce depth by adding pipeline registers" So it will be interesting to see how this solution benchmarks (latency, throughput, area, power) against other implementations. For example, at FCCM00 last year, Cameron Patterson of Xilinx presented a paper on a jbits-floorplanned 16-stages-unrolled pipelined implemenation of DES [1] that runs at 168 MHz and does 10.7 Gb/s ("non-feedback ECB"), for 3.2 W, in a modest Virtex-150-5, for a throughput-per-area of about 3 Mb/s/logic-cell. Working from this paper, I infer that at 168 MHz, clocking through the stated 35 pipe stages, the 16 round DES would take ~200 ns -- which is in the same ballpark for latency as the stated 155 ns TPD of the combinational FPGA version -- but presumably with much higher throughput and throughput-per-area. Or are the two implementations not comparable (apples to oranges)? Jan Gray, Gray Research LLC [1] Cameron Patterson, Xilinx, "High Performance DES Encryption in Virtex FPGAs using JBits", 2000 IEEE Symposium on Field-Programmable Custom Computing Machines.

On Sun, 11 Feb 2001 11:59:20 +0000 (UTC), V R <ipickledthefigsmyself@mrbourns.com> wrote: >Sorry for the off-topic and cross-post but I was curious (since we have >the attention of so many now) if more "intelligent" floating point scheme >exists (i.e. non-IEEE 754/854)? I know computers performed math before >Intel's spec so of course there will be dozens of proprietary formats... > >It feels like manipulation of floating point data in the 754/854 formats >is more cumbersome than it needs to be. Any there any other schemes that >are "simpler" (besides fixed point, etc) and/or easier to implement? Any >implementations that nicely lend themselves to FPGAs? Obviously one will >have to make a trade offs such as bit-size vs. precision, etc. but I'm >inquiring about a general schemes... > >Thanks! >VR. VR, I've done a couple of projects using a 64-bit fixed-point format, with 32 bits of integer and 32 bits of fraction. That's enough dynamic range for most real-world, engineering-units apps. A fully-saturating implementation has no exceptions to worry about. And adds are *fast*. John

Amontec has one Virtex Symbole in BG560 Gil Golov wrote: > Does anybody have this symbol for Orcad capture? > > Thanks very much in advance. > > Gil Golov

Brian Philofsky wrote: > >Most likely, you will have to hand create this structure as I am not sure > >if any synthesis tools currently infer this structure. Maybe in the near > >future though... Christian Plessl <cplessl@ee.ethz.ch> writes: > So how do think such a design could be implemented, if the design > tools cannot infere it? Im using Xilinx Foundation 3.1i, the design is > coded in VHDL. Somehow I would need complete acces to the FPGA > elements, such as LUTs and carry chains. > > Is there a way to acces these directly? From the Xilinx Library Guide > I've seen, that components ADD16 and OR16 work exactly like this, but > since these are marcos I cannot extend them to my needs. I will assume that you've figured out how to instantiate macros/components from the libraries guide. If you look at the MUXCY component, you'll see that it is an instantiatable (sp?) component for the mux implemented in the carry chain. You can simply instantiate this however many times you need. (If you look, you'll see that LUTs and whatnot are there, too.) A little while ago, I was playng around with a 32-bit wide trinary compare function (32 bits & 32 bit compare with 32 bits mask), and I tested both the carry-chain method and the pull-up (wired-or) method, and I did indeed find that the Carry chain was faster. Note that you will need to break the carry chain in half if the device height isn't enough for your entire carry chain. (Which is what I did.) I've stuck my compare component here so you can see how I did it. I tested it using Foundatoin 3.3iISE. Hope this helps, -Kent ------------------------------------------------------------------------------- -- Title : Fast Trinary Compare Component -- Project : Common Component ------------------------------------------------------------------------------- -- File : FastCompare.vhd -- Author : K.Orthner -- Created : 2001/01/27 -- Last update: 2001-01-25 -- Platform : Active-HDL/FPGA Express(Synopsys) ------------------------------------------------------------------------------- -- Description: A Fast Trinary Compare component. -- Completely combinational. ------------------------------------------------------------------------------- library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; -- synopsys translate_off library unisim; use unisim.all; -- synopsys translate_on entity FastCompare is generic ( Width : integer); port ( Comparand0 : in std_logic_vector(Width-1 downto 0); Comparand1 : in std_logic_vector(Width-1 downto 0); Mask : in std_logic_vector(Width-1 downto 0); Match : out std_logic ); end FastCompare; ---------------------------------------------------------------------------------------------------- -- Architecture Xilinx_Carry ---------------------------------------------------------------------------------------------------- architecture Xilinx_Carry of FastCompare is component MUXCY is port ( O : out std_ulogic; DI : in std_ulogic; CI : in std_ulogic; S : in std_ulogic); end component MUXCY; signal BitMatch : std_logic_vector(Width-1 downto 0); signal CarryChain0 : std_logic_vector(Width/2 downto 0); signal CarryChain1 : std_logic_vector(Width/2 downto 0); signal Logic_0 : std_logic; begin Logic_0 <= '0'; ---------------------------------------------------------------------------------------------------- -- Determine bitwise matching. ---------------------------------------------------------------------------------------------------- GenBitMatch : process( Comparand0, Comparand1, Mask) is begin for i in BitMatch'range loop if (Mask(i) = '0') then BitMatch(i) <= '1'; elsif (Comparand0(i) = Comparand1(i)) then BitMatch(i) <= '1'; else BitMatch(i) <= '0'; end if; end loop; end process GenBitMatch; GenMux : for i in 0 to ((Width/2)-1) generate MUXCY_0 : MUXCY port map ( O => CarryChain0(i+1), DI => Logic_0, CI => CarryChain0(i), S => BitMatch(i)); MUXCY_1 : MUXCY port map ( O => CarryChain1(i+1), DI => Logic_0, CI => CarryChain1(i), S => BitMatch(i+(Width/2))); end generate GenMux; CarryChain0(0) <= '1'; CarryChain1(0) <= '1'; Match <= CarryChain0(Width/2) and CarryChain1(Width/2);--and CarryChain2(Width/4) and CarryChain3(Width/4); end Xilinx_Carry; --==[ eof ]==--

Hi For something like raytracing I would suggest that you implement some kind of vector processer. This means that you for example give one command to add N pairs of numbers instead of just one. This way you can do very heavy pipelining and almost allways keep the pipeline full. This would require vectorization of the loops in POVRay. This is not hard to do, and would for sure make it easier to get high troughpout from your hardware. Read the user manuals of Motorola DSP96002. They only implemented addition, subtraction, comparison and multiplication and a 8-bit division approximation. The same tradoff is probably true for FPGA as well. The results might be better if you spend more time on getting the basic operation really fast with a lot of pipelining instead of having specialized hardware for everything. Maybe you can even run multiple instances of the basic operations in parallel from internel blockrams. These can than be used to implement the transcendentals. The user manual of DSP96002 contains assembler code to emulate the other operations. They are only a little slower than hardware iterations. 1/X takes 6 instructions 1/SQRT(X) takes 11 instructions using newton-raphson approximation SQRT(X) is just X/SQRT(X) and takes 12 instructions SIN/COS is done with cordic. Analog Devices SHARC Manuals also contain a lot of assembler code. In article <mHAh6.13554$ra.1279071@typhoon.kc.rr.com>, "Matt Billenstein" <mbillens@mbillens.yi.org> wrote: > Well, thanks to everyone for the good information... I'm trying to > accelerate a ray tracing application (POVRay) using digital hardware. My > guess is that I do not have to be strictly IEEE compliant in my > implementation, but the numbers I hand back to the have to be in double > precision format. > > I'm counting on everything being heavily pipelined and I am looking into > CORDIC for implementing the trancendentals. The board I'm using hangs off > the memory bus, so I'd like everything to be synchronous to the 66 or 100 > MHz SDRAM clock. > > I'm doing this for a research project for school, so no worry of me bidding > it too low :) Sent via Deja.com http://www.deja.com/

If speed (and not area) is a concern you might try something completely different. Let's say you have a statemachine with output flip-flops y_i You want to compute y_0 or y_1 or...or y_N. Usully all the inputs of your statemachine come from flip-flops, too. Let's call these input flip-flops x_i. All the y_i are computed by a boolean function f_i(x, y). You can now easily obtain the funtion F(x, y) = f_0(x,y) or f_1(x,y) or ... or f_M(x,y) This is the same function as before, but you now have the result one clock cycle earlier, or you can use retiming and you have twice the time to evaluate the function. In some cases F can be much larger and slower than the OR-tree. But my experience is that especially for one-hot encoded state machines F is almost the same size as the tree and you therefore gain almost a factor of 2 in throughput. CU, Kolja In article <ofq48t0073rudnjlvo4fpkjnccsm2vq11f@4ax.com>, Christian Plessl <cplessl@ee.ethz.ch> wrote: > What I need, is a _fast_ boolean OR resp. AND operations on all of > these output signals. Since there are quite a lot of output signals, > say typically more than 40 signals, I need several levels of logic, > when implementing this in the obivous tree-like structure with a tree > of 4 Input AND/OR gates. Sent via Deja.com http://www.deja.com/

Hi Brian, Phil and others. Thank you very much for the very usefull comments and design ideas for implementing wide logical functions. I've made little test circuits to compare your proposals, and want shortly show the results: I compared 3 different architectures for a 32 input AND gate a) Simply using the 'and' operator b) Using Brian Philofsky's scheme, by instanciating LUT's which implement a 4bit Boolean function and passing the intermediate results via the carry chain. c) Phils Hays's clever idea of using the desing tools capability to infere adders that use the carry chain for constructing wide boolean functions. All designs were implemented using Xilinx Foundation Tools Version 3.3i Servicepack 6 using VHDL toolflow. The target FPGA is Xilinx Virtex-XCV1000-4. Results: +-----------------------------------------------+ | Cirucit | Slices used | LUTs used | Delay | +-----------------------------------------------+ | a | 9 | 11 | 17 ns | +-----------------------------------------------+ | b | 5 | 8 | 13 ns | +-----------------------------------------------+ | c | 17 | 0 | 15.25 ns | +-----------------------------------------------+ Remarks: a) shows that the tools cannot infere a higly-efficient implementation when using just the obivous naive way of coding wide logic functions. b) Brians scheme generates the fastest wired-and implementation for 32-input ANDs. The logic infered is as expected, each slice implements 2 4input LUTs each of the LUTs implements a 4-input AND. All the outputs of the LUTs control the CYSEL multiplexers and the results are passed via carry-chain. c) Phils scheme doesn't use any LUTs at all, all the logic is implemented using the carry-chain and the LUT just used for routing 1 single signal to the multiplexer, which means the circuit is similar to b) but every slice handles only 2 bits, instead of 8 bits in circuit b). Surprisingly the circuit is quite fast. Seems as if the Virtex Carry chains are _really_ fast. Thanks to all of you, for posting your ideas. /Chris

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search