Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Terje Mathisen <Terje.Mathisen@hda.hydro.com> writes: >Klering's code was actually fairly straightforward, except for a set of >flags used to detect static areas. >Skipping that part would still (most probably) let it run at the >required 60 fps, the code is 'just' a parallel implementation of the >counting logic: >Alive next iteration = (alive now AND (count == 2 OR count == 3)) OR > (not alive AND count == 3), >which simplifies to just: >Alive next iteration = (count == 3) OR (alive AND count == 2). >By including the cell itself in the count, then it becomes easier to >reuse the counting logic for multiple rows: > alive = (iCount == 3) OR (alive AND iCount == 4) >You need 4 bits to count to 8 (or 9), so 4 registers for counting plus >one for the center cells leaves one or two registers for array >addressing on an x86. >Klering did a lot of work to simplify the logic as much as possible, >i.e. he didn't actually implement the full 'count-to-9' bitwise logic, >since it is possible to early-out many of the branches. >Implementing the same logic with MMX-style wide registers should make it >approximately twice as fast. I'm not sure if the following is the same as Klerings code, but the approach sounds similar. The code below is based on code I got from David Seal and then optimized slightly (removing two binary operations). Initially, the variable "middle" contains a bitvector of a number of cells. "up" and "down" contains the rows over and below the row in question. "left" and "right" contain the same as middle, except that they are shifted one bit left or right (with the appropriate neighbouring bits shifted in). Similarly for "upleft" etc. At the end, "newmiddle" contains the new values for the row corresponding to "middle". ones = up ^ upleft; twos = up & upleft; carry = ones & upright; ones = ones ^ upright; twos = twos ^ carry; ones1 = down ^ downleft; twos1 = down & downleft; carry = ones1 & downright; ones1 = ones1 ^ downright; twos1 = twos1 ^ carry; carry = ones & ones1; ones = ones ^ ones1; fours = twos & twos1; twos = twos ^ twos1; carry1 = twos & carry; twos = twos ^ carry; fours = fours ^ carry1; /* could be | */ carry = ones & left; ones = ones ^ left; carry1 = ones & right; ones = ones ^ right; carry = carry | carry1; twos = twos ^ carry; ones = ones | middle; newmiddle = ones & twos & ~fours; If we assume that ^, &, | and &~ are available as single-cycle operations this takes 26 cycles to complete, some of which can be done in parallel on a superscalar machine. If we add in slightly over a dozen cycles for computing "upleft" etc., we get 40 cycles per word-sized bitvector. If we make sure only to read each memory word once, we should be able to update a wordlength of cells in 40 cycles plus the time it takes to read and write a word. If the read/write is to non-cached meory (as it will be if it goes to memory mapped display memory), a read will take a full memory cycle (though you can use burst access) while the store can be through a write buffer and hence take only one CPU cycle (which can be scheduled in parallel with other operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz CPU and 33MHz memory, this works out to about 4 million words per second, or 128 million cells/s. With a 800x600 display, we get more than 250 frames per second. With a superscalar CPU and a non-blocking cache, this can be improved considerably. Torben Mogensen (torbenm@diku.dk)Article: 10501
In article <6kbnki$4ji@grimer.diku.dk>, torbenm@diku.dk (Torben AEgidius Mogensen) wrote: I'm not sure if the following is the same as Klerings code, but the | ones = up ^ upleft; | twos = up & upleft; | carry = ones & upright; | ones = ones ^ upright; | twos = twos ^ carry; | | ones1 = down ^ downleft; | twos1 = down & downleft; | carry = ones1 & downright; | ones1 = ones1 ^ downright; | twos1 = twos1 ^ carry; | | carry = ones & ones1; | ones = ones ^ ones1; | fours = twos & twos1; | twos = twos ^ twos1; | carry1 = twos & carry; | twos = twos ^ carry; | fours = fours ^ carry1; /* could be | */ | | carry = ones & left; | ones = ones ^ left; | carry1 = ones & right; | ones = ones ^ right; | carry = carry | carry1; | twos = twos ^ carry; | | ones = ones | middle; | newmiddle = ones & twos & ~fours; This algorithm looks like the one described in the Smalltalk "blue book", where a version of Life was implemented using BitBlt operations to implement the cell counting in parallel. -- -- Tim OlsonArticle: 10502
Hi all, I have been working on a parameterised, synthesisable CRC generator. I know it's generating correct implementations for CRC-32, CRC-16, and anything else where a documented check for the polynomial exists. Now I just need a handle on acceptable speeds and density. So far, I've come up with the following figures for a Ethernet CRC-32 with 32 bit data being fed in every clock cycle: Device area FLEX10K100A-1 211 LC's XC4005XL-09 95 CLB's Performance for both is around 2 Gbits per second. I think the numbers are pretty reasonable but if anyone out there has better numbers, I would be grateful for some feedback. Thanks StuartArticle: 10503
The site for the Programmable Logic News & Views newsletter has been updated with summaries of the February and March newsletters. http://www.plnv.com Murray DismanArticle: 10504
I wonder if anyone has used the Altera MaxPlus software with a third party programmer. I programmed a EPC1213LC20 using the Altera programmer and was unable to verify it using my ALLMAX+ programmer. The .POF file used to programm the device is made from two seperate .SOF files, one set as active, the other passive. It does not seem to matter if I use the .HEX or .RBF formats. Both file will generate the warning that the selected configuration has disabled the start-up time-out device option. Which is no good. What I ended up doing was looking at the raw .POF file produced and stripping the header information and storing the file back out as a raw binary image. In the file I created, the data appears to start at an offset of 0xA3. When I do this, the file may be read into the ALLMAX+ and verified against the original EPROM. I don't believe I need to use this converter to get the correct image. Has anyone had this kind of problem, or done this kind of test? I have a call into Altera also and am waiting on a responce from them. Also notive that when selecting the .HEX and .RBF formats that the menu states that this is for a single device.Article: 10505
Torben AEgidius Mogensen wrote: [snip] > I'm not sure if the following is the same as Klerings code, but the > approach sounds similar. The code below is based on code I got from > David Seal and then optimized slightly (removing two binary > operations). Initially, the variable "middle" contains a bitvector of > a number of cells. "up" and "down" contains the rows over and below > the row in question. "left" and "right" contain the same as middle, > except that they are shifted one bit left or right (with the > appropriate neighbouring bits shifted in). Similarly for "upleft" etc. > At the end, "newmiddle" contains the new values for the row > corresponding to "middle". > > ones = up ^ upleft; > twos = up & upleft; > carry = ones & upright; > ones = ones ^ upright; > twos = twos ^ carry; > > ones1 = down ^ downleft; > twos1 = down & downleft; > carry = ones1 & downright; > ones1 = ones1 ^ downright; > twos1 = twos1 ^ carry; > > carry = ones & ones1; > ones = ones ^ ones1; > fours = twos & twos1; > twos = twos ^ twos1; > carry1 = twos & carry; > twos = twos ^ carry; > fours = fours ^ carry1; /* could be | */ > > carry = ones & left; > ones = ones ^ left; > carry1 = ones & right; > ones = ones ^ right; > carry = carry | carry1; > twos = twos ^ carry; > > ones = ones | middle; > newmiddle = ones & twos & ~fours; > > If we assume that ^, &, | and &~ are available as single-cycle All of these except &~ (NAND) are available on all cpus I know, and on those which miss out, you can of course synthesize it in two cycles. > operations this takes 26 cycles to complete, some of which can be done > in parallel on a superscalar machine. If we add in slightly over a Actually, it should be quite easy to get close to 2 IPC, because there's a lot of independent operations all the way to the end. > dozen cycles for computing "upleft" etc., we get 40 cycles per > word-sized bitvector. If we make sure only to read each memory word > once, we should be able to update a wordlength of cells in 40 cycles > plus the time it takes to read and write a word. If the read/write is > to non-cached meory (as it will be if it goes to memory mapped display > memory), a read will take a full memory cycle (though you can use > burst access) while the store can be through a write buffer and hence As you've just discovered, to get max speed you must maintain a back buffer in RAM, and then write updated blocks to the display. It is also critical to have a one bit/pixel display mode, because otherwise you'll be totally limited by write bandwidth. I.e. working in 32-bit true color will increase the size of a full screen buffer from 64K to 2MB. The 120 MB/sec required write speed (for 60 fps) will definitely overload a PCI bus, which has a (very) theoretical max speed of 133 MB/sec on a long burst. > take only one CPU cycle (which can be scheduled in parallel with other > operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz > CPU and 33MHz memory, this works out to about 4 million words per > second, or 128 million cells/s. With a 800x600 display, we get more > than 250 frames per second. With a superscalar CPU and a non-blocking > cache, this can be improved considerably. Anyway, this is the important point: A general cpu is more than fast enough to handle this problem at full frame rate, as long as the code is properly optimized. When you've optimized the code, then you'll discover that the problem really is memory bandwidth and nothing else. On the regular VGA cards we had to target, writing a single pixel was so slow that it was critical to minimize the number of writes to just those pixels that actually changed. In my code I stored 4 cells plus the neighborhood counts in a 16-bit word, and then used a 64K lookup table to convert the current value to the new state. If any of the 4 cells changed state, then I would write the updated pixel(s) to display memory, and then lookup a pair of 32-bit increments/decrements: The values needed to update the status of the current line, and the lines above/below. My program actually used less instructions/iteration than both Stafford and Klerings entries, but they both blew me away by keeping their working sets small enough to fit (mostly) in the 8K L1 cache! Some time after this I formulated my .sig. :-) Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"Article: 10506
Tim Olson wrote >This algorithm looks like the one described in the Smalltalk "blue book", >where a version of Life was implemented using BitBlt operations to >implement the cell counting in parallel. Another reference to the bitwise parallel approach is "Life Algorithms", Mark Niemiec, Byte, Jan. 1979, pp 70-79. If I recall correctly, Mark, David Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move mountain" instructions to animate 64K 36-bit words per iteration. Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did 800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000 cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985. As Messrs. Mathisen and Mogensen describe, Life should run very fast on modern processors (superscalar and multimedia enhanced and large caches). 64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns. FPGA Implementation: It is straightforward to run at full memory bandwidth. For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns. Since a given line is read three times (as "below", "current", and "above"), we buffer 2 lines of cells in RAM in the FPGA. A 1024 x n playfield requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024 bits for 3 banks since each clock you must read from up to two lines and write to a third. Detailed design/floor plan. One bit requires approx. 9 CLBs. Assuming the cell neighbours are (a,b,c,d,e,f,g,h), we need :- 3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer) 6 CLBs logic -- 1 s0a=a^b^c^d; s0e=e^f^g^h 2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3" 3 s2a=a&b&c&d; s2e=e&f&g&h 4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e) (uses dedicated carry logic) 6 new = ~s3&~s2&s1&(s0|old) and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in the left half, another 16 in the right half, leaving plenty of room to spare for control and address generation. At the 1997 FPGAs for Custom Computing Machines conference., the paper "The RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for comparing reconfigurable computing systems. One of the 12 benchmarks was Life, for which they reported speedups of several hundred times over a SparcStation20+software approach, but in fairness, they write "we are not currently implementing known improvements to the software to take advantage of the bit-level parallelism available in a microprocessor". Summary. Hypotheticially... Fast microprocessor + cache: ~1 bit/ns Single FPGA + SRAM custom machine: ~1 bit/ns Jan GrayArticle: 10507
> ones = up ^ upleft; > twos = up & upleft; > carry = ones & upright; > . . . You can beat this (in terms of number of logical operations and shifts) by quite a bit. Here, `g' is the original, and only input. sl3=(sl2=(a=left(g))^(b=right(g)))^g sh3=(sh2=a&b)|(sl2&g) sll=(a=up(sl3)^(b=down(sl3)))^sl2 slh=(a|(b^sl2))^sll a=up(sh3)^(b=down(sh3)) g=(a^sh2^slh)&((a|(b^sh2))^slh)&(sll|g) I believe that's 19 logical operations, one left, one right, two ups and two downs (this assumes the ups and downs are cheap). > Actually, it should be quite easy to get close to 2 IPC, because there's > a lot of independent operations all the way to the end. You bet! > As you've just discovered, to get max speed you must maintain a back > buffer in RAM, and then write updated blocks to the display. Yep, and you need to block the algorithm appropriately so it fits in cache. This is pretty easy to do; just do the above algorithm in appropriately sized strips. Further, it's pretty easy to block out (not process) areas that are static or oscillating with period 2 (which are terribly common in Life); I generally use two alternating buffers and keep a `superbitmap' of those chunks that are changing with period >2. > The 120 MB/sec required write speed (for 60 fps) will definitely > overload a PCI bus, which has a (very) theoretical max speed of 133 > MB/sec on a long burst. Which is why you do the delta. Indeed, what I did is `stupider' than that. There's no sense updating the display at greater than the frame rate but it's easy to calculate at greater than frame rate. So I don't update on every generation, just on every frame. And then I only update the deltas, which are often quite small compared to the real data. > When you've optimized the code, then you'll discover that the problem > really is memory bandwidth and nothing else. I'm not so sure about this; it's pretty easy to make the loads/stores overlap pretty well. Of course, I did it on the 68000 where there are enough registers; I'm not sure about the x86 world. The above code was completely designed by me, although I'm sure others have found a similar solution. (I actually implemented the above on an HP calculator in user-RPL, and on the Amiga, both with the blitter and in assembler. I keep meaning to get around to speeding up xlife but never can seem to find the time.) Here's 48G code for anyone who cares; just put a GROB on the stack and hit `GEN': GEN << WHILE 1 REPEAT DUP ->LCD GEN1 END >> GEN1 << {#0 #1} DUP2 SH OVER LX ROT REVLIST SWAP OVER SH 5 ROLLD 4 ROLLD SH 4 PICK LX ROT 3 PICK + NEG + 4 ROLLD NEG + LX + NEG >> SH << DUP2 OVER DUP ROT {#FFFh #FFFh} SUB LX 3 DUPN 7 ROLLD GXOR 5 ROLLD GXOR + >> LX << {#0 #0} SWAP GXOR >> -tomArticle: 10508
Nice to see one's name in the press! The story below is basically correct but misses out a couple of vital details: 1. Algotronix web address: www.algotronix.com 2. Algotronix phone number: (408) 480 5707 Tom Kean mtmason@ix.netcom.com wrote: > Taken from EETimes > > SAN JOSE, Calif. — Xilinx Inc. has stopped development work on its XC6200 > line of partially reconfigurable field-programmable gate arrays (FPGAs), and > the founders of the the company's reconfigurable R&D group in Edinburgh, > Scotland, John Gray and Tom Kean, have both left the company. The remaining > engineering staff at Edinburgh has been reassigned to develop IP cores for > use by Xilinx's customers within the company's FPGAs. > > However, Xilinx says it remains committed to the partial reconfigurability > offered by the XC6200 devices and will offer many of the features of the > XC6200 in its next-generation FPGA family, known as Virtex. > > "John Gray is still working for Xilinx as a consultant," said Roland > Triffaux, manager of Xilinx Europe. "Tom Kean and another two engineers have > left to start a company in California." > > That spin-off company, called Quicksilver, is believed to have some backing > from Xilinx. Quicksilver is looking to apply reconfigurable logic to > multiprotocol handsets for mobile communications, sources said. It is also > believed to be working with systems companies on the application of > reconfigurable logic. > > Gray said his departure from the company was amicable. "I am just looking to > do something that's more fun again," he said. "It's time to kick back a bit." > > Meanwhile, Kean is reacquiring the name Algotronix from Xilinx. Kean and Gray > led Algotronix Ltd. in the early 1990s before it was acquired by Xilinx in > 1993, when it became the basis of the reconfigurable R&D group, and was named > Xilinx Development Corp. > > Algotronix developed a reconfigurable FPGA architecture known as CAL > (configurable array logic), which eventually became the XC6200. Kean said the > new Algotronix would act as a consultancy and would advise users on the > application of reconfigurable logic. > > Xilinx said the work of the R&D group was largely completed. "The goal of the > reconfigurable group in Edinburgh has been achieved," Triffaux said. XC6200 > devices would continue to be available for academic and commercial research > groups, as they have been in the past. "We never really sold it," Triffaux > said. > > Peter Cheung, a researcher in the department of electrical and electronic > engineering at Imperial College of Science and Technology (London), has used > XC6200 devices for reconfigurable hardware platforms. "We've heard they are > not developing the XC6200," Cheung said. "Unless there is a real commitment > to it we may have to look at other things. The tools for 6200 are primitive > and not well done. > > "In many ways it [the XC6200] was a product ahead of its time," he said. "It > was a beautifully conceived device but not sufficiently well supported." > > -----== Posted via Deja News, The Leader in Internet Discussion ==----- > http://www.dejanews.com/ Now offering spam-free web-based newsreadingArticle: 10509
I have seen problems with the CCLK in master mode, but it sounds like you are using the slave mode. Still could be a timing issue, try adding some series termination... say a 33 ohm resistor on the cclk. Alexander Sherstuk wrote: > > > Hi All, > > I encountered unexpected difficulty, when loading XILINX XC4005E > configuration from ATMEL AT89C52 (in serial slave mode). > I connected P1.5 pin to configuration clock CCLK pin of XILINX chip, > > and connected P1.0 pin to DIN pin of XILINX chip. > XILINX configuration is loaded, but not with 100% probability - > sometimes (1 attempt of 5) it fails. > It looks like the problem is with 8051 signals rise time. > When I fed CCLK through 74HC14, everything works fine. > Maybe, somebody knows more about this problem. > How to avoid it? > > Thanks, > Alex Sherstuk > Sherstuk@amsd.comArticle: 10510
Some time ago, I was faced with a similar question (in my case I was using Xilinxes instead of Alteras. As far as I understood from the Altera tool, it is ok if you have not yet made your pcb. That is were these tools are good for. However, I don't know of any way to provide a tool such as MaxPlus2 or Xilinx M1.x to provide with such a connection list. Since you are prototyping chances exist that connections will change. What we finally did was made a board, define the connection list (with inclusion of some spares), make for each module or set of modules a separate component and routed these component as stand-alone FPGA's. -- Koenraad SCHELFHOUT Switching Systems Division http://www.alcatel.com/ Microelectronics Department - VA21 _______________ ________________________________________\ /-___ \ / / Phone : (32/3) 240 89 93 \ ALCATEL / / Fax : (32/3) 240 99 88 \ / / mailto:ksch@sh.bel.alcatel.be \ / / _____________________________________________\ / /______ \ / / Francis Wellesplein, 1 v\/ B-2018 Antwerpen BelgiumArticle: 10511
Terje Mathisen wrote: > Anyway, this is the important point: A general cpu is more than fast > enough to handle this problem at full frame rate, as long as the code is > properly optimized. I've only just seen this discussion, but I'm the guy who wrote the demonstration at Oxford. I should emphasise that I wasn't trying to show the power of FPGAs particularly, more just trying to make an attractive demo, so getting it running at the full frame rate was my sole aim also. It is trivial to compute more than one cell per cycle, but as you rightly point out the problem will quickly become one of memory bandwidth. We the system we're using, I could probably get 10 cells per cycle (giving us around 600fps) before that became the bottleneck. I recently extended the program in a different way: I wrote a fairly general cellular automata harness in which each cell has four bits of state, and just about any automata you want to try out can simply be plugged in. These should all then run at the full frame rate, as you have around four or five levels of pipelining available. I think multistate automatas like this could present a pretty difficult challenge for conventional processors: it really is just one of those things that FPGAs are very good at. Shameless plug: the general automata (including memory interfaces, VGA display and serial mouse interface for interacting with the automatas) was done in around 700 lines of Handel-C code. If anybody wants a copy, I'll gladly send it out (send mail to mpa@comlab.ox.ac.uk). The Hardware Compilation Group homepage is at: http://www.comlab.ox.ac.uk/oucl/hwcomp.html Cheers, Matt -- Matt Aubury, Oxford University Computing LaboratoryArticle: 10512
Another parallel implementation of Conway's Life is given by Eugene McDonnell in "Life: Nasty, Brutish, and Short", in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves a number of algorithms in dialects of APL, ending up with a 9-token expression for one iteration. Bob Jan Gray wrote: > >where a version of Life was implemented using BitBlt operations to > >implement the cell counting in parallel. > Another reference to the bitwise parallel approach is "LifeArticle: 10513
Robert Bernecky wrote: > > Another parallel implementation of Conway's Life is given > by Eugene McDonnell in "Life: Nasty, Brutish, and Short", > in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves > a number of algorithms in dialects of APL, ending up with > a 9-token expression for one iteration. > The most parallel implementation I've ever seen is "Life in the Stencil Buffer" on page 407 of the OpenGL Programming Guide. -- Regards, | No sense being pessimistic -- Ian Ameline, | It wouldn't work anyway. Senior Software Engineer, | Alias/Wavefront |Article: 10514
Matt Aubury wrote: > > Terje Mathisen wrote: > > Anyway, this is the important point: A general cpu is more than fast > > enough to handle this problem at full frame rate, as long as the code is > > properly optimized. > > I've only just seen this discussion, but I'm the guy who wrote the > demonstration at Oxford. I should emphasise that I wasn't trying to > show the power of FPGAs particularly, more just trying to make an > attractive demo, so getting it running at the full frame rate was my > sole aim also. It is trivial to compute more than one cell per cycle, > but as you rightly point out the problem will quickly become one of > memory bandwidth. We the system we're using, I could probably get 10 > cells per cycle (giving us around 600fps) before that became the > bottleneck. Nice! :-) > I recently extended the program in a different way: I wrote a fairly > general cellular automata harness in which each cell has four bits of > state, and just about any automata you want to try out can simply be > plugged in. These should all then run at the full frame rate, as you > have around four or five levels of pipelining available. I think > multistate automatas like this could present a pretty difficult > challenge for conventional processors: it really is just one of those > things that FPGAs are very good at. This could be solved with runtime code generation as well, compiling an optimized set of binary logic ops on the fly, or (much simpler), by embedding the cell rules in lookup tables. This is actually one of my favourite ways to solve many kinds of problems, generating one or more tables at runtime, which implements all the required logic. This is basically a state machine, which will almost always run at whatever speed the tables can support the (nested) lookups. My single favourite program is a 16-bit version of Word Count, which handles user-specified word and line separators, i.e. it can handle both CR and LF by themselves, as well as the CRLF combination. This program processes 256 chars in the inner loop, with zero compare/test/branch operations. It needs just 1.5 instructions/byte, so it is probably faster than any kind of disk or even main memory! :-) > Hardware Compilation Group homepage is at: > > http://www.comlab.ox.ac.uk/oucl/hwcomp.html Interesting, although it seems like some the sample applications didn't run too well, i.e. the real-time video image warper is a much simpler application than a sw MPEG-2 decoder, and it still ran at just 9 fps. Is this due to poorly optimized (handel-C) source code, or just that the task isn't very well suited to an FPGA implementation? Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"Article: 10515
Terje Mathisen wrote: > Matt Aubury wrote: > > I recently extended the program in a different way: I wrote a fairly > > general cellular automata harness in which each cell has four bits of > > state, and just about any automata you want to try out can simply be > > plugged in. These should all then run at the full frame rate, as you > > have around four or five levels of pipelining available. I think > > multistate automatas like this could present a pretty difficult > > challenge for conventional processors: it really is just one of those > > things that FPGAs are very good at. > > This could be solved with runtime code generation as well, compiling an > optimized set of binary logic ops on the fly, or (much simpler), by > embedding the cell rules in lookup tables. I think you're going to have a problem there: the total state coming into the lookup table is going to be the eight neighbours plus the central cell, each with four bits of state, so thats 36 bits of input data to 4 bits of output. A 32 GB lookup table might be a touch cumbersome! :-) Runtime code generation might well work; but it isn't exactly easy. I wonder how well a partial evaluator, like TEMPO (http://www.irisa.fr/compose/tempo/), would work on a problem like this... > > Hardware Compilation Group homepage is at: > > http://www.comlab.ox.ac.uk/oucl/hwcomp.html > > Interesting, although it seems like some the sample applications didn't > run too well, i.e. the real-time video image warper is a much simpler > application than a sw MPEG-2 decoder, and it still ran at just 9 fps. Ack! > Is this due to poorly optimized (handel-C) source code, or just that > the task isn't very well suited to an FPGA implementation? In the case of that particular version the problem was with the host board and its interface. Since that demo was created, I've written my own version which runs happily a 60fps (although that's running on a static image it would be fairly trivial to extend it to video). We have an realtime warp of Bill Gates' face which entertains quite a few visitors! Cheers, Matt -- Matt Aubury, Oxford University Computing LaboratoryArticle: 10516
Ramon <rco00003@teleline.es> wrote: >Hi, I am an student from Bracelona. I am doing my finally carrer project >designig with VHDL,. Now I need to compare my synthesis with other >synthesis. > >What kind of parametres can I compare? >Where could I find some comparisons? (articles, books, ...) Ramon, There's a whole world of parameters you can compare synthesis tools on! Off the top of my head I can think of: - (For ASICs) highest speed, least gates, net-to-gate ratios - (For FPGAs) highest speed, least CLBs, routability - FPGA to ASIC conversion (translation abilities) - Various levels of VHDL (or Verilog for that matter, too) support for synthesis - Design For Testability issues as related to various synthesis tools - Low power design synthesis (can this tool do it & how well?) - Module block sizes, wire load models, hierarchical synthesis - portability of a synthesis tool's input or output with other EDA tools And, of course, price, support, etc. - John Cooley Part Time EDA Consumer Advocate Full Time ASIC, FPGA & EDA Design Consultant ============================================================================ Trapped trying to figure out a Synopsys bug? Want to hear how 6000+ other users dealt with it ? Then join the E-Mail Synopsys Users Group (ESNUG)! !!! "It's not a BUG, jcooley@world.std.com /o o\ / it's a FEATURE!" (508) 429-4357 ( > ) \ - / - John Cooley, EDA & ASIC Design Consultant in Synopsys, _] [_ Verilog, VHDL and numerous Design Methodologies. Holliston Poor Farm, P.O. Box 6222, Holliston, MA 01746-6222 Legal Disclaimer: "As always, anything said here is only opinion."Article: 10517
mpa@comlab.ox.ac.uk (Matt Aubury) writes: > Terje Mathisen wrote: > > This could be solved with runtime code generation as well, compiling an > > optimized set of binary logic ops on the fly, or (much simpler), by > > embedding the cell rules in lookup tables. > > I think you're going to have a problem there: the total state coming > into the lookup table is going to be the eight neighbours plus the > central cell, each with four bits of state, so thats 36 bits of input > data to 4 bits of output. A 32 GB lookup table might be a touch > cumbersome! :-) While that's true, I'm sure your hardware solution isn't using a full width sum-of-products implementation either. Anywhere you use cascaded logic, a software implementation can use the exact same cascaded logic or cascaded table lookups. -- Bruce -- 'We have no intention of shipping another bloated operating system and forcing that down the throats of our Windows customers' -- Paul Maritz, Microsoft Group Vice PresidentArticle: 10518
Hi, Sorry, I don't know how to help you with ABEL, but there are several ways to use HDL and accomplish what you wish. Using VHDL (or Verilog, but I'm just using VHDL), you can use scripts with either the Synopsys or Synplicity packages. Alternatively, if you can convert your netlist into a Viewlogic wir format, we have written software and macros to perform the substitution or you can use Viewgen to create a schematic and replace the macros (we have made C-C module equivalents for each of the S-Module macros) from our custom library. Lastly, Actel has software (in beta right now) that incorporates flip-flop control into both Actgen and Actmap - you can select either C-Module or TMR implementations. Also, similarly to what we did, they offer library symbols in their database with hardened equivalents of all of their flip-flop macros. Please email me or see our www site (http://rk.gsfc.nasa.gov) for some more notes on this topic. Hope this helps, rk _____________________________________________________________________ Jules wrote: HI, Could anyone tell me possibly how to cause ABEL code to synthesize to specific actel macros. I need to do this to avoid using sequential flip flops to better protect the system from single event upset in a high radiation earth orbit. Are there any ACTEL directives in Synario ABEL that can force this? Thanks for your help regards Jules ClusterII team Space Physics group Imperial College LondonArticle: 10519
Dear all, I will be in States(California), would like to know if you guys know whether is there any training on VHDL (more on pratice workshop) conducted by anybody ? During June and July ' 98 Thanks, vananArticle: 10520
Bruce Hoult wrote: > > mpa@comlab.ox.ac.uk (Matt Aubury) writes: > > Terje Mathisen wrote: > > > This could be solved with runtime code generation as well, compiling an > > > optimized set of binary logic ops on the fly, or (much simpler), by > > > embedding the cell rules in lookup tables. > > > > I think you're going to have a problem there: the total state coming > > into the lookup table is going to be the eight neighbours plus the > > central cell, each with four bits of state, so thats 36 bits of input > > data to 4 bits of output. A 32 GB lookup table might be a touch > > cumbersome! :-) Actually, it isn't quite so bad: Since the output is just 4 bits, I'd pack two of them into a single byte, so my table would be "only" 16GB. :-) > While that's true, I'm sure your hardware solution isn't using a > full width sum-of-products implementation either. Anywhere you use > cascaded logic, a software implementation can use the exact same > cascaded logic or cascaded table lookups. That is of course the way to implement it. I.e. the word counter I mentioned uses one table to classify pairs of input chars, combines this 4-bit value with the result from the previous pair, and then uses another table to lookup the corresponding line/word increments which gets added to the running (block) total. Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"Article: 10521
There's been an update! See what's new on The Programmable Logic Jump Station at http://www.optimagic.com/ http://www.optimagic.com/whatsnew.html The Programmable Logic Jump Station is a comprehensive set of links to nearly all matters related to programmable logic. Featuring: --------- --- Frequently-Asked Questions (FAQ) --- Programmable Logic FAQ - http://www.optimagic.com/faq.html A great resource for designers new to programmable logic. --- FPGAs, CPLDs, FPICs, etc. --- Recent Developments - http://www.optimagic.com Find out the latest news about programmable logic. Device Vendors - http://www.optimagic.com/companies.html FPGA, CPLD, SPLD, and FPIC manufacturers. Device Summary - http://www.optimagic.com/summary.html Who makes what and where to find out more. Market Statistics - http://www.optimagic.com/market.html Total high-density programmable logic sales and market share. --- Development Software --- Free and Low-Cost Software - http://www.optimagic.com/lowcost.html Free, downloadable demos and evaluation versions from all the major suppliers. Design Software - http://www.optimagic.com/software.html Find the right tool for building your programmable logic design. Synthesis Tutorials - http://www.optimagic.com/tutorials.html How to use VHDL or Verilog. --- Related Topics --- FPGA Boards - http://www.optimagic.com/boards.html See the latest FPGA boards and reconfigurable computers. Design Consultants - http://www.optimagic.com/consultants.html Find a programmable logic expert in your area of the world. Research Groups - http://www.optimagic.com/research.html The latest developments from universities, industry, and government R&D facilities covering FPGA and CPLD devices, applications, and reconfigurable computing. News Groups - http://www.optimagic.com/newsgroups.html Information on useful newsgroups. Related Conferences - http://www.optimagic.com/conferences.html Conferences and seminars on programmable logic. Information Search - http://www.optimagic.com/search.html Pre-built queries for popular search engines plus other information resources. The Programmable Logic Bookstore - http://www.optimagic.com/books.html Books on programmable logic, VHDL, and Verilog. Most can be ordered on-line, in association with Amazon.com . . . and much, much more. Bookmark it today!Article: 10522
Ian_Ameline wrote: > > Robert Bernecky wrote: > > > > Another parallel implementation of Conway's Life is given > > by Eugene McDonnell in "Life: Nasty, Brutish, and Short", > > in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves > > a number of algorithms in dialects of APL, ending up with > > a 9-token expression for one iteration. > > > > The most parallel implementation I've ever seen is "Life in the Stencil > Buffer" on page 407 of the OpenGL Programming Guide. > > -- > Regards, | No sense being pessimistic -- > Ian Ameline, | It wouldn't work anyway. > Senior Software Engineer, | > Alias/Wavefront | Since we are reminiscing, I recall Life on CLIP4 in 1984. CLIP4 was a 96x96 SIMD array processor build at University College London. The processors ran at about 1MHz. We had to slow the code by about 500x to see the display :-) The code I suspect was written by Paul Otto and David Renolds (I did not join the group until 1984). Regards, Zahid -- Zahid Hussain, BSc (Hons), PhD (Lond.) E-mail: zhus@daldd.sc.ti.com 3D Graphics Software Architect Tel: (972) 480-2864 Texas Instruments Inc. Fax: (972) 480-6303 8505 Forest Lane, Dallas, TX, USA MS: 8724, MSGID: ZHUSArticle: 10523
I'm using Altera products for the first time and there are some pins that remain quite mysterious : clkusr cs /cs dev_clr dev_oe init_done (well, I suppose this one is quite obvious :-) rdy_/bsy /rs /ws Some of them must be used for parallel programming, but I didn't find any info about it in the data book. I also wonder what is the "user mode" (and what are the other modes)... thanks Nicolas MATRINGE DotCom SA Développement électronique 16 rue du Moulin des Bruyères Tel: 00 33 1 46 67 51 00 92400 COURBEVOIE Fax: 00 33 1 46 67 51 01 FRANCEArticle: 10524
I'm using Altera products for the first time and there are some pins that remain quite mysterious : clkusr cs /cs dev_clr dev_oe init_done (well, I suppose this one is quite obvious :-) rdy_/bsy /rs /ws Some of them must be used for parallel programming, but I didn't find any info about it in the data book. I also wonder what is the "user mode" (and what are the other modes)... thanks Nicolas MATRINGE DotCom SA Développement électronique 16 rue du Moulin des Bruyères Tel: 00 33 1 46 67 51 00 92400 COURBEVOIE Fax: 00 33 1 46 67 51 01 FRANCE
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z