Messages from 10500

Article: 10500
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: torbenm@diku.dk (Torben AEgidius Mogensen)
Date: 25 May 1998 14:19:30 +0200
Links: << >> << T >> << A >>

Terje Mathisen <Terje.Mathisen@hda.hydro.com> writes:

>Klering's code was actually fairly straightforward, except for a set of
>flags used to detect static areas.

>Skipping that part would still (most probably) let it run at the
>required 60 fps, the code is 'just' a parallel implementation of the
>counting logic:

>Alive next iteration = (alive now AND (count == 2 OR count == 3)) OR
>                       (not alive AND count == 3),

>which simplifies to just:

>Alive next iteration = (count == 3) OR (alive AND count == 2).

>By including the cell itself in the count, then it becomes easier to
>reuse the counting logic for multiple rows:

> alive = (iCount == 3) OR (alive AND iCount == 4)

>You need 4 bits to count to 8 (or 9), so 4 registers for counting plus
>one for the center cells leaves one or two registers for array
>addressing on an x86.

>Klering did a lot of work to simplify the logic as much as possible,
>i.e. he didn't actually implement the full 'count-to-9' bitwise logic,
>since it is possible to early-out many of the branches.

>Implementing the same logic with MMX-style wide registers should make it
>approximately twice as fast.

I'm not sure if the following is the same as Klerings code, but the
approach sounds similar. The code below is based on code I got from
David Seal and then optimized slightly (removing two binary
operations). Initially, the variable "middle" contains a bitvector of
a number of cells. "up" and "down" contains the rows over and below
the row in question. "left" and "right" contain the same as middle,
except that they are shifted one bit left or right (with the
appropriate neighbouring bits shifted in). Similarly for "upleft" etc.
At the end, "newmiddle" contains the new values for the row
corresponding to "middle".

    ones = up ^ upleft;
    twos = up & upleft;
    carry = ones & upright;
    ones = ones ^ upright;
    twos = twos ^ carry;

    ones1 = down ^ downleft;
    twos1 = down & downleft;
    carry = ones1 & downright;
    ones1 = ones1 ^ downright;
    twos1 = twos1 ^ carry;

    carry = ones & ones1;
    ones = ones ^ ones1;
    fours = twos & twos1;
    twos = twos ^ twos1;
    carry1 = twos & carry;
    twos = twos ^ carry;
    fours = fours ^ carry1; /* could be | */

    carry = ones & left;
    ones = ones ^ left;
    carry1 = ones & right;
    ones = ones ^ right;
    carry = carry | carry1;
    twos = twos ^ carry;

    ones = ones | middle;
    newmiddle = ones & twos & ~fours;

If we assume that ^, &, | and &~ are available as single-cycle
operations this takes 26 cycles to complete, some of which can be done
in parallel on a superscalar machine. If we add in slightly over a
dozen cycles for computing "upleft" etc., we get 40 cycles per
word-sized bitvector. If we make sure only to read each memory word
once, we should be able to update a wordlength of cells in 40 cycles
plus the time it takes to read and write a word. If the read/write is
to non-cached meory (as it will be if it goes to memory mapped display
memory), a read will take a full memory cycle (though you can use
burst access) while the store can be through a write buffer and hence
take only one CPU cycle (which can be scheduled in parallel with other
operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz
CPU and 33MHz memory, this works out to about 4 million words per
second, or 128 million cells/s. With a 800x600 display, we get more
than 250 frames per second. With a superscalar CPU and a non-blocking
cache, this can be improved considerably.

	Torben Mogensen (torbenm@diku.dk)

Article: 10501
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: tim@jumpnet.com (Tim Olson)
Date: Mon, 25 May 1998 07:54:58 -0500
Links: << >> << T >> << A >>

In article <6kbnki$4ji@grimer.diku.dk>, torbenm@diku.dk (Torben AEgidius
Mogensen) wrote:

 I'm not sure if the following is the same as Klerings code, but the
|     ones = up ^ upleft;
|     twos = up & upleft;
|     carry = ones & upright;
|     ones = ones ^ upright;
|     twos = twos ^ carry;
| 
|     ones1 = down ^ downleft;
|     twos1 = down & downleft;
|     carry = ones1 & downright;
|     ones1 = ones1 ^ downright;
|     twos1 = twos1 ^ carry;
| 
|     carry = ones & ones1;
|     ones = ones ^ ones1;
|     fours = twos & twos1;
|     twos = twos ^ twos1;
|     carry1 = twos & carry;
|     twos = twos ^ carry;
|     fours = fours ^ carry1; /* could be | */
|     
|     carry = ones & left;
|     ones = ones ^ left;
|     carry1 = ones & right;
|     ones = ones ^ right;
|     carry = carry | carry1;
|     twos = twos ^ carry;
| 
|     ones = ones | middle;
|     newmiddle = ones & twos & ~fours;

This algorithm looks like the one described in the Smalltalk "blue book",
where a version of Life was implemented using BitBlt operations to
implement the cell counting in parallel.

-- 

     -- Tim Olson

Article: 10502
Subject: CRC speeds and density please
From: s_clubb@die.spammer.netcomuk.co.uk (Stuart Clubb)
Date: Mon, 25 May 1998 16:39:06 GMT
Links: << >> << T >> << A >>

Hi all,

I have been working on a parameterised, synthesisable CRC generator. I
know it's generating correct implementations for CRC-32, CRC-16, and
anything else where a documented check for the polynomial exists.

Now I just need a handle on acceptable speeds and density. So far,
I've come up with the following figures for a Ethernet CRC-32 with 32
bit data being fed in every clock cycle:

Device		area

FLEX10K100A-1	211 LC's
XC4005XL-09	 95 CLB's

Performance for both is around 2 Gbits per second.

I think the numbers are pretty reasonable but if anyone out there has
better numbers, I would be grateful for some feedback.

Thanks
Stuart

Article: 10503
Subject: Programmable Logic News update
From: "mdisman" <mdisman@ix.netcom.com>
Date: 25 May 1998 17:29:27 GMT
Links: << >> << T >> << A >>

The site for the Programmable Logic News & Views newsletter has been
updated with summaries of the February and March newsletters.

http://www.plnv.com


Murray Disman

Article: 10504
Subject: Altera MaxPlus using third party programmer
From: "M R Wheeler" <ic2@iquest.net>
Date: 25 May 1998 18:11:32 GMT
Links: << >> << T >> << A >>

I wonder if anyone has used the Altera MaxPlus software with a third party
programmer.  I programmed a EPC1213LC20 using the Altera programmer and was
unable to verify it using my ALLMAX+ programmer.  The .POF file used to
programm the device is made from two seperate .SOF files, one set as
active, the other passive.  It does not seem to matter if I use the .HEX or
.RBF formats. Both file will generate the warning that the selected
configuration has disabled the start-up time-out device option.  Which is
no good.  What I ended up doing was looking at the raw .POF file produced
and stripping the header information and storing the file back out as a raw
binary image.  In the file I created, the data appears to start at an
offset of 0xA3.  When I do this, the file may be read into the ALLMAX+ and
verified against the original EPROM.  I don't believe I need to use this
converter to get the correct image.    Has anyone had this kind of problem,
or done this kind of test?  I have a call into Altera also and am waiting
on a responce from them.  Also notive that when selecting the .HEX and .RBF
formats that the menu states that this is for a single device.

Article: 10505
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Date: Mon, 25 May 1998 20:57:25 +0200
Links: << >> << T >> << A >>

Torben AEgidius Mogensen wrote:
[snip]
> I'm not sure if the following is the same as Klerings code, but the
> approach sounds similar. The code below is based on code I got from
> David Seal and then optimized slightly (removing two binary
> operations). Initially, the variable "middle" contains a bitvector of
> a number of cells. "up" and "down" contains the rows over and below
> the row in question. "left" and "right" contain the same as middle,
> except that they are shifted one bit left or right (with the
> appropriate neighbouring bits shifted in). Similarly for "upleft" etc.
> At the end, "newmiddle" contains the new values for the row
> corresponding to "middle".
> 
>     ones = up ^ upleft;
>     twos = up & upleft;
>     carry = ones & upright;
>     ones = ones ^ upright;
>     twos = twos ^ carry;
> 
>     ones1 = down ^ downleft;
>     twos1 = down & downleft;
>     carry = ones1 & downright;
>     ones1 = ones1 ^ downright;
>     twos1 = twos1 ^ carry;
> 
>     carry = ones & ones1;
>     ones = ones ^ ones1;
>     fours = twos & twos1;
>     twos = twos ^ twos1;
>     carry1 = twos & carry;
>     twos = twos ^ carry;
>     fours = fours ^ carry1; /* could be | */
> 
>     carry = ones & left;
>     ones = ones ^ left;
>     carry1 = ones & right;
>     ones = ones ^ right;
>     carry = carry | carry1;
>     twos = twos ^ carry;
> 
>     ones = ones | middle;
>     newmiddle = ones & twos & ~fours;
> 
> If we assume that ^, &, | and &~ are available as single-cycle

All of these except &~ (NAND) are available on all cpus I know, and on
those which miss out, you can of course synthesize it in two cycles.

> operations this takes 26 cycles to complete, some of which can be done
> in parallel on a superscalar machine. If we add in slightly over a

Actually, it should be quite easy to get close to 2 IPC, because there's
a lot of independent operations all the way to the end.

> dozen cycles for computing "upleft" etc., we get 40 cycles per
> word-sized bitvector. If we make sure only to read each memory word
> once, we should be able to update a wordlength of cells in 40 cycles
> plus the time it takes to read and write a word. If the read/write is
> to non-cached meory (as it will be if it goes to memory mapped display
> memory), a read will take a full memory cycle (though you can use
> burst access) while the store can be through a write buffer and hence

As you've just discovered, to get max speed you must maintain a back
buffer in RAM, and then write updated blocks to the display.

It is also critical to have a one bit/pixel display mode, because
otherwise you'll be totally limited by write bandwidth.

I.e. working in 32-bit true color will increase the size of a full
screen buffer from 64K to 2MB.

The 120 MB/sec required write speed (for 60 fps) will definitely
overload a PCI bus, which has a (very) theoretical max speed of 133
MB/sec on a long burst.

> take only one CPU cycle (which can be scheduled in parallel with other
> operations). On a non-superscalar CPU (as e.g. StrongARM) with 200MHz
> CPU and 33MHz memory, this works out to about 4 million words per
> second, or 128 million cells/s. With a 800x600 display, we get more
> than 250 frames per second. With a superscalar CPU and a non-blocking
> cache, this can be improved considerably.

Anyway, this is the important point: A general cpu is more than fast
enough to handle this problem at full frame rate, as long as the code is
properly optimized.

When you've optimized the code, then you'll discover that the problem
really is memory bandwidth and nothing else.

On the regular VGA cards we had to target, writing a single pixel was so
slow that it was critical to minimize the number of writes to just those
pixels that actually changed.

In my code I stored 4 cells plus the neighborhood counts in a 16-bit
word, and then used a 64K lookup table to convert the current value to
the new state. If any of the 4 cells changed state, then I would write
the updated pixel(s) to display memory, and then lookup a pair of 32-bit
increments/decrements: The values needed to update the status of the
current line, and the lines above/below.

My program actually used less instructions/iteration than both Stafford
and Klerings entries, but they both blew me away by keeping their
working sets small enough to fit (mostly) in the 8K L1 cache!

Some time after this I formulated my .sig. :-)

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Article: 10506
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: "Jan Gray" <jsgray@acm.org.nospam>
Date: Mon, 25 May 1998 12:33:46 -0700
Links: << >> << T >> << A >>

Tim Olson wrote
>This algorithm looks like the one described in the Smalltalk "blue book",
>where a version of Life was implemented using BitBlt operations to
>implement the cell counting in parallel.


Another reference to the bitwise parallel approach is "Life Algorithms",
Mark Niemiec, Byte, Jan. 1979, pp 70-79.  If I recall correctly, Mark, David
Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move
mountain" instructions to animate 64K 36-bit words per iteration.

Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did
800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000
cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985.

As Messrs. Mathisen and Mogensen describe, Life should run very fast on
modern processors (superscalar and multimedia enhanced and large caches).
64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns.


FPGA Implementation: It is straightforward to run at full memory bandwidth.
For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns
clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns.

Since a given line is read three times (as "below", "current", and "above"),
we buffer 2 lines of cells in RAM in the FPGA.  A 1024 x n playfield
requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024
bits for 3 banks since each clock you must read from up to two lines and
write to a third.

Detailed design/floor plan.  One bit requires approx. 9 CLBs.  Assuming the
cell neighbours are (a,b,c,d,e,f,g,h), we need :-
3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer)
6 CLBs logic --
  1 s0a=a^b^c^d;  s0e=e^f^g^h
  2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3"
  3 s2a=a&b&c&d; s2e=e&f&g&h
  4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e)  (uses dedicated carry
logic)
  6 new = ~s3&~s2&s1&(s0|old)
and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in
the left half, another 16 in the right half, leaving plenty of room to spare
 for control and address generation.


At the 1997 FPGAs for Custom Computing Machines conference., the paper "The
RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for
comparing reconfigurable computing systems.  One of the 12 benchmarks was
Life, for which they reported speedups of several hundred times over a
SparcStation20+software approach, but in fairness, they write "we are not
currently implementing known improvements to the software to take advantage
of the bit-level parallelism available in a microprocessor".

Summary. Hypotheticially...
Fast microprocessor + cache: ~1 bit/ns
Single FPGA + SRAM custom machine:  ~1 bit/ns

Jan Gray

Article: 10507
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: rokicki@cello.hpl.hp.com (Tom Rokicki)
Date: 25 May 1998 12:58:14 -0700
Links: << >> << T >> << A >>

>     ones = up ^ upleft;
>     twos = up & upleft;
>     carry = ones & upright;
>   . . .

You can beat this (in terms of number of logical operations and shifts)
by quite a bit.  Here, `g' is the original, and only input.

sl3=(sl2=(a=left(g))^(b=right(g)))^g
sh3=(sh2=a&b)|(sl2&g)

sll=(a=up(sl3)^(b=down(sl3)))^sl2
slh=(a|(b^sl2))^sll
a=up(sh3)^(b=down(sh3))
g=(a^sh2^slh)&((a|(b^sh2))^slh)&(sll|g)

I believe that's 19 logical operations, one left, one right, two ups
and two downs (this assumes the ups and downs are cheap).

> Actually, it should be quite easy to get close to 2 IPC, because there's
> a lot of independent operations all the way to the end.

You bet!

> As you've just discovered, to get max speed you must maintain a back
> buffer in RAM, and then write updated blocks to the display.

Yep, and you need to block the algorithm appropriately so it fits in cache.
This is pretty easy to do; just do the above algorithm in appropriately
sized strips.  Further, it's pretty easy to block out (not process)
areas that are static or oscillating with period 2 (which are terribly
common in Life); I generally use two alternating buffers and keep a
`superbitmap' of those chunks that are changing with period >2.

> The 120 MB/sec required write speed (for 60 fps) will definitely
> overload a PCI bus, which has a (very) theoretical max speed of 133
> MB/sec on a long burst.

Which is why you do the delta.  Indeed, what I did is `stupider' than
that.  There's no sense updating the display at greater than the
frame rate but it's easy to calculate at greater than frame rate.
So I don't update on every generation, just on every frame.  And then
I only update the deltas, which are often quite small compared to the
real data.

> When you've optimized the code, then you'll discover that the problem
> really is memory bandwidth and nothing else.

I'm not so sure about this; it's pretty easy to make the loads/stores
overlap pretty well.  Of course, I did it on the 68000 where there are
enough registers; I'm not sure about the x86 world.

The above code was completely designed by me, although I'm sure others
have found a similar solution.

(I actually implemented the above on an HP calculator in user-RPL, and
on the Amiga, both with the blitter and in assembler.  I keep meaning
to get around to speeding up xlife but never can seem to find the time.)

Here's 48G code for anyone who cares; just put a GROB on the stack and
hit `GEN':

GEN << WHILE 1 REPEAT DUP ->LCD GEN1 END >>
GEN1 << {#0 #1} DUP2 SH OVER LX ROT REVLIST
  SWAP OVER SH 5 ROLLD 4 ROLLD SH 4 PICK LX
  ROT 3 PICK + NEG + 4 ROLLD NEG + LX + NEG >>
SH << DUP2 OVER DUP ROT {#FFFh #FFFh} SUB
  LX 3 DUPN 7 ROLLD GXOR 5 ROLLD GXOR + >>
LX << {#0 #0} SWAP GXOR >>

-tom

Article: 10508
Subject: Re: XC6200
From: Tom Kean <tom@algotronix.com>
Date: Tue, 26 May 1998 00:51:34 +0100
Links: << >> << T >> << A >>




Nice to see one's name in the press!

The story below is basically correct but misses out a couple of vital details:
1.  Algotronix web address:   www.algotronix.com
2.  Algotronix phone number: (408) 480 5707

Tom Kean

mtmason@ix.netcom.com wrote:

> Taken from EETimes
>
> SAN JOSE, Calif. — Xilinx Inc. has stopped development work on its XC6200
> line of partially reconfigurable field-programmable gate arrays (FPGAs), and
> the founders of the the company's reconfigurable R&D group in Edinburgh,
> Scotland, John Gray and Tom Kean, have both left the company. The remaining
> engineering staff at Edinburgh has been reassigned to develop IP cores for
> use by Xilinx's customers within the company's FPGAs.
>
> However, Xilinx says it remains committed to the partial reconfigurability
> offered by the XC6200 devices and will offer many of the features of the
> XC6200 in its next-generation FPGA family, known as Virtex.
>
> "John Gray is still working for Xilinx as a consultant," said Roland
> Triffaux, manager of Xilinx Europe. "Tom Kean and another two engineers have
> left to start a company in California."
>
> That spin-off company, called Quicksilver, is believed to have some backing
> from Xilinx. Quicksilver is looking to apply reconfigurable logic to
> multiprotocol handsets for mobile communications, sources said. It is also
> believed to be working with systems companies on the application of
> reconfigurable logic.
>
> Gray said his departure from the company was amicable. "I am just looking to
> do something that's more fun again," he said. "It's time to kick back a bit."
>
> Meanwhile, Kean is reacquiring the name Algotronix from Xilinx. Kean and Gray
> led Algotronix Ltd. in the early 1990s before it was acquired by Xilinx in
> 1993, when it became the basis of the reconfigurable R&D group, and was named
> Xilinx Development Corp.
>
> Algotronix developed a reconfigurable FPGA architecture known as CAL
> (configurable array logic), which eventually became the XC6200. Kean said the
> new Algotronix would act as a consultancy and would advise users on the
> application of reconfigurable logic.
>
> Xilinx said the work of the R&D group was largely completed. "The goal of the
> reconfigurable group in Edinburgh has been achieved," Triffaux said. XC6200
> devices would continue to be available for academic and commercial research
> groups, as they have been in the past. "We never really sold it," Triffaux
> said.
>
> Peter Cheung, a researcher in the department of electrical and electronic
> engineering at Imperial College of Science and Technology (London), has used
> XC6200 devices for reconfigurable hardware platforms. "We've heard they are
> not developing the XC6200," Cheung said. "Unless there is a real commitment
> to it we may have to look at other things. The tools for 6200 are primitive
> and not well done.
>
> "In many ways it [the XC6200] was a product ahead of its time," he said. "It
> was a beautifully conceived device but not sufficiently well supported."
>
> -----== Posted via Deja News, The Leader in Internet Discussion ==-----
> http://www.dejanews.com/   Now offering spam-free web-based newsreading

Article: 10509
Subject: Re: Problem with loading XC4000E configuration from 8051
From: John <jsmeltze@columbus.rr.com>
Date: Mon, 25 May 1998 23:05:11 -0400
Links: << >> << T >> << A >>

I have seen problems with the CCLK in master mode, but it sounds like
you are using the slave mode. Still could be a timing issue, try adding
some series termination... say a 33 ohm resistor on the cclk.

Alexander Sherstuk wrote:

>
>
> Hi All,
>
>   I encountered unexpected difficulty, when loading XILINX XC4005E
> configuration from ATMEL AT89C52 (in serial slave mode).
>   I connected P1.5 pin to configuration clock CCLK pin of XILINX chip,
>
> and connected P1.0 pin to DIN pin of XILINX chip.
>   XILINX configuration is loaded, but not with 100% probability -
> sometimes (1 attempt of 5) it fails.
>   It looks like the problem is with 8051 signals rise time.
> When I fed CCLK through 74HC14, everything works fine.
> Maybe, somebody knows more about this problem.
> How to avoid it?
>
> Thanks,
>    Alex Sherstuk
>      Sherstuk@amsd.com

Article: 10510
Subject: Re: Partitioning an a large design in Altera's Max+Plus II
From: Koenraad Schelfhout VH14 8993 <ksch@sh.bel.alcatel.be>
Date: Tue, 26 May 1998 08:37:29 +0200
Links: << >> << T >> << A >>

Some time ago, I was faced with a similar question (in my case I
was using Xilinxes instead of Alteras.

As far as I understood from the Altera tool, it is ok if you have not
yet made your pcb.  That is were these tools are good for.  However,
I don't know of any way to provide a tool such as MaxPlus2 or Xilinx
M1.x to provide with such a connection list.  Since you are prototyping
chances exist that connections will change.  

What we finally did was made a board, define the connection list (with
inclusion of some spares), make for each module or set of modules a
separate component and routed these component as stand-alone FPGA's.


-- 

 Koenraad SCHELFHOUT

 Switching Systems Division          http://www.alcatel.com/
 Microelectronics Department - VA21     _______________
________________________________________\             /-___
                                         \           / /
 Phone : (32/3) 240 89 93                 \ ALCATEL / /
 Fax   : (32/3) 240 99 88                  \       / /
 mailto:ksch@sh.bel.alcatel.be              \     / /
_____________________________________________\   / /______
                                              \ / /
 Francis Wellesplein, 1                        v\/
 B-2018  Antwerpen
 Belgium

Article: 10511
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: mpa@comlab.ox.ac.uk (Matt Aubury)
Date: 26 May 1998 12:28:38 GMT
Links: << >> << T >> << A >>

Terje Mathisen wrote:
> Anyway, this is the important point: A general cpu is more than fast
> enough to handle this problem at full frame rate, as long as the code is
> properly optimized.

I've only just seen this discussion, but I'm the guy who wrote the
demonstration at Oxford. I should emphasise that I wasn't trying to
show the power of FPGAs particularly, more just trying to make an
attractive demo, so getting it running at the full frame rate was my
sole aim also. It is trivial to compute more than one cell per cycle,
but as you rightly point out the problem will quickly become one of
memory bandwidth. We the system we're using, I could probably get 10
cells per cycle (giving us around 600fps) before that became the
bottleneck.

I recently extended the program in a different way: I wrote a fairly
general cellular automata harness in which each cell has four bits of
state, and just about any automata you want to try out can simply be
plugged in. These should all then run at the full frame rate, as you
have around four or five levels of pipelining available. I think
multistate automatas like this could present a pretty difficult
challenge for conventional processors: it really is just one of those
things that FPGAs are very good at.

Shameless plug: the general automata (including memory interfaces, VGA
display and serial mouse interface for interacting with the automatas)
was done in around 700 lines of Handel-C code. If anybody wants a
copy, I'll gladly send it out (send mail to mpa@comlab.ox.ac.uk). The
Hardware Compilation Group homepage is at:

	http://www.comlab.ox.ac.uk/oucl/hwcomp.html

Cheers,
Matt

--
Matt Aubury, Oxford University Computing Laboratory

Article: 10512
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Robert Bernecky <bernecky@acm.org>
Date: Tue, 26 May 1998 10:11:21 -0400
Links: << >> << T >> << A >>

Another parallel implementation of Conway's Life is given
by Eugene McDonnell in "Life: Nasty, Brutish, and Short",
in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves
a number of algorithms in dialects of APL, ending up with
a 9-token expression for one iteration.

Bob

Jan Gray wrote:
> >where a version of Life was implemented using BitBlt operations to
> >implement the cell counting in parallel.
> Another reference to the bitwise parallel approach is "Life

Article: 10513
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Ian_Ameline <no_spam@dev.null>
Date: Tue, 26 May 1998 11:25:59 -0400
Links: << >> << T >> << A >>

Robert Bernecky wrote:
> 
> Another parallel implementation of Conway's Life is given
> by Eugene McDonnell in "Life: Nasty, Brutish, and Short",
> in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves
> a number of algorithms in dialects of APL, ending up with
> a 9-token expression for one iteration.
> 

   The most parallel implementation I've ever seen is "Life in the Stencil
Buffer" on page 407 of the OpenGL Programming Guide. 

-- 
Regards,                   | No sense being pessimistic --
Ian Ameline,               | It wouldn't work anyway.
Senior Software Engineer,  |
Alias/Wavefront            |

Article: 10514
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Date: Tue, 26 May 1998 17:50:35 +0200
Links: << >> << T >> << A >>

Matt Aubury wrote:
> 
> Terje Mathisen wrote:
> > Anyway, this is the important point: A general cpu is more than fast
> > enough to handle this problem at full frame rate, as long as the code is
> > properly optimized.
> 
> I've only just seen this discussion, but I'm the guy who wrote the
> demonstration at Oxford. I should emphasise that I wasn't trying to
> show the power of FPGAs particularly, more just trying to make an
> attractive demo, so getting it running at the full frame rate was my
> sole aim also. It is trivial to compute more than one cell per cycle,
> but as you rightly point out the problem will quickly become one of
> memory bandwidth. We the system we're using, I could probably get 10
> cells per cycle (giving us around 600fps) before that became the
> bottleneck.

Nice! :-)

> I recently extended the program in a different way: I wrote a fairly
> general cellular automata harness in which each cell has four bits of
> state, and just about any automata you want to try out can simply be
> plugged in. These should all then run at the full frame rate, as you
> have around four or five levels of pipelining available. I think
> multistate automatas like this could present a pretty difficult
> challenge for conventional processors: it really is just one of those
> things that FPGAs are very good at.

This could be solved with runtime code generation as well, compiling an
optimized set of binary logic ops on the fly, or (much simpler), by
embedding the cell rules in lookup tables.

This is actually one of my favourite ways to solve many kinds of
problems, generating one or more tables at runtime, which implements all
the required logic.

This is basically a state machine, which will almost always run at
whatever speed the tables can support the (nested) lookups.

My single favourite program is a 16-bit version of Word Count, which
handles user-specified word and line separators, i.e. it can handle both
CR and LF by themselves, as well as the CRLF combination.

This program processes 256 chars in the inner loop, with zero
compare/test/branch operations. It needs just 1.5 instructions/byte, so
it is probably faster than any kind of disk or even main memory! :-) 

> Hardware Compilation Group homepage is at:
> 
>         http://www.comlab.ox.ac.uk/oucl/hwcomp.html

Interesting, although it seems like some the sample applications didn't
run too well, i.e. the real-time video image warper is a much simpler
application than a sw MPEG-2 decoder, and it still ran at just 9 fps.

Is this due to poorly optimized (handel-C) source code, or just that the
task isn't very well suited to an FPGA implementation?

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Article: 10515
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: mpa@comlab.ox.ac.uk (Matt Aubury)
Date: 26 May 1998 17:30:05 GMT
Links: << >> << T >> << A >>

Terje Mathisen wrote:
> Matt Aubury wrote:
> > I recently extended the program in a different way: I wrote a fairly
> > general cellular automata harness in which each cell has four bits of
> > state, and just about any automata you want to try out can simply be
> > plugged in. These should all then run at the full frame rate, as you
> > have around four or five levels of pipelining available. I think
> > multistate automatas like this could present a pretty difficult
> > challenge for conventional processors: it really is just one of those
> > things that FPGAs are very good at.
> 
> This could be solved with runtime code generation as well, compiling an
> optimized set of binary logic ops on the fly, or (much simpler), by
> embedding the cell rules in lookup tables.

I think you're going to have a problem there: the total state coming
into the lookup table is going to be the eight neighbours plus the
central cell, each with four bits of state, so thats 36 bits of input
data to 4 bits of output. A 32 GB lookup table might be a touch
cumbersome! :-)

Runtime code generation might well work; but it isn't exactly easy. I
wonder how well a partial evaluator, like TEMPO
(http://www.irisa.fr/compose/tempo/), would work on a problem like
this...

> > Hardware Compilation Group homepage is at:
> >         http://www.comlab.ox.ac.uk/oucl/hwcomp.html
> 
> Interesting, although it seems like some the sample applications didn't
> run too well, i.e. the real-time video image warper is a much simpler
> application than a sw MPEG-2 decoder, and it still ran at just 9 fps.

Ack! 

> Is this due to poorly optimized (handel-C) source code, or just that
> the task isn't very well suited to an FPGA implementation?

In the case of that particular version the problem was with the host
board and its interface. Since that demo was created, I've written my
own version which runs happily a 60fps (although that's running on a
static image it would be fairly trivial to extend it to video). We
have an realtime warp of Bill Gates' face which entertains quite a few
visitors!

Cheers,
Matt

--
Matt Aubury, Oxford University Computing Laboratory

Article: 10516
Subject: Re: COMPARISON SYNTHESIS
From: jcooley@world.std.com (John Cooley)
Date: Tue, 26 May 1998 17:40:28 GMT
Links: << >> << T >> << A >>

Ramon <rco00003@teleline.es> wrote:
>Hi, I am an student from Bracelona. I am doing my finally carrer project
>designig with VHDL,. Now I need to compare my synthesis with other
>synthesis.
>
>What kind of parametres can I compare?
>Where could I find some comparisons? (articles, books, ...)

Ramon,

There's a whole world of parameters you can compare synthesis tools on!
Off the top of my head I can think of:

  - (For ASICs) highest speed, least gates, net-to-gate ratios

  - (For FPGAs) highest speed, least CLBs, routability

  - FPGA to ASIC conversion (translation abilities)

  - Various levels of VHDL (or Verilog for that matter, too) support
    for synthesis

  - Design For Testability issues as related to various synthesis tools

  - Low power design synthesis (can this tool do it & how well?)

  - Module block sizes, wire load models, hierarchical synthesis

  - portability of a synthesis tool's input or output with other EDA tools

And, of course, price, support, etc.

                           - John Cooley
                             Part Time EDA Consumer Advocate
                             Full Time ASIC, FPGA & EDA Design Consultant

============================================================================
 Trapped trying to figure out a Synopsys bug?  Want to hear how 6000+ other
 users dealt with it ?  Then join the E-Mail Synopsys Users Group (ESNUG)!
 
      !!!     "It's not a BUG,               jcooley@world.std.com
     /o o\  /  it's a FEATURE!"                 (508) 429-4357
    (  >  )
     \ - /     - John Cooley, EDA & ASIC Design Consultant in Synopsys,
     _] [_         Verilog, VHDL and numerous Design Methodologies.

     Holliston Poor Farm, P.O. Box 6222, Holliston, MA  01746-6222
   Legal Disclaimer: "As always, anything said here is only opinion."

Article: 10517
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Bruce@hoult.actrix.gen.nz (Bruce Hoult)
Date: Wed, 27 May 1998 09:01:05 +1200
Links: << >> << T >> << A >>

mpa@comlab.ox.ac.uk (Matt Aubury) writes:
> Terje Mathisen wrote:
> > This could be solved with runtime code generation as well, compiling an
> > optimized set of binary logic ops on the fly, or (much simpler), by
> > embedding the cell rules in lookup tables.
>
> I think you're going to have a problem there: the total state coming
> into the lookup table is going to be the eight neighbours plus the
> central cell, each with four bits of state, so thats 36 bits of input
> data to 4 bits of output. A 32 GB lookup table might be a touch
> cumbersome! :-)

While that's true, I'm sure your hardware solution isn't using a
full width sum-of-products implementation either.  Anywhere you use
cascaded logic, a software implementation can use the exact same
cascaded logic or cascaded table lookups.

-- Bruce

--
'We have no intention of shipping another bloated operating system and 
forcing that down the throats of our Windows customers'
  -- Paul Maritz, Microsoft Group Vice President

Article: 10518
Subject: Rad Hard
From: "Richard B. Katz" <rich.katz@gsfc.nasa.NOSPAM.gov>
Date: Tue, 26 May 1998 22:56:42 -0400
Links: << >> << T >> << A >>

Hi,

Sorry, I don't know how to help you with ABEL, but there are several
ways to use HDL and accomplish what you wish.  Using VHDL (or Verilog,
but I'm just using VHDL), you can use scripts with either the Synopsys
or Synplicity packages.  Alternatively, if you can convert your netlist
into a Viewlogic wir format, we have written software and macros to
perform the substitution or you can use Viewgen to create a schematic
and replace the macros (we have made C-C module equivalents for each of
the S-Module macros) from our custom library.

Lastly, Actel has software (in beta right now) that incorporates
flip-flop control into both Actgen and Actmap - you can select either
C-Module or TMR implementations.  Also, similarly to what we did, they
offer library symbols in their database with hardened equivalents of all
of their flip-flop macros.

Please email me or see our www site (http://rk.gsfc.nasa.gov) for some
more notes on this topic.

Hope this helps,

rk

_____________________________________________________________________

Jules wrote:

  HI,

  Could anyone tell me possibly how to cause ABEL code to synthesize to
  specific actel macros.  I need to do this to avoid using sequential
  flip flops to better protect the system from single event upset in a
  high radiation earth orbit. Are there any ACTEL directives in Synario
  ABEL  that can force this?

  Thanks for your help

  regards

  Jules

  ClusterII team
  Space Physics group
  Imperial College London

Article: 10519
Subject: VHDL workshop at carlifornia region
From: "csc" <vanan@channelsys.com.sg>
Date: 27 May 1998 05:40:25 GMT
Links: << >> << T >> << A >>

Dear all,

I will be in States(California), would  like to know if you guys know
whether is there any training 
on VHDL (more on pratice workshop)  conducted by anybody ? During June and
July ' 98  


Thanks,

vanan

Article: 10520
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Date: Wed, 27 May 1998 09:31:08 +0200
Links: << >> << T >> << A >>

Bruce Hoult wrote:
> 
> mpa@comlab.ox.ac.uk (Matt Aubury) writes:
> > Terje Mathisen wrote:
> > > This could be solved with runtime code generation as well, compiling an
> > > optimized set of binary logic ops on the fly, or (much simpler), by
> > > embedding the cell rules in lookup tables.
> >
> > I think you're going to have a problem there: the total state coming
> > into the lookup table is going to be the eight neighbours plus the
> > central cell, each with four bits of state, so thats 36 bits of input
> > data to 4 bits of output. A 32 GB lookup table might be a touch
> > cumbersome! :-)

Actually, it isn't quite so bad: Since the output is just 4 bits, I'd
pack two of them into a single byte, so my table would be "only" 16GB.
:-)

> While that's true, I'm sure your hardware solution isn't using a
> full width sum-of-products implementation either.  Anywhere you use
> cascaded logic, a software implementation can use the exact same
> cascaded logic or cascaded table lookups.

That is of course the way to implement it.

I.e. the word counter I mentioned uses one table to classify pairs of
input chars, combines this 4-bit value with the result from the previous
pair, and then uses another table to lookup the corresponding line/word
increments which gets added to the running (block) total.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Article: 10521
Subject: UPDATE: The Programmable Logic Jump Station (www.optimagic.com)
From: "Steven K. Knapp" <sknapp@optimagic.com>
Date: Wed, 27 May 1998 07:33:16 -0700
Links: << >> << T >> << A >>

There's been an update!  See what's new on The Programmable
Logic Jump Station at


                   http://www.optimagic.com/
             http://www.optimagic.com/whatsnew.html


The Programmable Logic Jump Station is a comprehensive set of
links to nearly all matters related to programmable logic.



Featuring:
---------


          --- Frequently-Asked Questions (FAQ) ---


Programmable Logic FAQ - http://www.optimagic.com/faq.html
A great resource for designers new to programmable logic.



          --- FPGAs, CPLDs, FPICs, etc. ---


Recent Developments - http://www.optimagic.com
Find out the latest news about programmable logic.


Device Vendors - http://www.optimagic.com/companies.html
FPGA, CPLD, SPLD, and FPIC manufacturers.


Device Summary - http://www.optimagic.com/summary.html
Who makes what and where to find out more.


Market Statistics - http://www.optimagic.com/market.html
Total high-density programmable logic sales and market share.



            --- Development Software ---


Free and Low-Cost Software - http://www.optimagic.com/lowcost.html
Free, downloadable demos and evaluation versions from all the major
suppliers.


Design Software - http://www.optimagic.com/software.html
Find the right tool for building your programmable logic design.


Synthesis Tutorials - http://www.optimagic.com/tutorials.html
How to use VHDL or Verilog.



              --- Related Topics ---


FPGA Boards - http://www.optimagic.com/boards.html
See the latest FPGA boards and reconfigurable computers.


Design Consultants - http://www.optimagic.com/consultants.html
Find a programmable logic expert in your area of the world.


Research Groups - http://www.optimagic.com/research.html
The latest developments from universities, industry, and
government R&D facilities covering FPGA and CPLD devices,
applications, and reconfigurable computing.


News Groups - http://www.optimagic.com/newsgroups.html
Information on useful newsgroups.


Related Conferences - http://www.optimagic.com/conferences.html
Conferences and seminars on programmable logic.


Information Search - http://www.optimagic.com/search.html
Pre-built queries for popular search engines plus other
information resources.


The Programmable Logic Bookstore - http://www.optimagic.com/books.html
Books on programmable logic, VHDL, and Verilog.  Most can be
ordered on-line, in association with Amazon.com



            . . . and much, much more.


Bookmark it today!

Article: 10522
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set))
From: Zahid Hussain <zhus@daldd.sc.ti.com>
Date: Wed, 27 May 1998 09:39:34 -0500
Links: << >> << T >> << A >>

Ian_Ameline wrote:
> 
> Robert Bernecky wrote:
> >
> > Another parallel implementation of Conway's Life is given
> > by Eugene McDonnell in "Life: Nasty, Brutish, and Short",
> > in ACM SIGAPL APL88 Conference Proceedings. Eugene evolves
> > a number of algorithms in dialects of APL, ending up with
> > a 9-token expression for one iteration.
> >
> 
>    The most parallel implementation I've ever seen is "Life in the Stencil
> Buffer" on page 407 of the OpenGL Programming Guide.
> 
> --
> Regards,                   | No sense being pessimistic --
> Ian Ameline,               | It wouldn't work anyway.
> Senior Software Engineer,  |
> Alias/Wavefront            |

Since we are reminiscing, I recall Life on CLIP4 in 1984. CLIP4
was a 96x96 SIMD array processor build at University College
London. The processors ran at about 1MHz. We had to slow the 
code by about 500x to see the display :-) The code I suspect
was written by Paul Otto and David Renolds (I did not join
the group until 1984).

Regards,
	Zahid
--
Zahid Hussain, BSc (Hons), PhD (Lond.)	E-mail: zhus@daldd.sc.ti.com
3D Graphics Software Architect 		Tel: (972) 480-2864
Texas Instruments Inc.			Fax: (972) 480-6303
8505 Forest Lane, Dallas, TX, USA	MS: 8724,  MSGID: ZHUS

Article: 10523
Subject: Altera 10k pin function ??
From: Nicolas Matringe <nicolas.matringe@dotcom.fr>
Date: Wed, 27 May 1998 17:08:07 +0200
Links: << >> << T >> << A >>

I'm using Altera products for the first time and there are some pins
that remain
quite mysterious :

clkusr
cs
/cs
dev_clr
dev_oe
init_done (well, I suppose this one is quite obvious :-)
rdy_/bsy
/rs
/ws

Some of them must be used for parallel programming, but I didn't find
any info about
it in the data book.

I also wonder what is the "user mode" (and what are the other modes)...

thanks

Nicolas MATRINGE                   DotCom SA
Développement électronique         16 rue du Moulin des Bruyères
Tel: 00 33 1 46 67 51 00           92400 COURBEVOIE
Fax: 00 33 1 46 67 51 01           FRANCE

Article: 10524
Subject: Altera 10k pin function ??
From: Nicolas Matringe <nicolas.matringe@dotcom.fr>
Date: Wed, 27 May 1998 17:08:41 +0200
Links: << >> << T >> << A >>

I'm using Altera products for the first time and there are some pins
that remain
quite mysterious :

clkusr
cs
/cs
dev_clr
dev_oe
init_done (well, I suppose this one is quite obvious :-)
rdy_/bsy
/rs
/ws

Some of them must be used for parallel programming, but I didn't find
any info about
it in the data book.

I also wonder what is the "user mode" (and what are the other modes)...

thanks

Nicolas MATRINGE                   DotCom SA
Développement électronique         16 rue du Moulin des Bruyères
Tel: 00 33 1 46 67 51 00           92400 COURBEVOIE
Fax: 00 33 1 46 67 51 01           FRANCE

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search