Messages from 143350

Article: 143350
Subject: Re: Implement ARM cores on a FPGA chip?
From: Mike Treseler <mtreseler@gmail.com>
Date: Sun, 04 Oct 2009 10:48:45 -0700
Links: << >> << T >> << A >>

Nico Coesel wrote:

> Why does verification take a rack of servers?

It doesn't.
That would be easy to arrange for me,
but probably not for Lucien.
Python might be more practical alternative.

My advice to Lucien is to verify that the parallel algorithm
actually works and has the expected advantages before even thinking 
about the target hardware.

    -- Mike Treseler

Article: 143351
Subject: Re: Implement ARM cores on a FPGA chip?
From: LucienZ <lucien.zhang@gmail.com>
Date: Sun, 4 Oct 2009 12:58:34 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 4, 7:24=A0pm, "Antti.Luk...@googlemail.com"
<antti.luk...@googlemail.com> wrote:
> On Oct 4, 7:55=A0pm, LucienZ <lucien.zh...@gmail.com> wrote:
>
> > > cortex m3
> > > is easy obtainable a license for 1000 instances cost 2500 $
>
> > > this IS common knowledge i assumed you know this
>
> > > Antti
>
> > Is that Cortex-M3 license targeting ASIC fabrication or what?
> > Sorry for not sharing the common knowledge; I am really a dummy here...
>
> not that is for 1000 instances in Altera Cyclone III
> asic licenses are different
>
> Antti

Thanks Antti. What I concern is:
Suppose that I've licensed 1000 'instances', and now I want to use 4
of them for my design.
Does it mean I have to pick up 4 FPGA chips, with exactly one M3 on
each chip?
Or
I can deploy 4 instances on only one FPGA chip, using tools like
Quartus?

I see Cyclone solutions based on Cortex-M1, but I think the M3 is for
ASICs. It's just a little bit confusing...

---
To Mike, I will follow your advice and start with a verification and
evaluation step. The idea of going embedded is from my bosses...and
now I am forced to make my brain work in parallel: rewriting the
algorithm as well as looking for an embedded architecture for it :).

Article: 143352
Subject: Re: Implement ARM cores on a FPGA chip?
From: "Antti.Lukats@googlemail.com" <antti.lukats@googlemail.com>
Date: Sun, 4 Oct 2009 13:59:56 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 4, 10:58=A0pm, LucienZ <lucien.zh...@gmail.com> wrote:
> On Oct 4, 7:24=A0pm, "Antti.Luk...@googlemail.com"
>
>
>
> <antti.luk...@googlemail.com> wrote:
> > On Oct 4, 7:55=A0pm, LucienZ <lucien.zh...@gmail.com> wrote:
>
> > > > cortex m3
> > > > is easy obtainable a license for 1000 instances cost 2500 $
>
> > > > this IS common knowledge i assumed you know this
>
> > > > Antti
>
> > > Is that Cortex-M3 license targeting ASIC fabrication or what?
> > > Sorry for not sharing the common knowledge; I am really a dummy here.=
..
>
> > not that is for 1000 instances in Altera Cyclone III
> > asic licenses are different
>
> > Antti
>
> Thanks Antti. What I concern is:
> Suppose that I've licensed 1000 'instances', and now I want to use 4
> of them for my design.
> Does it mean I have to pick up 4 FPGA chips, with exactly one M3 on
> each chip?
> Or
> I can deploy 4 instances on only one FPGA chip, using tools like
> Quartus?
>
> I see Cyclone solutions based on Cortex-M1, but I think the M3 is for
> ASICs. It's just a little bit confusing...
>
> ---
> To Mike, I will follow your advice and start with a verification and
> evaluation step. The idea of going embedded is from my bosses...and
> now I am forced to make my brain work in parallel: rewriting the
> algorithm as well as looking for an embedded architecture for it :).

my mistake M1 of course

Antti

Article: 143353
Subject: Post route simulation and real implementation
From: Giuseppe Marullo <giuseppe.marullonospam@iname.com>
Date: Mon, 05 Oct 2009 02:06:26 +0200
Links: << >> << T >> << A >>

Hi all,
after some simulations I finally connected a Logic Analyzer, and 
discovered that the results are within the simulation, except some very 
short spikes on the SPI bus.

The clock is 12Mhz, spi clock is about 6MHz.

Just wondering if this is due to the LA sampling (100MHz), not shielded 
      probes or effectively the "circuit" is not simulated accurately, 
and this crap is going out for whatever reason from the Spartan 3E.

Expected:
        ______________
______/              \___

Sampled:
        _   __________
______/ \_/          \___

If I sample at 50MHz, there is no sign of spikes. It seems the LA...any 
idea?

Thanks in advance,

Giuseppe Marullo

Article: 143354
Subject: Re: Post route simulation and real implementation
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Mon, 5 Oct 2009 00:51:00 +0000 (UTC)
Links: << >> << T >> << A >>

Giuseppe Marullo <giuseppe.marullonospam@iname.com> wrote:

< after some simulations I finally connected a Logic Analyzer, and 
< discovered that the results are within the simulation, except some very 
< short spikes on the SPI bus.
 
< The clock is 12Mhz, spi clock is about 6MHz.
 
< Just wondering if this is due to the LA sampling (100MHz), not shielded 
< probes or effectively the "circuit" is not simulated accurately, 
< and this crap is going out for whatever reason from the Spartan 3E.

Not knowing at all what your design looks like, a proper synchronous
design won't have such spikes.  That means latching such outputs
that come out of logic likely to spike.  (Most logic with more 
than one path from the previous latch.)  With FPGAs and IOB
FFs that is usually pretty easy to do.

If you do a post-route simulation, you should see such in the logic 
design generates them.  Pre-route will normally not generate them.

(snip)

-- glen

Article: 143355
Subject: Re: Post route simulation and real implementation
From: gabor <gabor@alacron.com>
Date: Mon, 5 Oct 2009 06:05:28 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 4, 8:51=A0pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> Giuseppe Marullo <giuseppe.marullonos...@iname.com> wrote:
>
> < after some simulations I finally connected a Logic Analyzer, and
> < discovered that the results are within the simulation, except some very
> < short spikes on the SPI bus.
>
> < The clock is 12Mhz, spi clock is about 6MHz.
>
> < Just wondering if this is due to the LA sampling (100MHz), not shielded
> < probes or effectively the "circuit" is not simulated accurately,
> < and this crap is going out for whatever reason from the Spartan 3E.
>
> Not knowing at all what your design looks like, a proper synchronous
> design won't have such spikes. =A0That means latching such outputs
> that come out of logic likely to spike. =A0(Most logic with more
> than one path from the previous latch.) =A0With FPGAs and IOB
> FFs that is usually pretty easy to do.
>
> If you do a post-route simulation, you should see such in the logic
> design generates them. =A0Pre-route will normally not generate them.
>
> (snip)
>
> -- glen

If you suspect the logic analyzer is the culprit, I would
use an oscilloscope instead.  Since your signal is pretty
slow, it's possible that the rise time is also slow and
that could cause sampling problems on the logic analyzer
as it slowly slews through the logic threshold region.
A typical cause of this might be the use of an open drain
signal with inadequate pullup (for the logic analyzer)
or a bus with a lot of capacitance.  If the SPI device
has hysteresis on the input it might not see the spike.

Regards,
Gabor

Article: 143356
Subject: Multiplier design with carry-save adder + Booth encoding
From: pallav <pallavgupta@gmail.com>
Date: Mon, 5 Oct 2009 10:32:43 -0700 (PDT)
Links: << >> << T >> << A >>

Hi,

For fun, I'm trying to code up a 32x32 multiplier (R = X*Y) using 4
layers of CSA and radix-4 booth encoding. This is not targeting an
FPGA so no using the "*" operator in Verilog. My block diagram
basically has 4 layers of CSA with each layer that can compute 2 bits
of multiplication. Thus, it takes 4 clock cycles to multiply 32x32.

While I understand booth encoding, the problem I'm having is the
actual implementation. The design must handle unsigned/signed
multiplication. My procedure is as follows:

1. Clear partial sum, partial carry registers (64 bits)
2. For each CSA, choose the appropriate 3 bits for Booth select and
determine whether to add X, 2X, -X, -2X. If adding, negative term, add
a 1 to the LSB of the partial product.
3. The lower two bits of each CSA are concatenated as {csa4_ps[1:0],
csa3_ps[1:0], csa2_ps[1:0], csa1_ps[1:0]} and 8 bits of carry are
created as {0, pp4_cin, 0, pp3_in, 0, pp2_cin, 0, pp1_cin} (this just
contains the +1 to be added if we're adding a negative term). These
are rotated to the top MSB bits of the partial sum/carry register
every clock cycle.
4. Within the CSA, the partial sum from CSA1 is right shited two bits
(as csa#_ps[1:0] require no further processing until final addition)
and the bits 01 are added to the MSB. This along with the partial
carry is then fed to the next row of CSA.
5. After four stages, the partial sum/carry are stored in the lower 32
bits of their respective register and then fed back to the top for the
next cycle.
6. In the last step, carry-lookahead addition is form on the upper32
bits of partial sum/carry (which are actually the lower 32 bits of the
product). The carry out from here is the carry  in to the lower 32
bits of partial/carry (which are actually the upper 32 bits of the
product).

My problem arises from handling sign extension and carry-save
addition. In essence, I'm following the rule on this page:
http://www.geoffknagge.com/fyp/booth.shtml#sign

Trying an example by hand, I don't seem to get the appropriate results
with CSA. I get the right results with normal addition. I've checked
the work and it looks OK but I can't figure out which part of the
logic is wrong. Maybe some of you have some ideas of what I'm doing
wrong:

An example:

X = 001011 = 11 (multiplicand)
Y = 010011 = 19 (multiplier)
Expected result = 209 = 11010001

First, we need to add -X (110), then add X (001), then add X (010).
The above site says to invert the MSB of the partial product and add
"01" to the front of this (this is the same as appending 01 to the
partial sum since the lower 2 bits are shifted out, I think).
Furthermore, for the first partial product, add a 1 to the MSB.

PS = partial sum, PC = partial carry, Z = partial product

Iteration 0:
PS: 000000
PC: 000000
Z:    010100     (-X with MSB inverted)  01 = temp_carry (to be added
later)
----------------
PS: 010100
C:   000000

Now PS is shift right 2 bits and 01 is appended,   temp_ps = 00
temp_carry = 01

Iteration 1:
PS: 010101
C  : 000000
Z  : 101011       (X with MSB inverted)   temp_ps = 00  temp_carry =
0001
----------------
PS: 111110
C :  000001

Now PS is shifted right 2 bits and 01 is appended, temp_ps = 1000,
temp_carry = 0001

Iteration 2: (final)
PS: 011111
C  : 000001
Z  : 101011       (X with MSB inverted)  temp_ps = 1000, temp_carry =
0001
---------------
PS: 110101
C:   001011

Now PS is shifted right 2 bits and 01 is appended, temp_ps = 011000,
temp_carry = 000001

PS: 011101
C:   001011

To get lower 6 bits of products, add temp_ps + temp_carry = 011000 +
000001 = 011001

To get upper 6 bits of product, add PS + C + Cin = 1 (for add 1 to MSB
of FIRST product). So 011101 + 001011 + 1 =101001

So this gives me a results of 101001 011011 =  2651 which is clearly
wrong. Any ideas where I'm going wrong? Is there a better way to think
about this?

Thanks for any help.

Kind regards.

Article: 143357
Subject: Re: Multiplier design with carry-save adder + Booth encoding
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Mon, 5 Oct 2009 18:23:54 +0000 (UTC)
Links: << >> << T >> << A >>

pallav <pallavgupta@gmail.com> wrote:
 
< For fun, I'm trying to code up a 32x32 multiplier (R = X*Y) using 4
< layers of CSA and radix-4 booth encoding. This is not targeting an
< FPGA so no using the "*" operator in Verilog. My block diagram
< basically has 4 layers of CSA with each layer that can compute 2 bits
< of multiplication. Thus, it takes 4 clock cycles to multiply 32x32.
 
< While I understand booth encoding, the problem I'm having is the
< actual implementation. The design must handle unsigned/signed
< multiplication. My procedure is as follows:

(snip of explanation and code)

Hopefully you can get the unsigned case working first.

Then, for the twos complement (signed) case, remember that
the result is what you would get giving negative weight instead
of positive weight to the sign bit.  For a negative multiplier,
subtract the multiplicand instead of adding for the MSB.
If the multiplicand is negative, sign extend it.

-- glen

Article: 143358
Subject: Re: Virtx 4 and FPGA programming
From: d_s_klein <d_s_klein@yahoo.com>
Date: Mon, 5 Oct 2009 13:13:41 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 2, 12:12=A0pm, akohan <amit.ko...@gmail.com> wrote:
> On Oct 2, 11:05=A0pm, akohan <amit.ko...@gmail.com> wrote:
>
> > Hello group,
>
> > I am about to start working with ML410 which is V4 Xilinx board, so
> > far I have used Altera and Spartan 3E for academic works. I will
> > appreciate it if you could guide me:
>
> > 1) where to start and what I should know in advance.
>

For Xilinx questions I like: <http://www.xilinx.com>

> > 2) I have a UCF file which defines some constraints on Spartan 3E, can
> > I use it on this board too or =A0connections are different?
>

The UCF belongs to the board.  New board, new UCF.

> > Regards,
> > Amit
>
> One thing I forgot to ask do I need using Linux or Windows for
> embedded coding with C?

Yes.

Article: 143359
Subject: Re: Implement ARM cores on a FPGA chip?
From: Darron <darron.black@gmail.com>
Date: Mon, 5 Oct 2009 13:54:18 -0700 (PDT)
Links: << >> << T >> << A >>

Algorithm development on FPGAs for maximum speed do not normally use
'cores' like you're talking about.  If your algorithm does make it to
an ASIC, it VERY likely should not be in the form of a multi-core
platform.  At least, not in the traditional processing core sense.

FPGAs are good at direct hardware implementation of algorithms.
They're pretty slow at simulating processors running embedded
software.

Everyone seems to be discussing the best soft core to put in FPGAs,
but the whole premise of using processor cores in FPGAs like you're
talking about sounds flawed to me.

The ultimate speed is going to come from coding your algorithm itself
in FPGA or ASIC gates.  Don't waste FPGA gates to simulate a
processor, which in turn runs normal embedded software code (very
slowly).  Directly implement your algorithm in hardware.

For vision algorithms, you could try Matlab-to-hardware or C-to-
hardware workflows.  However, for the best speed (and the most vendor
independence) you're likely to need to write a hardware algorithm
description in VHDL or Verilog.

You would normally code a VHDL/Verilog module that implements your
algorithm in hardware, and instanciate as many copies of that module
as will fit in an FPGA.  You might then include a single soft
processor core to manage the whole thing.  That, or you'd simply
provide an external interface to a normal processor or DSP making it
look like a memory mapped device or FIFO.

If your ASIC is just going to be a bunch of processing cores, you
could probably do it all MUCH more cheaply using some NVIDIA GT200-
based video card processing using CUDA (or ATI's stuff...  I'm not
trying to play favorites...  it's just what I know).  You're going to
get way more processing speed that way.  You're very unlikely to do
better than NVIDIA does.  (The modern video card GPUs can now do
generalized parallel processing an order of magnitude (or two!) faster
than the CPU)

You might also consider a small Linux computing cluster.  That's a lot
easier to write code for.

Article: 143360
Subject: Re: Implement ARM cores on a FPGA chip?
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Mon, 5 Oct 2009 21:16:57 +0000 (UTC)
Links: << >> << T >> << A >>

Darron <darron.black@gmail.com> wrote:

< Algorithm development on FPGAs for maximum speed do not normally use
< 'cores' like you're talking about.  If your algorithm does make it to
< an ASIC, it VERY likely should not be in the form of a multi-core
< platform.  At least, not in the traditional processing core sense.

I think I agree, and that is one reason I am against using serial
languages (such as C) as hardware description languages.
 
< FPGAs are good at direct hardware implementation of algorithms.
< They're pretty slow at simulating processors running embedded
< software.
 
< Everyone seems to be discussing the best soft core to put in FPGAs,
< but the whole premise of using processor cores in FPGAs like you're
< talking about sounds flawed to me.

I agree.  Though sometimes you need control processors for direct
hardware implementing the algorithm, and sometimes they can best
be implemented in a soft processor.  Most likely, though, that
should be one designed for efficient FPGA implementation.
 
< The ultimate speed is going to come from coding your algorithm itself
< in FPGA or ASIC gates.  Don't waste FPGA gates to simulate a
< processor, which in turn runs normal embedded software code (very
< slowly).  Directly implement your algorithm in hardware.

My favorite architecture for FPGA implementations is the systolic array.
Systolic arrays work especially well with FPGAs with FF's for each
LUT, and a clock tree designed to clock them all together.
 
< For vision algorithms, you could try Matlab-to-hardware or C-to-
< hardware workflows.  However, for the best speed (and the most vendor
< independence) you're likely to need to write a hardware algorithm
< description in VHDL or Verilog.

And likely it won't look anything like the serial description
of an algorithm for the same function.
 
< You would normally code a VHDL/Verilog module that implements your
< algorithm in hardware, and instanciate as many copies of that module
< as will fit in an FPGA.  You might then include a single soft
< processor core to manage the whole thing.  That, or you'd simply
< provide an external interface to a normal processor or DSP making it
< look like a memory mapped device or FIFO.

Yes.  Though a small on-chip FIFO might help.

(snip)
-- glen

Article: 143361
Subject: Re: Multiplier design with carry-save adder + Booth encoding
From: rickman <gnuarm@gmail.com>
Date: Mon, 5 Oct 2009 16:31:05 -0700 (PDT)
Links: << >> << T >> << A >>

I have not looked at a carry save implementation, but I don't think
that would matter.  IIRC, Booths algorithm automatically handles
signed numbers because of the subtractions required.

I recently coded an iterative multiplier and chose the simple shift-
add multiplier the same as you would do by hand.  My target was an
FPGA with 4 input LUTs and a builtin carry chain.  In that situation,
the Booths multiplier uses the same amount of resources since an N bit
adder uses no more resources than an N bit mux.  The shift-add
multiplier would be simpler actually, but to handle a signed
multiplicand, both the multiplier and multiplicand must be negated
which uses an extra N LUTs.  When I looked at the Booths algorithm
multiplier, I am pretty sure that the algorithm will work with either
signed or unsigned numbers equally well.  So you might want to pare
your code down to a small implementation, such as a 4 x 4 multiplier,
and step through each cycle and verify that the code is producing what
you expect.

Rick

On Oct 5, 1:32=A0pm, pallav <pallavgu...@gmail.com> wrote:
> Hi,
>
> For fun, I'm trying to code up a 32x32 multiplier (R =3D X*Y) using 4
> layers of CSA and radix-4 booth encoding. This is not targeting an
> FPGA so no using the "*" operator in Verilog. My block diagram
> basically has 4 layers of CSA with each layer that can compute 2 bits
> of multiplication. Thus, it takes 4 clock cycles to multiply 32x32.
>
> While I understand booth encoding, the problem I'm having is the
> actual implementation. The design must handle unsigned/signed
> multiplication. My procedure is as follows:
>
> 1. Clear partial sum, partial carry registers (64 bits)
> 2. For each CSA, choose the appropriate 3 bits for Booth select and
> determine whether to add X, 2X, -X, -2X. If adding, negative term, add
> a 1 to the LSB of the partial product.
> 3. The lower two bits of each CSA are concatenated as {csa4_ps[1:0],
> csa3_ps[1:0], csa2_ps[1:0], csa1_ps[1:0]} and 8 bits of carry are
> created as {0, pp4_cin, 0, pp3_in, 0, pp2_cin, 0, pp1_cin} (this just
> contains the +1 to be added if we're adding a negative term). These
> are rotated to the top MSB bits of the partial sum/carry register
> every clock cycle.
> 4. Within the CSA, the partial sum from CSA1 is right shited two bits
> (as csa#_ps[1:0] require no further processing until final addition)
> and the bits 01 are added to the MSB. This along with the partial
> carry is then fed to the next row of CSA.
> 5. After four stages, the partial sum/carry are stored in the lower 32
> bits of their respective register and then fed back to the top for the
> next cycle.
> 6. In the last step, carry-lookahead addition is form on the upper32
> bits of partial sum/carry (which are actually the lower 32 bits of the
> product). The carry out from here is the carry =A0in to the lower 32
> bits of partial/carry (which are actually the upper 32 bits of the
> product).
>
> My problem arises from handling sign extension and carry-save
> addition. In essence, I'm following the rule on this page:http://www.geof=
fknagge.com/fyp/booth.shtml#sign
>
> Trying an example by hand, I don't seem to get the appropriate results
> with CSA. I get the right results with normal addition. I've checked
> the work and it looks OK but I can't figure out which part of the
> logic is wrong. Maybe some of you have some ideas of what I'm doing
> wrong:
>
> An example:
>
> X =3D 001011 =3D 11 (multiplicand)
> Y =3D 010011 =3D 19 (multiplier)
> Expected result =3D 209 =3D 11010001
>
> First, we need to add -X (110), then add X (001), then add X (010).
> The above site says to invert the MSB of the partial product and add
> "01" to the front of this (this is the same as appending 01 to the
> partial sum since the lower 2 bits are shifted out, I think).
> Furthermore, for the first partial product, add a 1 to the MSB.
>
> PS =3D partial sum, PC =3D partial carry, Z =3D partial product
>
> Iteration 0:
> PS: 000000
> PC: 000000
> Z: =A0 =A0010100 =A0 =A0 (-X with MSB inverted) =A001 =3D temp_carry (to =
be added
> later)
> ----------------
> PS: 010100
> C: =A0 000000
>
> Now PS is shift right 2 bits and 01 is appended, =A0 temp_ps =3D 00
> temp_carry =3D 01
>
> Iteration 1:
> PS: 010101
> C =A0: 000000
> Z =A0: 101011 =A0 =A0 =A0 (X with MSB inverted) =A0 temp_ps =3D 00 =A0tem=
p_carry =3D
> 0001
> ----------------
> PS: 111110
> C : =A0000001
>
> Now PS is shifted right 2 bits and 01 is appended, temp_ps =3D 1000,
> temp_carry =3D 0001
>
> Iteration 2: (final)
> PS: 011111
> C =A0: 000001
> Z =A0: 101011 =A0 =A0 =A0 (X with MSB inverted) =A0temp_ps =3D 1000, temp=
_carry =3D
> 0001
> ---------------
> PS: 110101
> C: =A0 001011
>
> Now PS is shifted right 2 bits and 01 is appended, temp_ps =3D 011000,
> temp_carry =3D 000001
>
> PS: 011101
> C: =A0 001011
>
> To get lower 6 bits of products, add temp_ps + temp_carry =3D 011000 +
> 000001 =3D 011001
>
> To get upper 6 bits of product, add PS + C + Cin =3D 1 (for add 1 to MSB
> of FIRST product). So 011101 + 001011 + 1 =3D101001
>
> So this gives me a results of 101001 011011 =3D =A02651 which is clearly
> wrong. Any ideas where I'm going wrong? Is there a better way to think
> about this?
>
> Thanks for any help.
>
> Kind regards.

Article: 143362
Subject: Re: Multiplier design with carry-save adder + Booth encoding
From: pallav <pallavgupta@gmail.com>
Date: Mon, 5 Oct 2009 17:03:30 -0700 (PDT)
Links: << >> << T >> << A >>

Hi Rick/Glen,

Thanks a lot for these responses. I am looking at a smaller bit-width
and trying to get that built first.
I will keep your pointers in mind. I'm also reading a few papers that
discuss various implementation
particularly about sign-extension. From what I gather, if you have a N-
bit multiplier, 2 extra bits
seem to be being used for sign extension. Even in the unsigned case,
the partial products have to be
sign extended due to the subtraction (I think).

The main problem is figuring out how many extra bits are necessary and
what they are set to
and how the partial-sum/carry is working. From the description I had
posted above, I think I had the
right concept. But maybe after reading these papers, something
different might be involved.

I will work on this in more detail and report my findings once I get
somewhere.

Thanks for your time.

Kind regards.

Article: 143363
Subject: fpga custom cpu VS. cuda
From: olliH <oliver.hofherr@googlemail.com>
Date: Tue, 6 Oct 2009 00:58:16 -0700 (PDT)
Links: << >> << T >> << A >>

Hi,

I never tried CUDA, I just read about it. Maybe someone who uses FPGAs
hast also some experience with CUDA.

I developed a system with a FPGA that calculates a algorithm that
needs about 200 64-bit floating point operations. Add, multiply,
divide, square root, sinus and cosinus.

The algorithm is calculated in about 80 steps. So every timestep 2.5
floating point operations are done and the results are used in the
next time step. For the whole agorithm the FPGA needs 1300 clocks.
That makes 6.5 clocks for every FP-operation (average).

With a Xilinx Virtex4 LX60 the floating point cores run with 50 MHz.
This makes it possible to achieve about 35 kHz update rate for this
algorithm.

How fast could a CUDA-System make this calculations?

My ADCs can run with 100 kSample/s so it would be nice to speed the
whole thing a little bit up.

Article: 143364
Subject: Re: fpga custom cpu VS. cuda
From: Florian Stock <stock@esa.informatik.tu-darmstadt.de>
Date: Tue, 06 Oct 2009 11:13:05 +0200
Links: << >> << T >> << A >>

olliH <oliver.hofherr@googlemail.com> writes:

> I never tried CUDA, I just read about it. Maybe someone who uses FPGAs
> hast also some experience with CUDA.

We published some papers with GPU (CUDA) - FPGA comparison.

> I developed a system with a FPGA that calculates a algorithm that
> needs about 200 64-bit floating point operations. Add, multiply,
> divide, square root, sinus and cosinus.
>
> The algorithm is calculated in about 80 steps. So every timestep 2.5
> floating point operations are done and the results are used in the
> next time step. For the whole agorithm the FPGA needs 1300 clocks.
> That makes 6.5 clocks for every FP-operation (average).
>
> With a Xilinx Virtex4 LX60 the floating point cores run with 50 MHz.
> This makes it possible to achieve about 35 kHz update rate for this
> algorithm.
>
> How fast could a CUDA-System make this calculations?

Very fast - ok one short stopper is atm your double requirement. That
will change with upcoming Fermi Architecture, but sofar the GPUs just
support single precision (at reasonable speed). The good boards have 30
multi processors, each with 8 scalar datapathes wich operate in
SIMD. And for these 8 scalar datapathes exist just one double alu (you
could also emulate double by multiple single floats operations).

But even when you look at the worse double precision performance you get
the following numbers:

30 multiprocessors (if you had just single it would be 240 scalar
processors) at 600 MHz, that makes 18e9 computing cycles/s.  If we say
10 cycles per operation, we have (18e9/(10*80 steps)= 22.5e6 of your
computations in one second (ie 22.5 MSamples/s).

These numbers does not include the IO to/from the graphics board, but
it seems they are far from 50 GB/s, so they should be no problem.

> My ADCs can run with 100 kSample/s so it would be nice to speed the
> whole thing a little bit up.

Speed is not everything. The numbers you saw above are huge - but they
are only throughput. The drawback you get with your huge computation
power is: Latency. As you compute 30 samples at the same time, you 
wont have the result of the 1st before the 30th (btw. I assume in the
above calculation that they actually can computed independent from each
other - in case you have dependencies (feedback or similar) GPU makes no
sense at all). Also if your problem does not demand constantly the full
high rate your power effiency  (FLOPS/Watt) is very bad. 

Florian

Article: 143365
Subject: Re: Altera logic programmer card (PLP6) problem
From: LC <cupidoREMOVE@mail.ua.pt>
Date: Tue, 06 Oct 2009 11:40:12 +0100
Links: << >> << T >> << A >>

Hi,

I have one running on a Win98
and I have reasons to believe that you can't get
it to work on a WinXp. When I tried it on XP and MaxPlus2
could not access the hardware, later on using the
very same PC but running Win98 all worked just fine.

lc.

anotherUserName wrote:
> I have an (obsolete) Altera Logic Programmer Card (PLP6) that I cannot get
> to work.  I installed MaxPlus2 v 10.23 and the PLP6 driver on a WinXP box. 
> When I attempt to setup the card in the MaxPlus2, it tells me that it
> cannot find the card.  Do you have any information about this card?  I can
> find nothing online.  Do you have a datasheet?  There are 5 red LEDs on the
> card.  Do you know what they mean?  Thank you.
> 
>

Article: 143366
Subject: Re: Virtx 4 and FPGA programming
From: Brian Drummond <brian_drummond@btconnect.com>
Date: Tue, 06 Oct 2009 11:53:28 +0100
Links: << >> << T >> << A >>

On Mon, 5 Oct 2009 13:13:41 -0700 (PDT), d_s_klein <d_s_klein@yahoo.com> wrote:

>On Oct 2, 12:12 pm, akohan <amit.ko...@gmail.com> wrote:
>> On Oct 2, 11:05 pm, akohan <amit.ko...@gmail.com> wrote:
>>
>> > Hello group,
>>
>> > I am about to start working with ML410 which is V4 Xilinx board, so
>> > far I have used Altera and Spartan 3E for academic works. I will
>> > appreciate it if you could guide me:
>>
>> > 1) where to start and what I should know in advance.
>>
>
>For Xilinx questions I like: <http://www.xilinx.com>
>
>> > 2) I have a UCF file which defines some constraints on Spartan 3E, can
>> > I use it on this board too or  connections are different?
>>
>
>The UCF belongs to the board.  New board, new UCF.

Half right.
The pin allocations definitely belong to the board.

However the other constraints (I/O standards, timing constraints etc) may not,
unless you have also changed the clock rates, and the peripherals to which you
are interfacing. 

Any RLOCS or other internal constraints to ensure good placement will remain
(though if you change the FPGA you probably need to revisit these too)

- Brian

Article: 143367
Subject: Ideas for a pulse programmer needed
From: jmariano <jmariano65@gmail.com>
Date: Tue, 6 Oct 2009 08:42:59 -0700 (PDT)
Links: << >> << T >> << A >>

Dear Group,

I have very little experience in FPGA (and in digital design!).

As part of a research project I have to add to an existing microblaze
system, implemented on spartan 3 starter kit board, a pulse programmer
(PP). A PP is a system that outputs a given pattern to a set of
digital lines for a given time and then changes the pattern according
to a program.

There are several ways of implementing the PP but I have decided to
use what seams to me to be the simpler on: two blocks of RAM, say 2K
deep and 16 bits wide, pointed by the same address counter. One block
holds the time duration and the other the bit pattern. The control
block load's the contents of the first ram into a counter and latches
the content of the second one to the output. When the counter reaches
the end, the AC in incremented and the next time and pattern words are
loaded.

The IP access to RAM must be fast (this determines the time resolution
of the PP), but the access of the processor can be slow, since this is
done only once at the beginning of the experiment to write the
programming words and then the IP works by is one.

I was thinking on using BRAM to hold the data. Is this a good choice?
My other question is, what is the easiest way to implement microblaze
access the ram? I appreciate any commets on this.

I also appreciate if you could point me to somme examples or
application notes of a similar system (not a PP, but a system were
memory is accessed by an IP and microblaze), were I can get somme
ideas.

Tank you very much,

jmariano

Article: 143368
Subject: Re: Ideas for a pulse programmer needed
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 6 Oct 2009 16:43:44 +0000 (UTC)
Links: << >> << T >> << A >>

jmariano <jmariano65@gmail.com> wrote:
 
< As part of a research project I have to add to an existing microblaze
< system, implemented on spartan 3 starter kit board, a pulse programmer
< (PP). A PP is a system that outputs a given pattern to a set of
< digital lines for a given time and then changes the pattern according
< to a program.
 
< There are several ways of implementing the PP but I have decided to
< use what seams to me to be the simpler on: two blocks of RAM, say 2K
< deep and 16 bits wide, pointed by the same address counter. One block
< holds the time duration and the other the bit pattern. The control
< block load's the contents of the first ram into a counter and latches
< the content of the second one to the output. When the counter reaches
< the end, the AC in incremented and the next time and pattern words are
< loaded.
 
< The IP access to RAM must be fast (this determines the time resolution
< of the PP), but the access of the processor can be slow, since this is
< done only once at the beginning of the experiment to write the
< programming words and then the IP works by is one.

There are tricks that can be used if RAM access isn't fast enough.
For one, you want to fetch the next value from RAM and have it ready
in a register as soon as the previous one is being clocked out.
That probably works as long as the count isn't too small (like one).

Otherwise it sounds fine.

By the way, if this is homework be sure to reference the newsgroup
as source for any ideas that you use.

-- glen

Article: 143369
Subject: Re: Ideas for a pulse programmer needed
From: backhus <goouse@twinmail.de>
Date: Tue, 6 Oct 2009 23:10:26 -0700 (PDT)
Links: << >> << T >> << A >>

On 6 Okt., 17:42, jmariano <jmarian...@gmail.com> wrote:
> Dear Group,
>
> I have very little experience in FPGA (and in digital design!).
>
> As part of a research project I have to add to an existing microblaze
> system, implemented on spartan 3 starter kit board, a pulse programmer
> (PP). A PP is a system that outputs a given pattern to a set of
> digital lines for a given time and then changes the pattern according
> to a program.
>
> There are several ways of implementing the PP but I have decided to
> use what seams to me to be the simpler on: two blocks of RAM, say 2K
> deep and 16 bits wide, pointed by the same address counter. One block
> holds the time duration and the other the bit pattern. The control
> block load's the contents of the first ram into a counter and latches
> the content of the second one to the output. When the counter reaches
> the end, the AC in incremented and the next time and pattern words are
> loaded.
>
> The IP access to RAM must be fast (this determines the time resolution
> of the PP), but the access of the processor can be slow, since this is
> done only once at the beginning of the experiment to write the
> programming words and then the IP works by is one.
>
> I was thinking on using BRAM to hold the data. Is this a good choice?
> My other question is, what is the easiest way to implement microblaze
> access the ram? I appreciate any commets on this.
>
> I also appreciate if you could point me to somme examples or
> application notes of a similar system (not a PP, but a system were
> memory is accessed by an IP and microblaze), were I can get somme
> ideas.
>
> Tank you very much,
>
> jmariano

Hi,
if the processor is just needed for loading the RAM, a MicroBlaze may
be oversized for such a small task.
Picoblaze would be sufficient. And it comes with a UART, if you want
to use that kind of interface.
If you are going to use more complex intefaces like LAN a Microblaze
would be a better choice, of course.

For the microblaze there's a SRAM interface core available. It's
intended to acess register banks of peripheral devices, and sometimes
used to connect to the LAN interface chips on the development boards.
But that's just what you need, because your PP is just some interface
with a large register set (2K).
Remember, that you have to calculate BRAM ressources. MB needs some
BRAMS for Cache etc.. Or you might implement the core without cache.
You probably don't need that much performance.

Have a nice synthesis
  Eilert

Article: 143370
Subject: Re: Ideas for a pulse programmer needed
From: -jg <jim.granville@gmail.com>
Date: Wed, 7 Oct 2009 01:22:26 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 7, 4:42=A0am, jmariano <jmarian...@gmail.com> wrote:
>
> The IP access to RAM must be fast (this determines the time resolution
> of the PP),

Not quite. If you load pulse counters, for example, the RAM sets the
reload time, but the time-resolution can be smaller than the ram
access. Ram access sets the pulse update rates.

You could also use a simple scheme like run length coding, to expand
(compress?) what the ram holds, relative to the pulse resolution.

With a FPGA, you have a LOT of design freedom :)

-jg

Article: 143371
Subject: Re: Virtx 4 and FPGA programming
From: Jim <jimw567@gmail.com>
Date: Wed, 7 Oct 2009 03:11:24 -0700 (PDT)
Links: << >> << T >> << A >>

> I am about to start working with ML410 which is V4 Xilinx board, so
> far I have used Altera and Spartan 3E for academic works. I will
> appreciate it if you could guide me:
>
> 1) where to start and what I should know in advance.

I would start with some of the reference designs (http://
www.xilinx.com/products/boards/ml410/reference_designs.htm). Be aware
that ML410 is a relatively old board, so you may need to get versions
of ISE/EDK to match the ones used to create the reference design.

> 2) I have a UCF file which defines some constraints on Spartan 3E, can
> I use it on this board too or =A0connections are different?

Depending on your application, you should be able to use most if not
all constraints from the ML410 reference designs

> One thing I forgot to ask do I need using Linux or Windows for
> embedded coding with C?

Depending on your application, you may be able to have a standalone
system without an OS. Again, check the ML410 reference design page.

Cheers,
Jim
http://myfpgablog.blogspot.com/



>
> Regards,
> Amit

Article: 143372
Subject: Re: Multiplier design with carry-save adder + Booth encoding
From: rickman <gnuarm@gmail.com>
Date: Wed, 7 Oct 2009 04:48:55 -0700 (PDT)
Links: << >> << T >> << A >>

On Oct 5, 8:03=A0pm, pallav <pallavgu...@gmail.com> wrote:
> Hi Rick/Glen,
>
> Thanks a lot for these responses. I am looking at a smaller bit-width
> and trying to get that built first.
> I will keep your pointers in mind. I'm also reading a few papers that
> discuss various implementation
> particularly about sign-extension. From what I gather, if you have a N-
> bit multiplier, 2 extra bits
> seem to be being used for sign extension. Even in the unsigned case,
> the partial products have to be
> sign extended due to the subtraction (I think).
>
> The main problem is figuring out how many extra bits are necessary and
> what they are set to
> and how the partial-sum/carry is working. From the description I had
> posted above, I think I had the
> right concept. But maybe after reading these papers, something
> different might be involved.
>
> I will work on this in more detail and report my findings once I get
> somewhere.
>
> Thanks for your time.
>
> Kind regards.

Yes, you need to consider the length of the partial products at each
point in the calculation.  In the case of Booth's algorithm, it makes
sense to me that you will need two extended bits since you can add not
just X, but 2X.  I don't remember exactly how I handled this in my
code, but I believe I had to use an extra bit in my product register
or I may have dealt with it by shifting the data as it was saved in
the register.  Then only the calculation was needed, but not the extra
register bit... a very small savings, but it was part of the shifting
and so automatic, IIRC.

Rick

Article: 143373
Subject: Re: Virtex 5 HDMI
From: "raven1322" <jheagbayani@gmail.com>
Date: Wed, 07 Oct 2009 06:52:32 -0500
Links: << >> << T >> << A >>

>
>"maxascent" <maxascent@yahoo.co.uk> wrote in message 
>news:HpqdnQS28-7PJwbXnZ2dnUVZ_umdnZ2d@giganews.com...
>> Yes I understand what it is but I want some method to use HDMI with a
>> Virtex 5.
>>
>
>
>You could use VHDL or Verilog, or if you know what you are doing schematic

>capture.
>
>
>

hi! were you able to discover a solution about the problem? i also need to
know how to convert between LVDS and TMDS, for an HDMI interface.

Article: 143374
Subject: image scalar in Spartan 3E
From: "mr16" <hkmos@163.com>
Date: Wed, 07 Oct 2009 06:53:01 -0500
Links: << >> << T >> << A >>

Hi ,

really need help ...

i have a project image scalar. but i was very new in this
and found a topic here and many expertise give suggestion ...

so try to get help here ....

my project need to transform RGB (640x480) to RGB(1024X960)
all i have is just a spartan 3E board and google.....

i am trying to use verilog to write a scalar by linear interpolation

if i finished this , how can i put this in the board ??

and how to read the image and output it ?

thanks!!!

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search