Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search

Nico Coesel wrote: > Why does verification take a rack of servers? It doesn't. That would be easy to arrange for me, but probably not for Lucien. Python might be more practical alternative. My advice to Lucien is to verify that the parallel algorithm actually works and has the expected advantages before even thinking about the target hardware. -- Mike Treseler

On Oct 4, 7:24=A0pm, "Antti.Luk...@googlemail.com" <antti.luk...@googlemail.com> wrote: > On Oct 4, 7:55=A0pm, LucienZ <lucien.zh...@gmail.com> wrote: > > > > cortex m3 > > > is easy obtainable a license for 1000 instances cost 2500 $ > > > > this IS common knowledge i assumed you know this > > > > Antti > > > Is that Cortex-M3 license targeting ASIC fabrication or what? > > Sorry for not sharing the common knowledge; I am really a dummy here... > > not that is for 1000 instances in Altera Cyclone III > asic licenses are different > > Antti Thanks Antti. What I concern is: Suppose that I've licensed 1000 'instances', and now I want to use 4 of them for my design. Does it mean I have to pick up 4 FPGA chips, with exactly one M3 on each chip? Or I can deploy 4 instances on only one FPGA chip, using tools like Quartus? I see Cyclone solutions based on Cortex-M1, but I think the M3 is for ASICs. It's just a little bit confusing... --- To Mike, I will follow your advice and start with a verification and evaluation step. The idea of going embedded is from my bosses...and now I am forced to make my brain work in parallel: rewriting the algorithm as well as looking for an embedded architecture for it :).

On Oct 4, 10:58=A0pm, LucienZ <lucien.zh...@gmail.com> wrote: > On Oct 4, 7:24=A0pm, "Antti.Luk...@googlemail.com" > > > > <antti.luk...@googlemail.com> wrote: > > On Oct 4, 7:55=A0pm, LucienZ <lucien.zh...@gmail.com> wrote: > > > > > cortex m3 > > > > is easy obtainable a license for 1000 instances cost 2500 $ > > > > > this IS common knowledge i assumed you know this > > > > > Antti > > > > Is that Cortex-M3 license targeting ASIC fabrication or what? > > > Sorry for not sharing the common knowledge; I am really a dummy here.= .. > > > not that is for 1000 instances in Altera Cyclone III > > asic licenses are different > > > Antti > > Thanks Antti. What I concern is: > Suppose that I've licensed 1000 'instances', and now I want to use 4 > of them for my design. > Does it mean I have to pick up 4 FPGA chips, with exactly one M3 on > each chip? > Or > I can deploy 4 instances on only one FPGA chip, using tools like > Quartus? > > I see Cyclone solutions based on Cortex-M1, but I think the M3 is for > ASICs. It's just a little bit confusing... > > --- > To Mike, I will follow your advice and start with a verification and > evaluation step. The idea of going embedded is from my bosses...and > now I am forced to make my brain work in parallel: rewriting the > algorithm as well as looking for an embedded architecture for it :). my mistake M1 of course Antti

Hi all, after some simulations I finally connected a Logic Analyzer, and discovered that the results are within the simulation, except some very short spikes on the SPI bus. The clock is 12Mhz, spi clock is about 6MHz. Just wondering if this is due to the LA sampling (100MHz), not shielded probes or effectively the "circuit" is not simulated accurately, and this crap is going out for whatever reason from the Spartan 3E. Expected: ______________ ______/ \___ Sampled: _ __________ ______/ \_/ \___ If I sample at 50MHz, there is no sign of spikes. It seems the LA...any idea? Thanks in advance, Giuseppe Marullo

Giuseppe Marullo <giuseppe.marullonospam@iname.com> wrote: < after some simulations I finally connected a Logic Analyzer, and < discovered that the results are within the simulation, except some very < short spikes on the SPI bus. < The clock is 12Mhz, spi clock is about 6MHz. < Just wondering if this is due to the LA sampling (100MHz), not shielded < probes or effectively the "circuit" is not simulated accurately, < and this crap is going out for whatever reason from the Spartan 3E. Not knowing at all what your design looks like, a proper synchronous design won't have such spikes. That means latching such outputs that come out of logic likely to spike. (Most logic with more than one path from the previous latch.) With FPGAs and IOB FFs that is usually pretty easy to do. If you do a post-route simulation, you should see such in the logic design generates them. Pre-route will normally not generate them. (snip) -- glen

On Oct 4, 8:51=A0pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote: > Giuseppe Marullo <giuseppe.marullonos...@iname.com> wrote: > > < after some simulations I finally connected a Logic Analyzer, and > < discovered that the results are within the simulation, except some very > < short spikes on the SPI bus. > > < The clock is 12Mhz, spi clock is about 6MHz. > > < Just wondering if this is due to the LA sampling (100MHz), not shielded > < probes or effectively the "circuit" is not simulated accurately, > < and this crap is going out for whatever reason from the Spartan 3E. > > Not knowing at all what your design looks like, a proper synchronous > design won't have such spikes. =A0That means latching such outputs > that come out of logic likely to spike. =A0(Most logic with more > than one path from the previous latch.) =A0With FPGAs and IOB > FFs that is usually pretty easy to do. > > If you do a post-route simulation, you should see such in the logic > design generates them. =A0Pre-route will normally not generate them. > > (snip) > > -- glen If you suspect the logic analyzer is the culprit, I would use an oscilloscope instead. Since your signal is pretty slow, it's possible that the rise time is also slow and that could cause sampling problems on the logic analyzer as it slowly slews through the logic threshold region. A typical cause of this might be the use of an open drain signal with inadequate pullup (for the logic analyzer) or a bus with a lot of capacitance. If the SPI device has hysteresis on the input it might not see the spike. Regards, Gabor

Hi, For fun, I'm trying to code up a 32x32 multiplier (R = X*Y) using 4 layers of CSA and radix-4 booth encoding. This is not targeting an FPGA so no using the "*" operator in Verilog. My block diagram basically has 4 layers of CSA with each layer that can compute 2 bits of multiplication. Thus, it takes 4 clock cycles to multiply 32x32. While I understand booth encoding, the problem I'm having is the actual implementation. The design must handle unsigned/signed multiplication. My procedure is as follows: 1. Clear partial sum, partial carry registers (64 bits) 2. For each CSA, choose the appropriate 3 bits for Booth select and determine whether to add X, 2X, -X, -2X. If adding, negative term, add a 1 to the LSB of the partial product. 3. The lower two bits of each CSA are concatenated as {csa4_ps[1:0], csa3_ps[1:0], csa2_ps[1:0], csa1_ps[1:0]} and 8 bits of carry are created as {0, pp4_cin, 0, pp3_in, 0, pp2_cin, 0, pp1_cin} (this just contains the +1 to be added if we're adding a negative term). These are rotated to the top MSB bits of the partial sum/carry register every clock cycle. 4. Within the CSA, the partial sum from CSA1 is right shited two bits (as csa#_ps[1:0] require no further processing until final addition) and the bits 01 are added to the MSB. This along with the partial carry is then fed to the next row of CSA. 5. After four stages, the partial sum/carry are stored in the lower 32 bits of their respective register and then fed back to the top for the next cycle. 6. In the last step, carry-lookahead addition is form on the upper32 bits of partial sum/carry (which are actually the lower 32 bits of the product). The carry out from here is the carry in to the lower 32 bits of partial/carry (which are actually the upper 32 bits of the product). My problem arises from handling sign extension and carry-save addition. In essence, I'm following the rule on this page: http://www.geoffknagge.com/fyp/booth.shtml#sign Trying an example by hand, I don't seem to get the appropriate results with CSA. I get the right results with normal addition. I've checked the work and it looks OK but I can't figure out which part of the logic is wrong. Maybe some of you have some ideas of what I'm doing wrong: An example: X = 001011 = 11 (multiplicand) Y = 010011 = 19 (multiplier) Expected result = 209 = 11010001 First, we need to add -X (110), then add X (001), then add X (010). The above site says to invert the MSB of the partial product and add "01" to the front of this (this is the same as appending 01 to the partial sum since the lower 2 bits are shifted out, I think). Furthermore, for the first partial product, add a 1 to the MSB. PS = partial sum, PC = partial carry, Z = partial product Iteration 0: PS: 000000 PC: 000000 Z: 010100 (-X with MSB inverted) 01 = temp_carry (to be added later) ---------------- PS: 010100 C: 000000 Now PS is shift right 2 bits and 01 is appended, temp_ps = 00 temp_carry = 01 Iteration 1: PS: 010101 C : 000000 Z : 101011 (X with MSB inverted) temp_ps = 00 temp_carry = 0001 ---------------- PS: 111110 C : 000001 Now PS is shifted right 2 bits and 01 is appended, temp_ps = 1000, temp_carry = 0001 Iteration 2: (final) PS: 011111 C : 000001 Z : 101011 (X with MSB inverted) temp_ps = 1000, temp_carry = 0001 --------------- PS: 110101 C: 001011 Now PS is shifted right 2 bits and 01 is appended, temp_ps = 011000, temp_carry = 000001 PS: 011101 C: 001011 To get lower 6 bits of products, add temp_ps + temp_carry = 011000 + 000001 = 011001 To get upper 6 bits of product, add PS + C + Cin = 1 (for add 1 to MSB of FIRST product). So 011101 + 001011 + 1 =101001 So this gives me a results of 101001 011011 = 2651 which is clearly wrong. Any ideas where I'm going wrong? Is there a better way to think about this? Thanks for any help. Kind regards.

pallav <pallavgupta@gmail.com> wrote: < For fun, I'm trying to code up a 32x32 multiplier (R = X*Y) using 4 < layers of CSA and radix-4 booth encoding. This is not targeting an < FPGA so no using the "*" operator in Verilog. My block diagram < basically has 4 layers of CSA with each layer that can compute 2 bits < of multiplication. Thus, it takes 4 clock cycles to multiply 32x32. < While I understand booth encoding, the problem I'm having is the < actual implementation. The design must handle unsigned/signed < multiplication. My procedure is as follows: (snip of explanation and code) Hopefully you can get the unsigned case working first. Then, for the twos complement (signed) case, remember that the result is what you would get giving negative weight instead of positive weight to the sign bit. For a negative multiplier, subtract the multiplicand instead of adding for the MSB. If the multiplicand is negative, sign extend it. -- glen

On Oct 2, 12:12=A0pm, akohan <amit.ko...@gmail.com> wrote: > On Oct 2, 11:05=A0pm, akohan <amit.ko...@gmail.com> wrote: > > > Hello group, > > > I am about to start working with ML410 which is V4 Xilinx board, so > > far I have used Altera and Spartan 3E for academic works. I will > > appreciate it if you could guide me: > > > 1) where to start and what I should know in advance. > For Xilinx questions I like: <http://www.xilinx.com> > > 2) I have a UCF file which defines some constraints on Spartan 3E, can > > I use it on this board too or =A0connections are different? > The UCF belongs to the board. New board, new UCF. > > Regards, > > Amit > > One thing I forgot to ask do I need using Linux or Windows for > embedded coding with C? Yes.

Algorithm development on FPGAs for maximum speed do not normally use 'cores' like you're talking about. If your algorithm does make it to an ASIC, it VERY likely should not be in the form of a multi-core platform. At least, not in the traditional processing core sense. FPGAs are good at direct hardware implementation of algorithms. They're pretty slow at simulating processors running embedded software. Everyone seems to be discussing the best soft core to put in FPGAs, but the whole premise of using processor cores in FPGAs like you're talking about sounds flawed to me. The ultimate speed is going to come from coding your algorithm itself in FPGA or ASIC gates. Don't waste FPGA gates to simulate a processor, which in turn runs normal embedded software code (very slowly). Directly implement your algorithm in hardware. For vision algorithms, you could try Matlab-to-hardware or C-to- hardware workflows. However, for the best speed (and the most vendor independence) you're likely to need to write a hardware algorithm description in VHDL or Verilog. You would normally code a VHDL/Verilog module that implements your algorithm in hardware, and instanciate as many copies of that module as will fit in an FPGA. You might then include a single soft processor core to manage the whole thing. That, or you'd simply provide an external interface to a normal processor or DSP making it look like a memory mapped device or FIFO. If your ASIC is just going to be a bunch of processing cores, you could probably do it all MUCH more cheaply using some NVIDIA GT200- based video card processing using CUDA (or ATI's stuff... I'm not trying to play favorites... it's just what I know). You're going to get way more processing speed that way. You're very unlikely to do better than NVIDIA does. (The modern video card GPUs can now do generalized parallel processing an order of magnitude (or two!) faster than the CPU) You might also consider a small Linux computing cluster. That's a lot easier to write code for.

Darron <darron.black@gmail.com> wrote: < Algorithm development on FPGAs for maximum speed do not normally use < 'cores' like you're talking about. If your algorithm does make it to < an ASIC, it VERY likely should not be in the form of a multi-core < platform. At least, not in the traditional processing core sense. I think I agree, and that is one reason I am against using serial languages (such as C) as hardware description languages. < FPGAs are good at direct hardware implementation of algorithms. < They're pretty slow at simulating processors running embedded < software. < Everyone seems to be discussing the best soft core to put in FPGAs, < but the whole premise of using processor cores in FPGAs like you're < talking about sounds flawed to me. I agree. Though sometimes you need control processors for direct hardware implementing the algorithm, and sometimes they can best be implemented in a soft processor. Most likely, though, that should be one designed for efficient FPGA implementation. < The ultimate speed is going to come from coding your algorithm itself < in FPGA or ASIC gates. Don't waste FPGA gates to simulate a < processor, which in turn runs normal embedded software code (very < slowly). Directly implement your algorithm in hardware. My favorite architecture for FPGA implementations is the systolic array. Systolic arrays work especially well with FPGAs with FF's for each LUT, and a clock tree designed to clock them all together. < For vision algorithms, you could try Matlab-to-hardware or C-to- < hardware workflows. However, for the best speed (and the most vendor < independence) you're likely to need to write a hardware algorithm < description in VHDL or Verilog. And likely it won't look anything like the serial description of an algorithm for the same function. < You would normally code a VHDL/Verilog module that implements your < algorithm in hardware, and instanciate as many copies of that module < as will fit in an FPGA. You might then include a single soft < processor core to manage the whole thing. That, or you'd simply < provide an external interface to a normal processor or DSP making it < look like a memory mapped device or FIFO. Yes. Though a small on-chip FIFO might help. (snip) -- glen

I have not looked at a carry save implementation, but I don't think that would matter. IIRC, Booths algorithm automatically handles signed numbers because of the subtractions required. I recently coded an iterative multiplier and chose the simple shift- add multiplier the same as you would do by hand. My target was an FPGA with 4 input LUTs and a builtin carry chain. In that situation, the Booths multiplier uses the same amount of resources since an N bit adder uses no more resources than an N bit mux. The shift-add multiplier would be simpler actually, but to handle a signed multiplicand, both the multiplier and multiplicand must be negated which uses an extra N LUTs. When I looked at the Booths algorithm multiplier, I am pretty sure that the algorithm will work with either signed or unsigned numbers equally well. So you might want to pare your code down to a small implementation, such as a 4 x 4 multiplier, and step through each cycle and verify that the code is producing what you expect. Rick On Oct 5, 1:32=A0pm, pallav <pallavgu...@gmail.com> wrote: > Hi, > > For fun, I'm trying to code up a 32x32 multiplier (R =3D X*Y) using 4 > layers of CSA and radix-4 booth encoding. This is not targeting an > FPGA so no using the "*" operator in Verilog. My block diagram > basically has 4 layers of CSA with each layer that can compute 2 bits > of multiplication. Thus, it takes 4 clock cycles to multiply 32x32. > > While I understand booth encoding, the problem I'm having is the > actual implementation. The design must handle unsigned/signed > multiplication. My procedure is as follows: > > 1. Clear partial sum, partial carry registers (64 bits) > 2. For each CSA, choose the appropriate 3 bits for Booth select and > determine whether to add X, 2X, -X, -2X. If adding, negative term, add > a 1 to the LSB of the partial product. > 3. The lower two bits of each CSA are concatenated as {csa4_ps[1:0], > csa3_ps[1:0], csa2_ps[1:0], csa1_ps[1:0]} and 8 bits of carry are > created as {0, pp4_cin, 0, pp3_in, 0, pp2_cin, 0, pp1_cin} (this just > contains the +1 to be added if we're adding a negative term). These > are rotated to the top MSB bits of the partial sum/carry register > every clock cycle. > 4. Within the CSA, the partial sum from CSA1 is right shited two bits > (as csa#_ps[1:0] require no further processing until final addition) > and the bits 01 are added to the MSB. This along with the partial > carry is then fed to the next row of CSA. > 5. After four stages, the partial sum/carry are stored in the lower 32 > bits of their respective register and then fed back to the top for the > next cycle. > 6. In the last step, carry-lookahead addition is form on the upper32 > bits of partial sum/carry (which are actually the lower 32 bits of the > product). The carry out from here is the carry =A0in to the lower 32 > bits of partial/carry (which are actually the upper 32 bits of the > product). > > My problem arises from handling sign extension and carry-save > addition. In essence, I'm following the rule on this page:http://www.geof= fknagge.com/fyp/booth.shtml#sign > > Trying an example by hand, I don't seem to get the appropriate results > with CSA. I get the right results with normal addition. I've checked > the work and it looks OK but I can't figure out which part of the > logic is wrong. Maybe some of you have some ideas of what I'm doing > wrong: > > An example: > > X =3D 001011 =3D 11 (multiplicand) > Y =3D 010011 =3D 19 (multiplier) > Expected result =3D 209 =3D 11010001 > > First, we need to add -X (110), then add X (001), then add X (010). > The above site says to invert the MSB of the partial product and add > "01" to the front of this (this is the same as appending 01 to the > partial sum since the lower 2 bits are shifted out, I think). > Furthermore, for the first partial product, add a 1 to the MSB. > > PS =3D partial sum, PC =3D partial carry, Z =3D partial product > > Iteration 0: > PS: 000000 > PC: 000000 > Z: =A0 =A0010100 =A0 =A0 (-X with MSB inverted) =A001 =3D temp_carry (to = be added > later) > ---------------- > PS: 010100 > C: =A0 000000 > > Now PS is shift right 2 bits and 01 is appended, =A0 temp_ps =3D 00 > temp_carry =3D 01 > > Iteration 1: > PS: 010101 > C =A0: 000000 > Z =A0: 101011 =A0 =A0 =A0 (X with MSB inverted) =A0 temp_ps =3D 00 =A0tem= p_carry =3D > 0001 > ---------------- > PS: 111110 > C : =A0000001 > > Now PS is shifted right 2 bits and 01 is appended, temp_ps =3D 1000, > temp_carry =3D 0001 > > Iteration 2: (final) > PS: 011111 > C =A0: 000001 > Z =A0: 101011 =A0 =A0 =A0 (X with MSB inverted) =A0temp_ps =3D 1000, temp= _carry =3D > 0001 > --------------- > PS: 110101 > C: =A0 001011 > > Now PS is shifted right 2 bits and 01 is appended, temp_ps =3D 011000, > temp_carry =3D 000001 > > PS: 011101 > C: =A0 001011 > > To get lower 6 bits of products, add temp_ps + temp_carry =3D 011000 + > 000001 =3D 011001 > > To get upper 6 bits of product, add PS + C + Cin =3D 1 (for add 1 to MSB > of FIRST product). So 011101 + 001011 + 1 =3D101001 > > So this gives me a results of 101001 011011 =3D =A02651 which is clearly > wrong. Any ideas where I'm going wrong? Is there a better way to think > about this? > > Thanks for any help. > > Kind regards.

Hi Rick/Glen, Thanks a lot for these responses. I am looking at a smaller bit-width and trying to get that built first. I will keep your pointers in mind. I'm also reading a few papers that discuss various implementation particularly about sign-extension. From what I gather, if you have a N- bit multiplier, 2 extra bits seem to be being used for sign extension. Even in the unsigned case, the partial products have to be sign extended due to the subtraction (I think). The main problem is figuring out how many extra bits are necessary and what they are set to and how the partial-sum/carry is working. From the description I had posted above, I think I had the right concept. But maybe after reading these papers, something different might be involved. I will work on this in more detail and report my findings once I get somewhere. Thanks for your time. Kind regards.

Hi, I never tried CUDA, I just read about it. Maybe someone who uses FPGAs hast also some experience with CUDA. I developed a system with a FPGA that calculates a algorithm that needs about 200 64-bit floating point operations. Add, multiply, divide, square root, sinus and cosinus. The algorithm is calculated in about 80 steps. So every timestep 2.5 floating point operations are done and the results are used in the next time step. For the whole agorithm the FPGA needs 1300 clocks. That makes 6.5 clocks for every FP-operation (average). With a Xilinx Virtex4 LX60 the floating point cores run with 50 MHz. This makes it possible to achieve about 35 kHz update rate for this algorithm. How fast could a CUDA-System make this calculations? My ADCs can run with 100 kSample/s so it would be nice to speed the whole thing a little bit up.

olliH <oliver.hofherr@googlemail.com> writes: > I never tried CUDA, I just read about it. Maybe someone who uses FPGAs > hast also some experience with CUDA. We published some papers with GPU (CUDA) - FPGA comparison. > I developed a system with a FPGA that calculates a algorithm that > needs about 200 64-bit floating point operations. Add, multiply, > divide, square root, sinus and cosinus. > > The algorithm is calculated in about 80 steps. So every timestep 2.5 > floating point operations are done and the results are used in the > next time step. For the whole agorithm the FPGA needs 1300 clocks. > That makes 6.5 clocks for every FP-operation (average). > > With a Xilinx Virtex4 LX60 the floating point cores run with 50 MHz. > This makes it possible to achieve about 35 kHz update rate for this > algorithm. > > How fast could a CUDA-System make this calculations? Very fast - ok one short stopper is atm your double requirement. That will change with upcoming Fermi Architecture, but sofar the GPUs just support single precision (at reasonable speed). The good boards have 30 multi processors, each with 8 scalar datapathes wich operate in SIMD. And for these 8 scalar datapathes exist just one double alu (you could also emulate double by multiple single floats operations). But even when you look at the worse double precision performance you get the following numbers: 30 multiprocessors (if you had just single it would be 240 scalar processors) at 600 MHz, that makes 18e9 computing cycles/s. If we say 10 cycles per operation, we have (18e9/(10*80 steps)= 22.5e6 of your computations in one second (ie 22.5 MSamples/s). These numbers does not include the IO to/from the graphics board, but it seems they are far from 50 GB/s, so they should be no problem. > My ADCs can run with 100 kSample/s so it would be nice to speed the > whole thing a little bit up. Speed is not everything. The numbers you saw above are huge - but they are only throughput. The drawback you get with your huge computation power is: Latency. As you compute 30 samples at the same time, you wont have the result of the 1st before the 30th (btw. I assume in the above calculation that they actually can computed independent from each other - in case you have dependencies (feedback or similar) GPU makes no sense at all). Also if your problem does not demand constantly the full high rate your power effiency (FLOPS/Watt) is very bad. Florian

Hi, I have one running on a Win98 and I have reasons to believe that you can't get it to work on a WinXp. When I tried it on XP and MaxPlus2 could not access the hardware, later on using the very same PC but running Win98 all worked just fine. lc. anotherUserName wrote: > I have an (obsolete) Altera Logic Programmer Card (PLP6) that I cannot get > to work. I installed MaxPlus2 v 10.23 and the PLP6 driver on a WinXP box. > When I attempt to setup the card in the MaxPlus2, it tells me that it > cannot find the card. Do you have any information about this card? I can > find nothing online. Do you have a datasheet? There are 5 red LEDs on the > card. Do you know what they mean? Thank you. > >

On Mon, 5 Oct 2009 13:13:41 -0700 (PDT), d_s_klein <d_s_klein@yahoo.com> wrote: >On Oct 2, 12:12 pm, akohan <amit.ko...@gmail.com> wrote: >> On Oct 2, 11:05 pm, akohan <amit.ko...@gmail.com> wrote: >> >> > Hello group, >> >> > I am about to start working with ML410 which is V4 Xilinx board, so >> > far I have used Altera and Spartan 3E for academic works. I will >> > appreciate it if you could guide me: >> >> > 1) where to start and what I should know in advance. >> > >For Xilinx questions I like: <http://www.xilinx.com> > >> > 2) I have a UCF file which defines some constraints on Spartan 3E, can >> > I use it on this board too or connections are different? >> > >The UCF belongs to the board. New board, new UCF. Half right. The pin allocations definitely belong to the board. However the other constraints (I/O standards, timing constraints etc) may not, unless you have also changed the clock rates, and the peripherals to which you are interfacing. Any RLOCS or other internal constraints to ensure good placement will remain (though if you change the FPGA you probably need to revisit these too) - Brian

Dear Group, I have very little experience in FPGA (and in digital design!). As part of a research project I have to add to an existing microblaze system, implemented on spartan 3 starter kit board, a pulse programmer (PP). A PP is a system that outputs a given pattern to a set of digital lines for a given time and then changes the pattern according to a program. There are several ways of implementing the PP but I have decided to use what seams to me to be the simpler on: two blocks of RAM, say 2K deep and 16 bits wide, pointed by the same address counter. One block holds the time duration and the other the bit pattern. The control block load's the contents of the first ram into a counter and latches the content of the second one to the output. When the counter reaches the end, the AC in incremented and the next time and pattern words are loaded. The IP access to RAM must be fast (this determines the time resolution of the PP), but the access of the processor can be slow, since this is done only once at the beginning of the experiment to write the programming words and then the IP works by is one. I was thinking on using BRAM to hold the data. Is this a good choice? My other question is, what is the easiest way to implement microblaze access the ram? I appreciate any commets on this. I also appreciate if you could point me to somme examples or application notes of a similar system (not a PP, but a system were memory is accessed by an IP and microblaze), were I can get somme ideas. Tank you very much, jmariano

jmariano <jmariano65@gmail.com> wrote: < As part of a research project I have to add to an existing microblaze < system, implemented on spartan 3 starter kit board, a pulse programmer < (PP). A PP is a system that outputs a given pattern to a set of < digital lines for a given time and then changes the pattern according < to a program. < There are several ways of implementing the PP but I have decided to < use what seams to me to be the simpler on: two blocks of RAM, say 2K < deep and 16 bits wide, pointed by the same address counter. One block < holds the time duration and the other the bit pattern. The control < block load's the contents of the first ram into a counter and latches < the content of the second one to the output. When the counter reaches < the end, the AC in incremented and the next time and pattern words are < loaded. < The IP access to RAM must be fast (this determines the time resolution < of the PP), but the access of the processor can be slow, since this is < done only once at the beginning of the experiment to write the < programming words and then the IP works by is one. There are tricks that can be used if RAM access isn't fast enough. For one, you want to fetch the next value from RAM and have it ready in a register as soon as the previous one is being clocked out. That probably works as long as the count isn't too small (like one). Otherwise it sounds fine. By the way, if this is homework be sure to reference the newsgroup as source for any ideas that you use. -- glen

On 6 Okt., 17:42, jmariano <jmarian...@gmail.com> wrote: > Dear Group, > > I have very little experience in FPGA (and in digital design!). > > As part of a research project I have to add to an existing microblaze > system, implemented on spartan 3 starter kit board, a pulse programmer > (PP). A PP is a system that outputs a given pattern to a set of > digital lines for a given time and then changes the pattern according > to a program. > > There are several ways of implementing the PP but I have decided to > use what seams to me to be the simpler on: two blocks of RAM, say 2K > deep and 16 bits wide, pointed by the same address counter. One block > holds the time duration and the other the bit pattern. The control > block load's the contents of the first ram into a counter and latches > the content of the second one to the output. When the counter reaches > the end, the AC in incremented and the next time and pattern words are > loaded. > > The IP access to RAM must be fast (this determines the time resolution > of the PP), but the access of the processor can be slow, since this is > done only once at the beginning of the experiment to write the > programming words and then the IP works by is one. > > I was thinking on using BRAM to hold the data. Is this a good choice? > My other question is, what is the easiest way to implement microblaze > access the ram? I appreciate any commets on this. > > I also appreciate if you could point me to somme examples or > application notes of a similar system (not a PP, but a system were > memory is accessed by an IP and microblaze), were I can get somme > ideas. > > Tank you very much, > > jmariano Hi, if the processor is just needed for loading the RAM, a MicroBlaze may be oversized for such a small task. Picoblaze would be sufficient. And it comes with a UART, if you want to use that kind of interface. If you are going to use more complex intefaces like LAN a Microblaze would be a better choice, of course. For the microblaze there's a SRAM interface core available. It's intended to acess register banks of peripheral devices, and sometimes used to connect to the LAN interface chips on the development boards. But that's just what you need, because your PP is just some interface with a large register set (2K). Remember, that you have to calculate BRAM ressources. MB needs some BRAMS for Cache etc.. Or you might implement the core without cache. You probably don't need that much performance. Have a nice synthesis Eilert

On Oct 7, 4:42=A0am, jmariano <jmarian...@gmail.com> wrote: > > The IP access to RAM must be fast (this determines the time resolution > of the PP), Not quite. If you load pulse counters, for example, the RAM sets the reload time, but the time-resolution can be smaller than the ram access. Ram access sets the pulse update rates. You could also use a simple scheme like run length coding, to expand (compress?) what the ram holds, relative to the pulse resolution. With a FPGA, you have a LOT of design freedom :) -jg

> I am about to start working with ML410 which is V4 Xilinx board, so > far I have used Altera and Spartan 3E for academic works. I will > appreciate it if you could guide me: > > 1) where to start and what I should know in advance. I would start with some of the reference designs (http:// www.xilinx.com/products/boards/ml410/reference_designs.htm). Be aware that ML410 is a relatively old board, so you may need to get versions of ISE/EDK to match the ones used to create the reference design. > 2) I have a UCF file which defines some constraints on Spartan 3E, can > I use it on this board too or =A0connections are different? Depending on your application, you should be able to use most if not all constraints from the ML410 reference designs > One thing I forgot to ask do I need using Linux or Windows for > embedded coding with C? Depending on your application, you may be able to have a standalone system without an OS. Again, check the ML410 reference design page. Cheers, Jim http://myfpgablog.blogspot.com/ > > Regards, > Amit

On Oct 5, 8:03=A0pm, pallav <pallavgu...@gmail.com> wrote: > Hi Rick/Glen, > > Thanks a lot for these responses. I am looking at a smaller bit-width > and trying to get that built first. > I will keep your pointers in mind. I'm also reading a few papers that > discuss various implementation > particularly about sign-extension. From what I gather, if you have a N- > bit multiplier, 2 extra bits > seem to be being used for sign extension. Even in the unsigned case, > the partial products have to be > sign extended due to the subtraction (I think). > > The main problem is figuring out how many extra bits are necessary and > what they are set to > and how the partial-sum/carry is working. From the description I had > posted above, I think I had the > right concept. But maybe after reading these papers, something > different might be involved. > > I will work on this in more detail and report my findings once I get > somewhere. > > Thanks for your time. > > Kind regards. Yes, you need to consider the length of the partial products at each point in the calculation. In the case of Booth's algorithm, it makes sense to me that you will need two extended bits since you can add not just X, but 2X. I don't remember exactly how I handled this in my code, but I believe I had to use an extra bit in my product register or I may have dealt with it by shifting the data as it was saved in the register. Then only the calculation was needed, but not the extra register bit... a very small savings, but it was part of the shifting and so automatic, IIRC. Rick

> >"maxascent" <maxascent@yahoo.co.uk> wrote in message >news:HpqdnQS28-7PJwbXnZ2dnUVZ_umdnZ2d@giganews.com... >> Yes I understand what it is but I want some method to use HDMI with a >> Virtex 5. >> > > >You could use VHDL or Verilog, or if you know what you are doing schematic >capture. > > > hi! were you able to discover a solution about the problem? i also need to know how to convert between LVDS and TMDS, for an HDMI interface.

Hi , really need help ... i have a project image scalar. but i was very new in this and found a topic here and many expertise give suggestion ... so try to get help here .... my project need to transform RGB (640x480) to RGB(1024X960) all i have is just a spartan 3E board and google..... i am trying to use verilog to write a scalar by linear interpolation if i finished this , how can i put this in the board ?? and how to read the image and output it ? thanks!!!

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search