Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
"Ann" <ann.lai@analog.com> wrote in message news:ee8d229.1@webx.sUN8CHnE... > Hi, I have read these materials before I wrote the code, and I have just re-read it, seems like the way that I instantiate the code is write. I don't know why the data is not there though. Does anyone have an example or something? Thanks, Ann I looked at your code segments earlier and it looks 100% correct. The state machine goes through 256 writes to the same address. The first write to that address should produce a valid read value. As long as your clock is verifiably there, I'd suggest you could have a mistake in the code that reports the read value *suggesting* that the read value is zero when it isn't. I hope you find the trouble. - John_H (by the way, your posts aren't wrapping when sent causing some problems in other newsreaders)Article: 81926
mk wrote: > On 04 Apr 2005 19:48:58 +0100 (BST), Thomas Womack > <twomack@chiark.greenend.org.uk> wrote: > > >>Is there any way of using the Xilinx toolchain on a Mac? >> >>I have become spoiled by my Mac Mini, and unpacking my loud PC >>just to run place-and-route seems inelegant. >> >>Tom > > > give this a try > http://www.microsoft.com/mac/products/virtualpc/virtualpc.aspx Ahh...Even if it runs, expect at least a 10 fold performance decrease on PAR (assuming that you have already upgraded the mini mac's pathetic 256MB RAM). Sadly it is already slow enough. With the time that you wasted waiting for PAR to finish, might as well spend some time installing a water-cool x86 box. Better yet, running the tools on an xbox (with linux) might even be faster (http://www.xbox-linux.org/)! -jzArticle: 81927
perltcl@yahoo.com wrote: >hi > >I need help with my async desgin. I'm using xilinx virtex-ii. I'm very >new with async stuff and so my understanding is very limited-- >particularly different fpga architechures. (and async terminology.) > >Here is what I want to do: > >module async(clk,loopbackclk,....) >input clk; >output loobackclk; reg loopbackclk; >// decl , init and reset stuff omitted >always @(clk) >begin > case (state) > case 0: begin // do stuff > state <= state +1; > loopbackclk <=state; > end > case 1: begin // do stuff > state <= state +1; > loopbackclk <= state; > end > .... > endcase >end >endmodule > >Now in my top module: > >module top() >wire clk,loopbackclk; >async a(clk,loopbackclk,...); > >// now depends on what I use for synthesis -- >module SOME_BUF_STUFF?(O,I); // if using xilinx tools > assign clk <= I; // maybe O > assign loopbackclk <=O; // maybe I >endmodule // end loopback > >// I'm totally blank here , please tell me what do I do if using Icarus >verilog >param(.... clk .... // if using Icarus verilog >param(.... loopbackclk ... // if using Icarus verilog > >endmodule // top module > >A few questions: >First, is there some generic "buffer" or "pipe" (or insert correct >terms here) > for differenct FPGA's that I can loopback my "state" back as "clk", so >that my state transition only depends on internal circuit, not on a >global clock. >Please give me specific "names" for them. So that I can actually try >it. > >Since I prefer using generic tools like Icarus verilog, please help if >you now how to do it. (If possible, I only use vendor specific tools >for P and R) > >Thanks. > > > You are not likely to succeed without doing hand place and route because of the delays in the FPGA. You need to be very careful to eliminate hazards due to race conditions in your design using proper cover terms. Also, remember that the logic is implemented conceptually as small look-up tables, so you need to be careful about any glitches generated while traversing the LUT. Also, be aware that 'wires' in an FPGA add delay, so routing can become very important in order to avoid adding unintended delays. Yes, it can be done, but the existing tools are not meant for asynchronous design (and will quickly get you into trouble if you depend blindly on them), and the FPGAs are optimized for synchronous design. Generally speaking, you'll probably be using local signals for clock in this case, so the global clock buffers are likely to be of little or no interest to you (I think that is what you are asking). -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 81928
"Bret Wade" <bret.wade@xilinx.com> schrieb im Newsbeitrag news:424DC52A.1060704@xilinx.com... > Antti Lukats wrote: > > > but with the Virtex 4 bug, thats a bit scarier > > > > a simple design with 16 counters connected to 16 pin locked GCK inputs. > > > > P&R fails, saing one signal is not fully routed > > Hello Antti, > > If PAR only fails to route a single signal, that's usually an indication > of a packing or placement problem leading to an unroutable connection, > rather than a congestion issue. These problems are usually not too > difficult to understand and correct with packing or placement > constraints. Although you are focused on the number of clocks in the > design, you don't say whether the unrouted signal is a clock net or > something else, so I won't speculate on the root cause. > > I suggest examining the design in FPGA Editor and trying to understand > where the routing conflict is. If you are unable to make any progress > with this method, I suggest opening a webcase and providing a test for > investigation. > > Regards, > Bret Wade > Xilinx Product Applications > Hm, thanks for some hints, but well I think there is a an issue related to V4 because the same design with 16 clock routes OK on V2 without any problems. the all design occupies less than 4% of the V4, so I am pretty confident the design is routable. And if simple design is routable there should be no reason to look into FPGA editor or add placement constraints to make the design routable. I will try to open a webcase too. AnttiArticle: 81929
Mentre io pensavo ad una intro simpatica "Nemesis" scriveva: >> ModelSim needs an env variable called LM_LICENSE_FILE which points to >> the license.dat file. When you install ModelSim as administrator I guess >> that this env. variable is installed in the private space of the >> administrator. Have you verified that LM_LICENSE_FILE is defined as >> system variable and not user variable? Can you see it when you log in as >> a normal user? > Now I can't control (I'll see tomorrow) but the error says that a "text > file" cannot be written. > However I'll try setting this environment variable. I checked, the environment variable is correct. Today I saw that if I open the licensing wizard before opening ModelSim, it correctly reports that the license is valid, but when I open ModelSim as User it stops working. -- BREAKFAST.COM Halted... Cereal Port Not Responding. |\ | |HomePage : http://nem01.altervista.org | \|emesis |XPN (my nr): http://xpn.altervista.orgArticle: 81930
Hi Jonathan, Thank you for your consideration. I have lots of little objects being fed to me one pixel at a time. It's a line scan sorting operation with mutltiple ejectors. I have an algorithm now that does blob labeling. I am thinking that as the blob grows, the rate that pixels are added to the left and right side should first increase, then decrease. If I see increase, decrease, and increase, that might indicate a convexity, and should be fairly simple to detect. At least this is what I am thinking today. I do need the concavity information because it has proven so far to be the best way to determine where to segment the objects that are touching. The standard erosion/dilation techniques only separate some of the objects. I can remove holes with a filter if that becomes a problem. So are you doing vision now? Brad "Jonathan Bromley" <jonathan.bromley@doulos.com> wrote in message news:1k0251h31thoai7cftfg4h3vjtam859ukr@4ax.com... > On Fri, 1 Apr 2005 11:03:22 -0800, "Brad Smallridge" > <bradsmallridge@dslextreme.com> wrote: > >>But how do you calculate, or otherwise detect the concavities? >>I have done some initial work with small areas and bit patterns >>but one soon runs out of logic gates. > > Most of the traditional binary image manipulation algorithms > use various types of linked memory structures for flexibility. > These don't work at all well in FPGA. When faced with the > limited memory and non-existent memory allocation opportunities > in an FPGA, you'll need special algorithms that are sure to be > application specific. The key questions, it seems to me, are... > > 1) How is the image presented to you? Do you get it pixel-by- > pixel from a camera of some kind, or are you given ready- > processed data structures from a CPU that writes to your > FPGA? > 2) How big is the largest object that you need to process? > For the most interesting image processing operations, > you need to be able to store a complete bitmap of the > whole object, which in practice means storing every > pixel in a rectangle at least big enough to hold the > object. > 3) Do you need to process multiple objects simultaneously? > I ask this because, the last time I did any binary vision > stuff, we were capturing images of codfish on a conveyor > belt. The belt went under a line-scan camera, and it was > wide enough that there could be several fish in view at > any one time. Our software needed to process all of them, > else you got an awful lot of waste fish on the floor :-) > 4) Are you interested in any included holes in the object, > or only its outline? If you care only about the outline, > then the extent of the image on each scan line can be > represented by only two numbers - the left and right > boundaries of the object on that line. > > Given the kind of data representation I outlined in (4), > there's a relatively simple algorithm for extracting the > convex hull. It requires only two additional bits of > storage associated with each scan line, in addition to > the left and right boundary information. However, it is > not totally FPGA-friendly because at each scan line you > have to look back over all previous scan lines on the > object. You may find it's best to include a little > CPU in the design, to help with the sequencing of this > or similar algorithms. > > I wonder... does your application REALLY need all the > concavity information? It may be possible to get all > the information you need from simple accumulated > measurements such as the centre-of-gravity, area, > and first and second moments of inertia of the object. > -- > Jonathan Bromley, Consultant > > DOULOS - Developing Design Know-how > VHDL, Verilog, SystemC, Perl, Tcl/Tk, Verification, Project Services > > Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, BH24 1AW, UK > Tel: +44 (0)1425 471223 mail:jonathan.bromley@doulos.com > Fax: +44 (0)1425 471573 Web: http://www.doulos.com > > The contents of this message may contain personal views which > are not the views of Doulos Ltd., unless specifically stated.Article: 81931
v_mirgorodsky@yahoo.com wrote: >Hi ALL, > >I got the problem solved in not very efficient way. I replaced SRL16 >elements with conventional triggers and now design flys in the sky - >the fmax went all the way up to 214+MHz. > >The only thing left to figure out - why conventional triggers do such a >good job and "very efficient" SRL16 apeared to mess up everything :( > >With best regards, >Vladimir S. Mirgorodsky > > > The SRL16s MUST be used with a flip-flop on the output and located in the same slice in order to obtain reasonably high performance. The reason is the propagation time through the SRL16 and back out of the slice is dismally slow. In order to use that flip-flop, however, the flip-flop cannot have a reset on it other than the power on-reset because the SR line is shared with one of the controls for the SRL16. The catch is that in order to get the synthesis to infer the SRL16 followed by a flip-flop, you need to either instantiate the flip-flop or put a reset pin on it. You may also be able to make it produce this using a syn_preserve or similar attribute, but that really seems to be synthesis tool version dependent. So the short answer is your synthesizer is inferring the SRL16 without putting a flip-flop after it in the same slice, which makes for a very long set-up time through the SRL16. Another factor is the multiple levels of logic. The Xilinx PAR is notoriously bad at placing the additional LUTs when you have more than one LUT level between flip-flops in a signal path. If you can, redesign the logic so that the critical path goes through few layers of LUTs between flip-flops. Otherwise, at least look at the placement results and try some manual floorplanning to improve the LUT placement. From your success by replaicng the SRL16, it sounds like just making sure you get that flip-flop on the SRL16 output will probably fix this for you. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 81932
lecroy7200@chek.com wrote: <snip> > As per our off-line talks, I have gone ahead and rebuilt the design > using slew limited outputs for the two pins in question. I have begun > running my transient tests but it will be a few weeks before I am > convinced this was the problem. > > The following link is to my post about the reflected energy causing > possible problems: > > http://groups-beta.google.com/group/comp.arch.fpga/browse_frm/thread/1423e577bf37d509/1f921b2ef9ae4542?q=reflected&rnum=3#1f921b2ef9ae4542 > > The following was taken from a Xilinx app. note. > > "For all FPGA families, ringing signals are not a cause for reliability > concerns. To cause such a problem, the Absolution Maximum DC conditions > need to be violated for a considerable amount of time (seconds). " <snip> That's from a Pin-failure viewpoint. - ie energy damage. They also spec a MAX peak current. There IS another failure mode, which is the lateral currents that result from the clamp diodes ( which are actually side-ways transistors ). It is not easy to KNOW what peak currents you get, especially on cable or external runs. At the highest levels, these injection currents cause latch-up, but there can be lower levels, where operation is compromised, but the device does not latch up. Latchup tests are purely "did the SCR trigger?" ones, they do NOT (AFAIK) ever check to see if the part logically miss-fired in any way. -jgArticle: 81933
Antti Lukats wrote: ><v_mirgorodsky@yahoo.com> schrieb im Newsbeitrag >news:1112375908.017059.293810@g14g2000cwa.googlegroups.com... > > >>Hi ALL, >> >>I got the problem solved in not very efficient way. I replaced SRL16 >>elements with conventional triggers and now design flys in the sky - >>the fmax went all the way up to 214+MHz. >> >>The only thing left to figure out - why conventional triggers do such a >>good job and "very efficient" SRL16 apeared to mess up everything :( >> >> > >hm thats strange >there is on usually unused flip flop at the 'end' of SRL 16 >so doing the SRL16 1 clock shorter and using that flop should have the same >performance as only flips >if what you say is so, then it must be a bug in the timing estimation ?? > >antti > > > > Most likely, that flip-flop is not being placed with the SRL16 due to routing limitiations. Specifically, the flip-flop cannot have the reset used other than as part of the dedicated global reset because the reset pin to the slice is shared with the WE function for the SRL16. If you forced instantiation of a Flip-flop by adding a reset, it forced the flip-flop into another slice, which kills the SRL16 timing. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 81934
You have a few choices. You can instantiate the SRL16 and FF. Doing that guarantees the proper components, and provided you don't have a reset on the FF, the mapper will pack the two into the same slice even if you don't put RLOCs on it. Without the RLOCs, there is no issue of portability between Xilinx families later than XC4000* and SpartanI. Including RLOCs adds an additional wrinkle because the RLOC format for Virtex, VirtexE and Spartan2 is different than that for later families, and with Spartan3 or Virtex4, there are restrictions as to which columns can have an SRL16. Anyway, you have the choice of using or not using RLOCs You can also infer the flip-flop by connecting global reset to it. If you do this, you MUST make sure every flip-flop in the design also has the global reset connected to it if you are inferring the global reset. If you leave any flip-flop out, you wind up with a huge net on general routing resources for the reset. You also get a signal wired to the SRL flip-flop reset pin, which in turn forces it out of the SLR slice. You can connect the inferred flip-flop reset pin to an instantiated ROC component. This puts the flip-flop reset on the built in reset network Depending on the synthesis tool, you may be able to set an attribute to force the synthesizer to put a flip-flop on the SRL16 output. If you do this, check your result any time you use a different tool or version of the tool. If you have more than one SRL16 chained together, the tools historically have only put the flip-flop on the last one in the chain, which is no better than not using the flip-flop at all. Depending on the tool, you may also be able to put a keep buffer btween the inferred SRL16 and flip-flop to force that signal to be retained. Early on, I had mixed results with this using Synplify. Some versions it worked, others it didn't (one version it forced a LUT to be inserted between the SRL and the FF....the worst possible outcome). How do I deal with it? I have an IP block that instantiates RLOC'd SRL16's and flip-flops. It takes the desired delay and virtex family as generics and generates an array of SRL16's and FFs to match the width of the output port and divides the delay up into as many SRL16+FF segments needed to create the delay. The root of the problem is the SRL16 has a compartively very slow clock-Q time, which is not a problem as long as the SRL16 is wired only to the flip-flop in the same slice (thereby avoiding adding routing delays to the long clock to Q). This is compounded because synthesis tools don't automatically stick a register on the SRL16 output. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 81935
Antti Lukats wrote: > "Bret Wade" <bret.wade@xilinx.com> schrieb im Newsbeitrag > news:424DC52A.1060704@xilinx.com... > >>Antti Lukats wrote: >> >> >>>but with the Virtex 4 bug, thats a bit scarier >>> >>>a simple design with 16 counters connected to 16 pin locked GCK inputs. >>> >>>P&R fails, saing one signal is not fully routed >> >>Hello Antti, >> >>If PAR only fails to route a single signal, that's usually an indication >>of a packing or placement problem leading to an unroutable connection, >>rather than a congestion issue. These problems are usually not too >>difficult to understand and correct with packing or placement >>constraints. Although you are focused on the number of clocks in the >>design, you don't say whether the unrouted signal is a clock net or >>something else, so I won't speculate on the root cause. >> >>I suggest examining the design in FPGA Editor and trying to understand >>where the routing conflict is. If you are unable to make any progress >>with this method, I suggest opening a webcase and providing a test for >>investigation. >> >>Regards, >>Bret Wade >>Xilinx Product Applications >> > > > Hm, thanks for some hints, but well I think there is a an issue related to > V4 because > the same design with 16 clock routes OK on V2 without any problems. > the all design occupies less than 4% of the V4, so I am pretty confident the > design is > routable. And if simple design is routable there should be no reason to look > into > FPGA editor or add placement constraints to make the design routable. > > I will try to open a webcase too. > > Antti Hello Antti, Yes, this is likely a tool problem related to a new feature in the V4 parts such as the Regional Clocks. But until we know more details about the problem, like which connection is unrouted, we are no closer to a solution. It would also help to know what tool version is involved. I suspect that you are not yet using 7.1i since I would expect a different failure mode. 7.1i does an "unroutability check" before routing that detects and and errors out on unroutable connections, but you didn't describe that scenario. The 7.1i version also solves many of the early teething problems found with V4 devices. Regards, BretArticle: 81936
Thanks for the response Josh, I knew about the ability to run two separate OSes, but really do want to get an SMP machine out of the Virtex II. I will check out the mem management section like you advised. I would like to avoid software solutions to this issue. Honestly, I am a bit surprised that more information isn't readily available on using the hard cores in an SMP fashion. If anyone else has any advice or references to pass along, I would appreciate it. JosephArticle: 81937
mmm, next time you want one done send me to India ! Haven't been there yet :) Depends how big the asic is and how much info you have on it. www.fpgaarcade.com I have cloned a few early NAMCO asics and made plug in 28pin replacements. No documentation on them, but functionally simple. Very small amounts of code compared to my normal large virtex4 type stuff, but lots of debugging and trial and error to get exact behaviour under all (tested at least) cases. I have also (almost) finished the atari st custom chip sets, for which there is a lot of documentation. What are you after ? /Mike.Article: 81938
oh, I also have written a number of tools to turn various asic netlists back into VHDL ... Again, all depends what you want to do.Article: 81939
Hi Jason, > Better yet, running the tools on an xbox (with linux) might even be faster > (http://www.xbox-linux.org/)! Hmmm... A Celery 733 is not exactly the preferred CPU for computing-intensive stuff like P&R. I can even imagine a G5 using VirtualPC running faster than that. Best regards, BenArticle: 81940
"Laurent Gauch" <laurent.gauch@DELETEALLCAPSamontec.com> wrote in message news:4250FE32.90605@DELETEALLCAPSamontec.com... > > > Ross Marchant wrote: > > Hi, > > > > I'm using the XC95108 CPLD and Xilinx ISE 7.1.01i. The problem I am having > > is > > that outputs are inverted when they aren't supposed to be. > > > > ***************** > > > > This is my vhdl file: > > -------------------------------------------------------------------------- -- > > ---- > > library IEEE; > > use IEEE.STD_LOGIC_1164.ALL; > > use IEEE.STD_LOGIC_ARITH.ALL; > > use IEEE.STD_LOGIC_UNSIGNED.ALL; > > > > ---- Uncomment the following library declaration if instantiating > > ---- any Xilinx primitives in this code. > > --library UNISIM; > > --use UNISIM.VComponents.all; > > > > entity test is > > Port ( In1 : in std_logic; > > Out1 : out std_logic); > > end test; > > > > architecture Behavioral of test is > > > > begin > > Out1 <= In1; > > end Behavioral; > > > > ***************** > > > > This is my ucf file: > > #PACE: Start of Constraints generated by PACE > > > > #PACE: Start of PACE I/O Pin Assignments > > NET "In1" LOC = "P24" ; > > NET "Out1" LOC = "P54" ; > > > > #PACE: Start of PACE Area Constraints > > > > #PACE: Start of PACE Prohibit Constraints > > > > #PACE: End of Constraints generated by PACE > > > > ***************** > > > > Now i find if i put a low signal on pin 24 i get a high signal on pin 54 and > > vice-versa, even though the post fit simulation shows it working correctly. > > What could be wrong?? > > > > Try again removing your lib declaration > --use IEEE.STD_LOGIC_ARITH.ALL; > --use IEEE.STD_LOGIC_UNSIGNED.ALL; > Unfortunately this does not work. I have started a web case with Xilinx and they are looking into it for me. Thanks RossArticle: 81941
Hi, for some reason, the write line have to toggle high and low for me to write and read data back. I thought for this kind of memory module, you only need WE to be high to write, and WE to be low when you read. If in my state machine, I have 10 write cycle where I set WE <= 1'b1, then the rest read and keep looping in read where I set WE = 1'b0, it doesn't work. If I set it to write 10 cycles, read 10 cycles, write 10 cycles, read 10 cycles...etc, then it works. Does anyone know what is wrong? I am terribly confused. Thanks, AnnArticle: 81942
If you're simulating, look for the wr_en being assigned 0 outside the always block you showed us; this would present odd behavior. The Block RAM absolutely does not require the WE edge. The WE level is sampled on the rising edge of the clock to the BlockRAM with specific setup and hold requirements. Are you simulating, using ChipScope, looking at test points, or other? "Ann" <ann.lai@analog.com> wrote in message news:ee8d229.3@webx.sUN8CHnE... > Hi, for some reason, the write line have to toggle high and low for me to write and read data back. I thought for this kind of memory module, you only need WE to be high to write, and WE to be low when you read. If in my state machine, I have 10 write cycle where I set WE <= 1'b1, then the rest read and keep looping in read where I set WE = 1'b0, it doesn't work. If I set it to write 10 cycles, read 10 cycles, write 10 cycles, read 10 cycles...etc, then it works. Does anyone know what is wrong? I am terribly confused. Thanks, AnnArticle: 81943
Kolja Sulimma wrote: > > Most likely that's the way to do it. An alternative would be a > hyperbolic CORDIC, also explained by Ray: > http://www.andraka.com/files/crdcsrvy.pdf > > Kolja Sulimma If I needed high precision or for some reason could not use reasonably sized look-up tables, I'd consider a hybrid of these two, replacing the look-up with a CORDIC exp(x) after performing the normalize to strip off the integer part. The look up method I described in my prior post will get you plenty of precision for most applications. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 81944
[I've crossposted to comp.arch.fpga, where this question really belongs] Mark wrote: > > We have a particular software system/program > that spends a large fraction of it's time in two or three > particular functions that do lots of trig/geometric and matrix > calculations. Due to the nature of the algorithms they > use, they'll not parallelizable (the computer is a quad cpu box). > Also because the rest of the code depends on the results > from these functions, they are a bottleneck for the rest > of the code that we can split across cpus. > > I think I've read that network hardware engineers use > special hardware that allows them to program the hardware > to directly implement code via "Programmable arrays" or something. > What I'd really like is an expansion card that would plug > into the bus and a toolkit so that these few C++ functions > in question would actually be implemented very fast in > hardware and not in software on the unix box. The C++ code > would call the function and the toolkit API would set it > up so that the function call goes to the card, is processed > and the result returned back to the C++ program as the return > looking to the rest of the code like it was done in the app. > > I suppose the code for those functions would be dumped in a > binary file on the unix box, then the card would know to > load from the file when instructed to via a api call embedded > in the main application (in a constructor probably). Probably > the toolkit would have a way to define what C++ funcs to > grab to put on the card when the C++ is compiled. > > Is this doable? I think Xilinx and other companies make the chips > that do this, but I'm having trouble finding the end companies > that make an actual card we can use. > thanks > Mark > You're going to run into a few problems. First, FPGAs are programmed (in general) in Verilog/VHDL, not C++. Anyone know of any real behavioral synthesis from C/C++ tools? Second, if you're using trig functions, you're going to have to implement them differently than you would using software. Look up CORDIC algorithms, for example. Third, you might not get the speed-ups you expect. It depends on the chunk of work you're going to offload. It takes a significant amount of time to dump data to an accelerator card, and then to read the result back in. It works best if each function call takes a long time to complete. Fourth, if the code is truly non-parallelizable (i.e. you can't use SSE2/Altivec and you can't split it across multiple CPUs) then its quite likely that FPGAs won't help too much. Again, someone more clued in than me might be able to answer better, but I suspect that the FPUs in the CPU _MAY_ be better than what you can achieve on a an FPGA (it will depend on your algorithm etc.) Out of curiousity, what CPU are you currently using? Also, are you using double-precision or single-precision FP? Have you looked at the possibility of speeding up the performance of the software implementation? In particular, have you looked at how your trignometric functions are implemented, and whether you can trade accuracy/precision for perfromance? Unless you're absolutely sure that the software can't be improved, I wouldn't recommend looking at FPGA acceleration.Article: 81945
I'd agree, the PCI will kill you 1st, and any difficult for FPGA but easy on the PC will kill you again, and finally C++ will not be so fast as HDL by my estimate maybe 2-5x (my pure prejudice). If you must use C take a look at HandelC, at least its based on Occam so its provably able to synthesize into HW coz it ain't really C, just looks like it. If you absolutely must use IEEE to get particular results forget it, but I usually find these barriers are artificial, a good amount of transforms can flip things around entirely. To be fair an FPGA PCI could wipe out a PC only if the problem is a natural, say continuously processing a large stream of raw data either from converters or special interface and then reducing it in some way to a report level. Perhaps a HD could be specially interfaced to PCI card to bypass the OS, not sure if that can really help, getting high end there. Better still if the operators involved are simple but occur in the hundreds atleast in parallel. The x86 has atleast a 20x starting clock advantage of 20ops per FPGA clock for simple inline code. An FPGA solution would really have to be several times faster to even make it worth considering. A couple of years ago when PCI was relatively faster and PC & FPGAs relatively slower, the bottleneck would have been less of a problem. BUT, I also think that x86 is way overrated atleast when I measure nos. One thing FPGAs do with relatively no penalty is randomized processing. The x86 can take a huge hit if the application goes from entirely inside cache to almost never inside by maybe a factor of 5 but depends on how close data is temporally and spatially.. Now standing things upside down. Take some arbitrary HW function based on some simple math that is unnatural to PC, say summing a vector of 13b saturated nos. This uses less HW than the 16b version by about a quarter, but that sort of thing starts to torture x86 since now each trivial operator now needs to do a couple of things maybe even perform a test and bra per point which will hurt bra predictor. Imagine the test is strictly a random choice, real murder on the predictor and pipeline. Taken to its logical extreme, even quite simple projects such as say a cpu emulator can runs 100s of times slower as a C code than as the actuall HW even at the FPGA leisurely rate of 1/20th PC clock. It all depends. One thing to consider though is the system bandwidth in your problem for moving data into & out of rams or buffers. Even a modest FPGA can handle a 200 plus reads / writes per clock, where I suspect most x86 can really only express 1 ld or st to cached location about every 5 ops. Then the FPGA starts to shine with 200 v 20/4 ratio, Also when you start in C++, you have already favored the PC since you likely expressed ints as 32b nos and used FP. If you're using FP when integer can work, you really stacked the deck but that can often be undone. When you code in HDL for the data size you actually need you are favoring the FPGA by the same margin in reverse. Mind you I have never seen FP math get synthesized, you would have to instantiate a core for that. One final option to consider, use an FPGA cpu and take a 20x performance cut and run the code on that, the hit might not even be 20x because the SRAM or even DRAM is at your speed rather than 100s slower than PC. Then look for opportunities to add a special purpose instruction and see what the impact of 1 kernal op might be. A example crypto op might easily replace 100 opcodes with just 1 op. Now also consider you can gang up a few cpus too. It just depends on what you are doing and whether its mostly IO or mostly internal crunching. johnjakson at usa dot comArticle: 81946
Nju, XMD keeps the main memory consistent with the contents of the caches and vice versa. Debugging with caches on and/or off works in a consistent way and with guaranteed integrity of the data/code in the caches and the memory. - Peter Njuguna Njoroge wrote: > Hello, > > I would like to know whether turning on the caches in the PPC influences the functionality of XMD. > > 1) For instance, when XMD downloads an ELF binary to the memory, it issues writes to processor through the debug ports. Is it safe to assume that these writes bypass the data cache? If this wasn't the case and you are using a writeback cache setting, then there is a chance that the instructions wouldn't make it to main memory. Thus, when executing the program, the instructions won't be read by the processor because it searches the instruction cache, then main memory on a miss. Does this make sense? > > 2) When using the debug mrd (memory read) or mwr (memory write), is it safe to assume that the data cache is bypassed since it is a debug memory read/write, even if the address actually resides in the cache? If the debug read/write does search the cache and causes a miss, will the configured cache behavior ensue (like fetching the rest of the cache line on a miss)? If this is the case, then debug reads could change the state of the cache/memory, which may not be desired by the programmer. > > In general, I would like to understand the mechanism that XMD uses to issue writes/reads to the processor for both instruction download and debug memory read/writes. The "PowerPC Processor Reference Guide" goes into nice detail about the debug capabilities of the PPC 405 with the various configuration registers and signals. However, there is no documentation (that I have found) that discusses how XMD employs those debug features. Therefore, I don't know if XMD is configuring the caches to go into non-cacheable mode for the debug memory accesses or it uses the existing configuration as defined by the program. > > I'm working on a ML 310 board with a V2P30 -6 chip. > > NNArticle: 81947
On Mon, 04 Apr 2005 11:48:09 -0700, Eric Smith wrote: > Tobias Weingartner wrote: >> I doubt it's a matter of patents, but more a matter of licening. The two >> are very different beasts. > > But if there isn't a patent on an architecture, you don't need a license > to implement it. The purpose of the license is to grant you a right that > was taken away from the patent. If there's no patent, you haven't been > denied the right. Since this topic has come up, maybe someone could answer this for me: I've seen publicly available (often open source) cores for other processors, such as the AVR. Are these sort of cores legal to make, distribute and use? Supposing I made (from scratch) an msp430 compatible core for an FPGA - any ideas whether that would be legal or not? I'm guessing that using the name "msp430" would be a trademark and/or copyright violation, but if there are no patents involved it should be okay? Does it make any difference whether it is just used by the developer, released as an inaccessible part of a closed design, or whether it is released for free use by others? mvh., DavidArticle: 81948
currently we are doing one such assignemnt for a client. They want to do a board respin and wanted us to replace the few asics in there with fpga's. but fortunately they are not complex but the process sucks. less or no documentation or its in some foreign language, crazy!! and nothing for reference except the working board. so its like code, debug,debug,debug...until you get it right on the screen.Article: 81949
"David" <david.nospam@westcontrol.removethis.com> schrieb im Newsbeitrag news:pan.2005.04.05.07.04.46.345000@westcontrol.removethis.com... > On Mon, 04 Apr 2005 11:48:09 -0700, Eric Smith wrote: > > > Tobias Weingartner wrote: > >> I doubt it's a matter of patents, but more a matter of licening. The two > >> are very different beasts. > > > > But if there isn't a patent on an architecture, you don't need a license > > to implement it. The purpose of the license is to grant you a right that > > was taken away from the patent. If there's no patent, you haven't been > > denied the right. > > Since this topic has come up, maybe someone could answer this for me: > > I've seen publicly available (often open source) cores for other > processors, such as the AVR. Are these sort of cores legal to make, > distribute and use? Supposing I made (from scratch) an msp430 compatible its done full soc based on MSP430 compatible core :) http://bleyer.org/ > core for an FPGA - any ideas whether that would be legal or not? I'm > guessing that using the name "msp430" would be a trademark and/or > copyright violation, but if there are no patents involved it should be > okay? Does it make any difference whether it is just used by the > developer, released as an inaccessible part of a closed design, or whether > it is released for free use by others? > > mvh., > > David >
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z