Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search

On 1/27/2017 10:12 AM, Benjamin Couillard wrote: > Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a écrit : >> On 27/01/17 05:39, rickman wrote: >>> On 1/26/2017 9:38 PM, Kevin Neilson wrote: >>>>>> >>>>>> I think you oversimplify FP. It works a lot better with >>>>>> dedicated hardware. >>>>> >>>>> Not sure what your point is. The principles are the same in >>>>> software or hardware. I was describing hardware I have worked on. >>>>> ST-100 from Star Technologies. I became very intimate with the >>>>> inner workings. >>>>> >>>>> The only complications are from the various error and special case >>>>> handling of the IEEE-754 format. I doubt the FPGA is implementing >>>>> that, but possibly. The basics are still the same. Adds use a >>>>> barrel shifter to denormalize the mantissa so the exponents are >>>>> equal, a integer adder and a normalization barrel shifter to >>>>> produce the result. Multiplies use a multiplier for the mantissas >>>>> and an adder for the exponents (with adjustment for exponent bias) >>>>> followed by a simple shifter to normalize the result. >>>>> >>>>> Both add and multiply are about the same level of complexity as a >>>>> barrel shifter is almost as much logic as the multiplier. >>>>> >>>>> Other than the special case handling of IEEE-754, what do you think >>>>> I am missing? >>>>> >>>>> -- >>>>> >>>>> Rick C >>>> >>>> It just all works better with dedicated hardware. Finding the >>>> leading one for normalization is somewhat slow in the FPGA and is >>>> something that benefits from dedicated hardware. Using a DSP48 (if >>>> we're talking about Xilinx) for a barrel shifter is fairly fast, but >>>> requires 3 cycles of latency, can only shift up to 18 bits, and is >>>> overkill for the task. You're using a full multiplier as a shifter; >>>> a dedicated shifter would be smaller and faster. All this stuff adds >>>> latency. When I pull up CoreGen and ask for the basic FP adder, I >>>> get something that uses only 2 DSP48s but has 12 cycles of latency. >>>> And there is a lot of fabric routing so timing is not very >>>> deterministic. >>> >>> I'm not sure how much you know about multipliers and shifters. >>> Multipliers are not magical. Multiplexers *are* big. A multiplier has >>> N stages with a one bit adder at every bit position. A barrel >>> multiplexer has nearly as many bit positions (you typically don't need >>> all the possible outputs), but uses a bit less logic at each position. >>> Each bit position still needs a full 4 input LUT. Not tons of >>> difference in complexity. >>> >> >> A 32-bit barrel shifter can be made with 5 steps, each step being a set >> of 32 two-input multiplexers. Dedicated hardware for that will be >> /much/ smaller and more efficient than using LUTs or a full multiplier. >> >> Normalisation of FP results also requires a "find first 1" operation. >> Again, dedicated hardware is going to be a lot smaller and more >> efficient than using LUT's. >> >> So a DSP block that has dedicated FP support is going to be smaller and >> faster than using integer DSP blocks with LUT's to do the same job. >> >>> The multipliers I've seen have selectable latency down to 1 clock. >>> Rolling a barrel shifter will generate many layers of logic that will >>> need to be pipelined as well to reach high speeds, likely many more >>> layers for the same speeds. >>> >>> What do you get if you design a floating point adder in the fabric? I >>> can only imagine it will be *much* larger and slower. >>> > > If I understand, you can do a barrel shifter with log2(n) complexity, hence your 5 steps but you will have the combitional delays of 5 muxes, it could limit your maximum clock frequency. A brute force approach will use more resoures but will probably allow a higher clock frequency. Technically N log(N). -- Rick C

On 1/27/2017 11:33 AM, David Brown wrote: > On 27/01/17 16:12, Benjamin Couillard wrote: >> Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a écrit : >>> On 27/01/17 05:39, rickman wrote: >>>> On 1/26/2017 9:38 PM, Kevin Neilson wrote: >>>>>>> >>>>>>> I think you oversimplify FP. It works a lot better with >>>>>>> dedicated hardware. >>>>>> >>>>>> Not sure what your point is. The principles are the same in >>>>>> software or hardware. I was describing hardware I have >>>>>> worked on. ST-100 from Star Technologies. I became very >>>>>> intimate with the inner workings. >>>>>> >>>>>> The only complications are from the various error and special >>>>>> case handling of the IEEE-754 format. I doubt the FPGA is >>>>>> implementing that, but possibly. The basics are still the >>>>>> same. Adds use a barrel shifter to denormalize the mantissa >>>>>> so the exponents are equal, a integer adder and a >>>>>> normalization barrel shifter to produce the result. >>>>>> Multiplies use a multiplier for the mantissas and an adder >>>>>> for the exponents (with adjustment for exponent bias) >>>>>> followed by a simple shifter to normalize the result. >>>>>> >>>>>> Both add and multiply are about the same level of complexity >>>>>> as a barrel shifter is almost as much logic as the >>>>>> multiplier. >>>>>> >>>>>> Other than the special case handling of IEEE-754, what do you >>>>>> think I am missing? >>>>>> >>>>>> -- >>>>>> >>>>>> Rick C >>>>> >>>>> It just all works better with dedicated hardware. Finding the >>>>> leading one for normalization is somewhat slow in the FPGA and >>>>> is something that benefits from dedicated hardware. Using a >>>>> DSP48 (if we're talking about Xilinx) for a barrel shifter is >>>>> fairly fast, but requires 3 cycles of latency, can only shift >>>>> up to 18 bits, and is overkill for the task. You're using a >>>>> full multiplier as a shifter; a dedicated shifter would be >>>>> smaller and faster. All this stuff adds latency. When I pull >>>>> up CoreGen and ask for the basic FP adder, I get something that >>>>> uses only 2 DSP48s but has 12 cycles of latency. And there is a >>>>> lot of fabric routing so timing is not very deterministic. >>>> >>>> I'm not sure how much you know about multipliers and shifters. >>>> Multipliers are not magical. Multiplexers *are* big. A >>>> multiplier has N stages with a one bit adder at every bit >>>> position. A barrel multiplexer has nearly as many bit positions >>>> (you typically don't need all the possible outputs), but uses a >>>> bit less logic at each position. Each bit position still needs a >>>> full 4 input LUT. Not tons of difference in complexity. >>>> >>> >>> A 32-bit barrel shifter can be made with 5 steps, each step being a >>> set of 32 two-input multiplexers. Dedicated hardware for that will >>> be /much/ smaller and more efficient than using LUTs or a full >>> multiplier. >>> >>> Normalisation of FP results also requires a "find first 1" >>> operation. Again, dedicated hardware is going to be a lot smaller >>> and more efficient than using LUT's. >>> >>> So a DSP block that has dedicated FP support is going to be smaller >>> and faster than using integer DSP blocks with LUT's to do the same >>> job. >>> >>>> The multipliers I've seen have selectable latency down to 1 >>>> clock. Rolling a barrel shifter will generate many layers of >>>> logic that will need to be pipelined as well to reach high >>>> speeds, likely many more layers for the same speeds. >>>> >>>> What do you get if you design a floating point adder in the >>>> fabric? I can only imagine it will be *much* larger and slower. >>>> >> >> If I understand, you can do a barrel shifter with log2(n) complexity, >> hence your 5 steps but you will have the combitional delays of 5 >> muxes, it could limit your maximum clock frequency. A brute force >> approach will use more resoures but will probably allow a higher >> clock frequency. >> > > The "brute force" method would be 1 layer of 32 32-input multiplexers. > And how do you implement a 32-input multiplexer in gates? You basically > have 5 layers of 2-input multiplexers. > > If the depth of the multiplexer is high enough, you might use tri-state > gates but I suspect that in this case you'd implement it with normal logic. A barrel shifter is simpler than that. I believe in a somewhat parallel method to computing an FFT, the terms in a barrel shifter can be shared to allow this. (pseudo vhdl) function (indata : unsigned(31:0), sel : unsigned(4:0)) return unsigned(31:0) is variable a, b, c, d, e : unsigned(31:0); begin a := indata(31:0) & '0' when sel(0) else indata; b := (a(30:0), others => '0') when sel(1) else indata; c := (b(27:0), others => '0') when sel(2) else indata; d := (c(23:0), others => '0') when sel(3) else indata; e := (d(15:0), others => '0') when sel(4) else indata; return (e); end; -- Rick C

Le vendredi 27 janvier 2017 14:00:10 UTC-5, rickman a =C3=A9crit=C2=A0: > On 1/27/2017 10:12 AM, Benjamin Couillard wrote: > > Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a =C3=A9crit : > >> On 27/01/17 05:39, rickman wrote: > >>> On 1/26/2017 9:38 PM, Kevin Neilson wrote: > >>>>>> > >>>>>> I think you oversimplify FP. It works a lot better with > >>>>>> dedicated hardware. > >>>>> > >>>>> Not sure what your point is. The principles are the same in > >>>>> software or hardware. I was describing hardware I have worked on. > >>>>> ST-100 from Star Technologies. I became very intimate with the > >>>>> inner workings. > >>>>> > >>>>> The only complications are from the various error and special case > >>>>> handling of the IEEE-754 format. I doubt the FPGA is implementing > >>>>> that, but possibly. The basics are still the same. Adds use a > >>>>> barrel shifter to denormalize the mantissa so the exponents are > >>>>> equal, a integer adder and a normalization barrel shifter to > >>>>> produce the result. Multiplies use a multiplier for the mantissas > >>>>> and an adder for the exponents (with adjustment for exponent bias) > >>>>> followed by a simple shifter to normalize the result. > >>>>> > >>>>> Both add and multiply are about the same level of complexity as a > >>>>> barrel shifter is almost as much logic as the multiplier. > >>>>> > >>>>> Other than the special case handling of IEEE-754, what do you think > >>>>> I am missing? > >>>>> > >>>>> -- > >>>>> > >>>>> Rick C > >>>> > >>>> It just all works better with dedicated hardware. Finding the > >>>> leading one for normalization is somewhat slow in the FPGA and is > >>>> something that benefits from dedicated hardware. Using a DSP48 (if > >>>> we're talking about Xilinx) for a barrel shifter is fairly fast, but > >>>> requires 3 cycles of latency, can only shift up to 18 bits, and is > >>>> overkill for the task. You're using a full multiplier as a shifter; > >>>> a dedicated shifter would be smaller and faster. All this stuff add= s > >>>> latency. When I pull up CoreGen and ask for the basic FP adder, I > >>>> get something that uses only 2 DSP48s but has 12 cycles of latency. > >>>> And there is a lot of fabric routing so timing is not very > >>>> deterministic. > >>> > >>> I'm not sure how much you know about multipliers and shifters. > >>> Multipliers are not magical. Multiplexers *are* big. A multiplier h= as > >>> N stages with a one bit adder at every bit position. A barrel > >>> multiplexer has nearly as many bit positions (you typically don't nee= d > >>> all the possible outputs), but uses a bit less logic at each position= . > >>> Each bit position still needs a full 4 input LUT. Not tons of > >>> difference in complexity. > >>> > >> > >> A 32-bit barrel shifter can be made with 5 steps, each step being a se= t > >> of 32 two-input multiplexers. Dedicated hardware for that will be > >> /much/ smaller and more efficient than using LUTs or a full multiplier= . > >> > >> Normalisation of FP results also requires a "find first 1" operation. > >> Again, dedicated hardware is going to be a lot smaller and more > >> efficient than using LUT's. > >> > >> So a DSP block that has dedicated FP support is going to be smaller an= d > >> faster than using integer DSP blocks with LUT's to do the same job. > >> > >>> The multipliers I've seen have selectable latency down to 1 clock. > >>> Rolling a barrel shifter will generate many layers of logic that will > >>> need to be pipelined as well to reach high speeds, likely many more > >>> layers for the same speeds. > >>> > >>> What do you get if you design a floating point adder in the fabric? = I > >>> can only imagine it will be *much* larger and slower. > >>> > > > > If I understand, you can do a barrel shifter with log2(n) complexity, h= ence your 5 steps but you will have the combitional delays of 5 muxes, it c= ould limit your maximum clock frequency. A brute force approach will use mo= re resoures but will probably allow a higher clock frequency. >=20 > Technically N log(N). >=20 > --=20 >=20 > Rick C Yep true, thanks for the clarification

> On 01/26/2017 09:15 PM, rickman wrote: Snip > What does "customer visible" have to do with it? You seem to be talking > about rolling your own DAC using a bunch of I/O pins and resistors with > analog inversion circuitry after. I'm explaining how it can be done > more easily. If you don't need bipolar outputs, why do you need > negative binary values at all? > > I've said there are a few sign magnitude DAC parts out there, but not so > many. I can't recall ever using one. > "Not customer visible" means that the person that buys one of these parts to stick on a board is un-aware of the interface between the internal digital logic and internal analog circuitry. The people designing the analog guts of a mixed signal chip do "roll their own" DACs in several flavors. Unless you work in mixed signal IC design or test, you would never see these interfaces. For a DAC or ADC product, "customer visible" has a lot to do with it because people want 2's complement or biased binary at the system interface so that it plays nice with the arithmetic in the rest of the system or doesn't confuse the firmware people. For internals of a chip, as a digital design engineer or test engineer, you just deal with it. As for why it is done the way it is, I am not completely sure, it is just what analog designers do. Even fairly analog seeming components like the voltage reference that David mentioned generally have a digital section with flash or OTP memory that gets read at power up and sets DACs used to configure the reference output to the 1% or whatever the part spec is. Bandgaps have 3 or 4 analog parameters effecting output voltage and output flatness over temperature that need to be set. As part of final test on the silicon die, they measure the device performance and set the values in the memory. This allows the manufacturer to correct for process variations and uniformity issues across a wafer the size of a dinner plate. When you see "NC" pins on small analog devices, they are often used for the access to these memories. The actual signalling methods are pretty closely held and not something a customer is likely to stumble on. The "I/O pins" in use are interconnects between the digital block and the analog section on the die. The last chip that I worked on had 100+ digital signals crossing the analog/digital boundary. Some were control signals and some were trim signals. I think that there were 5 or 6 parallel interfaced DACs from 4 bits to 11 bits. Of those, 2 or 3 used sign magnitude format to pass the trim information across. BobH

> I do almost all of my work in Verilog, but I do have comments > about editors in general. >=20 > I actually tried Sigasi's editor briefly, but found it lacking some > of the features I was used to, and I wasn't so interested in the > project management portion since I generally work from the Xilinx > GUI. =20 The last time I checked, both Sigasi and V3S had quite limited Verilog supp= ort, I think in the moment they are really VHDL editors, with some basic Ve= rilog support if you need to edit a Verilog file here and then... I did a Verilog project only once, about 2 years ago, and at that time VEdi= tor worked the best for me by far, at least from the free options (it was a= lso recommended to me by the respective customer, but I first tried some di= fferent approaches). It flagged a lot of issues in the source code that are= not detected by a Verilog compiler and saved me a lot of time looking for = stupid bugs. (In contrast, VHDL editors flag errors, that would be detected= by the compiler... But this is a different topic ;-) Regards, Thomas www.entner-electronics.com - Home of EEBlaster and JPEG Codec

On 26/01/2017 01:14, Tim Wescott wrote: > On Wed, 25 Jan 2017 02:59:46 -0800, cfbsoftware wrote: > >> On Wednesday, January 25, 2017 at 3:14:39 PM UTC+10:30, Tim Wescott >> wrote: >>> This is kind of a survey; I need some perspective (possibly historical) >>> >>> Are there any digital systems that you know of that use 1's compliment >>> or signed-magnitude number representation for technical reasons? >>> >>> Have you ever used it in the past? >>> >>> >> Quote: >> >> "Some designers chose 1’s complement, where −n was obtained from n by >> simply inverting all bits. Some chose 2’s complement, where −n is >> obtained by inverting all bits and then adding 1. The former has the >> drawback of featuring two forms for zero (0…0 and 1…1). This is nasty, >> particularly if available comparison instructions are inadequate. For >> example, the CDC 6000 computers had an instruction that tested for zero, >> recognizing both forms correctly, but also an instruction that tested >> the sign bit only, classifying 1…1 as a negative number, making >> comparisons unnecessarily complicated. This case of inadequate design >> reveals 1’s complement as a bad idea. Today, all computers use 2’s >> complement arithmetic." >> >> Ref: "Good Ideas, Through the Looking Glass" Niklaus Wirth, IEEE >> Computer. Issue No. 01 - January (2006 vol. 39). >> >> https://www.computer.org/csdl/mags/co/2006/01/r1028-abs.html > > I'm looking for current practice, not history. 1's complement is nearly the very definition of history as implied by "Have you ever used it in the past?" The only machine I am aware used 1's complement is a 1970's mainframe. I have not been aware of any since, its a daft idea. -- Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk

On 27/01/17 19:59, rickman wrote: > On 1/27/2017 3:17 AM, David Brown wrote: >> On 27/01/17 05:39, rickman wrote: >>> On 1/26/2017 9:38 PM, Kevin Neilson wrote: >>>>>> >>>>>> I think you oversimplify FP. It works a lot better with >>>>>> dedicated hardware. >>>>> >>>>> Not sure what your point is. The principles are the same in >>>>> software or hardware. I was describing hardware I have worked on. >>>>> ST-100 from Star Technologies. I became very intimate with the >>>>> inner workings. >>>>> >>>>> The only complications are from the various error and special case >>>>> handling of the IEEE-754 format. I doubt the FPGA is implementing >>>>> that, but possibly. The basics are still the same. Adds use a >>>>> barrel shifter to denormalize the mantissa so the exponents are >>>>> equal, a integer adder and a normalization barrel shifter to >>>>> produce the result. Multiplies use a multiplier for the mantissas >>>>> and an adder for the exponents (with adjustment for exponent bias) >>>>> followed by a simple shifter to normalize the result. >>>>> >>>>> Both add and multiply are about the same level of complexity as a >>>>> barrel shifter is almost as much logic as the multiplier. >>>>> >>>>> Other than the special case handling of IEEE-754, what do you think >>>>> I am missing? >>>>> >>>>> -- >>>>> >>>>> Rick C >>>> >>>> It just all works better with dedicated hardware. Finding the >>>> leading one for normalization is somewhat slow in the FPGA and is >>>> something that benefits from dedicated hardware. Using a DSP48 (if >>>> we're talking about Xilinx) for a barrel shifter is fairly fast, but >>>> requires 3 cycles of latency, can only shift up to 18 bits, and is >>>> overkill for the task. You're using a full multiplier as a shifter; >>>> a dedicated shifter would be smaller and faster. All this stuff adds >>>> latency. When I pull up CoreGen and ask for the basic FP adder, I >>>> get something that uses only 2 DSP48s but has 12 cycles of latency. >>>> And there is a lot of fabric routing so timing is not very >>>> deterministic. >>> >>> I'm not sure how much you know about multipliers and shifters. >>> Multipliers are not magical. Multiplexers *are* big. A multiplier has >>> N stages with a one bit adder at every bit position. A barrel >>> multiplexer has nearly as many bit positions (you typically don't need >>> all the possible outputs), but uses a bit less logic at each position. >>> Each bit position still needs a full 4 input LUT. Not tons of >>> difference in complexity. >>> >> >> A 32-bit barrel shifter can be made with 5 steps, each step being a set >> of 32 two-input multiplexers. Dedicated hardware for that will be >> /much/ smaller and more efficient than using LUTs or a full multiplier. > > Yes, I stand corrected. Still, it is hardly a "waste" of multipliers to > use them for multiplexers. Well, if the multipliers are already there and you don't have alternative dedicated hardware, then I agree you are not wasting the multipliers in using them for a shifter. > > >> Normalisation of FP results also requires a "find first 1" operation. >> Again, dedicated hardware is going to be a lot smaller and more >> efficient than using LUT's. > > Find first 1 can be done using a carry chain which is quite fast. It is > the same function as used in Gray code operations. > It is not something I have looked into, but I'll happily take your word for it. However, like pretty much /any/ function, it will be smaller and faster in dedicated hardware than in logic blocks. > >> So a DSP block that has dedicated FP support is going to be smaller and >> faster than using integer DSP blocks with LUT's to do the same job. > > Who said it wouldn't be? I say exactly that below. My point was just > that floating point isn't too hard to wrap your head around and not so > horribly different from fixed point. You just need to stick a few > functions onto a fixed point multiplier/adder. Fair enough. > > I was responding to: > > "Is this really a thing, or are they wrapping some more familiar fixed- > point processing with IP to make it floating point?" > > The difference between fixed and floating point operations require a few > functions beyond the basic integer operations which we have been > discussing. Floating point is not magic or incredibly hard to do. It > has not been included on FPGAs up until now because the primary market > is integer based. Okay. > > Some 15 years ago I discussed the need for hard IP in FPGAs and was told > by certain Xilinx employees that it isn't practical to include hard IP > because of the proliferation of combinations and wasted resources that > result. The trouble is the ratio of silicon area required for hard IP > vs. FPGA fabric gets worse with each larger generation. So as we see > now FPGAs are including all manner of functio blocks.... like other > devices. > > What I don't get is why FPGAs are so special that they are the last hold > out of becoming system on chip devices. I think this has come up before in this newsgroup. But I can't remember if any conclusion was reached (probably not!). > > >>> The multipliers I've seen have selectable latency down to 1 clock. >>> Rolling a barrel shifter will generate many layers of logic that will >>> need to be pipelined as well to reach high speeds, likely many more >>> layers for the same speeds. >>> >>> What do you get if you design a floating point adder in the fabric? I >>> can only imagine it will be *much* larger and slower. >

> >> Normalisation of FP results also requires a "find first 1" operation. > >> Again, dedicated hardware is going to be a lot smaller and more > >> efficient than using LUT's. > > > > Find first 1 can be done using a carry chain which is quite fast. It i= s > > the same function as used in Gray code operations. > > >=20 > It is not something I have looked into, but I'll happily take your word= =20 > for it. However, like pretty much /any/ function, it will be smaller=20 > and faster in dedicated hardware than in logic blocks. >=20 I've done it in a Xilinx, and it's not fast. First you have to go across t= he routing fabric and go through a set of LUTs to get onto the carry chain.= The carry chain is pretty fast; getting on and off the carry chain is slo= w. After you get off the carry chain, you have to go through the general r= outing fabric again. This is where most of your clock cycle gets eaten up.= Remember, if you had dedicated hardware, this would be a dedicated route.= Now you get into a second set of LUTs, where you have to AND the data fro= m the carry chain with the original number in order to get a one-hot bus wi= th only the leading 1 set. Now you have to encode that into a number which= you can use for your shifter. You may be able to do this with the same se= t of LUTs; I can't remember.

On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote: >> >> Normalisation of FP results also requires a "find first 1" >> >> operation. >> >> Again, dedicated hardware is going to be a lot smaller and more >> >> efficient than using LUT's. >> > >> > Find first 1 can be done using a carry chain which is quite fast. It >> > is the same function as used in Gray code operations. >> > >> > >> It is not something I have looked into, but I'll happily take your word >> for it. However, like pretty much /any/ function, it will be smaller >> and faster in dedicated hardware than in logic blocks. >> >> > I've done it in a Xilinx, and it's not fast. First you have to go > across the routing fabric and go through a set of LUTs to get onto the > carry chain. The carry chain is pretty fast; getting on and off the > carry chain is slow. After you get off the carry chain, you have to go > through the general routing fabric again. This is where most of your > clock cycle gets eaten up. Remember, if you had dedicated hardware, > this would be a dedicated route. Now you get into a second set of LUTs, > where you have to AND the data from the carry chain with the original > number in order to get a one-hot bus with only the leading 1 set. Now > you have to encode that into a number which you can use for your > shifter. You may be able to do this with the same set of LUTs; I can't > remember. What Xilinx part? The Altera Stratus 10 (I think that's the one) uses paired DSP blocks that are designed with a bit of extra logic so that you can use the pair of them as a floating-point block, or each one as a fixed-point block. (I'm not using their terminology). Apparently there's enough stuff going on at the really high end that floating point is better. -- Tim Wescott Control systems, embedded software and circuit design I'm looking for work! See my website if you're interested http://www.wescottdesign.com

Le vendredi 27 janvier 2017 11:34:00 UTC-5, David Brown a =C3=A9crit=C2=A0: > On 27/01/17 16:12, Benjamin Couillard wrote: > > Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a =C3=A9crit : > >> On 27/01/17 05:39, rickman wrote: > >>> On 1/26/2017 9:38 PM, Kevin Neilson wrote: > >>>>>>=20 > >>>>>> I think you oversimplify FP. It works a lot better with=20 > >>>>>> dedicated hardware. > >>>>>=20 > >>>>> Not sure what your point is. The principles are the same in=20 > >>>>> software or hardware. I was describing hardware I have > >>>>> worked on. ST-100 from Star Technologies. I became very > >>>>> intimate with the inner workings. > >>>>>=20 > >>>>> The only complications are from the various error and special > >>>>> case handling of the IEEE-754 format. I doubt the FPGA is > >>>>> implementing that, but possibly. The basics are still the > >>>>> same. Adds use a barrel shifter to denormalize the mantissa > >>>>> so the exponents are equal, a integer adder and a > >>>>> normalization barrel shifter to produce the result. > >>>>> Multiplies use a multiplier for the mantissas and an adder > >>>>> for the exponents (with adjustment for exponent bias)=20 > >>>>> followed by a simple shifter to normalize the result. > >>>>>=20 > >>>>> Both add and multiply are about the same level of complexity > >>>>> as a barrel shifter is almost as much logic as the > >>>>> multiplier. > >>>>>=20 > >>>>> Other than the special case handling of IEEE-754, what do you > >>>>> think I am missing? > >>>>>=20 > >>>>> -- > >>>>>=20 > >>>>> Rick C > >>>>=20 > >>>> It just all works better with dedicated hardware. Finding the=20 > >>>> leading one for normalization is somewhat slow in the FPGA and > >>>> is something that benefits from dedicated hardware. Using a > >>>> DSP48 (if we're talking about Xilinx) for a barrel shifter is > >>>> fairly fast, but requires 3 cycles of latency, can only shift > >>>> up to 18 bits, and is overkill for the task. You're using a > >>>> full multiplier as a shifter; a dedicated shifter would be > >>>> smaller and faster. All this stuff adds latency. When I pull > >>>> up CoreGen and ask for the basic FP adder, I get something that > >>>> uses only 2 DSP48s but has 12 cycles of latency. And there is a > >>>> lot of fabric routing so timing is not very deterministic. > >>>=20 > >>> I'm not sure how much you know about multipliers and shifters.=20 > >>> Multipliers are not magical. Multiplexers *are* big. A > >>> multiplier has N stages with a one bit adder at every bit > >>> position. A barrel multiplexer has nearly as many bit positions > >>> (you typically don't need all the possible outputs), but uses a > >>> bit less logic at each position. Each bit position still needs a > >>> full 4 input LUT. Not tons of difference in complexity. > >>>=20 > >>=20 > >> A 32-bit barrel shifter can be made with 5 steps, each step being a > >> set of 32 two-input multiplexers. Dedicated hardware for that will > >> be /much/ smaller and more efficient than using LUTs or a full > >> multiplier. > >>=20 > >> Normalisation of FP results also requires a "find first 1" > >> operation. Again, dedicated hardware is going to be a lot smaller > >> and more efficient than using LUT's. > >>=20 > >> So a DSP block that has dedicated FP support is going to be smaller > >> and faster than using integer DSP blocks with LUT's to do the same > >> job. > >>=20 > >>> The multipliers I've seen have selectable latency down to 1 > >>> clock. Rolling a barrel shifter will generate many layers of > >>> logic that will need to be pipelined as well to reach high > >>> speeds, likely many more layers for the same speeds. > >>>=20 > >>> What do you get if you design a floating point adder in the > >>> fabric? I can only imagine it will be *much* larger and slower. > >>>=20 > >=20 > > If I understand, you can do a barrel shifter with log2(n) complexity, > > hence your 5 steps but you will have the combitional delays of 5 > > muxes, it could limit your maximum clock frequency. A brute force > > approach will use more resoures but will probably allow a higher > > clock frequency. > >=20 >=20 > The "brute force" method would be 1 layer of 32 32-input multiplexers. > And how do you implement a 32-input multiplexer in gates? You basically > have 5 layers of 2-input multiplexers. >=20 > If the depth of the multiplexer is high enough, you might use tri-state > gates but I suspect that in this case you'd implement it with normal logi= c. Yeah, you're right.

On 1/30/2017 1:40 PM, Kevin Neilson wrote: >>>> Normalisation of FP results also requires a "find first 1" operation. >>>> Again, dedicated hardware is going to be a lot smaller and more >>>> efficient than using LUT's. >>> >>> Find first 1 can be done using a carry chain which is quite fast. It is >>> the same function as used in Gray code operations. >>> >> >> It is not something I have looked into, but I'll happily take your word >> for it. However, like pretty much /any/ function, it will be smaller >> and faster in dedicated hardware than in logic blocks. >> > > I've done it in a Xilinx, and it's not fast. First you have to go across the routing fabric and go through a set of LUTs to get onto the carry chain. The carry chain is pretty fast; getting on and off the carry chain is slow. After you get off the carry chain, you have to go through the general routing fabric again. This is where most of your clock cycle gets eaten up. Remember, if you had dedicated hardware, this would be a dedicated route. Now you get into a second set of LUTs, where you have to AND the data from the carry chain with the original number in order to get a one-hot bus with only the leading 1 set. Now you have to encode that into a number which you can use for your shifter. You may be able to do this with the same set of LUTs; I can't remember. The comparison is using a carry chain vs. not using a carry chain. First 1 in LUTs is either log2(N) in depth and linear in size or log2(N) in size and linear in depth (speed). Using general routing and LUTs this is very slow. Using a fast carry uses a LUT to enter the carry chain and a LUT to exit the carry chain. The carry chain is a fraction of a nanosecond per bit. -- Rick C

On 1/30/2017 3:54 PM, Tim Wescott wrote: > On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote: > >>>>> Normalisation of FP results also requires a "find first 1" >>>>> operation. >>>>> Again, dedicated hardware is going to be a lot smaller and more >>>>> efficient than using LUT's. >>>> >>>> Find first 1 can be done using a carry chain which is quite fast. It >>>> is the same function as used in Gray code operations. >>>> >>>> >>> It is not something I have looked into, but I'll happily take your word >>> for it. However, like pretty much /any/ function, it will be smaller >>> and faster in dedicated hardware than in logic blocks. >>> >>> >> I've done it in a Xilinx, and it's not fast. First you have to go >> across the routing fabric and go through a set of LUTs to get onto the >> carry chain. The carry chain is pretty fast; getting on and off the >> carry chain is slow. After you get off the carry chain, you have to go >> through the general routing fabric again. This is where most of your >> clock cycle gets eaten up. Remember, if you had dedicated hardware, >> this would be a dedicated route. Now you get into a second set of LUTs, >> where you have to AND the data from the carry chain with the original >> number in order to get a one-hot bus with only the leading 1 set. Now >> you have to encode that into a number which you can use for your >> shifter. You may be able to do this with the same set of LUTs; I can't >> remember. > > What Xilinx part? > > The Altera Stratus 10 (I think that's the one) uses paired DSP blocks > that are designed with a bit of extra logic so that you can use the pair > of them as a floating-point block, or each one as a fixed-point block. > (I'm not using their terminology). > > Apparently there's enough stuff going on at the really high end that > floating point is better. I'm not sure what "high end" means. Floating point has some advantages and it has some disadvantages. Fixed point is the same. Neither is perfect for all uses or even *any* uses actually. You always need to analyze the problem you are solving and consider the sources of computational errors. They are different but always potentially present with either approach. -- Rick C

On Mon, 30 Jan 2017 16:32:51 -0500, rickman wrote: > On 1/30/2017 3:54 PM, Tim Wescott wrote: >> On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote: >> >>>>>> Normalisation of FP results also requires a "find first 1" >>>>>> operation. >>>>>> Again, dedicated hardware is going to be a lot smaller and more >>>>>> efficient than using LUT's. >>>>> >>>>> Find first 1 can be done using a carry chain which is quite fast. >>>>> It is the same function as used in Gray code operations. >>>>> >>>>> >>>> It is not something I have looked into, but I'll happily take your >>>> word for it. However, like pretty much /any/ function, it will be >>>> smaller and faster in dedicated hardware than in logic blocks. >>>> >>>> >>> I've done it in a Xilinx, and it's not fast. First you have to go >>> across the routing fabric and go through a set of LUTs to get onto the >>> carry chain. The carry chain is pretty fast; getting on and off the >>> carry chain is slow. After you get off the carry chain, you have to >>> go through the general routing fabric again. This is where most of >>> your clock cycle gets eaten up. Remember, if you had dedicated >>> hardware, this would be a dedicated route. Now you get into a second >>> set of LUTs, >>> where you have to AND the data from the carry chain with the original >>> number in order to get a one-hot bus with only the leading 1 set. Now >>> you have to encode that into a number which you can use for your >>> shifter. You may be able to do this with the same set of LUTs; I >>> can't remember. >> >> What Xilinx part? >> >> The Altera Stratus 10 (I think that's the one) uses paired DSP blocks >> that are designed with a bit of extra logic so that you can use the >> pair of them as a floating-point block, or each one as a fixed-point >> block. (I'm not using their terminology). >> >> Apparently there's enough stuff going on at the really high end that >> floating point is better. > > I'm not sure what "high end" means. Floating point has some advantages > and it has some disadvantages. Fixed point is the same. Neither is > perfect for all uses or even *any* uses actually. You always need to > analyze the problem you are solving and consider the sources of > computational errors. They are different but always potentially present > with either approach. Yes, you are correct. I tend to mostly work with stuff that comes out of an ADC, goes through some processing (usually for me it's a processor and not an FPGA, but it's still DSP), and then goes out a DAC. In that case, fixed-point processing for the signal itself is usually the way to go because the ADC and DAC between them pretty much set the ranges, which means that floating point is just a waste of silicon. HOWEVER: that's just what I mostly run into. I'm currently working on a project where, by its nature, the sensible numerical format is double- precision floating point (not FPGA -- it's _slow_ data reception on a PC- class processor, where double-precision floating point is almost as fast as integer math unless you use the DSP extensions). -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

I have installed V3S ... When I instantiate a component in the following manner: i_my: entity work.sub V3S complains that sub is unknown ... How can I suppress that behavior? (Ok, apart from using a component declaration...) Noro

Am Donnerstag, 2. Februar 2017 09:36:34 UTC+1 schrieb noreeli....@gmail.com: > I have installed V3S ... > > When I instantiate a component in the following manner: > > i_my: entity work.sub > > V3S complains that sub is unknown ... > > How can I suppress that behavior? (Ok, apart from using a component declaration...) > > Noro <sub> must be defined somewhere in the project (either another entity, or in a library/package) -> add that file to the project Thomas

It is too complicated to understand to me,I just want to divide range of nu= mbers from -32767 to +32767 and get according data from -1 to +1. To get th= is i must divide by 32767. For this purpose i use Xilinx divider generator.So i use signed core, remin= der type both fractional and reminder too.=20 It is OK when results must be 1 or more,but when it less then 1 thats sad. When result must be more then 1 quotient show right results in two's compli= ment digits (according to datasheet) . But !!! fractional part is mad, cann= ot use two's compliment digit conversion. when i divide any digit to 32767 = (0111111111111111) and in fractional result is always dividend. I tryed fr= actional part 32-bit width but no result/ Why it doesnot work ? Does anyone meet this problem ?

On Fri, 03 Feb 2017 07:32:52 -0800, abirov wrote: > It is too complicated to understand to me,I just want to divide range of > numbers from -32767 to +32767 and get according data from -1 to +1. To > get this i must divide by 32767. > > For this purpose i use Xilinx divider generator.So i use signed core, > reminder type both fractional and reminder too. > > It is OK when results must be 1 or more,but when it less then 1 thats > sad. > When result must be more then 1 quotient show right results in two's > compliment digits (according to datasheet) . But !!! fractional part is > mad, cannot use two's compliment digit conversion. when i divide any > digit to 32767 (0111111111111111) and in fractional result is always > dividend. I tryed fractional part 32-bit width but no result/ > > Why it doesnot work ? Does anyone meet this problem ? Explain how it is "mad". And please, please, please, stop for a moment and think on how sensible it is to use up a whole bunch of resources to do a divide by 32767 when a divide by 32768 is just a matter of shifting down by 16 bits -- which, on an FPGA, is simply a matter of relabeling your wires. If you're absolutely bound and determined to divide by 32767, then use the following rule, which shouldn't take too much logic, because if you think about it you'll only be paying attention to the top two bits: * If the input number has an absolute value less than 0x4000, shift down by 16 * If the input number has an absolute value 0x4000 or greater, shift down by 16 and add (or subtract) 1 to (from) it, depending on whether it's positive or negative. * Unless, of course, the input is 32767, in which case you need to shift down by 16 and _don't_ add 1, because if you do the result will be -1, which is a lot different from 1 - 1/32768. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

Shifting down by 15 means what? My english very poor sorry. You mean shifting down like following vhdl code : Ding(15) is sign bit. Dout <= std_logic_vector(unsigned(Din(14 downto 1)) sll 14); ?

On Friday, February 3, 2017 at 11:36:36 PM UTC+6, Tim Wescott wrote: > On Fri, 03 Feb 2017 07:32:52 -0800, abirov wrote: > > > It is too complicated to understand to me,I just want to divide range of > > numbers from -32767 to +32767 and get according data from -1 to +1. To > > get this i must divide by 32767. > > > > For this purpose i use Xilinx divider generator.So i use signed core, > > reminder type both fractional and reminder too. > > > > It is OK when results must be 1 or more,but when it less then 1 thats > > sad. > > When result must be more then 1 quotient show right results in two's > > compliment digits (according to datasheet) . But !!! fractional part is > > mad, cannot use two's compliment digit conversion. when i divide any > > digit to 32767 (0111111111111111) and in fractional result is always > > dividend. I tryed fractional part 32-bit width but no result/ > > > > Why it doesnot work ? Does anyone meet this problem ? > > Explain how it is "mad". > > And please, please, please, stop for a moment and think on how sensible > it is to use up a whole bunch of resources to do a divide by 32767 when a > divide by 32768 is just a matter of shifting down by 16 bits -- which, on > an FPGA, is simply a matter of relabeling your wires. > > If you're absolutely bound and determined to divide by 32767, then use > the following rule, which shouldn't take too much logic, because if you > think about it you'll only be paying attention to the top two bits: > > * If the input number has an absolute value less than 0x4000, shift down > by 16 > > * If the input number has an absolute value 0x4000 or greater, shift down > by 16 and add (or subtract) 1 to (from) it, depending on whether it's > positive or negative. > > * Unless, of course, the input is 32767, in which case you need to shift > down by 16 and _don't_ add 1, because if you do the result will be -1, > which is a lot different from 1 - 1/32768. > > -- > > Tim Wescott > Wescott Design Services > http://www.wescottdesign.com > > I'm looking for work -- see my website! I got it, and finish this problem, Very thanx everything is OK now.

On 2/6/2017 8:26 AM, abirov@gmail.com wrote: > Shifting down by 15 means what? My english very poor sorry. > You mean shifting down like following vhdl code : > > Ding(15) is sign bit. > Dout <= std_logic_vector(unsigned(Din(14 downto 1)) sll 14); ? Shifting down mean a right shift. "Down" because the value of the number is less. The assumption is that the input value is an integer. So logically to divide by 32768 (2^15) would be the same as a right shift by 15 bits. Your integer has no fractional part so this would require using a fixed point 31 bit number 16.15 which means 16 bits to the left of the binary point and 15 bits to the right. This will prevent loss of data when shifting. However... Shifting can also be done by moving the binary point while keeping the data in place. In other words, treat the data Ding(15 downto 0) as a 1.15 fixed point number rather than a 16 bit integer. I hope this is more clear. -- Rick C

On Mon, 06 Feb 2017 14:24:59 -0500, rickman wrote: > On 2/6/2017 8:26 AM, abirov@gmail.com wrote: >> Shifting down by 15 means what? My english very poor sorry. You mean >> shifting down like following vhdl code : >> >> Ding(15) is sign bit. >> Dout <= std_logic_vector(unsigned(Din(14 downto 1)) sll 14); ? > > Shifting down mean a right shift. "Down" because the value of the > number is less. The assumption is that the input value is an integer. > So logically to divide by 32768 (2^15) would be the same as a right > shift by 15 bits. Your integer has no fractional part so this would > require using a fixed point 31 bit number 16.15 which means 16 bits to > the left of the binary point and 15 bits to the right. This will > prevent loss of data when shifting. However... > > Shifting can also be done by moving the binary point while keeping the > data in place. In other words, treat the data Ding(15 downto 0) as a > 1.15 fixed point number rather than a 16 bit integer. > > I hope this is more clear. That's what I said! Only Rick's version makes sense. Yes -- perform a right shift. Except, as Rick says, you're not really moving anything, you're just re-labeling the wires. Your 16-bit integer had a wire with weight 1, a wire with weight 2, etc., all the way up to a wire with weight 32768. You "shift" that by relabeling your wires as having weight 1/32768, 1/16384, ... 1/2, 1. Note that there is no physical operation whatsoever inside your chip to perform this shift -- you're just _thinking differently_ about the number for all operations except multiplications. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

On 02/06/2017 12:23 PM, Tim Wescott wrote: > On Mon, 06 Feb 2017 14:24:59 -0500, rickman wrote: > >> On 2/6/2017 8:26 AM, abirov@gmail.com wrote: >>> Shifting down by 15 means what? My english very poor sorry. You mean >>> shifting down like following vhdl code : >>> >>> Ding(15) is sign bit. >>> Dout <= std_logic_vector(unsigned(Din(14 downto 1)) sll 14); ? >> >> Shifting down mean a right shift. "Down" because the value of the >> number is less. The assumption is that the input value is an integer. >> So logically to divide by 32768 (2^15) would be the same as a right >> shift by 15 bits. Your integer has no fractional part so this would >> require using a fixed point 31 bit number 16.15 which means 16 bits to >> the left of the binary point and 15 bits to the right. This will >> prevent loss of data when shifting. However... >> >> Shifting can also be done by moving the binary point while keeping the >> data in place. In other words, treat the data Ding(15 downto 0) as a >> 1.15 fixed point number rather than a 16 bit integer. >> >> I hope this is more clear. > > That's what I said! Only Rick's version makes sense. > > Yes -- perform a right shift. Except, as Rick says, you're not really > moving anything, you're just re-labeling the wires. Your 16-bit integer > had a wire with weight 1, a wire with weight 2, etc., all the way up to a > wire with weight 32768. You "shift" that by relabeling your wires as > having weight 1/32768, 1/16384, ... 1/2, 1. > > Note that there is no physical operation whatsoever inside your chip to > perform this shift -- you're just _thinking differently_ about the number > for all operations except multiplications. > The bundle of wires doesn't care whether you think it has a decimal point in it or not. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.

On Wednesday, 25 January 2017 05:44:39 UTC+1, Tim Wescott wrote: > This is kind of a survey; I need some perspective (possibly historical) > > Are there any digital systems that you know of that use 1's compliment or > signed-magnitude number representation for technical reasons? > > Have you ever used it in the past? > > Is the world down to legacy applications and interfacing with legacy > sensors? > > TIA. > > -- > > Tim Wescott > Wescott Design Services > http://www.wescottdesign.com > > I'm looking for work -- see my website! I've used sign-magnitude for some compression schemes, in which case the sign bit takes the least significant place. 10 = -1 11 = +1 100 = -2 101 = +2 110 = -3 111 = +3 1000 = -4 ...

So, there are algorithms out there to perform an FFT on real data, that save (I think) roughly 2x the calculations of FFTs for complex data. I did a quick search, but didn't find any that are made specifically for FPGAs. Was my search too quick, or are there no IP sources to do this? It would seem like a slam-dunk for Xilinx and Intel/Altera to include these algorithms in their FFT libraries. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!

On Sunday, February 12, 2017 at 12:05:25 PM UTC-6, Tim Wescott wrote: > So, there are algorithms out there to perform an FFT on real data, that > save (I think) roughly 2x the calculations of FFTs for complex data. > > I did a quick search, but didn't find any that are made specifically for > FPGAs. Was my search too quick, or are there no IP sources to do this? > > It would seem like a slam-dunk for Xilinx and Intel/Altera to include > these algorithms in their FFT libraries. > > -- > > Tim Wescott > Wescott Design Services > http://www.wescottdesign.com > > I'm looking for work -- see my website! It's been a long time, as I remember: The Hartley transform will work. Shuffling the data before and after a half size complex FFT will work. And you can use one of them to check the other.

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive

Compare FPGA features and resources

Threads starting:

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search