Messages from 29250

Article: 29250
Subject: Re: OT: IEEE & Floating point
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Date: Sun, 11 Feb 2001 15:43:10 +0100
Links: << >> << T >> << A >>

V R wrote:
> 
> Sorry for the off-topic and cross-post but I was curious (since we have
> the attention of so many now) if more "intelligent" floating point scheme
> exists (i.e. non-IEEE 754/854)? I know computers performed math before
> Intel's spec so of course there will be dozens of proprietary formats...
> 
> It feels like manipulation of floating point data in the 754/854 formats
> is more cumbersome than it needs to be. Any there any other schemes that
> are "simpler" (besides fixed point, etc) and/or easier to implement? Any
> implementations that nicely lend themselves to FPGAs? Obviously one will
> have to make a trade offs such as bit-size vs. precision, etc. but I'm
> inquiring about a general schemes...

I've implemented a few sw math libs, so the following is based on that
experience, and not from hw implementations:

For a maximum speed/ease of implementation sw package, I start as others
here have suggested, i.e. disregard underflow, Inf, Nan handling, and
possibly also Zero.

All this can be replaced with very simple saturating arithmetic on the
exponent part. Keeping Zero as a special case is probably OK, but it
means that you have 4 paths through each two-operand function instead of
just one.

When working on the Pentium FDIV sw workaround 7 years ago, I wrote a
quick&dirty library of 128-bit math operations, and used this to
implement Arctan(), so I could verify the results given by our
workaround code.

The storage format I used was a direct extension of the IEEE formats,
with a leading sign bit, a number of exponent bits and the remainder
left over for the mantissa. I believe I used explicit storage of the
leading (1) mantissa bit, but I don't remember exactly.

Anyway, the only significant IEEE spec I disobeyed totally was the
requirement for multiple rounding formats, I just used a single round to
nearest rule instead.

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Article: 29251
Subject: Re: any idea ?
From: Peter Alfke <palfke@earthlink.net>
Date: Sun, 11 Feb 2001 14:58:45 GMT
Links: << >> << T >> << A >>

rk wrote:How about when the counter counts?

>
>     01111111
>     10000000
>
> Hint for the h-work kid: look up static hazard in your logic book.

Yes.
Why go to the trouble of using exotic schemes, likeLFSR, when a binary
solution is actually simpler, and also equally fast.

Also, unless you know the tricks, an LFSR divides by 255, not 256.

Peter Alfke

>
>

Article: 29252
Subject: Re: OT: IEEE & Floating point
From: Peter Alfke <palfke@earthlink.net>
Date: Sun, 11 Feb 2001 15:06:55 GMT
Links: << >> << T >> << A >>

V R wrote:

> . Any there any other schemes that
> are "simpler" (besides fixed point, etc) and/or easier to implement? Any
> implementations that nicely lend themselves to FPGAs?

Virtex-II now has fast combinatorial multipliers ( 18 x 18 2's complement ),
and such a multiplier can obviously also be used to shift the exponent.
Saves a lot of multiplexers and routing. And is fast

Peter Alfke

Article: 29253
Subject: Re: any idea ?
From: Peter Alfke <palfke@earthlink.net>
Date: Sun, 11 Feb 2001 15:08:20 GMT
Links: << >> << T >> << A >>

rk wrote:How about when the counter counts?

>
>     01111111
>     10000000
>
> Hint for the h-work kid: look up static hazard in your logic book.

Yes.
Why go to the trouble of using exotic schemes,
like LFSR, when a binary
solution is actually simpler, and also equally fast.

Also, unless you know the tricks, an LFSR divides
by 255, not 256.

Peter Alfke

>

Article: 29254
Subject: Re: Mentor Advice
From: eml@riverside-machines.com.NOSPAM
Date: Sun, 11 Feb 2001 15:44:30 GMT
Links: << >> << T >> << A >>

On Fri, 09 Feb 2001 23:24:12 GMT, s_clubb@NOSPAMnetcomuk.co.uk (Stuart
Clubb) wrote:

>That's basically taking Renoir out of the flow. You would be missing
>the design management capabilities of Renoir and compromising the
>ability of your colleagues to cooperatively work with your design as
>part of a project. The documentation and re-use of your design would
>likely be a lot harder for those within your organisation using
>Renoir. Sure it adds a little overhead, but once you understand "how"
>it all works, you'll find Renoir quite useful I suspect.

If you don't mind me saying so, this sounds like a case of the tail
wagging the dog. Surely the point is that Renoir has to integrate into
existing proven design flows, and not the other way round. I worked in
a group last year where some people used Renoir, and some didn't. This
caused a huge amount of trouble. Maybe the situation is better with
2000.x; I don't know.

Evan

PS - can't understand why you'd prefer Portland to Stevenage - very
odd... :)

Article: 29255
Subject: Re: any idea ?
From: rk <stellare@nospamplease.erols.com>
Date: Sun, 11 Feb 2001 10:52:43 -0500
Links: << >> << T >> << A >>

Peter Alfke wrote:
> 
> How about when the counter counts?
> 
> >
> >     01111111
> >     10000000
> >
> > Hint for the h-work kid: look up static hazard in your logic book.
> 
> Yes.

It would be interesting to see if they take off on H-Work assignments
for a pulse that has potential glitches in it.

> Why go to the trouble of using exotic schemes,
> like LFSR, when a binary
> solution is actually simpler, and also equally fast.

<rk shrugs>  Lots of solutions to this one.  Very simple H-Work problem.
 
> Also, unless you know the tricks, an LFSR divides
> by 255, not 256.

Hints for the H-Work kid:

   1. _The Art of Electronics_, Horowitz and Hill

   2. _HDL Chip Design_, Smith.

-----------------------------------------------------------------------
rk                               A designer has arrived at perfection
stellar engineering, ltd.        not when there is no longer anything
stellare@erols.com.NOSPAM        to add, but when there is no longer
Hi-Rel Digital Systems Design    anything to take away - Bentley, 1983

Article: 29256
Subject: Re: what exactly is the dff between fpga and cpld?
From: rk <stellare@nospamplease.erols.com>
Date: Sun, 11 Feb 2001 11:06:37 -0500
Links: << >> << T >> << A >>

Peter Alfke wrote:
> 
> This is a good basic article ( and I am notoriously critical of anybody else's
> tutorials :-)
> Minor flaws:
> The author describes antifuse circuits as if they used fuses ( they really make a
> connection, not break it ) and he fails to mention the enormous difference in
> flip-flop count between CPLDs and FPGAs:
> CPLDs have from 32 or 36 to <300 flip-flops, while FPGAs nowadays start at a
> couple of hundred and end above 50,000 flip-flops. So that is a dramatic
> difference.
> The speed difference has disapeared, FPGAs are now as fast as CPLDs, even for
> simple functions.
> But, fundamentally, a good introductory story.

Agreed.

A few minor nits.

     It’s important to take the FPGA configuration time
     (at startup) into account when designing your 
     system. If you need instant power-on performance,
     you probably want to use a flash memory-based device
     or an OTP device.

OTP devices, by themselves, do not give one instant power-on
performance.  One must take FPGA start time into account when designing
a system.  Real devices attached to your FPGA may be effected during the
startup transient.  This includes devices such as relays and pyrotechnic
devices.

I would also have included Quicklogic and Atmel in the list of suppliers
of FPGAs.

----------------------------------------------------------------------
rk                               How the hell do I know? I'm just a
stellar engineering, ltd.        common, ordinary, simple savior of
stellare@erols.com.NOSPAM        America's destiny.
Hi-Rel Digital Systems Design    -- Pat Paulsen

Article: 29257
Subject: Re: any idea ?
From: karenwlead@my-deja.com
Date: Sun, 11 Feb 2001 17:18:05 GMT
Links: << >> << T >> << A >>

hello,

thanks to you all for the reply. I have to mention that it's not a
homework...

Now, how can i generate those 7 waveforms EFFICIENTLY

waveform 1
----------
0 from 0-->4 cycles and 1 from 5-->255

waveform 2
----------
0 from 0-->3 cycles and 1 from 4-->255

waveform 3
----------
0 from 0-->2 cycles and 1 from 3-->255

waveform 4
----------
0 from 0-->1 cycles and 1 from 2-->255

waveform 5
----------
"1" from 0-->251 cycles and "0" from 252-->255

waveform 6
----------
1 from 0-->252 cycles and 0 from 253-->255

waveform 7
----------
1 from 0-->253 cycles and 0 from 254-->255

i can use what you suggested, many comparators in //, but the problem
of fan out occurs. As i can duplicate the counter. but is there any
more appropriate way to do it ?


thanks




In article <3A86B54B.D8BAEC41@nospamplease.erols.com>,
  rk <stellare@nospamplease.erols.com> wrote:
> Peter Alfke wrote:
> >
> > How about when the counter counts?
> >
> > >
> > >     01111111
> > >     10000000
> > >
> > > Hint for the h-work kid: look up static hazard in your logic book.
> >
> > Yes.
>
> It would be interesting to see if they take off on H-Work assignments
> for a pulse that has potential glitches in it.
>
> > Why go to the trouble of using exotic schemes,
> > like LFSR, when a binary
> > solution is actually simpler, and also equally fast.
>
> <rk shrugs>  Lots of solutions to this one.  Very simple H-Work
problem.
>
> > Also, unless you know the tricks, an LFSR divides
> > by 255, not 256.
>
> Hints for the H-Work kid:
>
>    1. _The Art of Electronics_, Horowitz and Hill
>
>    2. _HDL Chip Design_, Smith.
>
> ----------------------------------------------------------------------
-
> rk                               A designer has arrived at perfection
> stellar engineering, ltd.        not when there is no longer anything
> stellare@erols.com.NOSPAM        to add, but when there is no longer
> Hi-Rel Digital Systems Design    anything to take away - Bentley, 1983
>


Sent via Deja.com
http://www.deja.com/

Article: 29258
Subject: Re: any idea ?
From: Falk Brunner <Falk.Brunner@gmx.de>
Date: Sun, 11 Feb 2001 18:40:40 +0100
Links: << >> << T >> << A >>

karenwlead@my-deja.com schrieb:
> 
> hello,
> 
> thanks to you all for the reply. I have to mention that it's not a
> homework...
> 
> Now, how can i generate those 7 waveforms EFFICIENTLY
> 
> waveform 1
> ----------
> 0 from 0-->4 cycles and 1 from 5-->255
> 
>
> 
> i can use what you suggested, many comparators in //, but the problem
> of fan out occurs. As i can duplicate the counter. but is there any
> more appropriate way to do it ?

There are many solutions. One quick and dirty is using blockram.
Generate a file with 256 bytes, waveform 1 is the 1. bit in each byte
(LSB), waveform 2 is the 2. Bit and so on. load this table into your
blockram (use coregen to generate a 256x8Bit ROM) attatch a 8 bit
counter to the address input, ready. You can create as complex as
possible waveforms, it wont degrade performance.

-- 
MFG
Falk

Article: 29259
Subject: Re: OT: IEEE & Floating point
From: "Jan Gray" <jsgray@acm.org>
Date: Sun, 11 Feb 2001 17:43:16 GMT
Links: << >> << T >> << A >>

"Peter Alfke" <palfke@earthlink.net> wrote in message
news:3A86AA7C.D34633E4@earthlink.net...
> Virtex-II now has fast combinatorial multipliers ( 18 x 18 2's
complement ),
> and such a multiplier can obviously also be used to shift the exponent.
> Saves a lot of multiplexers and routing. And is fast
>
> Peter Alfke

Ignoring rounding modes, etc., the two most area- and interconnect-intensive
jobs in an FPGA-implemented FPU are the multiplier, and the denormalize
(mantissa binary point alignment) and normalize barrel shifters in the
adder. Focusing on the latter, you can indeed use the new 18x18 multipliers
either iteratively or in parallel to do these shifts.

But in this posting [www.fpgacpu.org/usenet/fp.html] (which references
several other FPGA FPU implementations), I proposed an alternative FP adder
implementation.  Maybe it's wacky and unusable, I don't know.  Here's the
idea:

If you do FP perform addition in a bit- or nybble-serial fashion, you can
implement the binary point alignment denormalization, and the subsequent sum
normalization, with a variable-tap shift register, which is implemented
extremely efficiently in Virtex-derivative FPGAs using the powerful SRL16
primitive (which packs a variable-tap 16-element shift register into a
single logic cell).  (For more on applying SRL16, see
[http://www.xilinx.com/support/techxclusives/SRL16-techxclusive2.htm].)

Of course, a bit- or nybble-serial FP adder will be slower than a
combinational one.  But you can now implement one in much less area.  If you
can express your FP computation as a parallelizable data flow, you might
instantiate many of these area-optimized FP adders in the space of one
combinational FP adder.  Will the throughput be higher?  I don't know.

Jan Gray, Gray Research LLC
FPGA CPU News: www.fpgacpu.org

Article: 29260
Subject: Re: any idea ?
From: rk <stellare@nospamplease.erols.com>
Date: Sun, 11 Feb 2001 13:02:25 -0500
Links: << >> << T >> << A >>

karenwlead@my-deja.com wrote:
> 
> hello,
> 
> thanks to you all for the reply. I have to mention that it's not a
> homework...

OK, it looks like homework.  

I would still suggest looking up the references.

Good luck!

----------------------------------------------------------------------
rk                               We had dodged bullets before, but
stellar engineering, ltd.        this time we caught one in midair and
stellare@erols.com.NOSPAM        spit it out.
Hi-Rel Digital Systems Design    -- Gene Kranz after Apollo 5

Article: 29261
Subject: FPGAs take wron road. SoC NO - on-the-fly reprogrammability YES
From: "Dan" <daniel.deconinck@sympatico.ca>
Date: Sun, 11 Feb 2001 18:08:23 GMT
Links: << >> << T >> << A >>

EET Feb 5,2001 Ron Wilson argues that on-the-fly reprogrammable FPGAs are
much better an architecture than SoC.

The leading FPGA companies are going full speed into SoC / platform FPGAs.

How valid is Ron's point of view ?

Dan

Article: 29262
Subject: Re: double precision floating point arithmetic
From: "Matt Billenstein" <mbillens@mbillens.yi.org>
Date: Sun, 11 Feb 2001 18:17:54 GMT
Links: << >> << T >> << A >>

Well, thanks to everyone for the good information...  I'm trying to
accelerate a ray tracing application (POVRay) using digital hardware.  My
guess is that I do not have to be strictly IEEE compliant in my
implementation, but the numbers I hand back to the have to be in double
precision format.

I'm counting on everything being heavily pipelined and I am looking into
CORDIC for implementing the trancendentals.  The board I'm using hangs off
the memory bus, so I'd like everything to be synchronous to the 66 or 100
MHz SDRAM clock.

I'm doing this for a research project for school, so no worry of me bidding
it too low :)

Matt



Matt Billenstein
mbillens (at) one (dot) net
http://w3.one.net/~mbillens/


"Matt Billenstein" <mbillens@mbillens.yi.org> wrote in message
news:6L2h6.1820$xh3.173569@typhoon.kc.rr.com...
| All,
|
| I've taken on a project where I'll be implementing a number of math
| functions on IEEE double precision floating point types (64 bit).
| Multiplication, division, addition, and subtraction are fairly straight
| forward.  I'll need to do cosine, exponential (e^x), and square roots.
Any
| advice/pointers/book titles would be appreciated.  I'll be implementing in
| VHDL targeting a large Xilinx VirtexE device (XCV1000E).  Hopefully at 66
or
| 100 MHz.
|
| Thanks,
|
| Matt
|
|
| --
|
| Matt Billenstein
| REMOVEmbillens@one.net
| REMOVEhttp://w3.one.net/~mbillens/
|
|
|
|

Article: 29263
Subject: Re: any idea ?
From: Jim Granville <jim.granville@designtools.co.nz>
Date: Mon, 12 Feb 2001 07:54:17 +1300
Links: << >> << T >> << A >>

karenwlead@my-deja.com wrote:
> Now, how can i generate those 7 waveforms EFFICIENTLY
<snip> 
> i can use what you suggested, many comparators in //, but the problem
> of fan out occurs. As i can duplicate the counter. but is there any
> more appropriate way to do it ?

 To avoid the wastage of comparators, and the decode glitches, the
most efficent is to use JK Register design.
( T -> JK is the usual synth path ).

Then, you need decode only the SET and CLEAR points, and if you have
edge-adjacent waveforms ( as you do ), you can save more.
a) common the CLR on OP 1..4 all  K = 1 at CV == 255 ( 1 clk delay )
b) Either compare, or count, to create J=1 at CV==1 for OP4.
   Then J3=OP4, J2=OP3, J1 = OP2 will be 1 clock wider each.

=jg

-- 
======= 80x51 Tools & IP Specialists  =========
= http://www.DesignTools.co.nz

Article: 29264
Subject: Re: double precision floating point arithmetic
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Date: Sun, 11 Feb 2001 20:06:33 +0100
Links: << >> << T >> << A >>

Matt Billenstein wrote:
> 
> Well, thanks to everyone for the good information...  I'm trying to
> accelerate a ray tracing application (POVRay) using digital hardware.  My
> guess is that I do not have to be strictly IEEE compliant in my
> implementation, but the numbers I hand back to the have to be in double
> precision format.
> 
> I'm counting on everything being heavily pipelined and I am looking into
> CORDIC for implementing the trancendentals.  The board I'm using hangs off
> the memory bus, so I'd like everything to be synchronous to the 66 or 100
> MHz SDRAM clock.
> 
> I'm doing this for a research project for school, so no worry of me bidding
> it too low :)

Using POVRay as the single target for your job makes things much
simpler, in some ways!

I would take a long, close look at the full POVRay pipeline, and see
what's actually needed to get from input to output, as opposed to how
the algorithm is currently broken down into basic mathlib operations.

One idea is to explicitely split each trancendental operation into a
corresponding polynomial function, with sufficient precision to achieve
the final sub-pixel accuracy required.

When you do this you'll notice that many of those underlying operations
really doesn't need anything like full IEEE double semantics.

You could even look into a limited-precision fixed-point number format,
which makes both addition and multiplication simpler.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Article: 29265
Subject: Re: Wired-or on Virtex FPGAs
From: krw@btv.ibm.com (Keith R. Williams)
Date: Sun, 11 Feb 2001 21:14:41 GMT
Links: << >> << T >> << A >>

On Sat, 10 Feb 2001 03:09:18 GMT, Phil Hays <spampostmaster@home.com>
wrote:

>Carry chain can be infered by synthesis tools, however the code may not be
>highly readable.  For example, to create an OR gate:
>
>OR_temp <= '0' & A & B & C & D & E;
>Result_temp = OR_temp + "011111"
>Result = Result_temp(5); -- result is zero unless (A or B or C or D or E) = 1

Neat.  However, the results from SynplifyPro don't show much of a
gain. I must be missing something. If I just did a simple 12 bit add
(for a 12 bit AND) Synplify inferred twelve LUTS feeding twelve
MUXCYs.  The speed was nothing to write home about (3.5ns in a VirtexE
-7).  A two stage 12 input AND Synplify infers from the normal a
normal IF/THEN/ELSE comes in at 2.5ns. Both tests the AND feeds the
DFF.

I then instantiated three four-input ANDs and then did a three-bit
add.  Synplicity inferred a two stage LUT feeding the flop.  Grrr.

Only when I extended the AND to three stages (32 bits) did the carry
method become a tad faster.  Synplify says the ADD method is 3.5ns,
vs. 3.6 for the three stage LUT (seems strange as I write, I'll have
to look at this again).

>I'd suggest using a proceedure to improve readability.

Understood.  Good comments help a lot too. ;-)

>Biggest gain in speed is from using the carry chain for priority encoders, large
>AND and OR gates gain some.

Any tricks for a fast address comparitor?  It seems there ought to be
some way to do this with carry chains.  Looking at the logic drawings
of the Virtex CLB, I don't immediately see it though.  I have need of
a fast comparitor to compare an address bus against a register
(variable address).  One input has to be fast, the other doesn't. I'm
now pipelining the operation over three cycles (XOR, AND, Encode),
which is ugly.

----
  Keith

Article: 29266
Subject: Re: Wired-or on Virtex FPGAs
From: "Jan Gray" <jsgray@acm.org>
Date: Sun, 11 Feb 2001 22:00:55 GMT
Links: << >> << T >> << A >>

There's a nice example of a Virtex-carry-chain-optimized n-bit comparator in
VHDL from Arrigo Benedetti in the fpga-cpu list archive, at
http://groups.yahoo.com/group/fpga-cpu/message/73.

Jan Gray, Gray Research LLC

Article: 29267
Subject: Re: OT: SEIKO-EPSON LCD behaving strange.
From: Philip Freidin <philip@fliptronics.com>
Date: Sun, 11 Feb 2001 17:11:29 -0800
Links: << >> << T >> << A >>

The display is upside down. Rotate 180 degrees.

On Sun, 11 Feb 2001 04:10:15 +0100, "Daniel Nilsson"
<danielnilsson@REMOVE_THIShem3.passagen.se> wrote:
>This LCD almost lacks own logic, many signals need to be generated etc.
>The LCD works fine now, I can put things into memory and it will come out
>fine on the display... The problem is that when I try to write to the
>display on position (x,y) = (0,0) then it ends up at (255,62), (10,0) @
>(245,62), (10,10) @ (245,52) ... and the datasheet says that it should begin
>counting from upper left corner... mine counts from (lower-1) right corner,
>and counts upwards towards lower Y-value (if (0,0) is upper left corner)
>I have verified that all timing in datasheet is met, still it doesn't behave
>properly.
>
>Has anyone of you had any similar experience will LCD diaplays?
>
>/Daniel Nilsson
>
>
>
>
>
>
>

Philip Freidin
Fliptronics

Article: 29268
Subject: Re: New DES/AES (RIJNDAEL) Cores
From: "Jan Gray" <jsgray@acm.org>
Date: Mon, 12 Feb 2001 02:15:31 GMT
Links: << >> << T >> << A >>

Jabari Zakiya <jzakiya@mail.com> wrote
> But it should be
> unquestionable that clocked CMOS devices draw more power than
> unclocked CMOS devices of the same technology.

While I don't know much about low power design, I do know that this does not
necessarily follow, because while a purely combinational design may save on
clocking power, it may also waste power due to internal signal glitching.

Consider one gate G in the middle of an unbalanced, deep, purely
combinational logic expression graph.  If G's inputs arrive at, and settle
at, different times, it may GLITCH high then low then high again, over and
over, charging and discharging its output net, and perhaps causing the
downstream gates on that net to themselves glitch high and low, and so
forth.  Result: arbitrary amounts of wasted power due to glitching.

Now consider the same gate G, but this time in the midst of a pipelined
design, with pipeline registers at each two levels of logic.  Here, even if
G's inputs are unbalanced (perhaps one input sources a gate that sources
some registers, another input sources another register directly) then there
will be at most one glitch originating at G, and since its output is
registered, this glitch is not seen by downstream gates.  Result: less or no
power wasted due to glitching.

To quote slide# "Architecture:18" from the wonderful ISCA00 Tutorial "Low
Power Design: From Soup to Nuts", presented by Mary Jane Irwin and
Vijaykrishnan Narayanan of Penn State:

"Glitch Reduction by Pipelining
* Glitches are dependent on the logic depth of the circuit
* Nodes logically deeper are more prone to glitching
  > Arrival times of the gate inputs are more spread due to delay imbalances
  > Usually affected by more primary input switching
* Reduce depth by adding pipeline registers"


So it will be interesting to see how this solution benchmarks (latency,
throughput, area, power) against other implementations.

For example, at FCCM00 last year, Cameron Patterson of Xilinx presented a
paper on a jbits-floorplanned 16-stages-unrolled pipelined implemenation of
DES [1] that runs at 168 MHz and does 10.7 Gb/s ("non-feedback ECB"), for
3.2 W, in a modest Virtex-150-5, for a throughput-per-area of about 3
Mb/s/logic-cell.  Working from this paper, I infer that at 168 MHz, clocking
through the stated 35 pipe stages, the 16 round DES would take ~200 ns --
which is in the same ballpark for latency as the stated 155 ns TPD of the
combinational FPGA version -- but presumably with much higher throughput and
throughput-per-area.

Or are the two implementations not comparable (apples to oranges)?

Jan Gray, Gray Research LLC

[1] Cameron Patterson, Xilinx, "High Performance DES Encryption in Virtex
FPGAs using JBits", 2000 IEEE Symposium on Field-Programmable Custom
Computing Machines.

Article: 29269
Subject: Re: OT: IEEE & Floating point
From: John Larkin <jjlarkin@highlandSNIPTHIStechnology.com>
Date: Sun, 11 Feb 2001 20:09:14 -0800
Links: << >> << T >> << A >>

On Sun, 11 Feb 2001 11:59:20 +0000 (UTC), V R
<ipickledthefigsmyself@mrbourns.com> wrote:

>Sorry for the off-topic and cross-post but I was curious (since we have
>the attention of so many now) if more "intelligent" floating point scheme
>exists (i.e. non-IEEE 754/854)? I know computers performed math before
>Intel's spec so of course there will be dozens of proprietary formats...
>
>It feels like manipulation of floating point data in the 754/854 formats
>is more cumbersome than it needs to be. Any there any other schemes that
>are "simpler" (besides fixed point, etc) and/or easier to implement? Any
>implementations that nicely lend themselves to FPGAs? Obviously one will
>have to make a trade offs such as bit-size vs. precision, etc. but I'm
>inquiring about a general schemes...
>
>Thanks!
>VR.


VR,

I've done a couple of projects using a 64-bit fixed-point format, with
32 bits of integer and 32 bits of fraction. That's enough dynamic
range for most real-world, engineering-units apps. A fully-saturating
implementation has no exceptions to worry about. And adds are *fast*.

John

Article: 29270
Subject: Re: Virtex XCV2000E-6 BG560C - Orcad capture symbol
From: Laurent Gauch <laurent.gauch@amontec.com>
Date: Mon, 12 Feb 2001 09:00:06 +0100
Links: << >> << T >> << A >>

Amontec has one Virtex Symbole in BG560

Gil Golov wrote:

> Does anybody have this symbol for Orcad capture?
> 
> Thanks very much in advance.
> 
> Gil Golov

Article: 29271
Subject: Re: Wired-or on Virtex FPGAs
From: Kent Orthner <korthner@hotmail.nospam.com>
Date: 12 Feb 2001 17:32:31 +0900
Links: << >> << T >> << A >>



Brian Philofsky wrote:
> >Most likely, you will have to hand create this structure as I am not sure
> >if any synthesis tools currently infer this structure.  Maybe in the near
> >future though...

Christian Plessl <cplessl@ee.ethz.ch> writes:
> So how do think such a design could be implemented, if the design
> tools cannot infere it? Im using Xilinx Foundation 3.1i, the design is
> coded in  VHDL. Somehow  I would need complete acces to the FPGA
> elements, such as LUTs and carry chains.
> 
> Is there a way to acces these directly? From the Xilinx Library Guide
> I've seen, that components ADD16 and OR16 work exactly like this, but
> since these are marcos  I cannot extend them to my needs.

I will assume that you've figured out how to instantiate macros/components 
from the libraries guide.  If you look at the MUXCY component, you'll
see that it is an instantiatable (sp?) component for the mux implemented 
in the carry chain.

You can simply instantiate this however many times you need.

(If you look, you'll see that LUTs and whatnot are there, too.)

A little while ago, I was playng around with a 32-bit wide 
trinary compare function  (32 bits & 32 bit compare with 32 
bits mask), and I tested both the carry-chain method and the
pull-up (wired-or) method, and I did indeed find that the 
Carry chain was faster.

Note that you will need to break the carry chain in half if 
the device height isn't enough for your entire carry chain. 
(Which is what I did.)

I've stuck my compare component here so you can see how I 
did it.  I tested it using Foundatoin 3.3iISE.

Hope this helps, 
-Kent


-------------------------------------------------------------------------------
-- Title      : Fast Trinary Compare Component 
-- Project    : Common Component
-------------------------------------------------------------------------------
-- File       : FastCompare.vhd
-- Author     : K.Orthner
-- Created    : 2001/01/27
-- Last update: 2001-01-25
-- Platform   : Active-HDL/FPGA Express(Synopsys)
-------------------------------------------------------------------------------
-- Description: A Fast Trinary Compare component.
--              Completely combinational.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
-- synopsys translate_off
library unisim;
use unisim.all;
-- synopsys translate_on

entity FastCompare is
  generic (
    Width    : integer);
  port (
    Comparand0 : in  std_logic_vector(Width-1 downto 0);
    Comparand1 : in  std_logic_vector(Width-1 downto 0);
    Mask       : in  std_logic_vector(Width-1 downto 0);
    Match      : out std_logic );

end FastCompare;

----------------------------------------------------------------------------------------------------
-- Architecture Xilinx_Carry
----------------------------------------------------------------------------------------------------
architecture Xilinx_Carry of FastCompare is

  component MUXCY is
    port (
      O  : out std_ulogic;
      DI : in  std_ulogic;
      CI : in  std_ulogic;
      S  : in  std_ulogic);
  end component MUXCY;

  signal BitMatch    : std_logic_vector(Width-1 downto 0);
  signal CarryChain0 : std_logic_vector(Width/2 downto 0);
  signal CarryChain1 : std_logic_vector(Width/2 downto 0);
  signal Logic_0     : std_logic;
  
begin

  Logic_0 <= '0';

  ----------------------------------------------------------------------------------------------------
  -- Determine bitwise matching.
  ----------------------------------------------------------------------------------------------------
  GenBitMatch : process( Comparand0, Comparand1, Mask) is
  begin
    for i in BitMatch'range loop
      if (Mask(i) = '0') then
        BitMatch(i) <= '1';
      elsif (Comparand0(i) = Comparand1(i)) then
        BitMatch(i) <= '1';
      else
        BitMatch(i) <= '0';
      end if;
    end loop;
    
  end process GenBitMatch;


  GenMux : for i in 0 to ((Width/2)-1) generate
    MUXCY_0 : MUXCY
      port map (
        O  => CarryChain0(i+1),
        DI => Logic_0,
        CI => CarryChain0(i),
        S  => BitMatch(i));

    MUXCY_1 : MUXCY
      port map (
        O  => CarryChain1(i+1),
        DI => Logic_0,
        CI => CarryChain1(i),
        S  => BitMatch(i+(Width/2)));


  end generate GenMux;

  CarryChain0(0) <= '1';
  CarryChain1(0) <= '1';

  Match <= CarryChain0(Width/2) and CarryChain1(Width/2);--and CarryChain2(Width/4) and CarryChain3(Width/4);
  
end Xilinx_Carry;

--==[ eof ]==--

Article: 29272
Subject: Re: double precision floating point arithmetic
From: kolja@prowokulta.org
Date: Mon, 12 Feb 2001 12:20:57 GMT
Links: << >> << T >> << A >>

Hi

For something like raytracing I would suggest that you implement some
kind of vector processer. This means that you for example give one
command to add N pairs of numbers instead of just one.
This way you can do very heavy pipelining and almost allways keep the
pipeline full. This would require vectorization of the loops in POVRay.
This is not hard to do, and would for sure make it easier to get high
troughpout from your hardware.

Read the user manuals of Motorola DSP96002. They only implemented
addition, subtraction, comparison and multiplication and a 8-bit
division approximation.

The same tradoff is probably true for FPGA as well. The results might be
better if you spend more time on getting the basic operation really fast
with a lot of pipelining instead of having specialized hardware for
everything. Maybe you can even run multiple instances of the basic
operations in parallel from internel blockrams. These can than be used
to implement the transcendentals.

The user manual of DSP96002 contains assembler code to emulate the other
operations. They are only a little slower than  hardware iterations.

1/X takes 6 instructions
1/SQRT(X) takes 11 instructions using newton-raphson approximation
SQRT(X) is just X/SQRT(X) and takes 12 instructions
SIN/COS is done with cordic.

Analog Devices SHARC Manuals also contain a lot of assembler code.

In article <mHAh6.13554$ra.1279071@typhoon.kc.rr.com>,
  "Matt Billenstein" <mbillens@mbillens.yi.org> wrote:
> Well, thanks to everyone for the good information...  I'm trying to
> accelerate a ray tracing application (POVRay) using digital hardware.
My
> guess is that I do not have to be strictly IEEE compliant in my
> implementation, but the numbers I hand back to the have to be in
double
> precision format.
>
> I'm counting on everything being heavily pipelined and I am looking
into
> CORDIC for implementing the trancendentals.  The board I'm using hangs
off
> the memory bus, so I'd like everything to be synchronous to the 66 or
100
> MHz SDRAM clock.
>
> I'm doing this for a research project for school, so no worry of me
bidding
> it too low :)

Sent via Deja.com
http://www.deja.com/

Article: 29273
Subject: Re: Wired-or on Virtex FPGAs
From: kolja@prowokulta.org
Date: Mon, 12 Feb 2001 12:35:50 GMT
Links: << >> << T >> << A >>

If speed (and not area) is a concern you might try something completely
different.

Let's say you have a statemachine with output flip-flops y_i
You want to compute y_0 or y_1 or...or y_N.
Usully all the inputs of your statemachine come from flip-flops, too.
Let's call these input flip-flops x_i.
All the y_i are computed by a boolean function f_i(x, y).
You can now easily obtain the funtion
F(x, y) = f_0(x,y) or f_1(x,y) or ... or f_M(x,y)

This is the same function as before, but you now have the result
one clock cycle earlier, or you can use retiming and you have twice the
time to evaluate the function.

In some cases F can be much larger and slower than the OR-tree. But my
experience is that especially for one-hot encoded state machines F is
almost the same size as the tree and you therefore gain almost a factor
of 2 in throughput.

CU,
	Kolja

In article <ofq48t0073rudnjlvo4fpkjnccsm2vq11f@4ax.com>,
  Christian Plessl <cplessl@ee.ethz.ch> wrote:
> What I need, is a _fast_ boolean OR resp. AND operations on all of
> these output signals. Since there are quite a lot of output signals,
> say typically more than 40 signals, I need several levels of logic,
> when implementing this in the obivous tree-like structure with a tree
> of 4 Input AND/OR gates.

Sent via Deja.com
http://www.deja.com/

Article: 29274
Subject: Re: Wired-or on Virtex FPGAs
From: Christian Plessl <cplessl@ee.ethz.ch>
Date: Mon, 12 Feb 2001 14:09:32 +0100
Links: << >> << T >> << A >>

Hi Brian, Phil and others.

Thank you very much for the very usefull comments and design ideas for
implementing wide logical functions.

I've made little test circuits to compare your proposals, and want
shortly show the results:

I compared 3 different architectures for a 32 input AND gate

a) Simply using the 'and' operator

b) Using Brian Philofsky's scheme, by instanciating LUT's which
implement a 4bit Boolean function and passing the intermediate results
via the carry chain.

c) Phils Hays's clever idea of using the desing tools capability to
infere adders that use the carry chain for constructing wide boolean
functions.

All designs were implemented using Xilinx Foundation Tools Version
3.3i Servicepack 6 using VHDL toolflow. The target FPGA is Xilinx
Virtex-XCV1000-4.

Results:

+-----------------------------------------------+
| Cirucit | Slices used | LUTs used | Delay     |
+-----------------------------------------------+
| a       |  9          | 11        | 17 ns     |
+-----------------------------------------------+
| b       |  5          |  8        | 13 ns     |
+-----------------------------------------------+
| c       | 17          |  0        | 15.25 ns  |
+-----------------------------------------------+


Remarks:

a) shows that the tools cannot infere a higly-efficient implementation
when using just the obivous naive way of coding wide logic functions.

b) Brians scheme generates the fastest wired-and implementation for
32-input ANDs. The logic infered is as expected, each slice implements
2 4input LUTs each of the LUTs implements a 4-input AND. All the
outputs of the LUTs control the CYSEL multiplexers and the results are
passed via carry-chain.

c) Phils scheme doesn't use any LUTs at all, all the logic is
implemented using the carry-chain and the LUT just used for routing 1
single signal to the multiplexer, which means the circuit is similar
to b) but every slice handles only 2 bits, instead of 8 bits in
circuit b). Surprisingly the circuit is quite fast. Seems as if the
Virtex Carry chains are _really_ fast.


Thanks to all of you, for posting your ideas.

/Chris

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search