Messages from 157900

Article: 157900
Subject: Re: 16->5 "Sort"
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 12 May 2015 05:37:02 +0000 (UTC)
Links: << >> << T >> << A >>

rickman <gnuarm@gmail.com> wrote:

(snip)
>> Input (bit 15 on left):

>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1

>> 20-bit Output List:
>> 1111 1110 0010 0001 0000

> I am old school so I have to picture these things....

> I see five priority encoders.  The first priority encoder output is used 
> to disable that input to the second priority encoder, etc.

> How did you code it?  I can see how this would be a lot more than three 
> levels of logic.  The equations get quite long.  When you talk about 
> levels of logic I assume you mean layers of LUTs?  I can see maybe each 
> 16 input priority encoder being no more than 3 levels of LUTs, but all 
> five layers with the inhibit logic... I don't think so.  I expect the 
> tools did a pretty good job of optimizing it and I don't easily see any 
> way of using carry chains.

It is slightly easier than that. 

(Sometime ago I did this with 36 and 3.)

Consider the case of only two bits high. Use one priority encoder
the usual way, and one upside down. (That is, the highest and lowest
set bit.) Now, as you said, subtract off the highest and lowest, and
two more priority encoders, and finally subtract one more and the
last one. 

But first I would add the logic to determine that only 
(or at least) five were set.

I suspect that, as the OP noted, it is combinatorial hard.
Consider the logic needed to, separately, compute each bit
of the result. Even just the first one.

It does, however, pipeline very well.  In the one I was working on,
it had a fairly fast clock (66MHz) but I could stand some levels
of latency.   The OP claims the need for low latency.

-- glen

Article: 157901
Subject: Re: 16->5 "Sort"
From: Rob Doyle <radioengr@gmail.com>
Date: Mon, 11 May 2015 23:39:49 -0700
Links: << >> << T >> << A >>

On 5/11/2015 7:26 PM, Kevin Neilson wrote:
> To be clearer, here's an example.
>
> Input (bit 15 on left):
>
> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
>
> 20-bit Output List:
> 1111 1110 0010 0001 0000
>

64K x 20 ROM?

How badly do you want one level of logic?

Rob.

Article: 157902
Subject: Re: 16->5 "Sort"
From: GaborSzakacs <gabor@alacron.com>
Date: Tue, 12 May 2015 09:03:52 -0400
Links: << >> << T >> << A >>

Rob Doyle wrote:
> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
>> To be clearer, here's an example.
>>
>> Input (bit 15 on left):
>>
>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
>>
>> 20-bit Output List:
>> 1111 1110 0010 0001 0000
>>
> 
> 64K x 20 ROM?
> 
> How badly do you want one level of logic?
> 
> Rob.

Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set.  The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits.  Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

-- 
Gabor

Article: 157903
Subject: Re: 16->5 "Sort"
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 12 May 2015 15:54:22 +0000 (UTC)
Links: << >> << T >> << A >>

GaborSzakacs <gabor@alacron.com> wrote:
> Rob Doyle wrote:
>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
>>> To be clearer, here's an example.

>>> Input (bit 15 on left):

>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1

>>> 20-bit Output List:
>>> 1111 1110 0010 0001 0000

>> 64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

>> How badly do you want one level of logic?
 
> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
> a 3-bit indication of how many bits were set.  The second
> just encodes 0 to 5 4-bit numbers, left justified (you assume
> it has the remaining 1's not covered by the low 8 bits.  Then
> one more level to select how many bits to take from each ROM
> based on the number of ones in the lower 8 bits (each output
> bit is only a 2:1 mux in this case - that's why the ROMs
> are justified right and left as noted).

If not in BRAM, that seems a good way. 

How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen

Article: 157904
Subject: Re: 16->5 "Sort"
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 13:39:11 -0400
Links: << >> << T >> << A >>

On 5/12/2015 11:54 AM, glen herrmannsfeldt wrote:
> GaborSzakacs <gabor@alacron.com> wrote:
>> Rob Doyle wrote:
>>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
>>>> To be clearer, here's an example.
>
>>>> Input (bit 15 on left):
>
>>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
>
>>>> 20-bit Output List:
>>>> 1111 1110 0010 0001 0000
>
>>> 64K x 20 ROM?
>
> Hmm, that probably does work. Especially with the BRAM in many FPGAs.
>
> When I did it some years ago, it was three out of 36 bits, and ROMs
> weren't, and still aren't, that big.
>
>>> How badly do you want one level of logic?
>
>> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
>> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
>> a 3-bit indication of how many bits were set.  The second
>> just encodes 0 to 5 4-bit numbers, left justified (you assume
>> it has the remaining 1's not covered by the low 8 bits.  Then
>> one more level to select how many bits to take from each ROM
>> based on the number of ones in the lower 8 bits (each output
>> bit is only a 2:1 mux in this case - that's why the ROMs
>> are justified right and left as noted).
>
> If not in BRAM, that seems a good way.
>
> How many levels of logic is a 256 bit ROM in FPGA LUTs?

Like most things, "it depends".  The older devices have 16 bits per LUT, 
if any, and many of the newer devices have 64 bits per LUT.

Nearly all devices have BRAMs so it's a no brainer by the split table 
method.  Only trouble is BRAMs use a clock cycle...  I'm just sayin'...

-- 

Rick

Article: 157905
Subject: Re: 16->5 "Sort"
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 12 May 2015 18:03:47 +0000 (UTC)
Links: << >> << T >> << A >>

rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)
>> Hmm, that probably does work. Especially with the BRAM in many FPGAs.

>> When I did it some years ago, it was three out of 36 bits, and ROMs
>> weren't, and still aren't, that big.

(snip)

>> How many levels of logic is a 256 bit ROM in FPGA LUTs?
 
> Like most things, "it depends".  The older devices have 16 bits per LUT, 
> if any, and many of the newer devices have 64 bits per LUT.
 
> Nearly all devices have BRAMs so it's a no brainer by the split table 
> method.  Only trouble is BRAMs use a clock cycle...  I'm just sayin'...

OK, going back the OP says that three levels of logic is fine.
Doesn't say how many pipeline stages. 

I presume a clock is available for the BRAM, but I am not sure.

-- glen

Article: 157906
Subject: Re: 16->5 "Sort"
From: GaborSzakacs <gabor@alacron.com>
Date: Tue, 12 May 2015 14:09:24 -0400
Links: << >> << T >> << A >>

glen herrmannsfeldt wrote:
> GaborSzakacs <gabor@alacron.com> wrote:
>> Rob Doyle wrote:
>>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
>>>> To be clearer, here's an example.
> 
>>>> Input (bit 15 on left):
> 
>>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
> 
>>>> 20-bit Output List:
>>>> 1111 1110 0010 0001 0000
> 
>>> 64K x 20 ROM?
> 
> Hmm, that probably does work. Especially with the BRAM in many FPGAs.
> 
> When I did it some years ago, it was three out of 36 bits, and ROMs
> weren't, and still aren't, that big.
> 
>>> How badly do you want one level of logic?
>  
>> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
>> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
>> a 3-bit indication of how many bits were set.  The second
>> just encodes 0 to 5 4-bit numbers, left justified (you assume
>> it has the remaining 1's not covered by the low 8 bits.  Then
>> one more level to select how many bits to take from each ROM
>> based on the number of ones in the lower 8 bits (each output
>> bit is only a 2:1 mux in this case - that's why the ROMs
>> are justified right and left as noted).
> 
> If not in BRAM, that seems a good way. 
> 
> How many levels of logic is a 256 bit ROM in FPGA LUTs?
> 
> -- glen
>  

In Xilinx 7-series:

A 256-bit LUT fits in 1 SLICEL.  It uses all 4 64-bit LUTS and three
muxes (two for each pair of LUTS and one to combine the result), so it
would show up as 3 levels of logic, but it all routes internally to the
slice and the muxes are really fast.

Convincing the tools to use LUT memory is the fun part.  Here's my test
code:

module simple_lut
(
   input wire    [7:0] addr,
   output wire   [7:0] data
);

(* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];

initial $readmemh ("../source/lutcontents.hex", lut_mem);

assign data = lut_mem[addr];

endmodule

-- 
Gabor

Article: 157907
Subject: Re: 16->5 "Sort"
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 14:09:50 -0400
Links: << >> << T >> << A >>

On 5/12/2015 2:03 PM, glen herrmannsfeldt wrote:
> rickman <gnuarm@gmail.com> wrote:
>
> (snip, I wrote)
>>> Hmm, that probably does work. Especially with the BRAM in many FPGAs.
>
>>> When I did it some years ago, it was three out of 36 bits, and ROMs
>>> weren't, and still aren't, that big.
>
> (snip)
>
>>> How many levels of logic is a 256 bit ROM in FPGA LUTs?
>
>> Like most things, "it depends".  The older devices have 16 bits per LUT,
>> if any, and many of the newer devices have 64 bits per LUT.
>
>> Nearly all devices have BRAMs so it's a no brainer by the split table
>> method.  Only trouble is BRAMs use a clock cycle...  I'm just sayin'...
>
> OK, going back the OP says that three levels of logic is fine.
> Doesn't say how many pipeline stages.
>
> I presume a clock is available for the BRAM, but I am not sure.

The OP said, "minimal latency" and I think I over spec'd that to mean 
combinatorial.  So one clock for the BRAM and one clock for the muxing 
should be ok.  Better than the 5 clock cycles he mentions.

-- 

Rick

Article: 157908
Subject: Re: 16->5 "Sort"
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 14:20:15 -0400
Links: << >> << T >> << A >>

On 5/12/2015 2:09 PM, GaborSzakacs wrote:
> glen herrmannsfeldt wrote:
>> GaborSzakacs <gabor@alacron.com> wrote:
>>> Rob Doyle wrote:
>>>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
>>>>> To be clearer, here's an example.
>>
>>>>> Input (bit 15 on left):
>>
>>>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
>>
>>>>> 20-bit Output List:
>>>>> 1111 1110 0010 0001 0000
>>
>>>> 64K x 20 ROM?
>>
>> Hmm, that probably does work. Especially with the BRAM in many FPGAs.
>>
>> When I did it some years ago, it was three out of 36 bits, and ROMs
>> weren't, and still aren't, that big.
>>
>>>> How badly do you want one level of logic?
>>
>>> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
>>> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
>>> a 3-bit indication of how many bits were set.  The second
>>> just encodes 0 to 5 4-bit numbers, left justified (you assume
>>> it has the remaining 1's not covered by the low 8 bits.  Then
>>> one more level to select how many bits to take from each ROM
>>> based on the number of ones in the lower 8 bits (each output
>>> bit is only a 2:1 mux in this case - that's why the ROMs
>>> are justified right and left as noted).
>>
>> If not in BRAM, that seems a good way.
>> How many levels of logic is a 256 bit ROM in FPGA LUTs?
>>
>> -- glen
>>
>
> In Xilinx 7-series:
>
> A 256-bit LUT fits in 1 SLICEL.  It uses all 4 64-bit LUTS and three
> muxes (two for each pair of LUTS and one to combine the result), so it
> would show up as 3 levels of logic, but it all routes internally to the
> slice and the muxes are really fast.
>
> Convincing the tools to use LUT memory is the fun part.  Here's my test
> code:
>
> module simple_lut
> (
>    input wire    [7:0] addr,
>    output wire   [7:0] data
> );
>
> (* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];
>
> initial $readmemh ("../source/lutcontents.hex", lut_mem);
>
> assign data = lut_mem[addr];
>
> endmodule

It would be 43 x 4 or 172 LUTs.  A fair amount plus the muxes to combine 
the outputs.  Still, it is mostly in parallel so it should be fairly fast.

-- 

Rick

Article: 157909
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 11:46:16 -0700 (PDT)
Links: << >> << T >> << A >>

The idea of working from both sides is useful.  Finding the 5th bit set is =
a lot harder than finding the 1st because you have to keep a running sum of=
 the bits already set so that eats up a few inputs of each LUT.  If I searc=
h from both ends the running sum would be no bigger than 2 (finding, say, 3=
 starting from the top and 2 starting from the bottom).  When I draw this o=
ut I still have at least 3 levels of logic plus a few levels of carry chain=
 mux, which I'd probably still have to pipeline one stage, but that's accep=
table.  Vivado surely won't use the carry chain mux unless I  instantiate i=
t anyway, and then it would be 5-6 levels of logic, so I'd definitely need =
to pipeline that one stage.

Article: 157910
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 11:57:40 -0700 (PDT)
Links: << >> << T >> << A >>

Vivado made it 16 levels of logic, and I can't tell exactly what it's doing=
, but this is how I would expect it would work:  the first output is the ea=
siest.  You just find the leading 1 with a priority encoder and encode it. =
 You can look at the first 5 bits with the first level, using the 6th LUT i=
nput for an input from the next level if none of those 5 bits are set, and =
so on.  This requires 4 levels of LUTs.  One could use the carry chain muxe=
s to speed things up but you'd have to instantiate them because Vivado does=
n't seem to know how to do that.  So that first output requires 4 LUTs x 4 =
bits.

But the 5th encoded output is harder, because you have to keep a running 3-=
bit sum of the number of set bits already encountered, so 3 bits of each LU=
T after the first are needed for the running sum, and the sum itself requir=
es 2 levels of logic.  (I can't post pictures here, can I?)  So now you end=
 up with what I calculate should be 7 levels of logic, or 3 levels of LUT a=
nd 5 levels of carry chain mux.  I could maybe do this if I pipeline it and=
 I can get Vivado to synthesize it properly.  But it just seems like there =
should be some easier way.

Article: 157911
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 12:05:15 -0700 (PDT)
Links: << >> << T >> << A >>

Yes, that would work.  I think it would be about about 180 LUTs, which is quite a bit.  It would probably work in one cycle:  there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

Article: 157912
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 12:11:33 -0700 (PDT)
Links: << >> << T >> << A >>

I have plenty of BRAMs and don't mind using them, but they're a pain someti=
mes.  I have to use the output registers so they have 2 cycles of latency, =
and often I have to add another cycle just to route data to or from the BRA=
M column.  They come in handy, though.  Lately I've been using them for a l=
ot of Galois arithmetic, such as lookup tables for 1/x.

Article: 157913
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 12:16:39 -0700 (PDT)
Links: << >> << T >> << A >>

On Tuesday, May 12, 2015 at 12:10:35 PM UTC-6, Gabor wrote:
> glen herrmannsfeldt wrote:
> > GaborSzakacs <gabor@alacron.com> wrote:
> >> Rob Doyle wrote:
> >>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
> >>>> To be clearer, here's an example.
> >=20
> >>>> Input (bit 15 on left):
> >=20
> >>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
> >=20
> >>>> 20-bit Output List:
> >>>> 1111 1110 0010 0001 0000
> >=20
> >>> 64K x 20 ROM?
> >=20
> > Hmm, that probably does work. Especially with the BRAM in many FPGAs.
> >=20
> > When I did it some years ago, it was three out of 36 bits, and ROMs
> > weren't, and still aren't, that big.
> >=20
> >>> How badly do you want one level of logic?
> > =20
> >> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
> >> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
> >> a 3-bit indication of how many bits were set.  The second
> >> just encodes 0 to 5 4-bit numbers, left justified (you assume
> >> it has the remaining 1's not covered by the low 8 bits.  Then
> >> one more level to select how many bits to take from each ROM
> >> based on the number of ones in the lower 8 bits (each output
> >> bit is only a 2:1 mux in this case - that's why the ROMs
> >> are justified right and left as noted).
> >=20
> > If not in BRAM, that seems a good way.=20
> >=20
> > How many levels of logic is a 256 bit ROM in FPGA LUTs?
> >=20
> > -- glen
> > =20
>=20
> In Xilinx 7-series:
>=20
> A 256-bit LUT fits in 1 SLICEL.  It uses all 4 64-bit LUTS and three
> muxes (two for each pair of LUTS and one to combine the result), so it
> would show up as 3 levels of logic, but it all routes internally to the
> slice and the muxes are really fast.
>=20
> Convincing the tools to use LUT memory is the fun part.  Here's my test
> code:
>=20
> module simple_lut
> (
>    input wire    [7:0] addr,
>    output wire   [7:0] data
> );
>=20
> (* RAM_STYLE =3D "distributed" *) reg [7:0] lut_mem [0:255];
>=20
> initial $readmemh ("../source/lutcontents.hex", lut_mem);
>=20
> assign data =3D lut_mem[addr];
>=20
> endmodule
>=20
> --=20
> Gabor

The ROM can be fast if you use the F7/F8 muxes built into the slice.  I've =
found the key thing for V7 is to minimize the number of routes on the fabri=
c.  The F7/F8 muxes are slower than LUTs, I think, but since you don't have=
 to route the connecting net onto the fabric you save a lot of time.

Article: 157914
Subject: Re: ZYNQ temperature
From: lasselangwadtchristensen@gmail.com
Date: Tue, 12 May 2015 13:57:47 -0700 (PDT)
Links: << >> << T >> << A >>

Den mandag den 11. maj 2015 kl. 19.59.43 UTC+2 skrev John Larkin:
> Does anyone know if the ZYNQ chips have an internal high-temperature
> shutdown? They are behaving like they do.
>=20

looks like you have to enable it (it may be default) and you have to load t=
he PL=20

30.3.6 Critical Over-temperature Alarm
Note: This feature sends an interrupt status to the PS  and causes an autom=
atic shutdown feature for=20
the PL side of the Zynq-7000 device if enabled. Th e PL shutdown is enabled=
 via the bitstream and the=20
PL will only come out of power-down if th e over-temperature alarm goes ina=
ctive or a=20
reconfiguration occurs.
The on-chip temperature measurement is used for critical temperature warnin=
gs. The default  over=20
temperature  threshold is 125=B0C. This threshold is used when the contents=
 of the OT Upper Alarm=20
register (listed in UG480) have not been configured. When the die temperatu=
re exceeds the=20
threshold set in the XADC's Control register, the ov er-temperature alarm (=
OT) becomes active. The OT=20
signal resets when the  die temperature has fallen below set threshold.=20
The OT alarm can also be used to automatically power down the PL upon activ=
ation. The OT alarm can=20
be disabled by writing a 1  to the OT bit in the XADC's  Configuration regi=
ster.
Note: these registers are in the XADC and are accessible using the DRP.

-Lasse

Article: 157915
Subject: Re: 16->5 "Sort"
From: GaborSzakacs <gabor@alacron.com>
Date: Tue, 12 May 2015 17:05:06 -0400
Links: << >> << T >> << A >>

Kevin Neilson wrote:
> Yes, that would work.  I think it would be about about 180 LUTs, which is quite a bit.  It would probably work in one cycle:  there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

If you decide to pipeline it, a register placed after the 256-deep LUT
will go into the same slice with the 4 LUTs, 2 F7 muxes and F8 mux.
Then the final 2:1 would go after a standard fabric register, which
has pretty small clock to Q (much better than BRAM without the output
register).  Even in a -2 Artix you can run above 500 MHz with this
arrangement.

-- 
Gabor

Article: 157916
Subject: Re: 16->5 "Sort"
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 18:29:30 -0400
Links: << >> << T >> << A >>

On 5/12/2015 3:11 PM, Kevin Neilson wrote:
> I have plenty of BRAMs and don't mind using them, but they're a pain
> sometimes.  I have to use the output registers so they have 2 cycles
> of latency, and often I have to add another cycle just to route data
> to or from the BRAM column.  They come in handy, though.  Lately I've
> been using them for a lot of Galois arithmetic, such as lookup tables
> for 1/x.

Why do you have to use the output registers?  The clock to out time on a 
BRAM has always been very fast as is the setup time.  The ones I've 
worked with were only slightly slower than a FF in the context of 
typical delays in logic and fabric.  What is your clock speed?

If you are working in a large part the LUTs are not an unreasonable way 
to implement this.  Not sure how fast the resulting logic will be, but 
it should be in the same ballpark as the BRAM but purely combinatorial. 
  Do you need to run faster than 100 MHz?

-- 

Rick

Article: 157917
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Tue, 12 May 2015 15:59:41 -0700 (PDT)
Links: << >> << T >> << A >>

On Tuesday, May 12, 2015 at 4:29:39 PM UTC-6, rickman wrote:
> On 5/12/2015 3:11 PM, Kevin Neilson wrote:
> > I have plenty of BRAMs and don't mind using them, but they're a pain
> > sometimes.  I have to use the output registers so they have 2 cycles
> > of latency, and often I have to add another cycle just to route data
> > to or from the BRAM column.  They come in handy, though.  Lately I've
> > been using them for a lot of Galois arithmetic, such as lookup tables
> > for 1/x.
>=20
> Why do you have to use the output registers?  The clock to out time on a=
=20
> BRAM has always been very fast as is the setup time.  The ones I've=20
> worked with were only slightly slower than a FF in the context of=20
> typical delays in logic and fabric.  What is your clock speed?
>=20
> If you are working in a large part the LUTs are not an unreasonable way=
=20
> to implement this.  Not sure how fast the resulting logic will be, but=20
> it should be in the same ballpark as the BRAM but purely combinatorial.=
=20
>   Do you need to run faster than 100 MHz?
>=20
> --=20
>=20
> Rick

I'm using 350mHz, or a period of 2.8ns.  The clk->out time for a V7 -1 BRAM=
 (without output reg) is about 2.1ns, so if I didn't use the BRAM output re=
gister, I'd barely have enough time to get the output across a net to a FF.=
  And I know even that usually won't meet timing, because Vivado is fond of=
 pulling the output registers out of my BRAMs and putting them into slices,=
 I guess because it thinks it has extra slack and can give some of it to th=
e next path.  But then the net to the FF will be 600ps and the path will fa=
il.  I have not figured out how to make Vivado stop doing this (except by i=
nstantiating BRAM primitives).

Article: 157918
Subject: Re: 16->5 "Sort"
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 19:37:17 -0400
Links: << >> << T >> << A >>

On 5/12/2015 2:57 PM, Kevin Neilson wrote:
> Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work:  the first output is the easiest.  You just find the leading 1 with a priority encoder and encode it.  You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on.  This requires 4 levels of LUTs.  One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that.  So that first output requires 4 LUTs x 4 bits.
>
> But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic.  (I can't post pictures here, can I?)  So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly.  But it just seems like there should be some easier way.

Sorry, I just can't picture what you are doing.  What is the "running 
sum" for?  I think I might understand.  You look at the first 5 inputs 
and output codes for all five positions.  I'm not sure why you can't 
look at the first 6 inputs though.  This outputs a three bit code of the 
number of 1's found.   The second block looks at the next five inputs 
and outputs five codes.  The last five bits would be like the second 
group and have a mux with the second group when in turn is what actually 
feeds the first mux.  The first group would be one level of LUTs.  The 
following two groups

Let me try to draw this...

       ,------,   3                       ,-----,
   0-5 |      |--/------------------------|SEL  |
  -->--|      | 20                        |     |  20
       |      |--/------------------------| BUM*|--/-->--
       '------'                           |     |
                                      ,---|     |
       ,------,  3        ,-----,     |   '-----'
  6-10 |      |--/--------|SEL  |     |
  -->--|      | 20        |     |     |
       |      |--/--------|     |     |
       '------'           |     |  20 |
                          | BUM*|--/--'
       ,------,           |     |
11-15 |      |           |     |
  -->--|      | 20        |     |
       |      |--/--------|     |
       '------'           '-----'

*Big, Ugly Mux

The mux might be hard to work out and will surely be more than 1 level 
of LUTs.... unless you can use the magic muxes in the slice to combine 
multiple LUTs into a 6 input mux.  You don't need any adders for the 
counts since each 3 bit count controls a separate mux.  This might just 
work in three levels of LUTs if you can use multiple LUTs to form a 6 
input mux.

I just read your post where you said you were running at 350 MHz.  I 
guess even this will have to be pipelined.  But it should be less logic 
than the brute force distributed RAM approach.  But who knows until the 
LUTs are counted?  In essence this is the same thing I guess.  It might 
work better with the larger front end blocks and just one mux.

I'm very surprised the clock to out time on the V7 BRAM is 2.1ns.  I 
think that is about the same number as the Spartan 3s from long ago.  Am 
I mistaken?

-- 

Rick

Article: 157919
Subject: Re: ZYNQ temperature
From: rickman <gnuarm@gmail.com>
Date: Tue, 12 May 2015 21:08:09 -0400
Links: << >> << T >> << A >>

On 5/12/2015 4:57 PM, lasselangwadtchristensen@gmail.com wrote:
> Den mandag den 11. maj 2015 kl. 19.59.43 UTC+2 skrev John Larkin:
>> Does anyone know if the ZYNQ chips have an internal high-temperature
>> shutdown? They are behaving like they do.
>>
>
> looks like you have to enable it (it may be default) and you have to load the PL
>
> 30.3.6 Critical Over-temperature Alarm
> Note: This feature sends an interrupt status to the PS  and causes an automatic shutdown feature for
> the PL side of the Zynq-7000 device if enabled. Th e PL shutdown is enabled via the bitstream and the
> PL will only come out of power-down if th e over-temperature alarm goes inactive or a
> reconfiguration occurs.
> The on-chip temperature measurement is used for critical temperature warnings. The default  over
> temperature  threshold is 125°C. This threshold is used when the contents of the OT Upper Alarm
> register (listed in UG480) have not been configured. When the die temperature exceeds the
> threshold set in the XADC's Control register, the ov er-temperature alarm (OT) becomes active. The OT
> signal resets when the  die temperature has fallen below set threshold.
> The OT alarm can also be used to automatically power down the PL upon activation. The OT alarm can
> be disabled by writing a 1  to the OT bit in the XADC's  Configuration register.
> Note: these registers are in the XADC and are accessible using the DRP.

Without me digging into the data sheet myself, can you tell me what the 
PL and PS are?

-- 

Rick

Article: 157920
Subject: Re: ZYNQ temperature
From: Rob Gaddi <rgaddi@technologyhighland.invalid>
Date: Wed, 13 May 2015 01:19:19 +0000 (UTC)
Links: << >> << T >> << A >>

On Tue, 12 May 2015 21:08:09 -0400, rickman wrote:

> On 5/12/2015 4:57 PM, lasselangwadtchristensen@gmail.com wrote:
>> Den mandag den 11. maj 2015 kl. 19.59.43 UTC+2 skrev John Larkin:
>>> Does anyone know if the ZYNQ chips have an internal high-temperature
>>> shutdown? They are behaving like they do.
>>>
>>>
>> looks like you have to enable it (it may be default) and you have to
>> load the PL
>>
> Without me digging into the data sheet myself, can you tell me what the
> PL and PS are?

Programmable Logic (FPGA side of things) and Processor System (hard ARM 
processor and some peripherals).

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

Article: 157921
Subject: Re: ZYNQ temperature
From: John Larkin <jlarkin@highlandtechnology.com>
Date: Tue, 12 May 2015 18:42:15 -0700
Links: << >> << T >> << A >>

On Tue, 12 May 2015 13:57:47 -0700 (PDT),
lasselangwadtchristensen@gmail.com wrote:

>Den mandag den 11. maj 2015 kl. 19.59.43 UTC+2 skrev John Larkin:
>> Does anyone know if the ZYNQ chips have an internal high-temperature
>> shutdown? They are behaving like they do.
>> 
>
>looks like you have to enable it (it may be default) and you have to load the PL 
>
>30.3.6 Critical Over-temperature Alarm
>Note: This feature sends an interrupt status to the PS  and causes an automatic shutdown feature for 
>the PL side of the Zynq-7000 device if enabled. Th e PL shutdown is enabled via the bitstream and the 
>PL will only come out of power-down if th e over-temperature alarm goes inactive or a 
>reconfiguration occurs.
>The on-chip temperature measurement is used for critical temperature warnings. The default  over 
>temperature  threshold is 125°C. This threshold is used when the contents of the OT Upper Alarm 
>register (listed in UG480) have not been configured. When the die temperature exceeds the 
>threshold set in the XADC's Control register, the ov er-temperature alarm (OT) becomes active. The OT 
>signal resets when the  die temperature has fallen below set threshold. 
>The OT alarm can also be used to automatically power down the PL upon activation. The OT alarm can 
>be disabled by writing a 1  to the OT bit in the XADC's  Configuration register.
>Note: these registers are in the XADC and are accessible using the DRP.
>
>-Lasse

It's probably shutting down at 125C, without our specifically
programming any temperature.

Extensive searching, by us and by Avnet, finds no fan that matches the
hole spacing on the MicroZed board. So we'll fab a little aluminum
adapter plate and use a standard fan. With a pin-fin heat sink glued
to the 7020 FPGA, and the fan blowing down on that, we can run at 100C
ambient.



-- 

John Larkin         Highland Technology, Inc
picosecond timing   laser drivers and controllers

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com

Article: 157922
Subject: Re: ZYNQ temperature
From: rickman <gnuarm@gmail.com>
Date: Wed, 13 May 2015 03:42:44 -0400
Links: << >> << T >> << A >>

On 5/12/2015 9:42 PM, John Larkin wrote:
>
> It's probably shutting down at 125C, without our specifically
> programming any temperature.
>
> Extensive searching, by us and by Avnet, finds no fan that matches the
> hole spacing on the MicroZed board. So we'll fab a little aluminum
> adapter plate and use a standard fan. With a pin-fin heat sink glued
> to the 7020 FPGA, and the fan blowing down on that, we can run at 100C
> ambient.

Reminds me of an array processor I worked on in the early 80's.  It had 
ECL gate arrays in ceramic PGA packages with a heat sink on each chip 
and a specially designed plenum which slid over each one to direct air 
across the heat sink.  This machine was as fast as a CRAY-1 and only a 
few years later.

-- 

Rick

Article: 157923
Subject: Re: 16->5 "Sort"
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Wed, 13 May 2015 00:58:43 -0700 (PDT)
Links: << >> << T >> << A >>

On Tuesday, May 12, 2015 at 5:37:27 PM UTC-6, rickman wrote:
> On 5/12/2015 2:57 PM, Kevin Neilson wrote:
> > Vivado made it 16 levels of logic, and I can't tell exactly what it's d=
oing, but this is how I would expect it would work:  the first output is th=
e easiest.  You just find the leading 1 with a priority encoder and encode =
it.  You can look at the first 5 bits with the first level, using the 6th L=
UT input for an input from the next level if none of those 5 bits are set, =
and so on.  This requires 4 levels of LUTs.  One could use the carry chain =
muxes to speed things up but you'd have to instantiate them because Vivado =
doesn't seem to know how to do that.  So that first output requires 4 LUTs =
x 4 bits.
> >
> > But the 5th encoded output is harder, because you have to keep a runnin=
g 3-bit sum of the number of set bits already encountered, so 3 bits of eac=
h LUT after the first are needed for the running sum, and the sum itself re=
quires 2 levels of logic.  (I can't post pictures here, can I?)  So now you=
 end up with what I calculate should be 7 levels of logic, or 3 levels of L=
UT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it=
 and I can get Vivado to synthesize it properly.  But it just seems like th=
ere should be some easier way.
>=20
> Sorry, I just can't picture what you are doing.  What is the "running=20
> sum" for?  I think I might understand.  You look at the first 5 inputs=20
> and output codes for all five positions.  I'm not sure why you can't=20
> look at the first 6 inputs though.  This outputs a three bit code of the=
=20
> number of 1's found.   The second block looks at the next five inputs=20
> and outputs five codes.  The last five bits would be like the second=20
> group and have a mux with the second group when in turn is what actually=
=20
> feeds the first mux.  The first group would be one level of LUTs.  The=20
> following two groups
>=20
> Let me try to draw this...
>=20
>        ,------,   3                       ,-----,
>    0-5 |      |--/------------------------|SEL  |
>   -->--|      | 20                        |     |  20
>        |      |--/------------------------| BUM*|--/-->--
>        '------'                           |     |
>                                       ,---|     |
>        ,------,  3        ,-----,     |   '-----'
>   6-10 |      |--/--------|SEL  |     |
>   -->--|      | 20        |     |     |
>        |      |--/--------|     |     |
>        '------'           |     |  20 |
>                           | BUM*|--/--'
>        ,------,           |     |
> 11-15 |      |           |     |
>   -->--|      | 20        |     |
>        |      |--/--------|     |
>        '------'           '-----'
>=20
> *Big, Ugly Mux
>=20
> The mux might be hard to work out and will surely be more than 1 level=20
> of LUTs.... unless you can use the magic muxes in the slice to combine=20
> multiple LUTs into a 6 input mux.  You don't need any adders for the=20
> counts since each 3 bit count controls a separate mux.  This might just=
=20
> work in three levels of LUTs if you can use multiple LUTs to form a 6=20
> input mux.
>=20
> I just read your post where you said you were running at 350 MHz.  I=20
> guess even this will have to be pipelined.  But it should be less logic=
=20
> than the brute force distributed RAM approach.  But who knows until the=
=20
> LUTs are counted?  In essence this is the same thing I guess.  It might=
=20
> work better with the larger front end blocks and just one mux.
>=20
> I'm very surprised the clock to out time on the V7 BRAM is 2.1ns.  I=20
> think that is about the same number as the Spartan 3s from long ago.  Am=
=20
> I mistaken?
>=20
> --=20
>=20
> Rick

The BRAM output is 2.1 ns, but if you use the output register (which I have=
 to) it's 750 ps.  Then the BRAM has 2 cycles of latency.

Yes, something like you show would work.  The design I'd written up had the=
 sums as inputs to the LUTs.  So the top LUT could look at 6 bits (I said 5=
 originally because I was going to use the MUXCY but I abandoned that).  Th=
en the next LUT looks at 4 bits, and the other 2 inputs would be the 2-bit =
sum of the first 5 bits.  And the next LUT looks at 4 more bits and also as=
 a 2-bit sum of the first 10 bits.  (This is for the 3rd encoded output so =
we're looking for the 3rd bit set.)  I end up with 4 of these LUTs, 2 level=
s of LUTs to do the sums, and an F7/F8 mux afterward to pick one of the 4 L=
UTs.  So that's 3 levels of LUTs and an F7/F8, which would work in 1 cycle.=
  The whole thing would be about 100 LUTs. =20

I couldn't get that to work, though, because I can't get Vivado to synthesi=
ze anything right, and I was going to have to instantiate a lot of primitiv=
es (including the F7/F8 muxes).  I couldn't even get Vivado to do the sums =
correctly.  You should be able to find the mod-2 sum of up to 18 bits with =
8 LUTs in 2 levels, but Vivado does 3 levels.  It's pitiful.

I ended up doing something else.  I did a trailing-one detector like this:

wire [15:0] trailing_1 =3D ~(input_vec[15:0]-1) & input_vec[15:0];

This uses the carry chain.  I think the idea is from Knuth.  That gives you=
 a 16-bit vector with just the trailing 1 set. =20

You encode that for the 1st output.  You the same thing with a mirrored ver=
sion of input_vec to do a leading-one detector and encode that for the 2nd =
output.  Then you XOR those two vectors with the original to get a vector w=
ith just the 3 middle bits still set.  You do another leading/trailing 1 de=
tector and encode those two and then XOR those with the original and you ha=
ve a vector with 1 bit set and you encode that.

That's all 200 LUTs and I pipelined it for 3 cycles of latency.  There's a =
lot of slack so I might be able to do it in 2 but I'm not sure if I want to=
 risk it.

Article: 157924
Subject: Re: ZYNQ temperature
From: kkoorndyk <kris.koorndyk@gmail.com>
Date: Wed, 13 May 2015 05:37:17 -0700 (PDT)
Links: << >> << T >> << A >>

On Tuesday, May 12, 2015 at 9:42:58 PM UTC-4, John Larkin wrote:
> On Tue, 12 May 2015 13:57:47 -0700 (PDT),
> lasselangwadtchristensen@gmail.com wrote:
>=20
> >Den mandag den 11. maj 2015 kl. 19.59.43 UTC+2 skrev John Larkin:
> >> Does anyone know if the ZYNQ chips have an internal high-temperature
> >> shutdown? They are behaving like they do.
> >>=20
> >
> >looks like you have to enable it (it may be default) and you have to loa=
d the PL=20
> >
> >30.3.6 Critical Over-temperature Alarm
> >Note: This feature sends an interrupt status to the PS  and causes an au=
tomatic shutdown feature for=20
> >the PL side of the Zynq-7000 device if enabled. Th e PL shutdown is enab=
led via the bitstream and the=20
> >PL will only come out of power-down if th e over-temperature alarm goes =
inactive or a=20
> >reconfiguration occurs.
> >The on-chip temperature measurement is used for critical temperature war=
nings. The default  over=20
> >temperature  threshold is 125=B0C. This threshold is used when the conte=
nts of the OT Upper Alarm=20
> >register (listed in UG480) have not been configured. When the die temper=
ature exceeds the=20
> >threshold set in the XADC's Control register, the ov er-temperature alar=
m (OT) becomes active. The OT=20
> >signal resets when the  die temperature has fallen below set threshold.=
=20
> >The OT alarm can also be used to automatically power down the PL upon ac=
tivation. The OT alarm can=20
> >be disabled by writing a 1  to the OT bit in the XADC's  Configuration r=
egister.
> >Note: these registers are in the XADC and are accessible using the DRP.
> >
> >-Lasse
>=20
> It's probably shutting down at 125C, without our specifically
> programming any temperature.
>=20
> Extensive searching, by us and by Avnet, finds no fan that matches the
> hole spacing on the MicroZed board. So we'll fab a little aluminum
> adapter plate and use a standard fan. With a pin-fin heat sink glued
> to the 7020 FPGA, and the fan blowing down on that, we can run at 100C
> ambient.
>=20
>=20
>=20
> --=20
>=20
> John Larkin         Highland Technology, Inc
> picosecond timing   laser drivers and controllers
>=20
> jlarkin att highlandtechnology dott com
> http://www.highlandtechnology.com

The MicroZed has a -I part on it, right?  Those parts are spec'd at a max j=
unction temp of 100 C.  You need the Expanded temperature grade parts (Q) t=
o get the 125 C junction temps.

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search