Messages from 158225

Article: 158225
Subject: Re: Question about partial multiplication result in transposed FIR filter
From: fl <rxjwg98@gmail.com>
Date: Fri, 25 Sep 2015 15:47:29 -0700 (PDT)
Links: << >> << T >> << A >>

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
> >Hi,
> >
> >When I read a tutorial on FIR implementation on FPGA, I am not clear
> about
> > "partial results can be used for many multiplications (regardless of
> > symmetry)" That slide may be based on multiplier with logic cell in
> FPGA,
> >not
> > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can
> be
> > used for many multiplications (regardless of symmetry)'? I only think
> that
> >to
> > save 50% multiplier taking advantage of FIR coef symmetric
> characteristic.
> >
> >
> >Could you tell me how to understand about the partial results?
> >
> >Thank,
>=20
> can we read that tutorial?
>=20
> Kaz
> ---------------------------------------
> Posted through http://www.FPGARelated.com

Here is the link:https://www.google.ca/url?sa=3Dt&rct=3Dj&q=3D&esrc=3Ds&sou=
rce=3Dweb&cd=3D1&cad=3Drja&uact=3D8&ved=3D0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IK=
HZbHBTk&url=3Dhttp%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mc=
e%2Fpublic%2F06_MVD_%2520FIR_Design.pdf&usg=3DAFQjCNHDrIXK_J6WMErALOhKYrGsx=
LFg6w

Thanks,

Article: 158226
Subject: Re: Xilinx Spartan2E options?
From: Galina Szakacs <galina@szakacs.org>
Date: Fri, 25 Sep 2015 22:37:08 -0400
Links: << >> << T >> << A >>

On 9/25/2015 5:09 PM, Jon Elson wrote:
> I have a product series that has mostly moved to Spartan 3A, but one member
> of the family is still using up boards made some time ago with the
> Spartan2E.  I just found a mistake in all of the designs -- left out
> synchronizers on one input!
>
> I fixed the Spartan3A configurations, but no longer have a version of Ise 10
> running.  Looking through some files, I see there is a directory on my Ise
> 14.7 install - aISE_DS/ISE/spartan2e, so it seems at least some of the files
> needed to synthesize for the Spartan2E are there, but of course the main GUI
> doesn't allow you to select that family.  (This is on a Linux system.)
>
> Does anyone know if Spartan2E can be synthesized on Ise 14.7 by re-enabling
> that family?  I think I can probably boot up one of the archived hard drives
> that had Ise 10 on it if I have to, but that would be a bit of a pain.  This
> should be a one-time need, just to get this one PROM file corrected.
>
> Thanks!
>
> Jon
>

I'm pretty sure that any Spartan2e info still in the latest ISE is only
for programming (BSDL files, etc.) and you can't target those parts in
14.7 (or anything after 10.1.03).

-- 
Gabor

Article: 158227
Subject: Re: Xilinx Spartan2E options?
From: Jon Elson <elson@pico-systems.com>
Date: Fri, 25 Sep 2015 22:53:51 -0500
Links: << >> << T >> << A >>

Galina Szakacs wrote:


> 
> I'm pretty sure that any Spartan2e info still in the latest ISE is only
> for programming (BSDL files, etc.) and you can't target those parts in
> 14.7 (or anything after 10.1.03).
> 
You certainly can't select them in the GUI.  Yes, impact/BDSL may be the
reason these files remain there.

Thanks,

Jon

Article: 158228
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Sep 2015 00:06:04 -0400
Links: << >> << T >> << A >>

On 9/25/2015 4:10 PM, fl wrote:
> Hi,
>
> When I read a tutorial on FIR implementation on FPGA, I am not clear about
>   "partial results can be used for many multiplications (regardless of
>   symmetry)" That slide may be based on multiplier with logic cell in FPGA, not
>   a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can be
>   used for many multiplications (regardless of symmetry)'? I only think that to
>   save 50% multiplier taking advantage of FIR coef symmetric characteristic.
>
> Could you tell me how to understand about the partial results?

They are talking about an extreme level of optimization by sharing 
partial products between multiplies.  Trouble is, each multiply is by a 
different coefficient *and* a different data value.  But in each 
successive clock cycle the data moves to the next coefficient, so if any 
of the bits of the coefficients match, the result of the previous 
partial product can just be shifted into the appropriate location in the 
adjacent product calculation.  It would be a bit tortuous to code and 
would nullify the utility of the hard multipliers available in many 
FPGAs.  It might be worth while to do if you are designing an ASIC though.

-- 

Rick

Article: 158229
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Sep 2015 00:30:58 -0400
Links: << >> << T >> << A >>

On 9/26/2015 12:06 AM, rickman wrote:
> On 9/25/2015 4:10 PM, fl wrote:
>> Hi,
>>
>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>> about
>>   "partial results can be used for many multiplications (regardless of
>>   symmetry)" That slide may be based on multiplier with logic cell in
>> FPGA, not
>>   a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
>> can be
>>   used for many multiplications (regardless of symmetry)'? I only
>> think that to
>>   save 50% multiplier taking advantage of FIR coef symmetric
>> characteristic.
>>
>> Could you tell me how to understand about the partial results?
>
> They are talking about an extreme level of optimization by sharing
> partial products between multiplies.  Trouble is, each multiply is by a
> different coefficient *and* a different data value.  But in each
> successive clock cycle the data moves to the next coefficient, so if any
> of the bits of the coefficients match, the result of the previous
> partial product can just be shifted into the appropriate location in the
> adjacent product calculation.  It would be a bit tortuous to code and
> would nullify the utility of the hard multipliers available in many
> FPGAs.  It might be worth while to do if you are designing an ASIC though.

I posed this before I read your link.  I assumed right, but I didn't see 
the block diagram which shows all the multiplies happening on the same 
data at the same time.  I've written FIR filters before, I should have 
remembered this.  So the individual partial products can be shared 
across all the multiplies and added appropriately.  I expect this 
assumes fixed coefficients which naturally make multipliers simpler.

-- 

Rick

Article: 158230
Subject: Re: Question about partial multiplication result in transposed FIR
From: Richard Damon <Richard@Damon-Family.org>
Date: Sat, 26 Sep 2015 11:32:56 -0400
Links: << >> << T >> << A >>

On 9/26/15 12:06 AM, rickman wrote:
> On 9/25/2015 4:10 PM, fl wrote:
>> Hi,
>>
>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>> about
>>   "partial results can be used for many multiplications (regardless of
>>   symmetry)" That slide may be based on multiplier with logic cell in
>> FPGA, not
>>   a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
>> can be
>>   used for many multiplications (regardless of symmetry)'? I only
>> think that to
>>   save 50% multiplier taking advantage of FIR coef symmetric
>> characteristic.
>>
>> Could you tell me how to understand about the partial results?
>
> They are talking about an extreme level of optimization by sharing
> partial products between multiplies.  Trouble is, each multiply is by a
> different coefficient *and* a different data value.  But in each
> successive clock cycle the data moves to the next coefficient, so if any
> of the bits of the coefficients match, the result of the previous
> partial product can just be shifted into the appropriate location in the
> adjacent product calculation.  It would be a bit tortuous to code and
> would nullify the utility of the hard multipliers available in many
> FPGAs.  It might be worth while to do if you are designing an ASIC though.
>

A simple, and useful, transformation for a FIR or IIR filter for FPGAs 
is to switch from using one big summing node, with a series of delays 
before/after with tap offs and multiplies to having a single node feed 
forward/back to a series of nodes with simple adders. Since with the 
FPGA the registers at the outputs are free, this is the most efficient 
format. It also says that if the coefficients are constants, you have a 
possibility of optimizing some of the partial produces if building 
explicit multipliers.

Article: 158231
Subject: Soft core processors: RISC versus stack/accumulator for equal FPGA resources
From: jim.brakefield@ieee.org
Date: Sat, 26 Sep 2015 11:07:15 -0700 (PDT)
Links: << >> << T >> << A >>

It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
Herein, same multi-port LUT RAM for either RISC register file or dual stacks.  And the DSP for multiply and block RAM for main memory.  "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

Jim Brakefield

Article: 158232
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Sep 2015 16:02:21 -0400
Links: << >> << T >> << A >>

On 9/26/2015 2:07 PM, jim.brakefield@ieee.org wrote:
> It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
> Herein, same multi-port LUT RAM for either RISC register file or dual stacks.  And the DSP for multiply and block RAM for main memory.  "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.
>
> Has someone studied this situation?
> Would appear the stack/accumulator program code would be denser?
> Would appear multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a stack 
processor was not much denser than the instruction set for the RISC CPU 
I compared it to.  I don't recall which one it was.

A lot depends on the code you use for comparison.  I was using loops 
that move data.  Many stack processors have some levels of inefficiency 
because of the juggling of the stack required in some code.  Usually 
proponents say the code can be done to reduce the juggling of operands 
which I have found to be mostly true.  If you code to reduce the 
parameter juggling, stack processors can be somewhat more efficient in 
terms of code space usage.

I have looked at a couple of things as alternatives.  One is to use VLIW 
to allow as much parallelism in usage of the execution units within the 
processor, they are, data unit, address unit and instruction unit.  This 
presents some inherent inefficiency in that a fixed size instruction 
field is used to control the instruction unit when most IU instructions 
are just "next", for example.  But it allows both the address unit and 
the data unit to be doing work at the same time for doing things like 
moving data to/from memory and counting a loop iteration, for example.

Another potential stack optimization I have looked at is combining 
register and stack concepts by allowing very short offsets from top of 
stack to be used for a given operand along with variable size stack 
adjustments.  I didn't pursue this very far but I think it has potential 
of virtually eliminating operand juggling making stack processor much 
faster.  I'm not sure of the effect on code size optimization because of 
the larger instruction size.

-- 

Rick

Article: 158233
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: cfbsoftware@gmail.com
Date: Sat, 26 Sep 2015 18:19:22 -0700 (PDT)
Links: << >> << T >> << A >>

On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org w=
rote:
>=20
> Has someone studied this situation?
> Would appear the stack/accumulator program code would be denser?
> Would appear multiple instruction issue would be simpler with RISC?
>=20

I worked with the 1980's Lilith computer and its Modula-2 compiler which us=
ed a stack-based architecture. Christian Jacobi includes a detailed analysi=
s of the code generated in his dissertation titled "Code Generation and the=
 Lilith Architecture". You can download a copy from my website:

http://www.cfbsoftware.com/modula2/
I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Oberon =
compiler used in Project Oberon:

http://www.projectoberon.com

The code generation is described in detail in the included documentation.=
=20

I have both systems in operation and have some very similar test programs f=
or both. I'll experiment to see if the results give any surprises. Any comp=
arison would have to take into account the fact that the Lilith was a 16-bi=
t architecture whereas RISC5 is 32-bit so it might be tricky.

Regards,
Chris Burrows
CFB Software
http://www.astrobe.com

Article: 158234
Subject: Re: Question about partial multiplication result in transposed FIR
From: Tim Wescott <seemywebsite@myfooter.really>
Date: Sun, 27 Sep 2015 15:08:02 -0500
Links: << >> << T >> << A >>

On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>> >Hi,
>> >
>> >When I read a tutorial on FIR implementation on FPGA, I am not clear
>> about
>> > "partial results can be used for many multiplications (regardless of
>> > symmetry)" That slide may be based on multiplier with logic cell in
>> FPGA,
>> >not
>> > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
>> > can
>> be
>> > used for many multiplications (regardless of symmetry)'? I only think
>> that
>> >to
>> > save 50% multiplier taking advantage of FIR coef symmetric
>> characteristic.
>> >
>> >
>> >Could you tell me how to understand about the partial results?
>> >
>> >Thank,
>> 
>> can we read that tutorial?
>> 
>> Kaz --------------------------------------- Posted through
>> http://www.FPGARelated.com
> 
> Here is the
> link:https://www.google.ca/url?
sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
> 
> Thanks,

Well, the guy throws out one unfounded statement and then never supports 
it.  In a live presentation you could raise your hand and ask about it.  
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may 
be able to get the gate count down, but that's not really within the 
scope of the talk.

I suspect it's an editing turd -- he had something brilliant in a prior 
version of the presentation which he either found out was unfounded, or 
which he didn't have time to present for this particular talk, so he set 
out to edit all of it out but he left this one little bit.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Article: 158235
Subject: Re: Question about partial multiplication result in transposed FIR
From: Tim Wescott <seemywebsite@myfooter.really>
Date: Sun, 27 Sep 2015 15:50:10 -0500
Links: << >> << T >> << A >>

On Sun, 27 Sep 2015 15:08:02 -0500, Tim Wescott wrote:

> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:
> 
>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>>> >Hi,
>>> >
>>> >When I read a tutorial on FIR implementation on FPGA, I am not clear
>>> about
>>> > "partial results can be used for many multiplications (regardless of
>>> > symmetry)" That slide may be based on multiplier with logic cell in
>>> FPGA,
>>> >not
>>> > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
>>> > can
>>> be
>>> > used for many multiplications (regardless of symmetry)'? I only
>>> > think
>>> that
>>> >to
>>> > save 50% multiplier taking advantage of FIR coef symmetric
>>> characteristic.
>>> >
>>> >
>>> >Could you tell me how to understand about the partial results?
>>> >
>>> >Thank,
>>> 
>>> can we read that tutorial?
>>> 
>>> Kaz --------------------------------------- Posted through
>>> http://www.FPGARelated.com
>> 
>> Here is the link:https://www.google.ca/url?
> 
sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
>> 
>> Thanks,
> 
> Well, the guy throws out one unfounded statement and then never supports
> it.  In a live presentation you could raise your hand and ask about it.
> Can you send him an email?
> 
> I can see ways that, given a predetermined set of coefficients, you may
> be able to get the gate count down, but that's not really within the
> scope of the talk.
> 
> I suspect it's an editing turd -- he had something brilliant in a prior
> version of the presentation which he either found out was unfounded, or
> which he didn't have time to present for this particular talk, so he set
> out to edit all of it out but he left this one little bit.

Just a note: since nearly all FIR filters are symmetrical around some 
point, it would be interesting to see how much the area would increase or 
decrease if you insisted on that, then reduced the multipliers by a 
factor of two at the cost of having one more adder per multiplier.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Article: 158236
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sun, 27 Sep 2015 18:28:12 -0400
Links: << >> << T >> << A >>

On 9/27/2015 4:08 PM, Tim Wescott wrote:
> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:
>
>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>>>> Hi,
>>>>
>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>>> about
>>>> "partial results can be used for many multiplications (regardless of
>>>> symmetry)" That slide may be based on multiplier with logic cell in
>>> FPGA,
>>>> not
>>>> a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
>>>> can
>>> be
>>>> used for many multiplications (regardless of symmetry)'? I only think
>>> that
>>>> to
>>>> save 50% multiplier taking advantage of FIR coef symmetric
>>> characteristic.
>>>>
>>>>
>>>> Could you tell me how to understand about the partial results?
>>>>
>>>> Thank,
>>>
>>> can we read that tutorial?
>>>
>>> Kaz --------------------------------------- Posted through
>>> http://www.FPGARelated.com
>>
>> Here is the
>> link:https://www.google.ca/url?
> sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
>>
>> Thanks,
>
> Well, the guy throws out one unfounded statement and then never supports
> it.  In a live presentation you could raise your hand and ask about it.
> Can you send him an email?
>
> I can see ways that, given a predetermined set of coefficients, you may
> be able to get the gate count down, but that's not really within the
> scope of the talk.

I think it is clear that he is talking about a hard coded set of 
coefficients.  If the coefficients are variables in registers, there is 
no efficient way to optimize the multipliers.  With fixed coefficients 
and multipliers in the fabric, this would be an important optimization. 
  He discusses using fabric for multipliers in later slides.  It is not 
unreasonable to expect the tools to do this optimization automatically.

-- 

Rick

Article: 158237
Subject: Re: Question about partial multiplication result in transposed FIR
From: Tim Wescott <tim@seemywebsite.com>
Date: Sun, 27 Sep 2015 17:44:58 -0500
Links: << >> << T >> << A >>

On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote:

> On 9/27/2015 4:08 PM, Tim Wescott wrote:
>> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:
>>
>>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>>>>> Hi,
>>>>>
>>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>>>> about
>>>>> "partial results can be used for many multiplications (regardless of
>>>>> symmetry)" That slide may be based on multiplier with logic cell in
>>>> FPGA,
>>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial
>>>>> results can
>>>> be
>>>>> used for many multiplications (regardless of symmetry)'? I only
>>>>> think
>>>> that
>>>>> to save 50% multiplier taking advantage of FIR coef symmetric
>>>> characteristic.
>>>>>
>>>>>
>>>>> Could you tell me how to understand about the partial results?
>>>>>
>>>>> Thank,
>>>>
>>>> can we read that tutorial?
>>>>
>>>> Kaz --------------------------------------- Posted through
>>>> http://www.FPGARelated.com
>>>
>>> Here is the link:https://www.google.ca/url?
>> 
sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
>> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
>> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
>>>
>>> Thanks,
>>
>> Well, the guy throws out one unfounded statement and then never
>> supports it.  In a live presentation you could raise your hand and ask
>> about it.
>> Can you send him an email?
>>
>> I can see ways that, given a predetermined set of coefficients, you may
>> be able to get the gate count down, but that's not really within the
>> scope of the talk.
> 
> I think it is clear that he is talking about a hard coded set of
> coefficients.  If the coefficients are variables in registers, there is
> no efficient way to optimize the multipliers.  With fixed coefficients
> and multipliers in the fabric, this would be an important optimization.
>   He discusses using fabric for multipliers in later slides.  It is not
> unreasonable to expect the tools to do this optimization automatically.

I totally missed that -- yes, one would hope that in 2015 the tools would 
be able to figure out how to optimize fixed multipliers.

This is a tangent, but it makes me wonder -- I saw a paper ages ago that 
was basically saying that if addition and subtraction are equally costly, 
then you can optimize a multiplication by using both -- i.e., if you're 
multiplying (x) by 11110001111b, then you can either do eight adds, or 
you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings.

So -- do modern optimizers do this when multiplying by fixed 
coefficients, or not?

-- 
www.wescottdesign.com

Article: 158238
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sun, 27 Sep 2015 20:23:58 -0400
Links: << >> << T >> << A >>

On 9/27/2015 6:44 PM, Tim Wescott wrote:
> On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote:
>
>> On 9/27/2015 4:08 PM, Tim Wescott wrote:
>>> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:
>>>
>>>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>>>>> about
>>>>>> "partial results can be used for many multiplications (regardless of
>>>>>> symmetry)" That slide may be based on multiplier with logic cell in
>>>>> FPGA,
>>>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial
>>>>>> results can
>>>>> be
>>>>>> used for many multiplications (regardless of symmetry)'? I only
>>>>>> think
>>>>> that
>>>>>> to save 50% multiplier taking advantage of FIR coef symmetric
>>>>> characteristic.
>>>>>>
>>>>>>
>>>>>> Could you tell me how to understand about the partial results?
>>>>>>
>>>>>> Thank,
>>>>>
>>>>> can we read that tutorial?
>>>>>
>>>>> Kaz --------------------------------------- Posted through
>>>>> http://www.FPGARelated.com
>>>>
>>>> Here is the link:https://www.google.ca/url?
>>>
> sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
>>> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
>>> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
>>>>
>>>> Thanks,
>>>
>>> Well, the guy throws out one unfounded statement and then never
>>> supports it.  In a live presentation you could raise your hand and ask
>>> about it.
>>> Can you send him an email?
>>>
>>> I can see ways that, given a predetermined set of coefficients, you may
>>> be able to get the gate count down, but that's not really within the
>>> scope of the talk.
>>
>> I think it is clear that he is talking about a hard coded set of
>> coefficients.  If the coefficients are variables in registers, there is
>> no efficient way to optimize the multipliers.  With fixed coefficients
>> and multipliers in the fabric, this would be an important optimization.
>>    He discusses using fabric for multipliers in later slides.  It is not
>> unreasonable to expect the tools to do this optimization automatically.
>
> I totally missed that -- yes, one would hope that in 2015 the tools would
> be able to figure out how to optimize fixed multipliers.
>
> This is a tangent, but it makes me wonder -- I saw a paper ages ago that
> was basically saying that if addition and subtraction are equally costly,
> then you can optimize a multiplication by using both -- i.e., if you're
> multiplying (x) by 11110001111b, then you can either do eight adds, or
> you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings.
>
> So -- do modern optimizers do this when multiplying by fixed
> coefficients, or not?

I can't imagine they wouldn't.  But mostly, multiplications of any type 
are done using hardware multipliers available in the vast majority of 
FPGAs.

-- 

Rick

Article: 158239
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: jim.brakefield@ieee.org
Date: Sun, 27 Sep 2015 17:30:04 -0700 (PDT)
Links: << >> << T >> << A >>

On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
> > It would appear there are very similar resource needs for either RISC o=
r Stack/Accumulator architectures when both are of the "load/store" classif=
ication.
> > Herein, same multi-port LUT RAM for either RISC register file or dual s=
tacks.  And the DSP for multiply and block RAM for main memory.  "Load/stor=
e" refers to using distinct instructions for moving data between LUT RAM an=
d block RAM.
> >
> > Has someone studied this situation?
> > Would appear the stack/accumulator program code would be denser?
> > Would appear multiple instruction issue would be simpler with RISC?
>=20
> I've done a little investigation and the instruction set for a stack=20
> processor was not much denser than the instruction set for the RISC CPU=
=20
> I compared it to.  I don't recall which one it was.
>=20
> A lot depends on the code you use for comparison.  I was using loops=20
> that move data.  Many stack processors have some levels of inefficiency=
=20
> because of the juggling of the stack required in some code.  Usually=20
> proponents say the code can be done to reduce the juggling of operands=20
> which I have found to be mostly true.  If you code to reduce the=20
> parameter juggling, stack processors can be somewhat more efficient in=20
> terms of code space usage.
>=20
> I have looked at a couple of things as alternatives.  One is to use VLIW=
=20
> to allow as much parallelism in usage of the execution units within the=
=20
> processor, they are, data unit, address unit and instruction unit.  This=
=20
> presents some inherent inefficiency in that a fixed size instruction=20
> field is used to control the instruction unit when most IU instructions=
=20
> are just "next", for example.  But it allows both the address unit and=20
> the data unit to be doing work at the same time for doing things like=20
> moving data to/from memory and counting a loop iteration, for example.
>=20
> Another potential stack optimization I have looked at is combining=20
> register and stack concepts by allowing very short offsets from top of=20
> stack to be used for a given operand along with variable size stack=20
> adjustments.  I didn't pursue this very far but I think it has potential=
=20
> of virtually eliminating operand juggling making stack processor much=20
> faster.  I'm not sure of the effect on code size optimization because of=
=20
> the larger instruction size.
>=20
> --=20
>=20
> Rick

> I have looked at a couple of things as alternatives.  One is to use VLIW=
=20
> to allow as much parallelism in usage of the execution units within the=
=20
> processor, they are, data unit, address unit and instruction unit.
Have considered multiple stacks as a form of VLIW: each stack having its ow=
n part of the VLIW instruction, or if nothing to do, providing future immed=
iates for any of the other stack instructions.

> Another potential stack optimization I have looked at is combining=20
> register and stack concepts by allowing very short offsets from top of=20
> stack to be used for a given operand along with variable size stack=20
> adjustments.  I didn't pursue this very far but I think it has potential=
=20
> of virtually eliminating operand juggling making stack processor much=20
> faster.
Also this is a way to improve processing rate as there are fewer instructio=
ns than "pure" stack code (each instruction has a stack/accumulator operati=
on and a small offset for the other operand).  While one is at it, one can =
add various instructions bits for "return", stack/accumulator mode, replace=
 operation, stack pointer selector, ...

Personally, don't have hard numbers for any of this (there are open source =
stack machines with small offsets and various instruction bits, what is nee=
ded is compilers so that comparisons can be done).  And don't want to dupli=
cate any work (AKA research) that has already been done.

Jim Brakefield

Article: 158240
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: jim.brakefield@ieee.org
Date: Sun, 27 Sep 2015 18:19:41 -0700 (PDT)
Links: << >> << T >> << A >>

On Saturday, September 26, 2015 at 8:19:29 PM UTC-5, cfbso...@gmail.com wro=
te:
> On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org=
 wrote:
> >=20
> > Has someone studied this situation?
> > Would appear the stack/accumulator program code would be denser?
> > Would appear multiple instruction issue would be simpler with RISC?
> >=20
>=20
> I worked with the 1980's Lilith computer and its Modula-2 compiler which =
used a stack-based architecture. Christian Jacobi includes a detailed analy=
sis of the code generated in his dissertation titled "Code Generation and t=
he Lilith Architecture". You can download a copy from my website:
>=20
> http://www.cfbsoftware.com/modula2/
> I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Obero=
n compiler used in Project Oberon:
>=20
> http://www.projectoberon.com
>=20
> The code generation is described in detail in the included documentation.=
=20
>=20
> I have both systems in operation and have some very similar test programs=
 for both. I'll experiment to see if the results give any surprises. Any co=
mparison would have to take into account the fact that the Lilith was a 16-=
bit architecture whereas RISC5 is 32-bit so it might be tricky.
>=20
> Regards,
> Chris Burrows
> CFB Software
> http://www.astrobe.com

>Any comparison would have to take into account the fact that the Lilith wa=
s a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky.
And in the 1980s main memory access time was smaller multiple of clock rate=
 than today's DRAMs.  However, the main memory for the RISC5 FPGA card is a=
synchronous static RAM with a fast access time and comparable to the main m=
emory of the Lilith?

Jim Brakefield

Article: 158241
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: rickman <gnuarm@gmail.com>
Date: Sun, 27 Sep 2015 23:20:26 -0400
Links: << >> << T >> << A >>

On 9/27/2015 8:30 PM, jim.brakefield@ieee.org wrote:
> On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
>> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
>>> It would appear there are very similar resource needs for either
>>> RISC or Stack/Accumulator architectures when both are of the
>>> "load/store" classification. Herein, same multi-port LUT RAM for
>>> either RISC register file or dual stacks.  And the DSP for
>>> multiply and block RAM for main memory.  "Load/store" refers to
>>> using distinct instructions for moving data between LUT RAM and
>>> block RAM.
>>>
>>> Has someone studied this situation? Would appear the
>>> stack/accumulator program code would be denser? Would appear
>>> multiple instruction issue would be simpler with RISC?
>>
>> I've done a little investigation and the instruction set for a
>> stack processor was not much denser than the instruction set for
>> the RISC CPU I compared it to.  I don't recall which one it was.
>>
>> A lot depends on the code you use for comparison.  I was using
>> loops that move data.  Many stack processors have some levels of
>> inefficiency because of the juggling of the stack required in some
>> code.  Usually proponents say the code can be done to reduce the
>> juggling of operands which I have found to be mostly true.  If you
>> code to reduce the parameter juggling, stack processors can be
>> somewhat more efficient in terms of code space usage.
>>
>> I have looked at a couple of things as alternatives.  One is to use
>> VLIW to allow as much parallelism in usage of the execution units
>> within the processor, they are, data unit, address unit and
>> instruction unit.  This presents some inherent inefficiency in that
>> a fixed size instruction field is used to control the instruction
>> unit when most IU instructions are just "next", for example.  But
>> it allows both the address unit and the data unit to be doing work
>> at the same time for doing things like moving data to/from memory
>> and counting a loop iteration, for example.
>>
>> Another potential stack optimization I have looked at is combining
>> register and stack concepts by allowing very short offsets from top
>> of stack to be used for a given operand along with variable size
>> stack adjustments.  I didn't pursue this very far but I think it
>> has potential of virtually eliminating operand juggling making
>> stack processor much faster.  I'm not sure of the effect on code
>> size optimization because of the larger instruction size.
>>
>> --
>>
>> Rick
>
>> I have looked at a couple of things as alternatives.  One is to use
>> VLIW to allow as much parallelism in usage of the execution units
>> within the processor, they are, data unit, address unit and
>> instruction unit.
> Have considered multiple stacks as a form of VLIW: each stack having
> its own part of the VLIW instruction, or if nothing to do, providing
> future immediates for any of the other stack instructions.


I assume you mean two data stacks? I was trying hard not to expand on 
the hardware significantly. The common stack machine is typically two 
stacks, one for data and one for return addresses. In Forth the return 
stack is also used for loop counting. My derivation uses the return 
stack for addresses such as memory accesses as well as jump/calls, so I 
call it the address stack. This lets you do minimal arithmetic (loop 
counting and incrementing addresses) and reduces stack ops on the data 
stack such as the two drops required for a memory write.


>> Another potential stack optimization I have looked at is combining
>> register and stack concepts by allowing very short offsets from top
>> of stack to be used for a given operand along with variable size
>> stack adjustments.  I didn't pursue this very far but I think it
>> has potential of virtually eliminating operand juggling making
>> stack processor much faster.
> Also this is a way to improve processing rate as there are fewer
> instructions than "pure" stack code (each instruction has a
> stack/accumulator operation and a small offset for the other
> operand).  While one is at it, one can add various instructions bits
> for "return", stack/accumulator mode, replace operation, stack
> pointer selector, ...

Yes, returns are common so it can be useful to provide a minimal 
instruction overhead for that.  The other things can require extra 
hardware.


> Personally, don't have hard numbers for any of this (there are open
> source stack machines with small offsets and various instruction
> bits, what is needed is compilers so that comparisons can be done).
> And don't want to duplicate any work (AKA research) that has already
> been done.
>
> Jim Brakefield
>


-- 

Rick

Article: 158242
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: jim.brakefield@ieee.org
Date: Sun, 27 Sep 2015 21:31:23 -0700 (PDT)
Links: << >> << T >> << A >>

On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote:
> On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote:
> > On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
> >> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
> >>> It would appear there are very similar resource needs for either
> >>> RISC or Stack/Accumulator architectures when both are of the
> >>> "load/store" classification. Herein, same multi-port LUT RAM for
> >>> either RISC register file or dual stacks.  And the DSP for
> >>> multiply and block RAM for main memory.  "Load/store" refers to
> >>> using distinct instructions for moving data between LUT RAM and
> >>> block RAM.
> >>>
> >>> Has someone studied this situation? Would appear the
> >>> stack/accumulator program code would be denser? Would appear
> >>> multiple instruction issue would be simpler with RISC?
> >>
> >> I've done a little investigation and the instruction set for a
> >> stack processor was not much denser than the instruction set for
> >> the RISC CPU I compared it to.  I don't recall which one it was.
> >>
> >> A lot depends on the code you use for comparison.  I was using
> >> loops that move data.  Many stack processors have some levels of
> >> inefficiency because of the juggling of the stack required in some
> >> code.  Usually proponents say the code can be done to reduce the
> >> juggling of operands which I have found to be mostly true.  If you
> >> code to reduce the parameter juggling, stack processors can be
> >> somewhat more efficient in terms of code space usage.
> >>
> >> I have looked at a couple of things as alternatives.  One is to use
> >> VLIW to allow as much parallelism in usage of the execution units
> >> within the processor, they are, data unit, address unit and
> >> instruction unit.  This presents some inherent inefficiency in that
> >> a fixed size instruction field is used to control the instruction
> >> unit when most IU instructions are just "next", for example.  But
> >> it allows both the address unit and the data unit to be doing work
> >> at the same time for doing things like moving data to/from memory
> >> and counting a loop iteration, for example.
> >>
> >> Another potential stack optimization I have looked at is combining
> >> register and stack concepts by allowing very short offsets from top
> >> of stack to be used for a given operand along with variable size
> >> stack adjustments.  I didn't pursue this very far but I think it
> >> has potential of virtually eliminating operand juggling making
> >> stack processor much faster.  I'm not sure of the effect on code
> >> size optimization because of the larger instruction size.
> >>
> >> --
> >>
> >> Rick
> >
> >> I have looked at a couple of things as alternatives.  One is to use
> >> VLIW to allow as much parallelism in usage of the execution units
> >> within the processor, they are, data unit, address unit and
> >> instruction unit.
> > Have considered multiple stacks as a form of VLIW: each stack having
> > its own part of the VLIW instruction, or if nothing to do, providing
> > future immediates for any of the other stack instructions.
>=20
>=20
> I assume you mean two data stacks? I was trying hard not to expand on=20
> the hardware significantly. The common stack machine is typically two=20
> stacks, one for data and one for return addresses. In Forth the return=20
> stack is also used for loop counting. My derivation uses the return=20
> stack for addresses such as memory accesses as well as jump/calls, so I=
=20
> call it the address stack. This lets you do minimal arithmetic (loop=20
> counting and incrementing addresses) and reduces stack ops on the data=20
> stack such as the two drops required for a memory write.
>=20
>=20
> >> Another potential stack optimization I have looked at is combining
> >> register and stack concepts by allowing very short offsets from top
> >> of stack to be used for a given operand along with variable size
> >> stack adjustments.  I didn't pursue this very far but I think it
> >> has potential of virtually eliminating operand juggling making
> >> stack processor much faster.
> > Also this is a way to improve processing rate as there are fewer
> > instructions than "pure" stack code (each instruction has a
> > stack/accumulator operation and a small offset for the other
> > operand).  While one is at it, one can add various instructions bits
> > for "return", stack/accumulator mode, replace operation, stack
> > pointer selector, ...
>=20
> Yes, returns are common so it can be useful to provide a minimal=20
> instruction overhead for that.  The other things can require extra=20
> hardware.
>=20
>=20
> > Personally, don't have hard numbers for any of this (there are open
> > source stack machines with small offsets and various instruction
> > bits, what is needed is compilers so that comparisons can be done).
> > And don't want to duplicate any work (AKA research) that has already
> > been done.
> >
> > Jim Brakefield
> >
>=20
>=20
> --=20
>=20
> Rick

Reply:
>I assume you mean two data stacks?
Yes, in particular integer arithmetic on one and floating-point on the othe=
r.

>My derivation uses the return stack for addresses such as memory accesses =
as well as jump/calls, so I call it the address stack.
OK

> I was trying hard not to expand on the hardware significantly.
> The other things can require extra hardware.=20
With FPGA 6LUTs one can have several read ports (4LUT RAM can do it also, i=
ts just not as efficient).  At one operation per clock and mapping both dat=
a and address stacks to the same LUT RAM, one has two ports for operand rea=
ds, one port for result write and one port for "return" address read.  Just=
 about any stack or accumulator operation that fits these constraints is po=
ssible with appropriate instruction decode and ALU.  The SWAP operation req=
uires two writes, so one would need to make TOS a separate register to do i=
t in one clock (other implementations possible using two multiport LUT RAMs=
).

Jim

Article: 158243
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: rickman <gnuarm@gmail.com>
Date: Mon, 28 Sep 2015 01:24:03 -0400
Links: << >> << T >> << A >>

On 9/28/2015 12:31 AM, jim.brakefield@ieee.org wrote:
> On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote:
>> On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote:
>>> On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman
>>> wrote:
>>>> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
>>>>> It would appear there are very similar resource needs for
>>>>> either RISC or Stack/Accumulator architectures when both are
>>>>> of the "load/store" classification. Herein, same multi-port
>>>>> LUT RAM for either RISC register file or dual stacks.  And
>>>>> the DSP for multiply and block RAM for main memory.
>>>>> "Load/store" refers to using distinct instructions for moving
>>>>> data between LUT RAM and block RAM.
>>>>>
>>>>> Has someone studied this situation? Would appear the
>>>>> stack/accumulator program code would be denser? Would appear
>>>>> multiple instruction issue would be simpler with RISC?
>>>>
>>>> I've done a little investigation and the instruction set for a
>>>> stack processor was not much denser than the instruction set
>>>> for the RISC CPU I compared it to.  I don't recall which one it
>>>> was.
>>>>
>>>> A lot depends on the code you use for comparison.  I was using
>>>> loops that move data.  Many stack processors have some levels
>>>> of inefficiency because of the juggling of the stack required
>>>> in some code.  Usually proponents say the code can be done to
>>>> reduce the juggling of operands which I have found to be mostly
>>>> true.  If you code to reduce the parameter juggling, stack
>>>> processors can be somewhat more efficient in terms of code
>>>> space usage.
>>>>
>>>> I have looked at a couple of things as alternatives.  One is to
>>>> use VLIW to allow as much parallelism in usage of the execution
>>>> units within the processor, they are, data unit, address unit
>>>> and instruction unit.  This presents some inherent inefficiency
>>>> in that a fixed size instruction field is used to control the
>>>> instruction unit when most IU instructions are just "next", for
>>>> example.  But it allows both the address unit and the data unit
>>>> to be doing work at the same time for doing things like moving
>>>> data to/from memory and counting a loop iteration, for
>>>> example.
>>>>
>>>> Another potential stack optimization I have looked at is
>>>> combining register and stack concepts by allowing very short
>>>> offsets from top of stack to be used for a given operand along
>>>> with variable size stack adjustments.  I didn't pursue this
>>>> very far but I think it has potential of virtually eliminating
>>>> operand juggling making stack processor much faster.  I'm not
>>>> sure of the effect on code size optimization because of the
>>>> larger instruction size.
>>>>
>>>> --
>>>>
>>>> Rick
>>>
>>>> I have looked at a couple of things as alternatives.  One is to
>>>> use VLIW to allow as much parallelism in usage of the execution
>>>> units within the processor, they are, data unit, address unit
>>>> and instruction unit.
>>> Have considered multiple stacks as a form of VLIW: each stack
>>> having its own part of the VLIW instruction, or if nothing to do,
>>> providing future immediates for any of the other stack
>>> instructions.
>>
>>
>> I assume you mean two data stacks? I was trying hard not to expand
>> on the hardware significantly. The common stack machine is
>> typically two stacks, one for data and one for return addresses. In
>> Forth the return stack is also used for loop counting. My
>> derivation uses the return stack for addresses such as memory
>> accesses as well as jump/calls, so I call it the address stack.
>> This lets you do minimal arithmetic (loop counting and incrementing
>> addresses) and reduces stack ops on the data stack such as the two
>> drops required for a memory write.
>>
>>
>>>> Another potential stack optimization I have looked at is
>>>> combining register and stack concepts by allowing very short
>>>> offsets from top of stack to be used for a given operand along
>>>> with variable size stack adjustments.  I didn't pursue this
>>>> very far but I think it has potential of virtually eliminating
>>>> operand juggling making stack processor much faster.
>>> Also this is a way to improve processing rate as there are fewer
>>> instructions than "pure" stack code (each instruction has a
>>> stack/accumulator operation and a small offset for the other
>>> operand).  While one is at it, one can add various instructions
>>> bits for "return", stack/accumulator mode, replace operation,
>>> stack pointer selector, ...
>>
>> Yes, returns are common so it can be useful to provide a minimal
>> instruction overhead for that.  The other things can require extra
>> hardware.
>>
>>
>>> Personally, don't have hard numbers for any of this (there are
>>> open source stack machines with small offsets and various
>>> instruction bits, what is needed is compilers so that comparisons
>>> can be done). And don't want to duplicate any work (AKA research)
>>> that has already been done.
>>>
>>> Jim Brakefield
>>>
>>
>>
>> --
>>
>> Rick
>
> Reply:
>> I assume you mean two data stacks?
> Yes, in particular integer arithmetic on one and floating-point on
> the other.

Yes, if you need floating point a separate stack is often used.


>> My derivation uses the return stack for addresses such as memory
>> accesses as well as jump/calls, so I call it the address stack.
> OK
>
>> I was trying hard not to expand on the hardware significantly. The
>> other things can require extra hardware.
 >
> With FPGA 6LUTs one can have several read ports (4LUT RAM can do it
> also, its just not as efficient).  At one operation per clock and
> mapping both data and address stacks to the same LUT RAM, one has two
> ports for operand reads, one port for result write and one port for
> "return" address read.  Just about any stack or accumulator operation
> that fits these constraints is possible with appropriate instruction
> decode and ALU.  The SWAP operation requires two writes, so one would
> need to make TOS a separate register to do it in one clock (other
> implementations possible using two multiport LUT RAMs).

I used a TOS register for each stack and used a write port and read port 
for each stack in one block RAM.  The write/read ports share the 
address.  A read happens on each cycle automatically and in all the 
parts I have used that can be set so the data written in a cycle shows 
up on the read port, so it is the next on stack at all times.

Managing the stack pointers can get a bit complex if an effort to keep 
it simple is not made.  As it was the stack pointer was in the critical 
timing path which ended in the flag registers.  The stack pointers set 
error flags in the CPU status register for over and underflow.  I 
thought this would be useful for debugging, but there is likely ways to 
minimize the timing overhead.

-- 

Rick

Article: 158244
Subject: Re: Soft core processors: RISC versus stack/accumulator for equal
From: cfbsoftware@gmail.com
Date: Mon, 28 Sep 2015 03:54:24 -0700 (PDT)
Links: << >> << T >> << A >>

On Monday, September 28, 2015 at 10:49:47 AM UTC+9:30, jim.bra...@ieee.org =
wrote:
> And in the 1980s main memory access time was smaller multiple of clock ra=
te than today's DRAMs.  However, the main memory for the RISC5 FPGA card is=
 asynchronous static RAM with a fast access time and comparable to the main=
 memory of the Lilith?

Rather than trying to paraphrase the information and risk getting it wrong =
I refer you to a detailed description of the Lilith memory organisation in =
the 'Lilith Computer Hardware Manual'. You can download a copy of this and =
several other related documents from BitSavers:

http://www.bitsavers.org/pdf/eth/lilith/

Regards,
Chris

Article: 158245
Subject: Re: Xilinx Spartan2E options?
From: Jon Elson <jmelson@wustl.edu>
Date: Mon, 28 Sep 2015 14:01:40 -0500
Links: << >> << T >> << A >>

Jon Elson wrote:

> Galina Szakacs wrote:
> 
> 
>> 
>> I'm pretty sure that any Spartan2e info still in the latest ISE is only
>> for programming (BSDL files, etc.) and you can't target those parts in
>> 14.7 (or anything after 10.1.03).
>> 
> You certainly can't select them in the GUI.  Yes, impact/BDSL may be the
> reason these files remain there.
Well, wasn't that big of a deal.  Downloaded 10.1.03, installed it, re-
synthesized the corrected project, and it seems to work fine.  Still a bit 
more testing to do to be sure there aren't any unexpected behavior, but a 
quick check looks OK.  Thank goodness Xilinx still has these legacy versions 
available!

Jon

Article: 158246
Subject: Automatic latency balancing in VHDL-implemented complex pipelined systems
From: wzab01@gmail.com
Date: Mon, 28 Sep 2015 22:12:14 -0700 (PDT)
Links: << >> << T >> << A >>

Hi,
Last time I have spent a lot of time on development of quite complex high s=
peed data processing systems in FPGA. They all had pipeline architecture, a=
nd data were processed in parallel in multiple  pipelines with different la=
tencies.

The worst thing was that those latencies were changing during development. =
For example some operations were performed by blocks with tree structure, s=
o the number of levels depended on number of inputs handled by each node. T=
he number of inputs in each node was varied to find the acceptable balance =
between the number of levels and maximum clock speed. I also had to add som=
e pipeline registers to improve timing.

Entire designs were written in pure VHDL,  so I had to adjust latencies man=
ually, to ensure that data coming from different paths arrive in the next b=
lock in the same clock cycle. It was really a nightmare so I dreamed about =
an automated way to ensure proper equalization of latencies.

After some work I have elaborated a solution which I'd like to share with t=
he community. It is available under the BSD license on the OpenCores websit=
e http://opencores.org/project,lateq . The paper with detailed description =
is available on arXiv.org http://arxiv.org/abs/1509.08111.

I'll appreciate any comments.
I hope that the proposed  method will be useful for others.

With best regards,
Wojtek=20

Article: 158247
Subject: Re: Automatic latency balancing in VHDL-implemented complex pipelined systems
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 29 Sep 2015 06:49:02 +0000 (UTC)
Links: << >> << T >> << A >>

wzab01@gmail.com wrote:

> Last time I have spent a lot of time on development of quite 
> complex high speed data processing systems in FPGA. 
> They all had pipeline architecture, and data were processed in 
> parallel in multiple  pipelines with different latencies.

> The worst thing was that those latencies were changing 
> during development. For example some operations were 
> performed by blocks with tree structure, so the number of 
> levels depended on number of inputs handled by each node. 
> The number of inputs in each node was varied to find the 
> acceptable balance between the number of levels and maximum 
> clock speed. I also had to add some pipeline registers to 
> improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.  

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to 
hand optimize them that long.

-- glen

Article: 158248
Subject: Re: Automatic latency balancing in VHDL-implemented complex pipelined systems
From: wzab01@gmail.com
Date: Tue, 29 Sep 2015 00:04:20 -0700 (PDT)
Links: << >> << T >> << A >>

W dniu wtorek, 29 wrze=C5=9Bnia 2015 07:49:09 UTC+1 u=C5=BCytkownik glen he=
rrmannsfeldt napisa=C5=82:
> wzab01@gmail.com wrote:
>=20
> > Last time I have spent a lot of time on development of quite=20
> > complex high speed data processing systems in FPGA.=20
> > They all had pipeline architecture, and data were processed in=20
> > parallel in multiple  pipelines with different latencies.
> =20
> > The worst thing was that those latencies were changing=20
> > during development. For example some operations were=20
> > performed by blocks with tree structure, so the number of=20
> > levels depended on number of inputs handled by each node.=20
> > The number of inputs in each node was varied to find the=20
> > acceptable balance between the number of levels and maximum=20
> > clock speed. I also had to add some pipeline registers to=20
> > improve timing.
>=20
> I have heard that some synthesis software now knows how to move
> around pipeline registers to optimize timing. I haven't tried
> using the feature yet, though. =20
>=20
> I think it can move registers, but maybe not add them. You might
> need enough registers in place for it to move them around.
>=20
> I used to work on systolic arrays, which are really just very long
> (hundred or thousands of stages) pipelines. It is pretty hard to=20
> hand optimize them that long.
>

Yes, of course the pipeline registers may be moved (e.g. using the "retimin=
g" feature). I usually keep this option switched on for implementation.
My method only ensures, that the number of pipeline stages is the same in a=
ll parallel paths. And keeping track of that was really a huge problem in b=
igger designs.
--=20
Wojtek

Article: 158249
Subject: Re: Automatic latency balancing in VHDL-implemented complex pipelined systems
From: "kaz" <37480@FPGARelated>
Date: Tue, 29 Sep 2015 05:50:47 -0500
Links: << >> << T >> << A >>

>W dniu wtorek, 29 wrzeÅnia 2015 07:49:09 UTC+1 uÅ¼ytkownik glen
>herrmannsfeldt napisaÅ:
>> wzab01@gmail.com wrote:
>> 
>> > Last time I have spent a lot of time on development of quite 
>> > complex high speed data processing systems in FPGA. 
>> > They all had pipeline architecture, and data were processed in 
>> > parallel in multiple  pipelines with different latencies.
>>  
>> > The worst thing was that those latencies were changing 
>> > during development. For example some operations were 
>> > performed by blocks with tree structure, so the number of 
>> > levels depended on number of inputs handled by each node. 
>> > The number of inputs in each node was varied to find the 
>> > acceptable balance between the number of levels and maximum 
>> > clock speed. I also had to add some pipeline registers to 
>> > improve timing.
>> 
>> I have heard that some synthesis software now knows how to move
>> around pipeline registers to optimize timing. I haven't tried
>> using the feature yet, though.  
>> 
>> I think it can move registers, but maybe not add them. You might
>> need enough registers in place for it to move them around.
>> 
>> I used to work on systolic arrays, which are really just very long
>> (hundred or thousands of stages) pipelines. It is pretty hard to 
>> hand optimize them that long.
>>
>
>Yes, of course the pipeline registers may be moved (e.g. using the
>"retiming" feature). I usually keep this option switched on for
implementation.
>My method only ensures, that the number of pipeline stages is the same
in
>all parallel paths. And keeping track of that was really a huge problem
in
>bigger designs.
>-- 
>Wojtek

Not sure why you expect the tool to do what you should do and do so for
simulation tool. How can you you simulate a design that synthesis will put
for you registers?

Kaz
---------------------------------------
Posted through http://www.FPGARelated.com

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search