Messages from 151950

Article: 151950
Subject: Re: Area Optimization
From: ARSDMTHE <arsdmthe@gmail.com>
Date: Tue, 14 Jun 2011 03:21:56 -0700 (PDT)
Links: << >> << T >> << A >>

John
the best is to design to never reset !

Article: 151951
Subject: Re: Area Optimization
From: "jt_eaton" <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Tue, 14 Jun 2011 09:56:34 -0500
Links: << >> << T >> << A >>

>John
>the best is to design to never reset !
>

You can create a design that will work with no resets at all. The problem
is that the verification suite will take a few eons to finish.

John 
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151952
Subject: Area optimization (optimizing DSP48E usage)
From: Vivek Menon <vivek.menon79@gmail.com>
Date: Tue, 14 Jun 2011 09:18:52 -0700 (PDT)
Links: << >> << T >> << A >>

I am trying to map, place & route a large design on a Xilinx Virtex 6 FPGA
Target Device  : xc6vlx550t
Target Package : ff1759
Target Speed   : -2

My mapping process fails with the following errors:

ERROR:Pack:2310 - Too many comps of type "DSP48E1" found to fit this device=
.
ERROR:Pack:2860 - The number of logical carry chain blocks exceeds the capa=
city for the target device. This design requires 100940 slices
   but only has 85920 slices available that allow carry chains.
ERROR:Map:237 - The design is too large to fit the device.  Please check th=
e Design Summary section to see which resource requirement for
   your design exceeds the resources available in the device. Note that the=
 number of slices reported may not be reflected accurately as
   their packing might not have been completed.

When I inspect the Mapping report file, I see:
Interim Summary
---------------
Slice Logic Utilization:
  Number of Slice Registers:               460,088 out of 687,360   66%
    Number used as Flip Flops:             399,848
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:           60,240
  Number of Slice LUTs:                    388,284 out of 343,680  112% (OV=
ERMAPPED)
    Number used as logic:                  384,856 out of 343,680  111% (OV=
ERMAPPED)
      Number using O6 output only:         311,180
      Number using O5 output only:          10,716
      Number using O5 and O6:               62,960
      Number used as ROM:                        0
    Number used as Memory:                     114 out of  99,200    1%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:           114
        Number using O6 output only:           114
        Number using O5 output only:             0
        Number using O5 and O6:                  0
    Number used exclusively as route-thrus:  3,314
      Number with same-slice register load:      0
      Number with same-slice carry load:     3,313
      Number with other load:                    1

Slice Logic Distribution:
  Number of LUT Flip Flop pairs used:      584,470
    Number with an unused Flip Flop:       125,987 out of 584,470   21%
    Number with an unused LUT:             196,186 out of 584,470   33%
    Number of fully used LUT-FF pairs:     262,297 out of 584,470   44%
    Number of unique control sets:             233
    Number of slice register sites lost
      to control set restrictions:             854 out of 687,360    1%

Also,
  Number of DSP48E1s:                        4,800 out of     864  555% (OV=
ERMAPPED)

-----------------------------------------------------------------------

I did a quick calculation on design resource usage such as LUTs versus DSP4=
8E1s from the Xilinx Coregen GUI:
1. Multiplier1 uses 86 LUTs vs 1 DSP48E1. The design uses Multiplier1 x96. =
So I am looking at either 96 DSP48E1s or 8256 LUTs.
2. Multiplier2 uses 142 LUTs vs 1 DSP48E1. The design uses Multiplier2 x470=
4. So I am looking at either 4704 DSP48E1s or 667968 LUTs.

I tried different options to synthesize my design using LUTs and using DSPs=
. Before I partition my design, I just wanted to check with everyone here, =
on how the multipliers can optimize the usage of DSP48Es vs LUTs. The curre=
nt mapping report indicates all the multipliers were mapped using DSPs, hen=
ce 4800 DSPs.=20

1. How can the XST tool or the mapping partition the usage of the multiplie=
rs using both DSPs and slice logic? Is this possible with some constraint?

2. The multiplier cores are currently set for Area optimization vs Speed op=
timization and I have used "use Mults" option. If I set "use LUTs" option, =
will the XST and Mapping process partition the multiplier usage between LUT=
s and DSPs?

Thanks in advance !!!

Article: 151953
Subject: Re: Area Optimization
From: ARSDMTHE <arsdmthe@gmail.com>
Date: Tue, 14 Jun 2011 09:19:21 -0700 (PDT)
Links: << >> << T >> << A >>

John
so include it in the design and go for eons !

Article: 151954
Subject: Re: Area Optimization
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Tue, 14 Jun 2011 17:40:36 +0000 (UTC)
Links: << >> << T >> << A >>

jt_eaton <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
(snip)

> You can create a design that will work with no resets at all. 
> The problem is that the verification suite will take a few 
> eons to finish.

Most FPGA do an asynchronous reset on all FF at the end of
configuration.   I don't believe that is optional.  

-- glen

Article: 151955
Subject: Re: Area Optimization
From: rickman <gnuarm@gmail.com>
Date: Tue, 14 Jun 2011 15:30:19 -0700 (PDT)
Links: << >> << T >> << A >>

On Jun 13, 8:21=A0pm, "jt_eaton"
<z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
> >On Jun 12, 7:22=3DA0pm, "jt_eaton"
> ><z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
>
> >Thanks for that pointer. =A0I have always been a believer in using the
> >async reset and now I see that this may not always be the best way to
> >reset a design. =A0But the devil is in the details. =A0I wonder if this
> >still applies to non-Xilinx designs?
>
> >Rick
>
> It applies it all designs. Designers who started their careers with
> asynchronous logic carried it with them when Design for Synthesis and
> synchronous design became a requirement but it has never been the best
> choice. Many designers make the mistake of thinking that because they nee=
d
> an asynchronous reset system that they must design it using asynchronous
> logic. That is simply not true. We design synchronous systems that are
> black box equivalent to asynchronous systems all the time. The main thing
> that you need to realize about reset system design is that the purpose of
> the reset system is not to reset the system when a trigger event occurs.
> It's purpose is to NOT reset the system when a trigger event is NOT
> occuring.
>
> The same is true for airbag controllers.The job of an airbag controller i=
s
> not to deploy the bag when the car is in a accident, it's job is to not
> deploy the bag when the car is not having an accident. Any system where t=
he
> expected number of uses is small and the effects of the usage is large wi=
ll
> follow this rule.
>
> Remember the 1st StarWars movie? They built DeathStar with an emergency
> exhaust port that provided a direct path from the reactor core to the
> surface. It was ray shielded but could not be particle shielded. Bad plan=
.
>
> An asynchronous reset has a direct path from a pad into every flip-flop i=
n
> the entire chip. It is analog shielded but not digitally shielded. Bad
> plan.
>
> Resets in a real product (not a simulation) are really rare events. If a
> reset is delayed by 20 microseconds then nobody will notice. If a product
> that you are using suddenly resets itself then you will likely notice.
> Spend a few hundred cycles on a digital filter before you do something
> drastic.
>
> John Eaton
>
> --------------------------------------- =A0 =A0 =A0 =A0
> Posted throughhttp://www.FPGARelated.com

Interesting philosophy.

Rick

Article: 151956
Subject: Re: Area Optimization
From: rickman <gnuarm@gmail.com>
Date: Tue, 14 Jun 2011 15:38:40 -0700 (PDT)
Links: << >> << T >> << A >>

On Jun 14, 1:40=A0pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> jt_eaton <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
>
> (snip)
>
> > You can create a design that will work with no resets at all.
> > The problem is that the verification suite will take a few
> > eons to finish.
>
> Most FPGA do an asynchronous reset on all FF at the end of
> configuration. =A0 I don't believe that is optional. =A0
>
> -- glen

I believe that is optional for any given FF.  The GSR has to be
enabled on each FF and that is the point of the white paper.  In
Xilinx devices using the GSR uses one of the set/reset input on a FF
as an async input which also configures the other input as async
IIRC.  The tools are capable of using the Set and Reset inputs a
synchronous inputs to reduce the LUT usage and improving the speed of
a design... in some cases.

As to the philosophical avoidance of async resets, I can't say I share
that belief.  As you point out, there is one async reset on the chip
that you can't eliminate, the PROGRAM pin.  Even if it doesn't reset
the FFs, it will stop the design from working and reload all the LUTs
and memory.

It has been a long time since I used a Xilinx part, so I may not
remember them 100% correctly.

Rick

Article: 151957
Subject: Re: Area Optimization
From: Christopher Head <chead@is.invalid>
Date: Tue, 14 Jun 2011 23:35:33 -0700
Links: << >> << T >> << A >>

Lots of interesting advice here! In particular I read the Xilinx
whitepaper with interest. Unfortunately, a lot of the advice seemed to
be inapplicable to my problem. I can't look for the individual
submodule that's taking up most of the area, because my application is
a single long pipeline with a large number of very similar stages: the
area isn't taken up by any one stage, but more by the number of stages.
And because the design is a pipeline with general logic (mostly
bitwise, plus a small bit of basic arithmetic) between registers, I
don't really see any opportunities for special primitives like SRLs,
DSPs, or the like that would reduce area. I can probably solve my
problem by building a smaller pipeline and reusing it; I preferred not
to do that as it will decrease system performance but it looks like I
don't have much choice now.

Thanks anyway!
Chris

Article: 151958
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 07:42:55 -0500
Links: << >> << T >> << A >>

>The data arrives with some unknown phase shift relatively to system 
>(synchronized to SDRAM) clock. DQ can be captured more reliably if we 
>route the data clock, DQS, along the data. They suggest that it is easy 
>to transport the received data bursts into the system clock domain using 
>a FIFO afterwards. This is great. I just see a one small problem:
>
>	How do you know that the read operation takes place so that
>	the captured data are valid for submission into FIFO?
>
>
>A READ_EN signal must be delivered from the SDRAM write/command part 
>(CLK domain) into asynchronously running receiver in DQS domain (the 
>period is the same but phase is unknown) with one DQS clock precision. 
>Remember that we run away from strobing DQ by CLK phases because we do 
>not know the data arriving phase relatively to CLK. That is why we 
>introduced the DQS. But now, we still must figure out the phase shift. 
>It looks like our attempt to do without the phase difference has failed.
>
>Why people still use DQS for strobing data instead of some CLK-derived 
>phase?
>

Some DDR2 SDRAM controllers require a feedback clok input, being their
output clock via a loop of track that goes the same distance as to the
SDRAM and back. Others go through a training phase where they work out the
"time-of-flight" from the controller to the SDRAM and back. Either works
well enough. If your FPGA is from Xilinx, use their MIG tool to generate
the controller.
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151959
Subject: Re: Area Optimization
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 07:53:38 -0500
Links: << >> << T >> << A >>

Something to remember about Xilinx FPGAs, at least when designing in VHDL
and synthesizing with XST, is that you can specify the initial value of
registered signals (when declaring the signal in the declarative part of
the architecture). This is sometimes considered bad practice (bad coding
style) in other contexts, and may not be supported by other tool flows.
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151960
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: "Morten Leikvoll" <mleikvol@yahoo.nospam>
Date: Wed, 15 Jun 2011 15:37:36 +0200
Links: << >> << T >> << A >>

"RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote in 
message news:nYadnfl5gelSNWXQnZ2dnUVZ_tOdnZ2d@giganews.com...
> >The data arrives with some unknown phase shift relatively to system
>>(synchronized to SDRAM) clock. DQ can be captured more reliably if we
>>route the data clock, DQS, along the data. They suggest that it is easy
>>to transport the received data bursts into the system clock domain using
>>a FIFO afterwards. This is great. I just see a one small problem:
>>
>> How do you know that the read operation takes place so that
>> the captured data are valid for submission into FIFO?
>>
>>
>>A READ_EN signal must be delivered from the SDRAM write/command part
>>(CLK domain) into asynchronously running receiver in DQS domain (the
>>period is the same but phase is unknown) with one DQS clock precision.
>>Remember that we run away from strobing DQ by CLK phases because we do
>>not know the data arriving phase relatively to CLK. That is why we
>>introduced the DQS. But now, we still must figure out the phase shift.
>>It looks like our attempt to do without the phase difference has failed.
>>
>>Why people still use DQS for strobing data instead of some CLK-derived
>>phase?
>>
>
> Some DDR2 SDRAM controllers require a feedback clok input, being their
> output clock via a loop of track that goes the same distance as to the
> SDRAM and back. Others go through a training phase where they work out the
> "time-of-flight" from the controller to the SDRAM and back. Either works
> well enough. If your FPGA is from Xilinx, use their MIG tool to generate
> the controller.

I think the point was:If you dont know the timing between outclk and inclk 
(or dqs) - It could be >1clk in theory - how do you know when data is valid 
on a read? I guess you can't trust DQS as it is floating when not active.. 
You just need to assume there is <1clk delay (and I think that is specified 
in the std).
Imho, dq's should be single direction and separate for r/w.. Maybe they did 
that to later DDR standards.

Article: 151961
Subject: Re: Area Optimization
From: "AMDyer@gmail.com" <amdyer@gmail.com>
Date: Wed, 15 Jun 2011 07:14:13 -0700 (PDT)
Links: << >> << T >> << A >>

On Jun 15, 1:35=A0am, Christopher Head <ch...@is.invalid> wrote:
> Lots of interesting advice here! In particular I read the Xilinx
> whitepaper with interest. Unfortunately, a lot of the advice seemed to
> be inapplicable to my problem. I can't look for the individual
> submodule that's taking up most of the area, because my application is
> a single long pipeline with a large number of very similar stages: the
> area isn't taken up by any one stage, but more by the number of stages.
> And because the design is a pipeline with general logic (mostly
> bitwise, plus a small bit of basic arithmetic) between registers, I
> don't really see any opportunities for special primitives like SRLs,
> DSPs, or the like that would reduce area. I can probably solve my
> problem by building a smaller pipeline and reusing it; I preferred not
> to do that as it will decrease system performance but it looks like I
> don't have much choice now.
>
> Thanks anyway!
> Chris

I would start by saying that the biggest opportunities for savings are
almost always by starting at the algorithm level.  You'll only get so
far by playing with implementation.

one suggestion might be to look for places where you could do 'double
clocking' - ie generate a 2x clock with the DCM and run a particular
piece of logic twice per cycle, muxing the inputs and distributing the
outputs.  We have some designs that were multiplier limited, so we
used this trick as our main pipeline was slow enough to use one
multiplier to do double duty per pipeline stage.

some other tricks - use multipliers as shifters if you have them
spare.  See if you can rejigger your pipeline stages.  Some of the
older parts (vitrex-2 or so) have dedicated BUFT primitives that you
can use to reduce the number of logic elements in multiplexers.

Look at and understand the logic usage reports from the synthesizer.
If a module gets generated with more f/fs than you think it should,
it's good to dig in and figure out what got generated.  For XST There
is a tool or option that will show a schematic of synthesized logic,
this can be handy.

Article: 151962
Subject: Determine latency of GTX links vs Aurora+LVDS
From: Vivek Menon <vivek.menon79@gmail.com>
Date: Wed, 15 Jun 2011 07:59:16 -0700 (PDT)
Links: << >> << T >> << A >>

I have a design partitioned over 2 FPGAs. I am trying to determine the bene=
fits of selecting GTX links vs. LVDS to transfer the data between FPGAs.
=20
Target Device  : xc6vlx550t
Target Package : ff1759
Target Speed   : -2=EF=BB=BF
=20
Latency calculations:
1. GTX interface: The GTX transceiver is configured at 106.25 MHz with 20 b=
its input. This means the bits are transmitted at bit-rate =3D 20*106.25 MH=
z =3D 2.125 Gbps.
# of bits to be transferred =3D 1728
Latency of this interface =3D 1/(80% of bit-rate * (20/16)*(# of bits=EF=BB=
=BF transferred/16)) =3D 1/(2.295+e11) =3D 4.35+e-12 seconds
=20
2. LVDS+Aurora: The Aurora interface is configured at 600MHz (6 Gbps) with =
lane width as 2 bytes.
=20
Latency of this interface =3D 1/(80% of clock rate * (# of bits=EF=BB=BF tr=
ansferred/16)=EF=BB=BF) =3D 1/(5.184+e10) =3D 19.29+e-12 seconds
=20
=20
Is this calculation correct? My assumption for the LVDS calculation is that=
 Aurora does not up-sample the clock frequency by 20 for transmitting data.
=20
Thanks in advance for all the feedback.

Article: 151963
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: valtih1978 <do@not.email.me>
Date: Wed, 15 Jun 2011 18:01:06 +0300
Links: << >> << T >> << A >>

> Some DDR2 SDRAM controllers require a feedback clok input, being their
> output clock via a loop of track that goes the same distance as to the
> SDRAM and back.Others go through a training phase where they work out the
> "time-of-flight" from the controller to the SDRAM and back. Either works
> well enough.

I do believe that this works very well. I just want to know one thing: 
how all this stuff helps to strobe nothing but valid data bits?


> If your FPGA is from Xilinx, use their MIG tool to generate
> the controller.

My board is http://www.xilinx.com/univ/XUPV2P, routed for Xilinx 
http://www.xilinx.com/support/documentation/ip_documentation/plb_ddr.pdf 
memory controller
It involves the on-board clock feedback trace, which matches the 
FPGA-to-SDRAM trace length. Can you explain the advantage of this design 
in 7.05.2011 topic "Why feedback clock in SDRAM controllers?"

There are two problems to use EDK controller:

1. The CoreGen of ISE10.1 (latest for XCv2p) does not include the memory 
generator and
2. plb_ddr.pdf says: "Due to the variation in board layout, the DDR 
clock and the DDR data relationship can vary. Therefore, the designer 
should analyze the time delays of the system and set all of the 
attributes of the phase shift controls of the DCM as needed to insure
stable clocking of the DDR data."


I just do not understand how to measure these timings and, at the first 
place, why do we need these DQS if phase shift with respect to system 
clock still must be adjusted manually? Why not to strobe DQ by this 
manually adjusted system clock phase right away?

Article: 151964
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: "maxascent" <maxascent@n_o_s_p_a_m.n_o_s_p_a_m.yahoo.co.uk>
Date: Wed, 15 Jun 2011 10:48:37 -0500
Links: << >> << T >> << A >>

With DDR memory you would use some sort of calibartion scheme so that the
data coming from the memory was calibrated to the clock inside the FPGA.
This usually consists of writing a 1010 pattern into the memory and then
reading it back and using a IO delay inside the FPGA to alter the
relationship between the data and internal clock.

Jon 	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151965
Subject: Re: Determine latency of GTX links vs Aurora+LVDS
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 11:28:12 -0500
Links: << >> << T >> << A >>

>I have a design partitioned over 2 FPGAs. I am trying to determine the
bene=
>fits of selecting GTX links vs. LVDS to transfer the data between FPGAs.
>=20
>Target Device  : xc6vlx550t
>Target Package : ff1759
>Target Speed   : -2=EF=BB=BF
>=20
>Latency calculations:
>1. GTX interface: The GTX transceiver is configured at 106.25 MHz with 20
b=
>its input. This means the bits are transmitted at bit-rate =3D 20*106.25
MH=
>z =3D 2.125 Gbps.
># of bits to be transferred =3D 1728
>Latency of this interface =3D 1/(80% of bit-rate * (20/16)*(# of
bits=EF=BB=
>=BF transferred/16)) =3D 1/(2.295+e11) =3D 4.35+e-12 seconds
>=20
>2. LVDS+Aurora: The Aurora interface is configured at 600MHz (6 Gbps) with
=
>lane width as 2 bytes.
>=20
>Latency of this interface =3D 1/(80% of clock rate * (# of bits=EF=BB=BF
tr=
>ansferred/16)=EF=BB=BF) =3D 1/(5.184+e10) =3D 19.29+e-12 seconds
>=20
>=20
>Is this calculation correct? My assumption for the LVDS calculation is
that=
> Aurora does not up-sample the clock frequency by 20 for transmitting
data.
>=20
>Thanks in advance for all the feedback.
>
Generate both lots of IP.
Write a testbench with both instantiated.
Simulate.
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151966
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 11:31:00 -0500
Links: << >> << T >> << A >>

[snipped]
>
>My board is http://www.xilinx.com/univ/XUPV2P, routed for Xilinx 
>http://www.xilinx.com/support/documentation/ip_documentation/plb_ddr.pdf 
>memory controller
>It involves the on-board clock feedback trace, which matches the 
>FPGA-to-SDRAM trace length. Can you explain the advantage of this design 
>in 7.05.2011 topic "Why feedback clock in SDRAM controllers?"
>
[snipped]

Oh the old Virtex-2PRO stuff. Bad luck!
It all works lovely on Virtex-4 and Virtex-5 with recent ISE and CoreGen.
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151967
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: valtih1978 <do@not.email.me>
Date: Wed, 15 Jun 2011 19:45:02 +0300
Links: << >> << T >> << A >>

In your explanation, only one thing is missing: DQS. Why do we need data 
if we still need to calibrate "memory to the clock"? One could calibrate 
DQ directly "to the clock inside FPGA".

Article: 151968
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: valtih1978 <do@not.email.me>
Date: Wed, 15 Jun 2011 19:48:04 +0300
Links: << >> << T >> << A >>

Why do we need _DQS_, I mean.

Thank you for the appreciation.

Article: 151969
Subject: Re: What is the advantage of source-syncronization (in SDRAMs)?
From: valtih1978 <do@not.email.me>
Date: Wed, 15 Jun 2011 20:00:07 +0300
Links: << >> << T >> << A >>

On 15.06.2011 18:48, maxascent wrote:
> With DDR memory you would use some sort of calibartion scheme so that the
> data coming from the memory was calibrated to the clock inside the FPGA.
> This usually consists of writing a 1010 pattern into the memory and then
> reading it back and using a IO delay inside the FPGA to alter the
> relationship between the data and internal clock.

BTW, why the static installation of FPGA-SDRAM on a single board needs 
the dynamic calibration? 1010 is produced by DQS. Do you mean that 
duplicaiton is needed because all DQ bits, in one DQS group, must be 
treated separately?

Article: 151970
Subject: Re: Area Optimization
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Wed, 15 Jun 2011 20:17:43 +0000 (UTC)
Links: << >> << T >> << A >>

AMDyer@gmail.com <amdyer@gmail.com> wrote:

(snip)

> I would start by saying that the biggest opportunities for savings are
> almost always by starting at the algorithm level.  You'll only get so
> far by playing with implementation.
 
> one suggestion might be to look for places where you could do 'double
> clocking' - ie generate a 2x clock with the DCM and run a particular
> piece of logic twice per cycle, muxing the inputs and distributing the
> outputs.  We have some designs that were multiplier limited, so we
> used this trick as our main pipeline was slow enough to use one
> multiplier to do double duty per pipeline stage.

For systolic arrays, which I will guess that the OP is working on,
that often doesn't help.  You could speed up the whole thing by
a factor of two, though.  

-- glen

Article: 151971
Subject: Re: Area Optimization
From: "jt_eaton" <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 17:16:51 -0500
Links: << >> << T >> << A >>

>Something to remember about Xilinx FPGAs, at least when designing in VHDL
>and synthesizing with XST, is that you can specify the initial value of
>registered signals (when declaring the signal in the declarative part of
>the architecture). This is sometimes considered bad practice (bad coding
>style) in other contexts, and may not be supported by other tool flows.
>	   
>					
>---------------------------------------		
>Posted through http://www.FPGARelated.com
>

I really like the fact that you can initialize rams as well. You no longer
need to think in terms of rams or roms, you have a universal read/writable
rom for everything.

Need a screen buffer for your display? Create a startup screen image file
and have that loaded as well. Need some boot/test code. Load it in at
startup and then reuse that memory later.

This stuff is great!!

John
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151972
Subject: Re: Area Optimization
From: rickman <gnuarm@gmail.com>
Date: Wed, 15 Jun 2011 17:09:02 -0700 (PDT)
Links: << >> << T >> << A >>

On Jun 15, 2:35=A0am, Christopher Head <ch...@is.invalid> wrote:
> Lots of interesting advice here! In particular I read the Xilinx
> whitepaper with interest. Unfortunately, a lot of the advice seemed to
> be inapplicable to my problem. I can't look for the individual
> submodule that's taking up most of the area, because my application is
> a single long pipeline with a large number of very similar stages: the
> area isn't taken up by any one stage, but more by the number of stages.
> And because the design is a pipeline with general logic (mostly
> bitwise, plus a small bit of basic arithmetic) between registers, I
> don't really see any opportunities for special primitives like SRLs,
> DSPs, or the like that would reduce area. I can probably solve my
> problem by building a smaller pipeline and reusing it; I preferred not
> to do that as it will decrease system performance but it looks like I
> don't have much choice now.
>
> Thanks anyway!
> Chris

"General" logic is always ripe for optimization, or maybe I should
say, de-unoptimization.  If I were you, I would code each stage as a
separate module and measure the size to compare to what you think it
should be.

I have seen many times where the tools took what I thought was pretty
straight forward code and blew it up to something ugly.  Obviously it
was doing what I told it to, but I would have been able to do better
than the machine because I understood the logic better.  So I had to
change my code to indicate how it could be simplified.

Don't worry about the special features of a chip.  First figure out if
the tools did an ok job...

Rick

Article: 151973
Subject: Re: Area Optimization
From: "jt_eaton" <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 15 Jun 2011 19:40:25 -0500
Links: << >> << T >> << A >>


>
>As to the philosophical avoidance of async resets, I can't say I share
>that belief.  As you point out, there is one async reset on the chip
>that you can't eliminate, the PROGRAM pin.  Even if it doesn't reset
>the FFs, it will stop the design from working and reload all the LUTs
>and memory.
>
>Rick
>


You can't avoid 100% of all async reset flops  but you can easily do the
99.999% where sync will give you a smaller, faster design and your design
is still a  black box equivalent to using the async reset.

With xilinx parts every flop with an async reset wastes 1 lut over a sync
reset. In asic design every async reset flop  doubles the number of
endpoints needing timing closure from 1 to 2. If you do a really lousy job
in designing your reset distribution then these async paths could become
critical paths and start taking routing resources away from your other more
important paths.

Async resets on flops are nothing but trouble.

John 
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151974
Subject: Re: Determine latency of GTX links vs Aurora+LVDS
From: OutputLogic <evgenist@gmail.com>
Date: Wed, 15 Jun 2011 20:17:37 -0700 (PDT)
Links: << >> << T >> << A >>

Vivek,

I've recently determined the latency of Aurora in my design by running
simulation. It's V6, 250Mhz, 20bit, no framing.
I've got 340ns. If there is a clock compensation, it periodically
inserts a symbol and adds an additional clock .

Thanks,
Evgeni

========================
http://outputlogic.com

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search