Messages from 48375

Article: 48375
Subject: Re: Xilinx microblaze vs. picoblaze
From: Ray Andraka <ray@andraka.com>
Date: Wed, 16 Oct 2002 20:24:28 GMT
Links: << >> << T >> << A >>

As I recall, you were getting speeds of ~135 MHz in V2.  You should be able to get a
fully pipelined processor using BRAMs up to 200 MHz or so without any big problems.
In VirtexII, the carry chains will be the limiting factor, not the BRAM if you do it
right.  In Virtex and VIrtexE, you can double the width of the BRAM and then use
registers to assemble consecutive accessess.  It does get a bit messy in that case
because it introduces pipeline misses.


Goran Bilski wrote:

> Hi,
>
> I agree that you can if you also double the clock frequency of the pipeline,
> creating parts of the normal clock.
> What I meant was the keeping the same clock and just adding more pipestages.
>
> I have finally got the idea of multithreading but it not as easy to implement
> since you need to find a good middle point in each
> pipestage that can divide the pipestage into equal parts.
>
> The control path is also needed to split into subparts and you also need to find
> good points to break it up.
> The processor also definitely needs a cache which you can run at the double
> speed or more ports to in order to get the data for each thread.
> I think that would be the largest obstacle for multithreading MicroBlaze, the
> number of ports to the BRAM is finite (2) and in my implementation BRAM is
> almost already in the critical path.
>
> Göran Bilski
>
> "Nicholas C. Weaver" wrote:
>
> > In article <3DADAAB7.E96D53F2@Xilinx.com>,
> > Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:
> >
> > >You can't just double the number of pipestage for a processor without
> > >major impacts.  For streaming pipeline which hardware pipelines are I
> > >agree but for processor that can't be done.
> >
> > Uhh, yes it can.
> >
> > Double all the pipeline stages, double the register file, rebalance
> > the delays now that you have more pipelining, and out drops a 2-thread
> > multithreaded architecture.  Each single thread now runs slower, but
> > aggregate throughput (sum of the two threads) is increased.
> >
> > It is so obvious yet unintuitive that nobody has actually DONE it
> > before.  :)
> > --
> > Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 48376
Subject: Re: Xilinx microblaze vs. picoblaze
From: Ray Andraka <ray@andraka.com>
Date: Wed, 16 Oct 2002 20:25:45 GMT
Links: << >> << T >> << A >>

The SRL16's make the permanent state really easy to store too.

Ken McElvain wrote:

> On big advantage of multi threading is that the pipeline
> interlocks can be eliminated if the number of threads is larger
> than the longest feedback path in the pipeline.  For example,
> a branch instruction does not have to stall waiting for
> conditions from the preceeding comparison.  This yields
> some boost in the total performance.
>
> Permanent state such as conditions codes, register files have to be
> expanded into larger memories with part of the index being the current
> thread id, but other registers mostly do not have to be modified.  Given
> the distributed ram capabilities in Xilinx parts, this is pretty
> cheap.
>
> The first place I saw this was the CDC 6600 IO processors, which
> I belive ran 16 threads.
>
> - Ken
>
> Goran Bilski wrote:
>
> > Hi,
> >
> > "Nicholas C. Weaver" wrote:
> >
> >
> >>In article <3DAD80F2.DC5AD4C4@Xilinx.com>,
> >>Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:
> >>
> >>>Hi,
> >>>
> >>>Sort of.
> >>>
> >>>The complete decoding and the ALU is around 10-13% of the design.
> >>>The actual instruction decoding is less than 5%.
> >>>
> >>>Make it multithreading as I understand is to have more than 1 instructions
> >>>streams in the pipeline.
> >>>What is the benefit unless you double the pipeline and have two data pipelines?
> >>>Almost nothing
> >>>
> >>Uhh, you don't double the pipelines, you take the single pipeline,
> >>double up the registers IN them, and then move the regsters to
> >>rebalance all the pipeline stages, as now you have 2x the registers
> >>through any fedback loop, allowing you to up the clock frequncy alot.
> >>
> >>If you do this to every register in the core (and tweak the RF), a
> >>multithreaded design just sort of "dros out" automatically.
> >>
> >>You can even write a tool to do that automatically.
> >>
> >>What happens in the end is is you take adantage of the two threads to
> >>up the clock substantially.  Each individual thread is now a little
> >>slower, but the throughput for the 2 threads is now substantiall
> >>higher.  You use more pipelining and more power, and you may or may
> >>not end up thrashing the caches, but itdoes work.
> >>
> >>I can send you a paper submission and a thesis chapter draft on the
> >>subject if you want.
> >>
> >>
> >
> > Please do.
> >
> > If you double all the registers in the data pipeline, hasn't you doubled the
> > pipeline?
> > Or is all functionality between the pipestages shared?
> >
> >
> >>>So with two threads in MicroBlaze, to double the pipeline is to
> >>>double the size of MicroBlaze.  You also have to double the
> >>>instruction fetching data throughput in order to get the two streams
> >>>busy.  That would put a big burden on the bus infrastructure and
> >>>external memory interface which suddenly has to double it's
> >>>performance.  The doubling of the pipeline and added control handling
> >>>WILL also lower the maximum clock frequency of MicroBlaze.
> >>>
> >>You don't need to double the exteral memory interface if you share the
> >>cache, this is especially true on workloads where the threads are
> >>related.  The external memory interfare is now 2x the CLOCK, but you
> >>could slow it down from there and arbitrate beween the two streams of
> >>execution.
> >>
> >
> >>You also probably want to make the feeding of interrupts a little
> >>different, so you can designate one thread as receiving the
> >>interrupts.
> >>
> >>
> >>>Say you suddenly would like to have 5 threads instead of 2. That is a major
> >>>change of the multithreading MicroBlaze and almost impossible to get the
> >>>instruction fetching to keep up. With multiprocessing, just add another 3
> >>>MicroBlazes and you're done.
> >>>
> >>What you do is you have a 1 thread and a 2 thread version (going
> >>beyond 2 threads seems to be less effective, maby 3 depending on the
> >>architecture).  From the exterior, however, they still look normal.
> >>You can still tile that like any other core to create a multiprocessor
> >>machine.
> >>
> >>
> >>>BUT there is always a catch and that is how you write programs for these
> >>>systems.
> >>>
> >>"one thread for I/O, one thread for processing" does come up in some
> >>cases.
> >>
> >>
> >>>Göran
> >>>
> >>>Hal Murray wrote:
> >>>
> >>>
> >>>>>Another approach is to add multi-threading capabilities but I think that
> >>>>>multi-processing is better for FPGA than multi-threading.
> >>>>>
> >>>>Why?
> >>>>
> >>>>If I understand what multi-threading means, the idea is to interleave
> >>>>alternate cycles of two execution streams in order to reduce the
> >>>>losses due to stalls.
> >>>>
> >>>>It looks like it "just" requires an extra address bit (odd/even cycle)
> >>>>to the register file and the same bit selects between pairs of special
> >>>>registers like the PC.
> >>>>
> >>>>Are you telling me that the ALU and instruction decoding is small enough
> >>>>so that I might just as well build two copies of the whole CPU?
> >>>>
> >>>>--
> >>>>The suespammers.org mail server is located in California.  So are all my
> >>>>other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
> >>>>commercial e-mail to my suespammers.org address or any of my other addresses.
> >>>>These are my opinions, not necessarily my employer's.  I hate spam.
> >>>>
> >>--
> >>Nicholas C. Weaver                                 nweaver@cs.berkeley.edu
> >>
> >

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 48377
Subject: Re: Why can Xilinx sw be as good as Altera's sw?
From: Ray Andraka <ray@andraka.com>
Date: Wed, 16 Oct 2002 20:31:42 GMT
Links: << >> << T >> << A >>

Good question.  The road is littered with FPGA start ups and even big companies
that tried to get in on the action:  Dynachip, Gatefield, Motorola, TI, AMD,....

rickman wrote:

> Where is the money to start a new FPGA going to come from... ?
>
>
> Rick "rickman" Collins
>
> rick.collins@XYarius.com
> Ignore the reply address. To email me use the above address with the XY
> removed.
>
> Arius - A Signal Processing Solutions Company
> Specializing in DSP and FPGA design      URL http://www.arius.com
> 4 King Ave                               301-682-7772 Voice
> Frederick, MD 21701-3110                 301-682-7666 FAX

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 48378
Subject: Re: Xilinx microblaze vs. picoblaze
From: Goran Bilski <Goran.Bilski@Xilinx.com>
Date: Wed, 16 Oct 2002 13:33:39 -0700
Links: << >> << T >> << A >>

Hal Murray wrote:

> > BUT there is always a catch and that is how you write programs
> > for these systems.
>
> Standard programming problem.  People are getting pretty good
> at it.  Yes, there are lots of applications where it doesn't work.
>
> If you can't take advantage of multi-threading then you wouldn't
> be able to use multi-processing either.
>

Is that true?

Don't you need to actually have two threads in order to use the multi-threading
but multiprocessor parallelism can be more fine grain.

ex.
A code where the inner loop has a function call where some operations take place.

Is it easier to thread that function or just place the function in another
processor?

Isn't it how data is move between two processor/threads that is more crucial?

How does you actually move data between two threads in the same processor?

Göran

>
> --
> The suespammers.org mail server is located in California.  So are all my
> other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
> commercial e-mail to my suespammers.org address or any of my other addresses.
> These are my opinions, not necessarily my employer's.  I hate spam.

Article: 48379
Subject: Standing on the shores of Stratix-land
From: rrr@ieee.org (Rajeev)
Date: 16 Oct 2002 13:51:56 -0700
Links: << >> << T >> << A >>

Hello all,

I've perused various Xilinx/Altera threads in this
newsgroup with due interest, and would now like to 
invite comments and thoughts on the situation I 
find myself in:

I've done a modest amount of design recently with
Xilinx and am overall comfortable with the design 
flow, assorted support, and achievable performance.
Recently a new application has come up requiring more
horsepower (a PCI accelerator card for some compute-
intensive portions of an imaging application).

To make this fly, I need for three things to come 
together: (1) devices (2) tools (3) a development
board.

In a nutshell the Altera offering is tempting...

(1) Devices: The Stratix prices I'm being offered are
aggressively low.  I'm comparing slow speed grades of
EP1S10/20 with 2V1000/2000 and I get the feeling that 
Stratix has the edge on raw speed and on DSP block 
capability.  Things like distributed memory and SRL 
that are strengths of Virtex-2 don't seem too important 
for this application.

(2) Tools: I went to a Quartus seminar.  Quartus seems 
learnable, no huge leaps for an ISE user.  Overall the
Altera tools cost considerably less too, when you look
at the packages available and the implications of Xilinx'
Time-Based License.  I'm unsure about the level of 
support and bugginess of Quartus, but then reports of
ISE 5.1i aren't exactly flattering.

(3) With some difficulty I have identified suitable
development boards (PCI + enough off-chip memory) for
both Stratix and Virtex-II, as it happens none of them are 
available today.  So that's a wash.

Anyway, I would be eager to hear from other folks re: wisdom
of your experience, or re: pitfalls for the unwary, or if
you've been looking at the same kind of decision...

Thanks,
-rajeev-

Article: 48380
Subject: Re: Xilinx microblaze vs. picoblaze
From: Goran Bilski <Goran.Bilski@Xilinx.com>
Date: Wed, 16 Oct 2002 14:06:22 -0700
Links: << >> << T >> << A >>

Yes, But I also have a embedded multiplier which already is using the registrated
output.
I can't add another pipestage in that path since there is nowhere to insert it.
Then I have to add special arrangement if the instructions is using the multiplier in
the control logic.

I have painfully detect that minor tweaks in the control logic can easily make it the
critical path.
I think that is possible to have two threads in MicroBlaze but I not convince that it
would give me more performance than two separate MicroBlazes. The overall area will  be
less than two MicroBlazes but not far from it.

MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and the slices
count will go up.
(There is also a lot of places for Virtex and VirtexE where I have used all in/out for a
slice and even if there is a free DFF, it can't be reached.)
It will not double the size but a significant increase will occur.

The multiprocessor approach makes it much easier to add 10 extra processors(threads).


Göran

Ray Andraka wrote:

> As I recall, you were getting speeds of ~135 MHz in V2.  You should be able to get a
> fully pipelined processor using BRAMs up to 200 MHz or so without any big problems.
> In VirtexII, the carry chains will be the limiting factor, not the BRAM if you do it
> right.  In Virtex and VIrtexE, you can double the width of the BRAM and then use
> registers to assemble consecutive accessess.  It does get a bit messy in that case
> because it introduces pipeline misses.
>
> Goran Bilski wrote:
>
> > Hi,
> >
> > I agree that you can if you also double the clock frequency of the pipeline,
> > creating parts of the normal clock.
> > What I meant was the keeping the same clock and just adding more pipestages.
> >
> > I have finally got the idea of multithreading but it not as easy to implement
> > since you need to find a good middle point in each
> > pipestage that can divide the pipestage into equal parts.
> >
> > The control path is also needed to split into subparts and you also need to find
> > good points to break it up.
> > The processor also definitely needs a cache which you can run at the double
> > speed or more ports to in order to get the data for each thread.
> > I think that would be the largest obstacle for multithreading MicroBlaze, the
> > number of ports to the BRAM is finite (2) and in my implementation BRAM is
> > almost already in the critical path.
> >
> > Göran Bilski
> >
> > "Nicholas C. Weaver" wrote:
> >
> > > In article <3DADAAB7.E96D53F2@Xilinx.com>,
> > > Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:
> > >
> > > >You can't just double the number of pipestage for a processor without
> > > >major impacts.  For streaming pipeline which hardware pipelines are I
> > > >agree but for processor that can't be done.
> > >
> > > Uhh, yes it can.
> > >
> > > Double all the pipeline stages, double the register file, rebalance
> > > the delays now that you have more pipelining, and out drops a 2-thread
> > > multithreaded architecture.  Each single thread now runs slower, but
> > > aggregate throughput (sum of the two threads) is increased.
> > >
> > > It is so obvious yet unintuitive that nobody has actually DONE it
> > > before.  :)
> > > --
> > > Nicholas C. Weaver                                 nweaver@cs.berkeley.edu
>
> --
> --Ray Andraka, P.E.
> President, the Andraka Consulting Group, Inc.
> 401/884-7930     Fax 401/884-7950
> email ray@andraka.com
> http://www.andraka.com
>
>  "They that give up essential liberty to obtain a little
>   temporary safety deserve neither liberty nor safety."
>                                           -Benjamin Franklin, 1759

Article: 48381
Subject: Re: Xilinx microblaze vs. picoblaze
From: Ray Andraka <ray@andraka.com>
Date: Wed, 16 Oct 2002 21:26:33 GMT
Links: << >> << T >> << A >>


Easiest to do via  memory or register file.  Thread timing has to make sure the value
is available before using it.

Goran Bilski wrote:

>
> Isn't it how data is move between two processor/threads that is more crucial?
>
> How does you actually move data between two threads in the same processor?
>
> Göran
>
> >
> > --
> > The suespammers.org mail server is located in California.  So are all my
> > other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
> > commercial e-mail to my suespammers.org address or any of my other addresses.
> > These are my opinions, not necessarily my employer's.  I hate spam.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 48382
Subject: unconnected nets in schematic editor of ISE 5.1
From: jym_23@yahoo.com (Jym)
Date: 16 Oct 2002 14:32:12 -0700
Links: << >> << T >> << A >>

I instantiate a 8-bit counter cc8ce from the library and only use the
MSB as output. What is the best way to deal with the rest 7 bits of
the output from the counter?

Thanks.

Article: 48383
Subject: Re: Xilinx microblaze vs. picoblaze
From: nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver)
Date: Wed, 16 Oct 2002 21:48:48 +0000 (UTC)
Links: << >> << T >> << A >>

In article <3DAC6ED3.A8E721C4@Xilinx.com>,
Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:

>Another approach is to rely on advanced compiler techniques for
>handling all the pipeline hazardous but it would make it almost
>impossible to program the processor in assembler since the user has
>to do the handling.  I personally don't think that this approach
>would gain that much more performance than MicroBlaze and you have to
>spend a lot of resources on the compiler which could be used for
>other stuff.

MIPS:  Machine without Interlocking Pipeline Stages.
-- 
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

Article: 48384
Subject: Re: xilinx: VirtexII in a pqfp208 or pqfp240 ?
From: mrand@my-deja.com (Marc Randolph)
Date: 16 Oct 2002 14:56:59 -0700
Links: << >> << T >> << A >>

John_H <johnhandwork@mail.com> wrote in message news:<3DACCA15.BE61F97B@mail.com>...
> "Theron Hicks (Terry)" wrote:
> <snip>
> 
> >  I will keep the rework machine idea in mind if we ever need to go to the virtex2
> > or other BGA parts.  How does one inspect the solder joints to determine whether the
> > joints all have flowed correctly?  How steep is the learning curve to mount the chip
> > consistently?
> 
> It looked like a straight-forward process.  I didn't do any of the work myself, but one
> of the model shop guys showed me the workings of the thing.  With the appropriate
> preheat section and a slider to the hot air reflow, it looked like a solid technique
> without too much leeway for problems.  While a good visual inspection requires expensive
> inspection microscopes (extremely low profile view) for outside balls, the "right" way
> to fully inspect the balls is with X-ray inspection.  While neither is appropriate for
> your needs, a cheap boundary scan might be the best way to go.

Good idea.  Or if you have tons of spare I/O, do a pin-out so that you
can retransmit every received signal out to a test pad if loaded with
a special FPGA load.

> Talking to the vendors that want to sell you those stations, you can probably get a
> better understanding of the ins and outs.

The learning curve isn't horrible, but it takes some practice.  IE,
I'd guess you'll waste 5 to 15 devices as you try to get it accurate. 
The hope is that after you save off the profile, you have no more
problems.

With business so slow, the vendors may be hungry enough that you could
convince them to do the dial-in for you if you provided the packages
(real or samples).  Note that vendors often have X-ray, so if you have
questions about your process, they can double check a few boards for
you (for no charge, at least in our experience).

Have fun,

   Marc

Article: 48385
Subject: Re: Why can Xilinx sw be as good as Altera's sw?
From: nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver)
Date: Wed, 16 Oct 2002 22:05:39 +0000 (UTC)
Links: << >> << T >> << A >>

In article <6uelaqjgg4.fsf@chonsp.franklin.ch>,
Neil Franklin  <neil@franklin.ch.remove> wrote:
>> What will you feed into the backend?  Output from the X or A front end?
>
>At present just interest to feed in my own simple language. May add
>XDL if that is sufficiently interesting.

Start with post placement XDL.  Make a translator from your language
to XDL, and you can now verify that your language makes sense by
feeding the results to xilinx routing etc.

You can now use your language (output as XDL) with your tool and
Xilinx tools, and use Xilinx toolflows to drive your backend
routing/bitgen, and you also decouple your language from your tools.

>Lesser case: You do know, that nearly all the fundamental patents in
>FPGAs appeared around/pre 1985 (XC2000) and are now nearing their 17
>year, and so at end of life? Give a few years (needed for any hypothetical
>bit-compatible scenario that makes cloning interesting anyway), and
>quite a lot of them will be gone.

But nobody cares about those parts.  THe patents on xc4000 features
will be in force longer, and the Virtex parts even longer, and Virtex
is so far superior that nobody is going to want to clone older parts.

You too can clone a 8088, but whats the real use apart from an
interesting emulator?

>Don't forget that then the only patents remaining are detail patents,
>i.e. on the actual implementation. And that can be varied, without
>losing bit compatibility. The situation is getting simpler the longer
>time goes.

Not exactly.  I'd bet (although I haven't searched) that there are
patents on the BlockRAMs, patents on the hex lines, which would make a
bitfile clones of the Virtex really suffer in terms of performance if
you used funky alternate structures.

But I will be glad when the LUT patent is dead and buried.

>Also you may want to take into account, that Altera managed to
>survive Xilinxes patents, despite starting when they Xilinx had
>maximal protection, and with Altera an latecomer. Any new competitor
>has an easier situation.

Or worse as both brand A and brand X lawyers get together to put the
Serious Hurtin on a competitor.

Part of the reason Altera was able to get away with it was a large
number of other patents, so although Xilinx was first, there was a
major mess of overlapping patents.  A new competitor won't have that
advantage.

>And an further scenario: assume bit compatible becomes important.
>Either X or A is the winner in becoming the standard. How long do you
>think will the other of the 2 look at declining sales, until they
>clone? And we already know that a patent battle between them 2 ends in
>stalemate.

It never will be.  If it does, I'll eat my hat.

>>  Likewise the
>> innards of an FPGA are patented and otherwise protected IP.  If you try
>> to make an FPGA that is bitstream compatible you will either violate
>> patents or end up with a very unworkable chip design or both.
>
>You can get around an patent. Altera survived Xilinxes ones. AMD has
>wrung patents off of Intel, by tripping them over other stuff. Via has
>stopped Intel attacks by tripping them up. Ask an good IP lawyer about
>all the possibilities. IP law is not the clear "you lose" that you
>believe it to be.

Your examples are all companies which entered the particular market
around the same time, give or take.  AMD also had serious
cross-liscencing with Intel as well, of which many of the lawsuits
were about.
-- 
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

Article: 48386
Subject: Re: Xilinx microblaze vs. picoblaze
From: Ray Andraka <ray@andraka.com>
Date: Wed, 16 Oct 2002 22:42:11 GMT
Links: << >> << T >> << A >>

It won't quite double performance, but it also should not be a significantly larger area or
you either aren't doing it right or it is already heavily pipelined.   The gain is not raw
performance, it is a gain of performance/area.  Two separate instances will provide more
MIPs one dual threaded machine, but at the cost of more area.

Normally, the pipeline stages should be inserted so that their input comes from the LUT in
the same slice, so it is not an issue if you used up all the inputs.  The only blocking
input in that case is the SR input if you are using a CLB RAM or SRL16.  The control logic
can usually be pipelined similarly, but it may require a fresh start at the design rather
than patching the existing one.


Goran Bilski wrote:

> Yes, But I also have a embedded multiplier which already is using the registrated
> output.
> I can't add another pipestage in that path since there is nowhere to insert it.
> Then I have to add special arrangement if the instructions is using the multiplier in
> the control logic.
>
> I have painfully detect that minor tweaks in the control logic can easily make it the
> critical path.
> I think that is possible to have two threads in MicroBlaze but I not convince that it
> would give me more performance than two separate MicroBlazes. The overall area will  be
> less than two MicroBlazes but not far from it.

>
>
> MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and the slices
> count will go up.
> (There is also a lot of places for Virtex and VirtexE where I have used all in/out for a
> slice and even if there is a free DFF, it can't be reached.)
> It will not double the size but a significant increase will occur.
>
> The multiprocessor approach makes it much easier to add 10 extra processors(threads).
>
> Göran
>
> Ray Andraka wrote:
>
> > As I recall, you were getting speeds of ~135 MHz in V2.  You should be able to get a
> > fully pipelined processor using BRAMs up to 200 MHz or so without any big problems.
> > In VirtexII, the carry chains will be the limiting factor, not the BRAM if you do it
> > right.  In Virtex and VIrtexE, you can double the width of the BRAM and then use
> > registers to assemble consecutive accessess.  It does get a bit messy in that case
> > because it introduces pipeline misses.
> >
> > Goran Bilski wrote:
> >
> > > Hi,
> > >
> > > I agree that you can if you also double the clock frequency of the pipeline,
> > > creating parts of the normal clock.
> > > What I meant was the keeping the same clock and just adding more pipestages.
> > >
> > > I have finally got the idea of multithreading but it not as easy to implement
> > > since you need to find a good middle point in each
> > > pipestage that can divide the pipestage into equal parts.
> > >
> > > The control path is also needed to split into subparts and you also need to find
> > > good points to break it up.
> > > The processor also definitely needs a cache which you can run at the double
> > > speed or more ports to in order to get the data for each thread.
> > > I think that would be the largest obstacle for multithreading MicroBlaze, the
> > > number of ports to the BRAM is finite (2) and in my implementation BRAM is
> > > almost already in the critical path.
> > >
> > > Göran Bilski
> > >
> > > "Nicholas C. Weaver" wrote:
> > >
> > > > In article <3DADAAB7.E96D53F2@Xilinx.com>,
> > > > Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:
> > > >
> > > > >You can't just double the number of pipestage for a processor without
> > > > >major impacts.  For streaming pipeline which hardware pipelines are I
> > > > >agree but for processor that can't be done.
> > > >
> > > > Uhh, yes it can.
> > > >
> > > > Double all the pipeline stages, double the register file, rebalance
> > > > the delays now that you have more pipelining, and out drops a 2-thread
> > > > multithreaded architecture.  Each single thread now runs slower, but
> > > > aggregate throughput (sum of the two threads) is increased.
> > > >
> > > > It is so obvious yet unintuitive that nobody has actually DONE it
> > > > before.  :)
> > > > --
> > > > Nicholas C. Weaver                                 nweaver@cs.berkeley.edu
> >
> > --
> > --Ray Andraka, P.E.
> > President, the Andraka Consulting Group, Inc.
> > 401/884-7930     Fax 401/884-7950
> > email ray@andraka.com
> > http://www.andraka.com
> >
> >  "They that give up essential liberty to obtain a little
> >   temporary safety deserve neither liberty nor safety."
> >                                           -Benjamin Franklin, 1759

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 48387
Subject: Re: xilinx: VirtexII in a pqfp208 or pqfp240 ?
From: Austin Lesea <austin.lesea@xilinx.com>
Date: Wed, 16 Oct 2002 15:46:04 -0700
Links: << >> << T >> << A >>

Marc (and all),

Peter wandered by a few moments ago, and asked me why I was so against pq packages?

I'm not, honestly!

The Spartan family always has small and inexpensive packages for their parts.

But the issue of assembly and rework is a real one for the bga packages.

If you can get a 8 layer board build in three days thru any number of on-line services, how do
you assemble these prototypes?

Well, as it so happens, most places close to big technology centers have assembly services
just for prototyping!  And it isn't that expensive!  In fact, it costs a lot less than having
all of the stuff and people yourself, and not using it (which is what most prototyping
consists of: 5% building, and 95% debugging what was built).  They do rework too.

So ask around.  Large assembly services sometimes have a small proto line just for this, and
in areas where there is a lot more business, there are small specialty shops that just do
proto runs.

And I am talking here about five, two, and sometimes even one board for a reasonable price.

Of course, you will have to learn how to assemble the kit of parts, and identify them, and
have a good schematic, and a good bill of materials, and have a good assembly drawing.  All of
that you already have, right?

Believe me, that if you don't have the right documentation, you are just playing at this, and
you must enjoy the subsequent pain.

The FPGA lab builds and gets assembled pcbs regularly that have packages up to the ff1517
size, and even though we do have the rework equipment (not that expensive if this is your
business) to mount the parts ourselves, it is just too easy and too quick to get it done
outside.

The outside folks are so good at it they seldom need to x-ray, but they can do that if there
is any question.

Austin

Marc Randolph wrote:

> John_H <johnhandwork@mail.com> wrote in message news:<3DACCA15.BE61F97B@mail.com>...
> > "Theron Hicks (Terry)" wrote:
> > <snip>
> >
> > >  I will keep the rework machine idea in mind if we ever need to go to the virtex2
> > > or other BGA parts.  How does one inspect the solder joints to determine whether the
> > > joints all have flowed correctly?  How steep is the learning curve to mount the chip
> > > consistently?
> >
> > It looked like a straight-forward process.  I didn't do any of the work myself, but one
> > of the model shop guys showed me the workings of the thing.  With the appropriate
> > preheat section and a slider to the hot air reflow, it looked like a solid technique
> > without too much leeway for problems.  While a good visual inspection requires expensive
> > inspection microscopes (extremely low profile view) for outside balls, the "right" way
> > to fully inspect the balls is with X-ray inspection.  While neither is appropriate for
> > your needs, a cheap boundary scan might be the best way to go.
>
> Good idea.  Or if you have tons of spare I/O, do a pin-out so that you
> can retransmit every received signal out to a test pad if loaded with
> a special FPGA load.
>
> > Talking to the vendors that want to sell you those stations, you can probably get a
> > better understanding of the ins and outs.
>
> The learning curve isn't horrible, but it takes some practice.  IE,
> I'd guess you'll waste 5 to 15 devices as you try to get it accurate.
> The hope is that after you save off the profile, you have no more
> problems.
>
> With business so slow, the vendors may be hungry enough that you could
> convince them to do the dial-in for you if you provided the packages
> (real or samples).  Note that vendors often have X-ray, so if you have
> questions about your process, they can double check a few boards for
> you (for no charge, at least in our experience).
>
> Have fun,
>
>    Marc

Article: 48388
Subject: Re: Xilinx microblaze vs. picoblaze
From: "Jan Gray" <jsgray@acm.org>
Date: Wed, 16 Oct 2002 15:47:46 -0700
Links: << >> << T >> << A >>

Nicholas wrote:
> MIPS:  Machine without Interlocking Pipeline Stages.

MIPS: "Microprocessor without Interlocked Pipeline Stages"

Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
would necessarily a 1 cycle ALU-to-use delay.  That is rather more
challenging for the code scheduler to address.

My (unpublished) V-II architectural studies concur back up what Goran has
been writing.  Furthermore, to 2-thread any such machine would increase the
area intolerably, because so much area is tied up in register files (which
would have to double in size).  Also, if you grow the area of the processor,
it will slow down because of increased interconnect delays.  A compact
processor is a fast processor.

Goran wrote:
> MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
the slices count will go up.

Is that an implementation improvement over the old 900+ LUTs figure, or did
the old 900 LUTs figure include other non-core resources?  Can you say what
changed?

(SPRAM reg files?  (The following is fast enough for ~150 MHz operation:
Register the result in a write-back register (in FFs) on CLK rising edge;
present reg file write address and write-back data to reg file SPRAMs while
CLK high; write results to SPRAMs on CLK falling edge; present reg file read
address while CLK low; mux SPRAM outputs with immediate and/or forwarded
result; and register in operand registers.))

Jan Gray, Gray Research LLC

Article: 48389
Subject: Re: Xilinx microblaze vs. picoblaze
From: Goran Bilski <Goran.Bilski@Xilinx.com>
Date: Wed, 16 Oct 2002 16:14:06 -0700
Links: << >> << T >> << A >>

Hi Jan,

Jan Gray wrote:

> Nicholas wrote:
> > MIPS:  Machine without Interlocking Pipeline Stages.
>
> MIPS: "Microprocessor without Interlocked Pipeline Stages"
>
> Even the R4000 (which had a 2 cycle load-to-use delay, IIRC) made sure to
> have a 0 cycle ALU-to-use delay, whereas a superpipelined 250 MHz V-II RISC
> would necessarily a 1 cycle ALU-to-use delay.  That is rather more
> challenging for the code scheduler to address.
>
> My (unpublished) V-II architectural studies concur back up what Goran has
> been writing.  Furthermore, to 2-thread any such machine would increase the
> area intolerably, because so much area is tied up in register files (which
> would have to double in size).  Also, if you grow the area of the processor,
> it will slow down because of increased interconnect delays.  A compact
> processor is a fast processor.
>
> Goran wrote:
> > MicroBlaze has 700 LUTS and 500 DFFs. Double the number of flipflops and
> the slices count will go up.
>
> Is that an implementation improvement over the old 900+ LUTs figure, or did
> the old 900 LUTs figure include other non-core resources?  Can you say what
> changed?
>

Ooops, I did it again!!
Error on my side, I looked at the report file for the core and took the number
of IO instead of LUTs.

>
> (SPRAM reg files?  (The following is fast enough for ~150 MHz operation:
> Register the result in a write-back register (in FFs) on CLK rising edge;
> present reg file write address and write-back data to reg file SPRAMs while
> CLK high; write results to SPRAMs on CLK falling edge; present reg file read
> address while CLK low; mux SPRAM outputs with immediate and/or forwarded
> result; and register in operand registers.))
>

(Of 900 LUTs, 256 of them are the register file => around 30%)
It's something that I have thought off but it would not leave much room for any
logic handling of the register addresses or operations on the register output.
Since I using a SRL16 as the instruction prefetch buffer and the output delay of
a SRL16 is around 2 ns, I can't add much logic to the register address before
they have to go to the register file.

But if I got some spare time (hahahaha) this is something that I would try to do
in order to get down the MicroBlaze size.
Jan, You might be able to do a clean room implementation of a MicroBlaze where
area is everything.
If you have any spare time ;-)

>
> Jan Gray, Gray Research LLC

Article: 48390
Subject: Re: Xilinx microblaze vs. picoblaze
From: hmurray@suespammers.org (Hal Murray)
Date: Wed, 16 Oct 2002 23:51:03 -0000
Links: << >> << T >> << A >>


>> >Another approach is to rely on advanced compiler techniques for handling all the
>> >pipeline hazardous but it would make it almost impossible to program the processor in
>> >assembler since the user has to do the handling.
>> >I personally don't think that this approach would gain that much more performance than
>> >MicroBlaze and you have to spend a lot of resources on the compiler which could be
>> >used for other stuff.
>> 
>> This seems like an interesting opportunity for an open source project.
>
>Aren't there already CPUs in FPGA open source projects?  
>
>http://www.fpgacpu.org/
>
>http://www.opencores.org/
>
>The list is getting pretty long.  

I was thinking of the compiler rather than the hardware.

The idea is to use one thread rather than multiple, and make the
compiler smart enough to understand the pipeline delays, and either
automatically insert noops or slap your wrist if you do something
bad.

Think of it as microcode rather than "normal" (whatever that
means) RISC type code.  You have to get your head around it,
but once you get in the right mode it's not that hard.  Maybe
I was lucky to have a good mentor at the right time.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

Article: 48391
Subject: Re: Xilinx microblaze vs. picoblaze
From: "Tim" <tim@rockylogic.com.nooospam.com>
Date: Thu, 17 Oct 2002 00:54:30 +0100
Links: << >> << T >> << A >>

Various people wrote:
> > > MIPS:  Machine without Interlocking Pipeline Stages.
> >
> > MIPS: "Microprocessor without Interlocked Pipeline Stages"

From long ago, when life was simpler.

MIPS means 'MIPS', and has plenty of interlocking.  Some
implementations also have the HACF instruction, which we
all challenge Goran to implement.  (Halt and Catch Fire)

Article: 48392
Subject: Re: xilinx: VirtexII in a pqfp208 or pqfp240 ?
From: Tullio Grassi <tullio@physics.umd.edu>
Date: Wed, 16 Oct 2002 20:22:47 -0400
Links: << >> << T >> << A >>

Falk Brunner wrote:
> 
> So you need to upgrade you assembly technology.
> You can get BGAs assembled by professional companys, or do it youself, using
> some advanced assembly tools.
> Even a amateur can do this. At least one in this world ;-))
> 

the problem is that often also "so-called" professional companies
mess up with BGAs, without warning you :(

Here :  http://edg.umd.edu/heater/bga/
  is what happened to our BGAs.

-- 

Tullio Grassi

======================================
Univ. of Maryland - Dept. of Physics
College Park, MD 20742 - US
Tel +1 301 405 5970
Fax +1 301 699 9195
======================================

Article: 48393
Subject: Re: Xilinx microblaze vs. picoblaze
From: Goran Bilski <Goran.Bilski@Xilinx.com>
Date: Wed, 16 Oct 2002 17:28:53 -0700
Links: << >> << T >> << A >>


I always tries to get in an instruction that always produce the value 42.
But I for some reason can never past the management on that one.

Göran
Tim wrote:

> Various people wrote:
> > > > MIPS:  Machine without Interlocking Pipeline Stages.
> > >
> > > MIPS: "Microprocessor without Interlocked Pipeline Stages"
>
> From long ago, when life was simpler.
>
> MIPS means 'MIPS', and has plenty of interlocking.  Some
> implementations also have the HACF instruction, which we
> all challenge Goran to implement.  (Halt and Catch Fire)

Article: 48394
Subject: multiple clocks
From: "Jamie Morken" <jmorken@shaw.ca>
Date: Thu, 17 Oct 2002 01:56:56 GMT
Links: << >> << T >> << A >>

Hi,

I'm working on a board that requires a 120MHz clock, a 24Mhz clock and a
40MHz clock.
The board has a SpartanIIE device on it.  I've never used a PLL (which the
Spartan device has)
so I'm unsure if I should use multiple crystals or if I can use the PLL to
give the needed frequencies.

One of the IC's requires 24 MHz so would it be possible to use a 24MHz
crystal for this IC and also
feed the output to the FPGA to generate the 40 and 120 MHz clocks?

cheers,
Jamie Morken

Article: 48395
Subject: Re: Xilinx microblaze vs. picoblaze
From: Ken McElvain <ken@synplicity.com>
Date: Thu, 17 Oct 2002 02:22:04 GMT
Links: << >> << T >> << A >>



Nicholas C. Weaver wrote:

> In article <3DADAAB7.E96D53F2@Xilinx.com>,
> Goran Bilski  <Goran.Bilski@Xilinx.com> wrote:
> 
> 
>>You can't just double the number of pipestage for a processor without
>>major impacts.  For streaming pipeline which hardware pipelines are I
>>agree but for processor that can't be done.
>>
> 
> Uhh, yes it can.  
> 
> Double all the pipeline stages, double the register file, rebalance
> the delays now that you have more pipelining, and out drops a 2-thread
> multithreaded architecture.  Each single thread now runs slower, but
> aggregate throughput (sum of the two threads) is increased.
> 
> It is so obvious yet unintuitive that nobody has actually DONE it
> before.  :)
> 

Sorry, it was done a long time ago.  Try to find some info on
the CDC 6600 IO processors.  They ran 16 threads in a very deep
pipeline.

Article: 48396
Subject: Re: Xilinx microblaze vs. picoblaze
From: nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver)
Date: Thu, 17 Oct 2002 02:25:45 +0000 (UTC)
Links: << >> << T >> << A >>

In article <3DADEB1E.35A6770B@andraka.com>,
Ray Andraka  <ray@andraka.com> wrote:

>Normally, the pipeline stages should be inserted so that their input
>comes from the LUT in the same slice, so it is not an issue if you
>used up all the inputs.  The only blocking input in that case is the
>SR input if you are using a CLB RAM or SRL16.  The control logic can
>usually be pipelined similarly, but it may require a fresh start at
>the design rather than patching the existing one.

One observation:  If you want to $C$-slow the clock enable anyway, you
want to loop it through LUT logic anyway, otherwise you get
interferance between the two threads.

Same actually goes for the reset as well.
-- 
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

Article: 48397
Subject: Re: Xilinx microblaze vs. picoblaze
From: nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver)
Date: Thu, 17 Oct 2002 02:28:51 +0000 (UTC)
Links: << >> << T >> << A >>

In article <3DAE1E83.60005@synplicity.com>,
Ken McElvain  <ken@synplicity.com> wrote:
>> Uhh, yes it can.  
>> 
>> Double all the pipeline stages, double the register file, rebalance
>> the delays now that you have more pipelining, and out drops a 2-thread
>> multithreaded architecture.  Each single thread now runs slower, but
>> aggregate throughput (sum of the two threads) is increased.
>> 
>> It is so obvious yet unintuitive that nobody has actually DONE it
>> before.  :)
>> 
>
>Sorry, it was done a long time ago.  Try to find some info on
>the CDC 6600 IO processors.  They ran 16 threads in a very deep
>pipeline.

Those machines, also Hep and Tera, didn't have any bypassing.  The
proposed multithreaded approach I'm talking about keeps the bypassing
but doubles the pipelining.

The closest, "interleaved multithreading" kept some of the bypassing,
but never took advantage that now the bypass feedback loops have more
registers in it to up the clock rate by finer pipelining.
-- 
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu

Article: 48398
Subject: Re: Xilinx microblaze vs. picoblaze
From: "Jan Gray" <jsgray@acm.org>
Date: Wed, 16 Oct 2002 19:53:49 -0700
Links: << >> << T >> << A >>

See also http://www.fpgacpu.org/log/nov01.html#011122:

"One can also build a simple barrel processor (say 4 threads (slots) x 32
regs = 128 entries of 32-bits = 2 16-bit ports on a single 256x16 BRAM,
tripled cycled, or two BRAMs double cycled) and switch threads on each
cycle. Then you can have a 4-deep pipeline without need for any result
forwarding muxes (by the time you read an operand on thread[i], you have
already retired that threads' previous result to the register file).

This seems to me to be a perfectly simple and practical basis to issue
instructions faster than the ALU + result forwarding mux + operand register
recurrence critical path. Unfortunately single-thread performance is not so
hot but in workloads such as a "network processing", who cares?

This idea was taken to sublime levels in the 20-stage pipelined 5-threaded 1
GHz MicroUnity MediaProcessor (which would have needed some result
forwarding, but not 18 stages worth)."

Jan Gray, Gray Research LLC

Article: 48399
Subject: Re: multiple clocks
From: hmurray@suespammers.org (Hal Murray)
Date: Thu, 17 Oct 2002 03:48:16 -0000
Links: << >> << T >> << A >>

>I'm working on a board that requires a 120MHz clock, a 24Mhz clock and a
>40MHz clock.

Do the clocks have to be in sync?  24 is 120/5 and 40 is 120/3 so it's
simple to generate the other clocks with a PLL or maybe just digital
logic (PAL?)

If you don't need them to be in sync (that is you have 3 separate
chunks of logic and can put synchronizers between them) then
compare the cost/risks of PLLs with three separate oscillator
packages.  Sometimes it's handy to be able to adjust (fudge?)
a clock a bit (say 22 rather than 24) because a chunk of logic
doesn't quite work at 24.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search