Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search

Messages from 156900

Article: 156900
Subject: Re: Know any good public FPGA projects to contribute to?
From: "mnentwig" <24789@embeddedrelated>
Date: Thu, 24 Jul 2014 22:31:05 -0500
Links: << >> << T >> << A >>

Hi,

here, for example, is one.
http://forum.gadgetfactory.net/index.php?/topic/2046-xthundercore-is-taking-shape/

In general, there are many CPUs but a shortage of simple (!) "Hello world"
examples to actually use them without spending a week first.
 
This blog nails it, more or less:
http://blog.tube42.se/?p=105
(that said: I managed to get the "small" variant of the ZPU in question
working on a Spartan 6, here.
http://forum.gadgetfactory.net/index.php?/topic/1863-bare-metal-zpu-hello-world/.
It is slow but fairly small, about 12 % on a Spartan 6 LX9)

Another interesting project is "minSoc". It appears to be very well
maintained. 
A simulation worked right out of the box when I tried yesterday - it even
includes its own iverilog simulator - but I wasn't able to build on Spartan
6 as the JTAG block is not supported.
A minimal openRisc "hello world" example could be useful for many - nothing
but processor, on-chip RAM with initial values for program code and a LED.
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156901
Subject: Re: Know any good public FPGA projects to contribute to?
From: "mnentwig" <24789@embeddedrelated>
Date: Fri, 25 Jul 2014 01:24:02 -0500
Links: << >> << T >> << A >>

wrong link: blog.tube42.se/?p=105
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156902
Subject: Re: Know any good public FPGA projects to contribute to?
From: "mnentwig" <24789@embeddedrelated>
Date: Fri, 25 Jul 2014 01:58:45 -0500
Links: << >> << T >> << A >>

well... as fascinating as this candy business is, I was trying to link to
"Tubologue | The sad state of OSS hardware (part 1)"
but usenet won't let me... Lost in quotation...
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156903
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: "mnentwig" <24789@embeddedrelated>
Date: Fri, 25 Jul 2014 10:39:58 -0500
Links: << >> << T >> << A >>

Hi,

>However, this is not an efficient use of resources in an FPGA using up 
>16 FFs along with the control logic, if any.  If it were any larger I 
>would use a direct address of an array constant would use a four bit 
>counter and a single LUT used as memory.

would this still apply if my design uses proportionally more LUTs than
registers?
For example, here is a synthesis report for a minimal "medium" ZPU
processor on Spartan 6 LX9 (that is most enthusiastically blinking its LED
as I write this):

Slice Logic Utilization:
  Number of Slice Registers:                   284 out of  11,440    2%
    Number used as Flip Flops:                 284
..
  Number of Slice LUTs:                        934 out of   5,720   16%
    Number used as logic:                      915 out of   5,720   15%
    Number used as Memory:                       9 out of   1,440    1%

This is not to argue the point, I just want to understand the possible
trade-offs. For example, I wonder if it would make sense to replace small
counters with one-hot shift registers in such a situation?	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156904
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Fri, 25 Jul 2014 13:38:02 -0400
Links: << >> << T >> << A >>

On 7/25/2014 11:39 AM, mnentwig wrote:
> Hi,
>
>> However, this is not an efficient use of resources in an FPGA using up
>> 16 FFs along with the control logic, if any.  If it were any larger I
>> would use a direct address of an array constant would use a four bit
>> counter and a single LUT used as memory.
>
> would this still apply if my design uses proportionally more LUTs than
> registers?
> For example, here is a synthesis report for a minimal "medium" ZPU
> processor on Spartan 6 LX9 (that is most enthusiastically blinking its LED
> as I write this):
>
> Slice Logic Utilization:
>    Number of Slice Registers:                   284 out of  11,440    2%
>      Number used as Flip Flops:                 284
> ...
>    Number of Slice LUTs:                        934 out of   5,720   16%
>      Number used as logic:                      915 out of   5,720   15%
>      Number used as Memory:                       9 out of   1,440    1%
>
> This is not to argue the point, I just want to understand the possible
> trade-offs. For example, I wonder if it would make sense to replace small
> counters with one-hot shift registers in such a situation?	

First, my comment was about going the other direction, from a long shift 
register to an encoded counter and memory.  You are asking if it makes 
sense to go from a state encoded counter to a one-hot register.  I don't 
see how that can save resources of any type.  The one-hot register will 
need at minimum one LUT per FF.

A counter is a very efficient use of the FPGA resources, however that is 
not a useful FSM.  To be useful there needs to be inputs which add logic 
to the counter.  In the simplest case this input is just a hold input 
which comes free other than the logic to generate the hold signal.  In a 
more general case the counter will need to jump around rather than just 
progressing through the states linearly.  In this case the FSM is not 
just a counter anymore and the LUT count increases.

So to answer your question, "it depends".  lol  But in general I would 
not expect a one-hot implementation to use any fewer LUTs at the expense 
of more FFs, but it is possible.

I've been watching the ZPU over the years and I would like to know what 
your LUT count includes.  Does that include I/O such as a UART?  Any 
idea how much is just for the CPU?  Early on the ZPU people claimed a 
*very* low LUT count of around 500 or less, IIRC.  I believe the Spartan 
6 has 6 input LUTs, so your LUT count is hard to compare to the LUT 
counts using 4 input LUTs.  Still, 900 is a fair amount more than 500. 
I assume you have optimized for performance at the expense of size?

-- 

Rick

Article: 156905
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: "mnentwig" <24789@embeddedrelated>
Date: Sat, 26 Jul 2014 01:24:10 -0500
Links: << >> << T >> << A >>

Hi,

> I don't 
>see how that can save resources of any type.  The one-hot register will 
>need at minimum one LUT per FF.

isn't a one-hot counter just a simple ring shift register? I can build it
from FFs without any further logic.
A simple experiment:

reg [1023:0] test = 1024'd1;
always @(posedge clk) begin
test <= {test[1022:0], test[1023]};
LED <= |test[1023:1];

The final "or" forces (mostly) use of physical FFs instead of LUTs in shift
register configuration 

 Number of Slice Registers:                 1,252 out of  11,440   10%
    Number used as Flip Flops:               1,252
  Number of Slice LUTs:                        573 out of   5,720   10%
    Number used as logic:                      216 out of   5,720    3%
    Number used as Memory:                      44 out of   1,440    3%
      Number used as Shift Register:            44
    Number used exclusively as route-thrus:    313

>I've been watching the ZPU over the years and I would like to know what 
>your LUT count includes.  

the one in my previous mail includes only the processor with on-chip RAM
and a single "GPIO" on the bus for the LED. It's the so-called "medium"
variant with some options changed. I use a simple "for" loop as benchmark
that controls the LED and it manages around 2M hardware writes / second.

There is also the "small" ZPU which is about half the size:
  Number of Slice Registers:                   258 out of  11,440    2%
  Number of Slice LUTs:                        596 out of   5,720   10%
This one includes a UART, 500 LUTs after setting options sounds correct. 
It is, however, very slow, maybe 10 % of "medium".
I haven't optimized the settings, for example LUT sharing might reduce size
further.
There are newer variants (ZPUino, "extreme" core) that are probably faster,
especially with external memory.

If anybody knows a good, free CPU, I'd love to hear about. Those two work
pretty well for me.
Faster CPUs exist, for example MICO32 was mentioned. I did some trials with
that one, but it used too much space on the LX9, maybe three times as big
as the "medium" ZPU if I remember correctly.

I don't use a CPU for high-performance computing, but mainly to change
functionality quickly without rebuilding RTL: Compiling my test code,
merging it to the bitstream and uploading takes only 750 ms,
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156906
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Jul 2014 04:57:59 -0400
Links: << >> << T >> << A >>

On 7/26/2014 2:24 AM, mnentwig wrote:
> Hi,
>
>> I don't
>> see how that can save resources of any type.  The one-hot register will
>> need at minimum one LUT per FF.
>
> isn't a one-hot counter just a simple ring shift register? I can build it
> from FFs without any further logic.

That's only if it is a simple counter with no other transitions or 
controls other than an enable.  Usually they need some sort of sync 
reset which may or may not be supported by the FF primitive without a LUT.

> A simple experiment:
>
> reg [1023:0] test = 1024'd1;
> always @(posedge clk) begin
> test <= {test[1022:0], test[1023]};
> LED <= |test[1023:1];
>
> The final "or" forces (mostly) use of physical FFs instead of LUTs in shift
> register configuration
>
>   Number of Slice Registers:                 1,252 out of  11,440   10%
>      Number used as Flip Flops:               1,252
>    Number of Slice LUTs:                        573 out of   5,720   10%
>      Number used as logic:                      216 out of   5,720    3%
>      Number used as Memory:                      44 out of   1,440    3%
>        Number used as Shift Register:            44
>      Number used exclusively as route-thrus:    313
>
>> I've been watching the ZPU over the years and I would like to know what
>> your LUT count includes.
>
> the one in my previous mail includes only the processor with on-chip RAM
> and a single "GPIO" on the bus for the LED. It's the so-called "medium"
> variant with some options changed. I use a simple "for" loop as benchmark
> that controls the LED and it manages around 2M hardware writes / second.
>
> There is also the "small" ZPU which is about half the size:
>    Number of Slice Registers:                   258 out of  11,440    2%
>    Number of Slice LUTs:                        596 out of   5,720   10%
> This one includes a UART, 500 LUTs after setting options sounds correct.
> It is, however, very slow, maybe 10 % of "medium".

Yes, this is the one that I thought was impressive in terms of the tiny 
size, but as you note, at a price of extreme lack of speed.  I believe 
the slowness comes from the architecture rather than the clock being a 
lot slower.  That is, the clock is still a reasonable speed, but it 
needs a lot more of them to get the work done because of having fewer 
data paths.

> I haven't optimized the settings, for example LUT sharing might reduce size
> further.

LUT sharing?  Is that where the logic is broken into pieces which can be 
shared between different paths when there is some overlap?  I've never 
bothered with that as I think the savings are typically pretty small.

> There are newer variants (ZPUino, "extreme" core) that are probably faster,
> especially with external memory.
>
> If anybody knows a good, free CPU, I'd love to hear about. Those two work
> pretty well for me.
> Faster CPUs exist, for example MICO32 was mentioned. I did some trials with
> that one, but it used too much space on the LX9, maybe three times as big
> as the "medium" ZPU if I remember correctly.
>
> I don't use a CPU for high-performance computing, but mainly to change
> functionality quickly without rebuilding RTL: Compiling my test code,
> merging it to the bitstream and uploading takes only 750 ms,

I'm not familiar with the MICO32... do you mean the one from Lattice, 
maybe named MICRO32?  I don't recall for sure.  Just about any standard 
RISC CPU will be a lot bigger than the ZPU.  OpenCores has one they call 
OpenRISC which has been around a while.  I think it is fairly large 
though.  ZPU was designed specifically to be as small as possible for 
code that needs very little speed.  Then they decided to develop a few 
faster variants which are totally binary compatible.  I think they 
achieved their objective and I have heard of it being used in some 
business apps.

The other day I did see another soft core that is supported by a C 
compiler, at least a beta version.  I don't recall the name, but I 
expect I could come up with it if you are interested.  Everything else I 
have seen are stack processors intended to run a Forth like language. 
That can make for a very simple machine... like the ZPU.  :)

-- 

Rick

Article: 156907
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Jul 2014 05:03:01 -0400
Links: << >> << T >> << A >>

Here is the info on the YARD-1 processor I was trying to remember.  He 
is doing an LCC backend so it has a C compiler, albeit in the early 
stages still...  This is the only other (than the ZPU) open source 
softcore CPU I know of with C support.

To: <fpga-cpu@yahoogroups.com>
From: "brimdavis@aol.com [fpga-cpu]" <fpga-cpu@yahoogroups.com>
Subject: [fpga-cpu] State of the YARD, July 2014

Another in an occasional series of updates on the YARD-1 processor.

Cleanup:
  Since my last status post[1], I've made some headway in cleaning up 
the code and documentation; the repository now contains all the core 
sources and some demo designs, in addition to the cross assembler tools 
and ISA verification code.

  Things are working well enough to use for small assembly projects, 
although not all processor features are implemented or working yet.

Docs:
  Google recently disabled the Downloads feature of Google Code, so I've 
added a wiki page[2] directly linking to the documentation files in the 
repository.

  I've also added some wiki pages[3] summarizing the build results for 
the Xilinx Spartan3 and Lattice XO2 demo designs.

ISA Changes:
  Other than some minor encoding changes, the only instruction set 
alterations of note were the {reluctant} replacement of the nifty bit 
counting instructions with register-register 8|16 bit sign|zero 
extending MOVes to better support LCC's code generator for char and 
short operations on registers.

LCC:
  The experimental YARD LCC port[4] now has a nearly complete (but not 
well tested) integer back-end, but neither floating point support nor a 
C library as of yet.

-Brian

[1] 2011 status post
  http://groups.yahoo.com/neo/groups/fpga-cpu/conversations/topics/3362

[2] doc wiki link
  http://code.google.com/p/yard-1/wiki/Documentation_Links

[3] build wiki links
  http://code.google.com/p/yard-1/wiki/Lattice_XP2_Brevia
  http://code.google.com/p/yard-1/wiki/Digilent_S3_Starter_Board

[4] lcc-homebrew link
  http://code.google.com/p/lcc-homebrew

__._,_.___

-- 

Rick

Article: 156908
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: "mnentwig" <24789@embeddedrelated>
Date: Sat, 26 Jul 2014 07:21:32 -0500
Links: << >> << T >> << A >>

HJi,

>That's only if it is a simple counter with no other transitions or 
>controls other than an enable.  Usually they need some sort of sync 
>reset which may or may not be supported by the FF primitive without a
LUT.

thanks. Maybe I'll just leave it to the synthesis tool...

>Yes, this is the one that I thought was impressive in terms of the tiny 
>size, but as you note, at a price of extreme lack of speed.  I believe 
>the slowness comes from the architecture rather than the clock being a 
>lot slower.  That is, the clock is still a reasonable speed, but it 
>needs a lot more of them to get the work done because of having fewer 
>data paths.

Yes, the achievable clock speed is even marginally higher for the small one
(~110 MHz vs 100 MHz, possibly faster if I'd tweak the settings).
It doesn't have registers, so every operand goes to the stack, if I
remember correctly. The "medium" variant has a hardware cache for the last
two levels.

>LUT sharing?  Is that where the logic is broken into pieces which can be 
>shared between different paths when there is some overlap?  I've never 
>bothered with that as I think the savings are typically pretty small.

There is an option to duplicate registers to reduce routing delay. But what
I meant is to put several independent logic functions into the same LUT,
i.e. four-input plus two-input to make it smaller. I haven't really read
the manual too carefully here. The one optimization option that I found
important is pipeline register balancing. 

This is the MICO32 I meant:
http://en.wikipedia.org/wiki/LatticeMico32

I just got feedback in another forum that the openRisc processor was too
limited in terms of clock speed.
There is also an ARM clone (amber), but it seems quite big, 90 % of an LX9
(compared to 20 and 10 % for the ZPUs)

I'll have a look at the YARD processor, thanks. Never heard about it
before.

For example, Ettus uses ZPUs in their SDR products, so I think I'm on the
right track with the ZPU. It doesn't have to be perfect, still beats the
alternative of running a separate MBED or raspberry board with a SPI link
to the FPGA.

Cheers

Markus
	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156909
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: hal-usenet@ip-64-139-1-69.sjc.megapath.net (Hal Murray)
Date: Sat, 26 Jul 2014 16:08:31 -0500
Links: << >> << T >> << A >>

In article <lqvs62$mf1$1@dont-email.me>,
 rickman <gnuarm@gmail.com> writes:

>Seems the head gasket was installed incorrectly, not because someone 
>munged up the installation.  It was installed incorrectly because the 
>installation procedures were wrong!  Potentially they *all* could have 
>been installed badly and *every generator could have failed within 
>minutes of starting up*!!!

Bugs in documentation have long been an "interesting" problem
in the software business area.

Many years ago, I read a neat story, probably on usenet.  Our
hero was on a US destroyer in the south pacific.  He was in charge
of the 5 inch guns.  They worked, but weren't quite as accurate as
they should have been.  They even had a factory rep flown out.  He
didn't fix anything.  When their time was up and they were headed
back home, one if his guys said, roughly, "Everything is clean and
polished, how about I take a look at the gun controller?"  The guy
was good at that sort of stuff, so the answer was "go for it".  This
was analog computer days.  Picture gears all over the place, like a
kid taking apart a clock. As things were put back together, the guy
turned one gear over.  That fixed it.  The picture in the book
showed it in the wrong way.

-- 
These are my opinions.  I hate spam.

Article: 156910
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: hal-usenet@ip-64-139-1-69.sjc.megapath.net (Hal Murray)
Date: Sat, 26 Jul 2014 16:14:37 -0500
Links: << >> << T >> << A >>

Argh/blush.  Wrong newsgroup (as if you couldn't guess).
Fatfinger on my part.  Sorry for the clutter.

-- 
These are my opinions.  I hate spam.

Article: 156911
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sat, 26 Jul 2014 23:33:26 -0400
Links: << >> << T >> << A >>

On 7/26/2014 8:21 AM, mnentwig wrote:
> HJi,
>
>> That's only if it is a simple counter with no other transitions or
>> controls other than an enable.  Usually they need some sort of sync
>> reset which may or may not be supported by the FF primitive without a
> LUT.
>
> thanks. Maybe I'll just leave it to the synthesis tool...
>
>> Yes, this is the one that I thought was impressive in terms of the tiny
>> size, but as you note, at a price of extreme lack of speed.  I believe
>> the slowness comes from the architecture rather than the clock being a
>> lot slower.  That is, the clock is still a reasonable speed, but it
>> needs a lot more of them to get the work done because of having fewer
>> data paths.
>
> Yes, the achievable clock speed is even marginally higher for the small one
> (~110 MHz vs 100 MHz, possibly faster if I'd tweak the settings).
> It doesn't have registers, so every operand goes to the stack, if I
> remember correctly. The "medium" variant has a hardware cache for the last
> two levels.
>
>> LUT sharing?  Is that where the logic is broken into pieces which can be
>> shared between different paths when there is some overlap?  I've never
>> bothered with that as I think the savings are typically pretty small.
>
> There is an option to duplicate registers to reduce routing delay. But what
> I meant is to put several independent logic functions into the same LUT,
> i.e. four-input plus two-input to make it smaller. I haven't really read
> the manual too carefully here. The one optimization option that I found
> important is pipeline register balancing.

Ok, you are talking about something that comes with the 6 input LUTs. 
For many years the standard size for LUTs was 4 inputs.  Xilinx used 
some extra logic in the CLB to allow multiple 4LUTs to be joined via 
another mux to create the equivalent of a 5 input LUT.  So you could say 
they had 5LUTs for some time now which had the option of being split 
into a pair of 4LUTs.  semantics...

The issue is routing.  The pair of 4LUTs require 8 inputs while the 
single 5LUT only requires 5 inputs obviously.  Extrapolating this to the 
6LUT in the device you are using, they have provided 6 separate inputs 
to the LUT.  They have actually done this not to give you a larger LUT 
(they can always be combined easily) but to reduce the required routing. 
  So now if you want to split the 6LUT into a pair of 5LUTs (possible 
giving the size of the LUT itself), there aren't enough inputs.  So 
instead it seems they give you a 4LUT and a 2LUT.  Better than nothing. 
  :)

I believe some of the Lattice devices do something like this but with 
larger LUTs as long as you can share the inputs to the two LUTs.  Or I 
may be thinking of how the add/carry thing works in their devices and I 
may be thinking of an older Altera chip, lol.

There is also a software function in most packages which can figure out 
that a given logic component is used by more than one function.  It can 
then change the net list to allow one LUT to drive both logic functions. 
  I believe they even will regroup the logic to facilitate this.  The 
down side is that it makes it harder for the placer to do it's job and 
get a placement that makes fast routing possible.

> This is the MICO32 I meant:
> http://en.wikipedia.org/wiki/LatticeMico32

Geeze, all this time I was reading that as "Micro".  lol    I know this 
core is "free" as in beer, but I don't know how free it is to modify and 
distribute.

> I just got feedback in another forum that the openRisc processor was too
> limited in terms of clock speed.
> There is also an ARM clone (amber), but it seems quite big, 90 % of an LX9
> (compared to 20 and 10 % for the ZPUs)

I remember some years back a guy cloned the ARM7... until he got a call 
from someone at ARM.  Seems there was a patent on a particular feature 
in the interrupt controller (if I remember correctly) that is very hard 
to work around.  My understanding is that they explained the patent to 
him and then offered him a job... the code disappeared from the 
OpenCores web site.

BTW, never use clock speed alone as a measure of performance.  I can't 
say if the openrisc processor is fast or not.  I find it funny that you 
would consider using the ZPU if you are looking for speed.  I believe 
the ZPU is the slowest processor I have ever seen.

> I'll have a look at the YARD processor, thanks. Never heard about it
> before.
>
> For example, Ettus uses ZPUs in their SDR products, so I think I'm on the
> right track with the ZPU. It doesn't have to be perfect, still beats the
> alternative of running a separate MBED or raspberry board with a SPI link
> to the FPGA.

I'm rather surprised they are using a ZPU, but I expect it is for 
controlling the overall functionality, a bit like a front panel 
controller that would have been an 8051 some years ago.

-- 

Rick

Article: 156912
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA using VHDL
From: "mnentwig" <24789@embeddedrelated>
Date: Sun, 27 Jul 2014 07:00:20 -0500
Links: << >> << T >> << A >>

Hi,

>>  So instead it seems they give you a 4LUT and a 2LUT.  Better than
nothing.   :)
that's how I understand it, yes. Anyway, I'll come back to the options once
I have some code that is worth optimizing...

A genuine ARM, with the hardware multiplier option, would be nice. Those do
one 32x32=>32 bit multiplication per clock cycle. But, I think an FPGA
can't do that because I have to cascade two 18x18 multipliers and that
needs pipeline registers or a slower clock. 
So I'll use the softcore for control purposes, and do the "heavy lifting"
in RTL. Too bad, there is a lot of audio C code out there that could be
adapted.

>BTW, never use clock speed alone as a measure of performance.  I can't 
>say if the openrisc processor is fast or not.  I find it funny that you 
>would consider using the ZPU if you are looking for speed.  I believe 
>the ZPU is the slowest processor I have ever seen.

Right. The reason is simply that I want to run it synchronously with the
DSP stuff at around 100 MHz (at least unless someone comes up with a better
plan). That means, it will limit the maximum clock frequency of the whole
design.

Even if demoted to front panel controller, the ZPU would still be my choice
over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front
panel control" in hand-crafted assembler, a long time ago...)

-Markus	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156913
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sun, 27 Jul 2014 11:40:34 -0400
Links: << >> << T >> << A >>

On 7/27/2014 8:00 AM, mnentwig wrote:
>
> A genuine ARM, with the hardware multiplier option, would be nice. Those do
> one 32x32=>32 bit multiplication per clock cycle. But, I think an FPGA
> can't do that because I have to cascade two 18x18 multipliers and that
> needs pipeline registers or a slower clock.

I think an ARM CPU would be rather large although they have the M1 (or 
is it the M0?) intended for FPGA use.  I wonder if anyone has cloned 
that yet?

Why would you need it to be cycle accurate?  The multiplier is already 
pipelined even if you just use one by itself.  It comes with an output 
register like the block memory so you can't send the results anywhere 
until the next clock cycle.  Using four of them to produce a 64 bit 
result and save the result in a register would take 2 clocks; one for 
the multiplies and one for the adds and save... unless you do some 
hardware register renaming... set a flag that says the output of the 
multiplier is Rxx instead of the register file.  Hmmmm... I need to 
think about that one.  It takes an extra mux which is not cheap in FPGAs 
though.  The ARM has any number of multi-clock cycle instructions, why 
couldn't the multiply be one of them?

I have this problem in my stack CPU design.  It was originally done in 
an older part where the block RAM can be run async and so a read can be 
written to the top of stack in one clock cycle - *all* instructions are 
1 clock cycle, this is a primary design goal.  With a sync RAM the data 
is not available until the next clock cycle, so I have to find tricks to 
make it work.  One is to use two instructions to read memory, one to 
start the read and one to grab the output - repercussions for 
exceptions, now there is another register to save.  Or I have considered 
grabbing the input to the address register rather than the output and 
doing a read on every clock cycle... somewhat wasteful of power and I 
intend to use this in a low power design.

> So I'll use the softcore for control purposes, and do the "heavy lifting"
> in RTL. Too bad, there is a lot of audio C code out there that could be
> adapted.
>
>> BTW, never use clock speed alone as a measure of performance.  I can't
>> say if the openrisc processor is fast or not.  I find it funny that you
>> would consider using the ZPU if you are looking for speed.  I believe
>> the ZPU is the slowest processor I have ever seen.
>
> Right. The reason is simply that I want to run it synchronously with the
> DSP stuff at around 100 MHz (at least unless someone comes up with a better
> plan). That means, it will limit the maximum clock frequency of the whole
> design.
>
> Even if demoted to front panel controller, the ZPU would still be my choice
> over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front
> panel control" in hand-crafted assembler, a long time ago...)

I wouldn't consider using an 8051 myself if there were good 
alternatives.  But I am in the stack processor crowd (which the ZPU is a 
member of oddly enough) and am happy programming in Forth or something 
like it.  I like working close to the hardware and I find it very useful 
to have a processor with all instructions 1 clock cycle long.  The ZPU 
would drive me batty and I would never want to program it in C.

-- 

Rick

Article: 156914
Subject: Primitive debuggable UART interface to a Nios within a multi-Nios
From: Ang Zhi Ping <azhiping@dso.org.sg>
Date: Mon, 28 Jul 2014 07:20:31 +0800
Links: << >> << T >> << A >>

I am working on an IP core with a Nios controller. This IP will
eventually be integrated into a multi-Nios system. I also foresee that
this IP will not be JTAG debuggable because the integrator will be using
the JTAG facility on a higher level Nios controller.

In this case I have planned to include a UART interface, which allows
the integrator to do on-the-fly primitive debugging with the IP using a
spare serial port, while at the same time using the JTAG debugger on
other Nios controllers.

Currently this is what has been implemented. The Nios controller waits
for 3 seconds, where upon receipt of a character 'd' within this period
it goes into diagnostic mode, otherwise it enters normal operation
without stdin and stdout. In diagnostic mode internal values are spewed
onto the console. I am planning to allow entry of an integer which
defines a bit pattern, where different bits selectively enables/disables
printing diagnostic messages. The console also allows input of an bit
pattern which selectively modifies internal parameters.

These modifications comes at the expense of adding several alt_printf
and alt_getchar which quickly clutters the Nios firmware code. Are there
any elegant method where an existing Nios firmware can be hooked onto a
debuggable framework via the UART? Even better, are there any memory
efficient way of performing gdb over a UART without hosting a full blown
OS on the Nios?

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

Article: 156915
Subject: Re: Generating a desired synthesizable binary pulse train on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sun, 27 Jul 2014 21:30:56 -0400
Links: << >> << T >> << A >>

On 7/27/2014 8:00 AM, mnentwig wrote:
>
> Right. The reason is simply that I want to run it synchronously with the
> DSP stuff at around 100 MHz (at least unless someone comes up with a better
> plan). That means, it will limit the maximum clock frequency of the whole
> design.
>
> Even if demoted to front panel controller, the ZPU would still be my choice
> over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front
> panel control" in hand-crafted assembler, a long time ago...)


BTW, do you know about the ZPU mailing list?

zylin-zpu mailing list
zylin-zpu@zylin.com
http://zylin.com/mailman/listinfo/zylin-zpu_zylin.com

-- 

Rick

Article: 156916
Subject: Re: Primitive debuggable UART interface to a Nios within a
From: Tim Wescott <tim@seemywebsite.please>
Date: Mon, 28 Jul 2014 00:04:14 -0500
Links: << >> << T >> << A >>

On Mon, 28 Jul 2014 07:20:31 +0800, Ang Zhi Ping wrote:

> I am working on an IP core with a Nios controller. This IP will
> eventually be integrated into a multi-Nios system. I also foresee that
> this IP will not be JTAG debuggable because the integrator will be using
> the JTAG facility on a higher level Nios controller.
> 
> In this case I have planned to include a UART interface, which allows
> the integrator to do on-the-fly primitive debugging with the IP using a
> spare serial port, while at the same time using the JTAG debugger on
> other Nios controllers.
> 
> Currently this is what has been implemented. The Nios controller waits
> for 3 seconds, where upon receipt of a character 'd' within this period
> it goes into diagnostic mode, otherwise it enters normal operation
> without stdin and stdout. In diagnostic mode internal values are spewed
> onto the console. I am planning to allow entry of an integer which
> defines a bit pattern, where different bits selectively enables/disables
> printing diagnostic messages. The console also allows input of an bit
> pattern which selectively modifies internal parameters.
> 
> These modifications comes at the expense of adding several alt_printf
> and alt_getchar which quickly clutters the Nios firmware code. Are there
> any elegant method where an existing Nios firmware can be hooked onto a
> debuggable framework via the UART? Even better, are there any memory
> efficient way of performing gdb over a UART without hosting a full blown
> OS on the Nios?
> 
> ---
> This email is free from viruses and malware because avast! Antivirus
> protection is active.
> http://www.avast.com

Why not MUX the JTAG to the various processors, get this (presumably 
deeply buried one) debugged, and then move on?

-- 
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com

Article: 156917
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios
From: Ang Zhi Ping <azhiping@dso.org.sg>
Date: Mon, 28 Jul 2014 13:13:59 +0800
Links: << >> << T >> << A >>

On 28/7/2014 1:04 PM, Tim Wescott wrote:
>
> Why not MUX the JTAG to the various processors, get this (presumably
> deeply buried one) debugged, and then move on?
>

This module is more or less finalised and debugged. There are internal 
values within the hardware which are of use to the integrator who is 
debugging the top level controller.

Article: 156918
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios system
From: "mnentwig" <24789@embeddedrelated>
Date: Mon, 28 Jul 2014 00:15:56 -0500
Links: << >> << T >> << A >>

Hi,

what I'm doing in a similar application is to put a UART to the bus as
addressable register, together with a four-byte FIFO. On bus read, the FIFO
is popped and empty/overflow conditions are reported in bits 30 and 31,
together with the read result in bytes 7:0.

Example code is here: 
https://drive.google.com/file/d/0B1gLUU8hXL7vc0xZa1ZmMUJIbjg/edit?usp=sharing
It is for MIDI serial, for 31250 baud (use 9600 or 115200, for example).

It is functional but there may be bugs.

This is the interesting part in zpu_top.c:
The address decoder asserts "busSel_MIDI" for a read operation, and the
result is routed via "MIDI_read" to the processor's data bus in the next
cycle.

The C code uses regular polling:
   while (1){
      u32 c = *MIDI; // (volatile u32)* to the bus address
      if (c & 0x80000000){
	printf("buffer overflow!n");
      } else if (c & 0x40000000){
	MIDI_parse(c & 0xFF);
      }
    }

My solution to deal with debug "printf" is  a VGA adapter on the FPGA :-)

   // ************************************************************
   // MIDI UART
   // ************************************************************      
   // 96 000 000 Hz (clock) / 31250 Hz (MIDI baudrate) = 3072 = nBitCycles
   wire [7:0] MIDI_byte;
   wire       MIDI_strobe;
   reg 	      MIDI_RX_r = 1;
   always @(posedge clk) MIDI_RX_r <= MIDI_RX;
   sk61_serialRx #(.nBitCycles(3072)) iMidiUart(.clk(clk),
in_rx(MIDI_RX_r), .out_byte(MIDI_byte), .out_strobe(MIDI_strobe));   
   
   // ************************************************************
   // MIDI FIFO
   // ************************************************************      
   wire       MIDI_rxStrobe;
   wire [7:0] MIDI_rxData;
   serialFifo2bus iMidiFifo
     (.i_clk(clk), 
      .i_push(MIDI_strobe), 	.i_byte(MIDI_byte),
      .i_pop(busSel_MIDI), 	.o_busword(MIDI_read));



	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156919
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios system
From: already5chosen@yahoo.com
Date: Mon, 28 Jul 2014 03:22:28 -0700 (PDT)
Links: << >> << T >> << A >>

On Monday, July 28, 2014 2:20:31 AM UTC+3, Ang Zhi Ping wrote:
> I am working on an IP core with a Nios controller. This IP will 
> 
> eventually be integrated into a multi-Nios system. I also foresee that  
> this IP will not be JTAG debuggable because the integrator will be using  
> the JTAG facility on a higher level Nios controller. 
> 
> In this case I have planned to include a UART interface, which allows  
> the integrator to do on-the-fly primitive debugging with the IP using a  
> spare serial port, while at the same time using the JTAG debugger on 
> other Nios controllers.
>

Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules.
The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration.

So, most likely, all your per-caution with physical UART is unnecessary.
Of course, JTAG-independent printouts can be useful for other reasons...

Article: 156920
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios
From: Ang Zhi Ping <azhiping@dso.org.sg>
Date: Mon, 28 Jul 2014 20:10:40 +0800
Links: << >> << T >> << A >>

On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote:
> Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules.
> The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration.

I can't seem to debug two Nios processors simultaneously.

Article: 156921
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios system
From: already5chosen@yahoo.com
Date: Mon, 28 Jul 2014 13:14:38 -0700 (PDT)
Links: << >> << T >> << A >>

On Monday, July 28, 2014 3:10:40 PM UTC+3, Ang Zhi Ping wrote:
> On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote:
>=20
> > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine o=
n the same JTAG interface. The same applies to multiple JTAG debug modules.
>=20
> > The only thing that you, as designer of the module, should care about i=
s avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of i=
nstance IDs should be governed by person, that is responsible for top-level=
 integration.
>=20
>=20
>=20
> I can't seem to debug two Nios processors simultaneously.

Did you assign different instance IDs?

I never tried to use debuggers on two Nios2 processors myself (I hate debug=
gers in general, so I didn't use debugger on *one* Nios2 processor for some=
thing like 7 years), but Altera documentation claims that it should work.

I did try software download (which aso uses debugger interface) to differen=
t Nios2 processors over the same JTAG interface. It certainly works. I neve=
r tested if it works simultaneously, because I never wanted to download sim=
ultaneously.

But all that is slightly off topic. The topic was "light" debugging with pr=
intouts. That's the method that I do like and do do regularly. Printouts ov=
er JTAG UARTs from different processor most definitely work simultaneously,=
 there are no problems at all. Just specify correct instance ID in nios2-te=
rminal command line and everything will work for you in the best possible m=
anner.

Article: 156922
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios system
From: already5chosen@yahoo.com
Date: Mon, 28 Jul 2014 13:16:48 -0700 (PDT)
Links: << >> << T >> << A >>

On Monday, July 28, 2014 3:10:40 PM UTC+3, Ang Zhi Ping wrote:
> On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote:
> 
> > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules.
> 
> > The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration.
> 
> 
> 
> I can't seem to debug two Nios processors simultaneously.

P.S.
alteraforum is a much better place for asking that sort of questions.

Article: 156923
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios system
From: "mnentwig" <24789@embeddedrelated>
Date: Mon, 28 Jul 2014 21:20:53 -0500
Links: << >> << T >> << A >>

>> the topic was "light" debugging with printouts. 

BTW my on-board VGA controller may seem a little over-the-top . The main
selling point is, it doesn't slow down the code, it's an infinite-baudrate
UART. It's surprisingly compact if I can spare one clock and a block RAM
(on Xilinx Spartan 6, haven't tried this yet on Altera). 
Electrically it's uncritical, patch cables to a cheap RGB resistor DAC
breakout board / "wing" work just fine at 640x480 / 25 MHz.

	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 156924
Subject: Re: Primitive debuggable UART interface to a Nios within a multi-Nios
From: Ang Zhi Ping <azhiping@dso.org.sg>
Date: Tue, 29 Jul 2014 11:20:10 +0800
Links: << >> << T >> << A >>

On 29/7/2014 4:14 AM, already5chosen@yahoo.com wrote:
> 
> Did you assign different instance IDs?

Yes different instance IDs are assigned. The JTAG UART under Eclipse IDE
is able to tell the different NIOS.

> I never tried to use debuggers on two Nios2 processors myself (I hate debuggers in general, so I didn't use debugger on *one* Nios2 processor for something like 7 years), but Altera documentation claims that it should work.

If the JTAG UART is used for stdout, the JTAG only routes the debugging
Nios to console. Any other Nios processors that are not being debugged
will not be able to route their stdout outputs to console. Hence this
question about routing messages via serial port.

> I did try software download (which aso uses debugger interface) to different Nios2 processors over the same JTAG interface. It certainly works. I never tested if it works simultaneously, because I never wanted to download simultaneously.

The JTAG certainly work for multi-Nios system, but it cannot handle
stdout from multiple Nios.

> But all that is slightly off topic. The topic was "light" debugging with printouts. That's the method that I do like and do do regularly. Printouts over JTAG UARTs from different processor most definitely work simultaneously, there are no problems at all. Just specify correct instance ID in nios2-terminal command line and everything will work for you in the best possible manner.

Haha ok let's keep this thread on topic then.

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search