Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Hi, here, for example, is one. http://forum.gadgetfactory.net/index.php?/topic/2046-xthundercore-is-taking-shape/ In general, there are many CPUs but a shortage of simple (!) "Hello world" examples to actually use them without spending a week first. This blog nails it, more or less: http://blog.tube42.se/?p=105 (that said: I managed to get the "small" variant of the ZPU in question working on a Spartan 6, here. http://forum.gadgetfactory.net/index.php?/topic/1863-bare-metal-zpu-hello-world/. It is slow but fairly small, about 12 % on a Spartan 6 LX9) Another interesting project is "minSoc". It appears to be very well maintained. A simulation worked right out of the box when I tried yesterday - it even includes its own iverilog simulator - but I wasn't able to build on Spartan 6 as the JTAG block is not supported. A minimal openRisc "hello world" example could be useful for many - nothing but processor, on-chip RAM with initial values for program code and a LED. --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156901
wrong link: blog.tube42.se/?p=105 --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156902
well... as fascinating as this candy business is, I was trying to link to "Tubologue | The sad state of OSS hardware (part 1)" but usenet won't let me... Lost in quotation... --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156903
Hi, >However, this is not an efficient use of resources in an FPGA using up >16 FFs along with the control logic, if any. If it were any larger I >would use a direct address of an array constant would use a four bit >counter and a single LUT used as memory. would this still apply if my design uses proportionally more LUTs than registers? For example, here is a synthesis report for a minimal "medium" ZPU processor on Spartan 6 LX9 (that is most enthusiastically blinking its LED as I write this): Slice Logic Utilization: Number of Slice Registers: 284 out of 11,440 2% Number used as Flip Flops: 284 .. Number of Slice LUTs: 934 out of 5,720 16% Number used as logic: 915 out of 5,720 15% Number used as Memory: 9 out of 1,440 1% This is not to argue the point, I just want to understand the possible trade-offs. For example, I wonder if it would make sense to replace small counters with one-hot shift registers in such a situation? --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156904
On 7/25/2014 11:39 AM, mnentwig wrote: > Hi, > >> However, this is not an efficient use of resources in an FPGA using up >> 16 FFs along with the control logic, if any. If it were any larger I >> would use a direct address of an array constant would use a four bit >> counter and a single LUT used as memory. > > would this still apply if my design uses proportionally more LUTs than > registers? > For example, here is a synthesis report for a minimal "medium" ZPU > processor on Spartan 6 LX9 (that is most enthusiastically blinking its LED > as I write this): > > Slice Logic Utilization: > Number of Slice Registers: 284 out of 11,440 2% > Number used as Flip Flops: 284 > ... > Number of Slice LUTs: 934 out of 5,720 16% > Number used as logic: 915 out of 5,720 15% > Number used as Memory: 9 out of 1,440 1% > > This is not to argue the point, I just want to understand the possible > trade-offs. For example, I wonder if it would make sense to replace small > counters with one-hot shift registers in such a situation? First, my comment was about going the other direction, from a long shift register to an encoded counter and memory. You are asking if it makes sense to go from a state encoded counter to a one-hot register. I don't see how that can save resources of any type. The one-hot register will need at minimum one LUT per FF. A counter is a very efficient use of the FPGA resources, however that is not a useful FSM. To be useful there needs to be inputs which add logic to the counter. In the simplest case this input is just a hold input which comes free other than the logic to generate the hold signal. In a more general case the counter will need to jump around rather than just progressing through the states linearly. In this case the FSM is not just a counter anymore and the LUT count increases. So to answer your question, "it depends". lol But in general I would not expect a one-hot implementation to use any fewer LUTs at the expense of more FFs, but it is possible. I've been watching the ZPU over the years and I would like to know what your LUT count includes. Does that include I/O such as a UART? Any idea how much is just for the CPU? Early on the ZPU people claimed a *very* low LUT count of around 500 or less, IIRC. I believe the Spartan 6 has 6 input LUTs, so your LUT count is hard to compare to the LUT counts using 4 input LUTs. Still, 900 is a fair amount more than 500. I assume you have optimized for performance at the expense of size? -- RickArticle: 156905
Hi, > I don't >see how that can save resources of any type. The one-hot register will >need at minimum one LUT per FF. isn't a one-hot counter just a simple ring shift register? I can build it from FFs without any further logic. A simple experiment: reg [1023:0] test = 1024'd1; always @(posedge clk) begin test <= {test[1022:0], test[1023]}; LED <= |test[1023:1]; The final "or" forces (mostly) use of physical FFs instead of LUTs in shift register configuration Number of Slice Registers: 1,252 out of 11,440 10% Number used as Flip Flops: 1,252 Number of Slice LUTs: 573 out of 5,720 10% Number used as logic: 216 out of 5,720 3% Number used as Memory: 44 out of 1,440 3% Number used as Shift Register: 44 Number used exclusively as route-thrus: 313 >I've been watching the ZPU over the years and I would like to know what >your LUT count includes. the one in my previous mail includes only the processor with on-chip RAM and a single "GPIO" on the bus for the LED. It's the so-called "medium" variant with some options changed. I use a simple "for" loop as benchmark that controls the LED and it manages around 2M hardware writes / second. There is also the "small" ZPU which is about half the size: Number of Slice Registers: 258 out of 11,440 2% Number of Slice LUTs: 596 out of 5,720 10% This one includes a UART, 500 LUTs after setting options sounds correct. It is, however, very slow, maybe 10 % of "medium". I haven't optimized the settings, for example LUT sharing might reduce size further. There are newer variants (ZPUino, "extreme" core) that are probably faster, especially with external memory. If anybody knows a good, free CPU, I'd love to hear about. Those two work pretty well for me. Faster CPUs exist, for example MICO32 was mentioned. I did some trials with that one, but it used too much space on the LX9, maybe three times as big as the "medium" ZPU if I remember correctly. I don't use a CPU for high-performance computing, but mainly to change functionality quickly without rebuilding RTL: Compiling my test code, merging it to the bitstream and uploading takes only 750 ms, --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156906
On 7/26/2014 2:24 AM, mnentwig wrote: > Hi, > >> I don't >> see how that can save resources of any type. The one-hot register will >> need at minimum one LUT per FF. > > isn't a one-hot counter just a simple ring shift register? I can build it > from FFs without any further logic. That's only if it is a simple counter with no other transitions or controls other than an enable. Usually they need some sort of sync reset which may or may not be supported by the FF primitive without a LUT. > A simple experiment: > > reg [1023:0] test = 1024'd1; > always @(posedge clk) begin > test <= {test[1022:0], test[1023]}; > LED <= |test[1023:1]; > > The final "or" forces (mostly) use of physical FFs instead of LUTs in shift > register configuration > > Number of Slice Registers: 1,252 out of 11,440 10% > Number used as Flip Flops: 1,252 > Number of Slice LUTs: 573 out of 5,720 10% > Number used as logic: 216 out of 5,720 3% > Number used as Memory: 44 out of 1,440 3% > Number used as Shift Register: 44 > Number used exclusively as route-thrus: 313 > >> I've been watching the ZPU over the years and I would like to know what >> your LUT count includes. > > the one in my previous mail includes only the processor with on-chip RAM > and a single "GPIO" on the bus for the LED. It's the so-called "medium" > variant with some options changed. I use a simple "for" loop as benchmark > that controls the LED and it manages around 2M hardware writes / second. > > There is also the "small" ZPU which is about half the size: > Number of Slice Registers: 258 out of 11,440 2% > Number of Slice LUTs: 596 out of 5,720 10% > This one includes a UART, 500 LUTs after setting options sounds correct. > It is, however, very slow, maybe 10 % of "medium". Yes, this is the one that I thought was impressive in terms of the tiny size, but as you note, at a price of extreme lack of speed. I believe the slowness comes from the architecture rather than the clock being a lot slower. That is, the clock is still a reasonable speed, but it needs a lot more of them to get the work done because of having fewer data paths. > I haven't optimized the settings, for example LUT sharing might reduce size > further. LUT sharing? Is that where the logic is broken into pieces which can be shared between different paths when there is some overlap? I've never bothered with that as I think the savings are typically pretty small. > There are newer variants (ZPUino, "extreme" core) that are probably faster, > especially with external memory. > > If anybody knows a good, free CPU, I'd love to hear about. Those two work > pretty well for me. > Faster CPUs exist, for example MICO32 was mentioned. I did some trials with > that one, but it used too much space on the LX9, maybe three times as big > as the "medium" ZPU if I remember correctly. > > I don't use a CPU for high-performance computing, but mainly to change > functionality quickly without rebuilding RTL: Compiling my test code, > merging it to the bitstream and uploading takes only 750 ms, I'm not familiar with the MICO32... do you mean the one from Lattice, maybe named MICRO32? I don't recall for sure. Just about any standard RISC CPU will be a lot bigger than the ZPU. OpenCores has one they call OpenRISC which has been around a while. I think it is fairly large though. ZPU was designed specifically to be as small as possible for code that needs very little speed. Then they decided to develop a few faster variants which are totally binary compatible. I think they achieved their objective and I have heard of it being used in some business apps. The other day I did see another soft core that is supported by a C compiler, at least a beta version. I don't recall the name, but I expect I could come up with it if you are interested. Everything else I have seen are stack processors intended to run a Forth like language. That can make for a very simple machine... like the ZPU. :) -- RickArticle: 156907
Here is the info on the YARD-1 processor I was trying to remember. He is doing an LCC backend so it has a C compiler, albeit in the early stages still... This is the only other (than the ZPU) open source softcore CPU I know of with C support. To: <fpga-cpu@yahoogroups.com> From: "brimdavis@aol.com [fpga-cpu]" <fpga-cpu@yahoogroups.com> Subject: [fpga-cpu] State of the YARD, July 2014 Another in an occasional series of updates on the YARD-1 processor. Cleanup: Since my last status post[1], I've made some headway in cleaning up the code and documentation; the repository now contains all the core sources and some demo designs, in addition to the cross assembler tools and ISA verification code. Things are working well enough to use for small assembly projects, although not all processor features are implemented or working yet. Docs: Google recently disabled the Downloads feature of Google Code, so I've added a wiki page[2] directly linking to the documentation files in the repository. I've also added some wiki pages[3] summarizing the build results for the Xilinx Spartan3 and Lattice XO2 demo designs. ISA Changes: Other than some minor encoding changes, the only instruction set alterations of note were the {reluctant} replacement of the nifty bit counting instructions with register-register 8|16 bit sign|zero extending MOVes to better support LCC's code generator for char and short operations on registers. LCC: The experimental YARD LCC port[4] now has a nearly complete (but not well tested) integer back-end, but neither floating point support nor a C library as of yet. -Brian [1] 2011 status post http://groups.yahoo.com/neo/groups/fpga-cpu/conversations/topics/3362 [2] doc wiki link http://code.google.com/p/yard-1/wiki/Documentation_Links [3] build wiki links http://code.google.com/p/yard-1/wiki/Lattice_XP2_Brevia http://code.google.com/p/yard-1/wiki/Digilent_S3_Starter_Board [4] lcc-homebrew link http://code.google.com/p/lcc-homebrew __._,_.___ -- RickArticle: 156908
HJi, >That's only if it is a simple counter with no other transitions or >controls other than an enable. Usually they need some sort of sync >reset which may or may not be supported by the FF primitive without a LUT. thanks. Maybe I'll just leave it to the synthesis tool... >Yes, this is the one that I thought was impressive in terms of the tiny >size, but as you note, at a price of extreme lack of speed. I believe >the slowness comes from the architecture rather than the clock being a >lot slower. That is, the clock is still a reasonable speed, but it >needs a lot more of them to get the work done because of having fewer >data paths. Yes, the achievable clock speed is even marginally higher for the small one (~110 MHz vs 100 MHz, possibly faster if I'd tweak the settings). It doesn't have registers, so every operand goes to the stack, if I remember correctly. The "medium" variant has a hardware cache for the last two levels. >LUT sharing? Is that where the logic is broken into pieces which can be >shared between different paths when there is some overlap? I've never >bothered with that as I think the savings are typically pretty small. There is an option to duplicate registers to reduce routing delay. But what I meant is to put several independent logic functions into the same LUT, i.e. four-input plus two-input to make it smaller. I haven't really read the manual too carefully here. The one optimization option that I found important is pipeline register balancing. This is the MICO32 I meant: http://en.wikipedia.org/wiki/LatticeMico32 I just got feedback in another forum that the openRisc processor was too limited in terms of clock speed. There is also an ARM clone (amber), but it seems quite big, 90 % of an LX9 (compared to 20 and 10 % for the ZPUs) I'll have a look at the YARD processor, thanks. Never heard about it before. For example, Ettus uses ZPUs in their SDR products, so I think I'm on the right track with the ZPU. It doesn't have to be perfect, still beats the alternative of running a separate MBED or raspberry board with a SPI link to the FPGA. Cheers Markus --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156909
In article <lqvs62$mf1$1@dont-email.me>, rickman <gnuarm@gmail.com> writes: >Seems the head gasket was installed incorrectly, not because someone >munged up the installation. It was installed incorrectly because the >installation procedures were wrong! Potentially they *all* could have >been installed badly and *every generator could have failed within >minutes of starting up*!!! Bugs in documentation have long been an "interesting" problem in the software business area. Many years ago, I read a neat story, probably on usenet. Our hero was on a US destroyer in the south pacific. He was in charge of the 5 inch guns. They worked, but weren't quite as accurate as they should have been. They even had a factory rep flown out. He didn't fix anything. When their time was up and they were headed back home, one if his guys said, roughly, "Everything is clean and polished, how about I take a look at the gun controller?" The guy was good at that sort of stuff, so the answer was "go for it". This was analog computer days. Picture gears all over the place, like a kid taking apart a clock. As things were put back together, the guy turned one gear over. That fixed it. The picture in the book showed it in the wrong way. -- These are my opinions. I hate spam.Article: 156910
Argh/blush. Wrong newsgroup (as if you couldn't guess). Fatfinger on my part. Sorry for the clutter. -- These are my opinions. I hate spam.Article: 156911
On 7/26/2014 8:21 AM, mnentwig wrote: > HJi, > >> That's only if it is a simple counter with no other transitions or >> controls other than an enable. Usually they need some sort of sync >> reset which may or may not be supported by the FF primitive without a > LUT. > > thanks. Maybe I'll just leave it to the synthesis tool... > >> Yes, this is the one that I thought was impressive in terms of the tiny >> size, but as you note, at a price of extreme lack of speed. I believe >> the slowness comes from the architecture rather than the clock being a >> lot slower. That is, the clock is still a reasonable speed, but it >> needs a lot more of them to get the work done because of having fewer >> data paths. > > Yes, the achievable clock speed is even marginally higher for the small one > (~110 MHz vs 100 MHz, possibly faster if I'd tweak the settings). > It doesn't have registers, so every operand goes to the stack, if I > remember correctly. The "medium" variant has a hardware cache for the last > two levels. > >> LUT sharing? Is that where the logic is broken into pieces which can be >> shared between different paths when there is some overlap? I've never >> bothered with that as I think the savings are typically pretty small. > > There is an option to duplicate registers to reduce routing delay. But what > I meant is to put several independent logic functions into the same LUT, > i.e. four-input plus two-input to make it smaller. I haven't really read > the manual too carefully here. The one optimization option that I found > important is pipeline register balancing. Ok, you are talking about something that comes with the 6 input LUTs. For many years the standard size for LUTs was 4 inputs. Xilinx used some extra logic in the CLB to allow multiple 4LUTs to be joined via another mux to create the equivalent of a 5 input LUT. So you could say they had 5LUTs for some time now which had the option of being split into a pair of 4LUTs. semantics... The issue is routing. The pair of 4LUTs require 8 inputs while the single 5LUT only requires 5 inputs obviously. Extrapolating this to the 6LUT in the device you are using, they have provided 6 separate inputs to the LUT. They have actually done this not to give you a larger LUT (they can always be combined easily) but to reduce the required routing. So now if you want to split the 6LUT into a pair of 5LUTs (possible giving the size of the LUT itself), there aren't enough inputs. So instead it seems they give you a 4LUT and a 2LUT. Better than nothing. :) I believe some of the Lattice devices do something like this but with larger LUTs as long as you can share the inputs to the two LUTs. Or I may be thinking of how the add/carry thing works in their devices and I may be thinking of an older Altera chip, lol. There is also a software function in most packages which can figure out that a given logic component is used by more than one function. It can then change the net list to allow one LUT to drive both logic functions. I believe they even will regroup the logic to facilitate this. The down side is that it makes it harder for the placer to do it's job and get a placement that makes fast routing possible. > This is the MICO32 I meant: > http://en.wikipedia.org/wiki/LatticeMico32 Geeze, all this time I was reading that as "Micro". lol I know this core is "free" as in beer, but I don't know how free it is to modify and distribute. > I just got feedback in another forum that the openRisc processor was too > limited in terms of clock speed. > There is also an ARM clone (amber), but it seems quite big, 90 % of an LX9 > (compared to 20 and 10 % for the ZPUs) I remember some years back a guy cloned the ARM7... until he got a call from someone at ARM. Seems there was a patent on a particular feature in the interrupt controller (if I remember correctly) that is very hard to work around. My understanding is that they explained the patent to him and then offered him a job... the code disappeared from the OpenCores web site. BTW, never use clock speed alone as a measure of performance. I can't say if the openrisc processor is fast or not. I find it funny that you would consider using the ZPU if you are looking for speed. I believe the ZPU is the slowest processor I have ever seen. > I'll have a look at the YARD processor, thanks. Never heard about it > before. > > For example, Ettus uses ZPUs in their SDR products, so I think I'm on the > right track with the ZPU. It doesn't have to be perfect, still beats the > alternative of running a separate MBED or raspberry board with a SPI link > to the FPGA. I'm rather surprised they are using a ZPU, but I expect it is for controlling the overall functionality, a bit like a front panel controller that would have been an 8051 some years ago. -- RickArticle: 156912
Hi, >> So instead it seems they give you a 4LUT and a 2LUT. Better than nothing. :) that's how I understand it, yes. Anyway, I'll come back to the options once I have some code that is worth optimizing... A genuine ARM, with the hardware multiplier option, would be nice. Those do one 32x32=>32 bit multiplication per clock cycle. But, I think an FPGA can't do that because I have to cascade two 18x18 multipliers and that needs pipeline registers or a slower clock. So I'll use the softcore for control purposes, and do the "heavy lifting" in RTL. Too bad, there is a lot of audio C code out there that could be adapted. >BTW, never use clock speed alone as a measure of performance. I can't >say if the openrisc processor is fast or not. I find it funny that you >would consider using the ZPU if you are looking for speed. I believe >the ZPU is the slowest processor I have ever seen. Right. The reason is simply that I want to run it synchronously with the DSP stuff at around 100 MHz (at least unless someone comes up with a better plan). That means, it will limit the maximum clock frequency of the whole design. Even if demoted to front panel controller, the ZPU would still be my choice over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front panel control" in hand-crafted assembler, a long time ago...) -Markus --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156913
On 7/27/2014 8:00 AM, mnentwig wrote: > > A genuine ARM, with the hardware multiplier option, would be nice. Those do > one 32x32=>32 bit multiplication per clock cycle. But, I think an FPGA > can't do that because I have to cascade two 18x18 multipliers and that > needs pipeline registers or a slower clock. I think an ARM CPU would be rather large although they have the M1 (or is it the M0?) intended for FPGA use. I wonder if anyone has cloned that yet? Why would you need it to be cycle accurate? The multiplier is already pipelined even if you just use one by itself. It comes with an output register like the block memory so you can't send the results anywhere until the next clock cycle. Using four of them to produce a 64 bit result and save the result in a register would take 2 clocks; one for the multiplies and one for the adds and save... unless you do some hardware register renaming... set a flag that says the output of the multiplier is Rxx instead of the register file. Hmmmm... I need to think about that one. It takes an extra mux which is not cheap in FPGAs though. The ARM has any number of multi-clock cycle instructions, why couldn't the multiply be one of them? I have this problem in my stack CPU design. It was originally done in an older part where the block RAM can be run async and so a read can be written to the top of stack in one clock cycle - *all* instructions are 1 clock cycle, this is a primary design goal. With a sync RAM the data is not available until the next clock cycle, so I have to find tricks to make it work. One is to use two instructions to read memory, one to start the read and one to grab the output - repercussions for exceptions, now there is another register to save. Or I have considered grabbing the input to the address register rather than the output and doing a read on every clock cycle... somewhat wasteful of power and I intend to use this in a low power design. > So I'll use the softcore for control purposes, and do the "heavy lifting" > in RTL. Too bad, there is a lot of audio C code out there that could be > adapted. > >> BTW, never use clock speed alone as a measure of performance. I can't >> say if the openrisc processor is fast or not. I find it funny that you >> would consider using the ZPU if you are looking for speed. I believe >> the ZPU is the slowest processor I have ever seen. > > Right. The reason is simply that I want to run it synchronously with the > DSP stuff at around 100 MHz (at least unless someone comes up with a better > plan). That means, it will limit the maximum clock frequency of the whole > design. > > Even if demoted to front panel controller, the ZPU would still be my choice > over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front > panel control" in hand-crafted assembler, a long time ago...) I wouldn't consider using an 8051 myself if there were good alternatives. But I am in the stack processor crowd (which the ZPU is a member of oddly enough) and am happy programming in Forth or something like it. I like working close to the hardware and I find it very useful to have a processor with all instructions 1 clock cycle long. The ZPU would drive me batty and I would never want to program it in C. -- RickArticle: 156914
I am working on an IP core with a Nios controller. This IP will eventually be integrated into a multi-Nios system. I also foresee that this IP will not be JTAG debuggable because the integrator will be using the JTAG facility on a higher level Nios controller. In this case I have planned to include a UART interface, which allows the integrator to do on-the-fly primitive debugging with the IP using a spare serial port, while at the same time using the JTAG debugger on other Nios controllers. Currently this is what has been implemented. The Nios controller waits for 3 seconds, where upon receipt of a character 'd' within this period it goes into diagnostic mode, otherwise it enters normal operation without stdin and stdout. In diagnostic mode internal values are spewed onto the console. I am planning to allow entry of an integer which defines a bit pattern, where different bits selectively enables/disables printing diagnostic messages. The console also allows input of an bit pattern which selectively modifies internal parameters. These modifications comes at the expense of adding several alt_printf and alt_getchar which quickly clutters the Nios firmware code. Are there any elegant method where an existing Nios firmware can be hooked onto a debuggable framework via the UART? Even better, are there any memory efficient way of performing gdb over a UART without hosting a full blown OS on the Nios? --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.comArticle: 156915
On 7/27/2014 8:00 AM, mnentwig wrote: > > Right. The reason is simply that I want to run it synchronously with the > DSP stuff at around 100 MHz (at least unless someone comes up with a better > plan). That means, it will limit the maximum clock frequency of the whole > design. > > Even if demoted to front panel controller, the ZPU would still be my choice > over the 8051 simply because it's 32 bit (got the T-shirt for "8051 front > panel control" in hand-crafted assembler, a long time ago...) BTW, do you know about the ZPU mailing list? zylin-zpu mailing list zylin-zpu@zylin.com http://zylin.com/mailman/listinfo/zylin-zpu_zylin.com -- RickArticle: 156916
On Mon, 28 Jul 2014 07:20:31 +0800, Ang Zhi Ping wrote: > I am working on an IP core with a Nios controller. This IP will > eventually be integrated into a multi-Nios system. I also foresee that > this IP will not be JTAG debuggable because the integrator will be using > the JTAG facility on a higher level Nios controller. > > In this case I have planned to include a UART interface, which allows > the integrator to do on-the-fly primitive debugging with the IP using a > spare serial port, while at the same time using the JTAG debugger on > other Nios controllers. > > Currently this is what has been implemented. The Nios controller waits > for 3 seconds, where upon receipt of a character 'd' within this period > it goes into diagnostic mode, otherwise it enters normal operation > without stdin and stdout. In diagnostic mode internal values are spewed > onto the console. I am planning to allow entry of an integer which > defines a bit pattern, where different bits selectively enables/disables > printing diagnostic messages. The console also allows input of an bit > pattern which selectively modifies internal parameters. > > These modifications comes at the expense of adding several alt_printf > and alt_getchar which quickly clutters the Nios firmware code. Are there > any elegant method where an existing Nios firmware can be hooked onto a > debuggable framework via the UART? Even better, are there any memory > efficient way of performing gdb over a UART without hosting a full blown > OS on the Nios? > > --- > This email is free from viruses and malware because avast! Antivirus > protection is active. > http://www.avast.com Why not MUX the JTAG to the various processors, get this (presumably deeply buried one) debugged, and then move on? -- Tim Wescott Control system and signal processing consulting www.wescottdesign.comArticle: 156917
On 28/7/2014 1:04 PM, Tim Wescott wrote: > > Why not MUX the JTAG to the various processors, get this (presumably > deeply buried one) debugged, and then move on? > This module is more or less finalised and debugged. There are internal values within the hardware which are of use to the integrator who is debugging the top level controller.Article: 156918
Hi, what I'm doing in a similar application is to put a UART to the bus as addressable register, together with a four-byte FIFO. On bus read, the FIFO is popped and empty/overflow conditions are reported in bits 30 and 31, together with the read result in bytes 7:0. Example code is here: https://drive.google.com/file/d/0B1gLUU8hXL7vc0xZa1ZmMUJIbjg/edit?usp=sharing It is for MIDI serial, for 31250 baud (use 9600 or 115200, for example). It is functional but there may be bugs. This is the interesting part in zpu_top.c: The address decoder asserts "busSel_MIDI" for a read operation, and the result is routed via "MIDI_read" to the processor's data bus in the next cycle. The C code uses regular polling: while (1){ u32 c = *MIDI; // (volatile u32)* to the bus address if (c & 0x80000000){ printf("buffer overflow!n"); } else if (c & 0x40000000){ MIDI_parse(c & 0xFF); } } My solution to deal with debug "printf" is a VGA adapter on the FPGA :-) // ************************************************************ // MIDI UART // ************************************************************ // 96 000 000 Hz (clock) / 31250 Hz (MIDI baudrate) = 3072 = nBitCycles wire [7:0] MIDI_byte; wire MIDI_strobe; reg MIDI_RX_r = 1; always @(posedge clk) MIDI_RX_r <= MIDI_RX; sk61_serialRx #(.nBitCycles(3072)) iMidiUart(.clk(clk), in_rx(MIDI_RX_r), .out_byte(MIDI_byte), .out_strobe(MIDI_strobe)); // ************************************************************ // MIDI FIFO // ************************************************************ wire MIDI_rxStrobe; wire [7:0] MIDI_rxData; serialFifo2bus iMidiFifo (.i_clk(clk), .i_push(MIDI_strobe), .i_byte(MIDI_byte), .i_pop(busSel_MIDI), .o_busword(MIDI_read)); --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156919
On Monday, July 28, 2014 2:20:31 AM UTC+3, Ang Zhi Ping wrote: > I am working on an IP core with a Nios controller. This IP will > > eventually be integrated into a multi-Nios system. I also foresee that > this IP will not be JTAG debuggable because the integrator will be using > the JTAG facility on a higher level Nios controller. > > In this case I have planned to include a UART interface, which allows > the integrator to do on-the-fly primitive debugging with the IP using a > spare serial port, while at the same time using the JTAG debugger on > other Nios controllers. > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules. The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration. So, most likely, all your per-caution with physical UART is unnecessary. Of course, JTAG-independent printouts can be useful for other reasons...Article: 156920
On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote: > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules. > The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration. I can't seem to debug two Nios processors simultaneously.Article: 156921
On Monday, July 28, 2014 3:10:40 PM UTC+3, Ang Zhi Ping wrote: > On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote: >=20 > > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine o= n the same JTAG interface. The same applies to multiple JTAG debug modules. >=20 > > The only thing that you, as designer of the module, should care about i= s avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of i= nstance IDs should be governed by person, that is responsible for top-level= integration. >=20 >=20 >=20 > I can't seem to debug two Nios processors simultaneously. Did you assign different instance IDs? I never tried to use debuggers on two Nios2 processors myself (I hate debug= gers in general, so I didn't use debugger on *one* Nios2 processor for some= thing like 7 years), but Altera documentation claims that it should work. I did try software download (which aso uses debugger interface) to differen= t Nios2 processors over the same JTAG interface. It certainly works. I neve= r tested if it works simultaneously, because I never wanted to download sim= ultaneously. But all that is slightly off topic. The topic was "light" debugging with pr= intouts. That's the method that I do like and do do regularly. Printouts ov= er JTAG UARTs from different processor most definitely work simultaneously,= there are no problems at all. Just specify correct instance ID in nios2-te= rminal command line and everything will work for you in the best possible m= anner.Article: 156922
On Monday, July 28, 2014 3:10:40 PM UTC+3, Ang Zhi Ping wrote: > On 28/7/2014 6:22 PM, already5chosen@yahoo.com wrote: > > > Multiple Nios2 cores, each with its own JTAG UART, co-exist just fine on the same JTAG interface. The same applies to multiple JTAG debug modules. > > > The only thing that you, as designer of the module, should care about is avoiding the conflict of Nios2 CPU instance IDs. Ideally, allocation of instance IDs should be governed by person, that is responsible for top-level integration. > > > > I can't seem to debug two Nios processors simultaneously. P.S. alteraforum is a much better place for asking that sort of questions.Article: 156923
>> the topic was "light" debugging with printouts. BTW my on-board VGA controller may seem a little over-the-top . The main selling point is, it doesn't slow down the code, it's an infinite-baudrate UART. It's surprisingly compact if I can spare one clock and a block RAM (on Xilinx Spartan 6, haven't tried this yet on Altera). Electrically it's uncritical, patch cables to a cheap RGB resistor DAC breakout board / "wing" work just fine at 640x480 / 25 MHz. --------------------------------------- Posted through http://www.FPGARelated.comArticle: 156924
On 29/7/2014 4:14 AM, already5chosen@yahoo.com wrote: > > Did you assign different instance IDs? Yes different instance IDs are assigned. The JTAG UART under Eclipse IDE is able to tell the different NIOS. > I never tried to use debuggers on two Nios2 processors myself (I hate debuggers in general, so I didn't use debugger on *one* Nios2 processor for something like 7 years), but Altera documentation claims that it should work. If the JTAG UART is used for stdout, the JTAG only routes the debugging Nios to console. Any other Nios processors that are not being debugged will not be able to route their stdout outputs to console. Hence this question about routing messages via serial port. > I did try software download (which aso uses debugger interface) to different Nios2 processors over the same JTAG interface. It certainly works. I never tested if it works simultaneously, because I never wanted to download simultaneously. The JTAG certainly work for multi-Nios system, but it cannot handle stdout from multiple Nios. > But all that is slightly off topic. The topic was "light" debugging with printouts. That's the method that I do like and do do regularly. Printouts over JTAG UARTs from different processor most definitely work simultaneously, there are no problems at all. Just specify correct instance ID in nios2-terminal command line and everything will work for you in the best possible manner. Haha ok let's keep this thread on topic then.
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z