Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: > >Hi, > > > >When I read a tutorial on FIR implementation on FPGA, I am not clear > about > > "partial results can be used for many multiplications (regardless of > > symmetry)" That slide may be based on multiplier with logic cell in > FPGA, > >not > > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can > be > > used for many multiplications (regardless of symmetry)'? I only think > that > >to > > save 50% multiplier taking advantage of FIR coef symmetric > characteristic. > > > > > >Could you tell me how to understand about the partial results? > > > >Thank, >=20 > can we read that tutorial? >=20 > Kaz > --------------------------------------- > Posted through http://www.FPGARelated.com Here is the link:https://www.google.ca/url?sa=3Dt&rct=3Dj&q=3D&esrc=3Ds&sou= rce=3Dweb&cd=3D1&cad=3Drja&uact=3D8&ved=3D0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IK= HZbHBTk&url=3Dhttp%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mc= e%2Fpublic%2F06_MVD_%2520FIR_Design.pdf&usg=3DAFQjCNHDrIXK_J6WMErALOhKYrGsx= LFg6w Thanks,Article: 158226
On 9/25/2015 5:09 PM, Jon Elson wrote: > I have a product series that has mostly moved to Spartan 3A, but one member > of the family is still using up boards made some time ago with the > Spartan2E. I just found a mistake in all of the designs -- left out > synchronizers on one input! > > I fixed the Spartan3A configurations, but no longer have a version of Ise 10 > running. Looking through some files, I see there is a directory on my Ise > 14.7 install - aISE_DS/ISE/spartan2e, so it seems at least some of the files > needed to synthesize for the Spartan2E are there, but of course the main GUI > doesn't allow you to select that family. (This is on a Linux system.) > > Does anyone know if Spartan2E can be synthesized on Ise 14.7 by re-enabling > that family? I think I can probably boot up one of the archived hard drives > that had Ise 10 on it if I have to, but that would be a bit of a pain. This > should be a one-time need, just to get this one PROM file corrected. > > Thanks! > > Jon > I'm pretty sure that any Spartan2e info still in the latest ISE is only for programming (BSDL files, etc.) and you can't target those parts in 14.7 (or anything after 10.1.03). -- GaborArticle: 158227
Galina Szakacs wrote: > > I'm pretty sure that any Spartan2e info still in the latest ISE is only > for programming (BSDL files, etc.) and you can't target those parts in > 14.7 (or anything after 10.1.03). > You certainly can't select them in the GUI. Yes, impact/BDSL may be the reason these files remain there. Thanks, JonArticle: 158228
On 9/25/2015 4:10 PM, fl wrote: > Hi, > > When I read a tutorial on FIR implementation on FPGA, I am not clear about > "partial results can be used for many multiplications (regardless of > symmetry)" That slide may be based on multiplier with logic cell in FPGA, not > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can be > used for many multiplications (regardless of symmetry)'? I only think that to > save 50% multiplier taking advantage of FIR coef symmetric characteristic. > > Could you tell me how to understand about the partial results? They are talking about an extreme level of optimization by sharing partial products between multiplies. Trouble is, each multiply is by a different coefficient *and* a different data value. But in each successive clock cycle the data moves to the next coefficient, so if any of the bits of the coefficients match, the result of the previous partial product can just be shifted into the appropriate location in the adjacent product calculation. It would be a bit tortuous to code and would nullify the utility of the hard multipliers available in many FPGAs. It might be worth while to do if you are designing an ASIC though. -- RickArticle: 158229
On 9/26/2015 12:06 AM, rickman wrote: > On 9/25/2015 4:10 PM, fl wrote: >> Hi, >> >> When I read a tutorial on FIR implementation on FPGA, I am not clear >> about >> "partial results can be used for many multiplications (regardless of >> symmetry)" That slide may be based on multiplier with logic cell in >> FPGA, not >> a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results >> can be >> used for many multiplications (regardless of symmetry)'? I only >> think that to >> save 50% multiplier taking advantage of FIR coef symmetric >> characteristic. >> >> Could you tell me how to understand about the partial results? > > They are talking about an extreme level of optimization by sharing > partial products between multiplies. Trouble is, each multiply is by a > different coefficient *and* a different data value. But in each > successive clock cycle the data moves to the next coefficient, so if any > of the bits of the coefficients match, the result of the previous > partial product can just be shifted into the appropriate location in the > adjacent product calculation. It would be a bit tortuous to code and > would nullify the utility of the hard multipliers available in many > FPGAs. It might be worth while to do if you are designing an ASIC though. I posed this before I read your link. I assumed right, but I didn't see the block diagram which shows all the multiplies happening on the same data at the same time. I've written FIR filters before, I should have remembered this. So the individual partial products can be shared across all the multiplies and added appropriately. I expect this assumes fixed coefficients which naturally make multipliers simpler. -- RickArticle: 158230
On 9/26/15 12:06 AM, rickman wrote: > On 9/25/2015 4:10 PM, fl wrote: >> Hi, >> >> When I read a tutorial on FIR implementation on FPGA, I am not clear >> about >> "partial results can be used for many multiplications (regardless of >> symmetry)" That slide may be based on multiplier with logic cell in >> FPGA, not >> a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results >> can be >> used for many multiplications (regardless of symmetry)'? I only >> think that to >> save 50% multiplier taking advantage of FIR coef symmetric >> characteristic. >> >> Could you tell me how to understand about the partial results? > > They are talking about an extreme level of optimization by sharing > partial products between multiplies. Trouble is, each multiply is by a > different coefficient *and* a different data value. But in each > successive clock cycle the data moves to the next coefficient, so if any > of the bits of the coefficients match, the result of the previous > partial product can just be shifted into the appropriate location in the > adjacent product calculation. It would be a bit tortuous to code and > would nullify the utility of the hard multipliers available in many > FPGAs. It might be worth while to do if you are designing an ASIC though. > A simple, and useful, transformation for a FIR or IIR filter for FPGAs is to switch from using one big summing node, with a series of delays before/after with tap offs and multiplies to having a single node feed forward/back to a series of nodes with simple adders. Since with the FPGA the registers at the outputs are free, this is the most efficient format. It also says that if the coefficients are constants, you have a possibility of optimizing some of the partial produces if building explicit multipliers.Article: 158231
It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification. Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM. Has someone studied this situation? Would appear the stack/accumulator program code would be denser? Would appear multiple instruction issue would be simpler with RISC? Jim BrakefieldArticle: 158232
On 9/26/2015 2:07 PM, jim.brakefield@ieee.org wrote: > It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification. > Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM. > > Has someone studied this situation? > Would appear the stack/accumulator program code would be denser? > Would appear multiple instruction issue would be simpler with RISC? I've done a little investigation and the instruction set for a stack processor was not much denser than the instruction set for the RISC CPU I compared it to. I don't recall which one it was. A lot depends on the code you use for comparison. I was using loops that move data. Many stack processors have some levels of inefficiency because of the juggling of the stack required in some code. Usually proponents say the code can be done to reduce the juggling of operands which I have found to be mostly true. If you code to reduce the parameter juggling, stack processors can be somewhat more efficient in terms of code space usage. I have looked at a couple of things as alternatives. One is to use VLIW to allow as much parallelism in usage of the execution units within the processor, they are, data unit, address unit and instruction unit. This presents some inherent inefficiency in that a fixed size instruction field is used to control the instruction unit when most IU instructions are just "next", for example. But it allows both the address unit and the data unit to be doing work at the same time for doing things like moving data to/from memory and counting a loop iteration, for example. Another potential stack optimization I have looked at is combining register and stack concepts by allowing very short offsets from top of stack to be used for a given operand along with variable size stack adjustments. I didn't pursue this very far but I think it has potential of virtually eliminating operand juggling making stack processor much faster. I'm not sure of the effect on code size optimization because of the larger instruction size. -- RickArticle: 158233
On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org w= rote: >=20 > Has someone studied this situation? > Would appear the stack/accumulator program code would be denser? > Would appear multiple instruction issue would be simpler with RISC? >=20 I worked with the 1980's Lilith computer and its Modula-2 compiler which us= ed a stack-based architecture. Christian Jacobi includes a detailed analysi= s of the code generated in his dissertation titled "Code Generation and the= Lilith Architecture". You can download a copy from my website: http://www.cfbsoftware.com/modula2/ I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Oberon = compiler used in Project Oberon: http://www.projectoberon.com The code generation is described in detail in the included documentation.= =20 I have both systems in operation and have some very similar test programs f= or both. I'll experiment to see if the results give any surprises. Any comp= arison would have to take into account the fact that the Lilith was a 16-bi= t architecture whereas RISC5 is 32-bit so it might be tricky. Regards, Chris Burrows CFB Software http://www.astrobe.comArticle: 158234
On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: > On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >> >Hi, >> > >> >When I read a tutorial on FIR implementation on FPGA, I am not clear >> about >> > "partial results can be used for many multiplications (regardless of >> > symmetry)" That slide may be based on multiplier with logic cell in >> FPGA, >> >not >> > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results >> > can >> be >> > used for many multiplications (regardless of symmetry)'? I only think >> that >> >to >> > save 50% multiplier taking advantage of FIR coef symmetric >> characteristic. >> > >> > >> >Could you tell me how to understand about the partial results? >> > >> >Thank, >> >> can we read that tutorial? >> >> Kaz --------------------------------------- Posted through >> http://www.FPGARelated.com > > Here is the > link:https://www.google.ca/url? sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w > > Thanks, Well, the guy throws out one unfounded statement and then never supports it. In a live presentation you could raise your hand and ask about it. Can you send him an email? I can see ways that, given a predetermined set of coefficients, you may be able to get the gate count down, but that's not really within the scope of the talk. I suspect it's an editing turd -- he had something brilliant in a prior version of the presentation which he either found out was unfounded, or which he didn't have time to present for this particular talk, so he set out to edit all of it out but he left this one little bit. -- Tim Wescott Wescott Design Services http://www.wescottdesign.comArticle: 158235
On Sun, 27 Sep 2015 15:08:02 -0500, Tim Wescott wrote: > On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: > >> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >>> >Hi, >>> > >>> >When I read a tutorial on FIR implementation on FPGA, I am not clear >>> about >>> > "partial results can be used for many multiplications (regardless of >>> > symmetry)" That slide may be based on multiplier with logic cell in >>> FPGA, >>> >not >>> > a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results >>> > can >>> be >>> > used for many multiplications (regardless of symmetry)'? I only >>> > think >>> that >>> >to >>> > save 50% multiplier taking advantage of FIR coef symmetric >>> characteristic. >>> > >>> > >>> >Could you tell me how to understand about the partial results? >>> > >>> >Thank, >>> >>> can we read that tutorial? >>> >>> Kaz --------------------------------------- Posted through >>> http://www.FPGARelated.com >> >> Here is the link:https://www.google.ca/url? > sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http > %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% > 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w >> >> Thanks, > > Well, the guy throws out one unfounded statement and then never supports > it. In a live presentation you could raise your hand and ask about it. > Can you send him an email? > > I can see ways that, given a predetermined set of coefficients, you may > be able to get the gate count down, but that's not really within the > scope of the talk. > > I suspect it's an editing turd -- he had something brilliant in a prior > version of the presentation which he either found out was unfounded, or > which he didn't have time to present for this particular talk, so he set > out to edit all of it out but he left this one little bit. Just a note: since nearly all FIR filters are symmetrical around some point, it would be interesting to see how much the area would increase or decrease if you insisted on that, then reduced the multipliers by a factor of two at the cost of having one more adder per multiplier. -- Tim Wescott Wescott Design Services http://www.wescottdesign.comArticle: 158236
On 9/27/2015 4:08 PM, Tim Wescott wrote: > On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: > >> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >>>> Hi, >>>> >>>> When I read a tutorial on FIR implementation on FPGA, I am not clear >>> about >>>> "partial results can be used for many multiplications (regardless of >>>> symmetry)" That slide may be based on multiplier with logic cell in >>> FPGA, >>>> not >>>> a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results >>>> can >>> be >>>> used for many multiplications (regardless of symmetry)'? I only think >>> that >>>> to >>>> save 50% multiplier taking advantage of FIR coef symmetric >>> characteristic. >>>> >>>> >>>> Could you tell me how to understand about the partial results? >>>> >>>> Thank, >>> >>> can we read that tutorial? >>> >>> Kaz --------------------------------------- Posted through >>> http://www.FPGARelated.com >> >> Here is the >> link:https://www.google.ca/url? > sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http > %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% > 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w >> >> Thanks, > > Well, the guy throws out one unfounded statement and then never supports > it. In a live presentation you could raise your hand and ask about it. > Can you send him an email? > > I can see ways that, given a predetermined set of coefficients, you may > be able to get the gate count down, but that's not really within the > scope of the talk. I think it is clear that he is talking about a hard coded set of coefficients. If the coefficients are variables in registers, there is no efficient way to optimize the multipliers. With fixed coefficients and multipliers in the fabric, this would be an important optimization. He discusses using fabric for multipliers in later slides. It is not unreasonable to expect the tools to do this optimization automatically. -- RickArticle: 158237
On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote: > On 9/27/2015 4:08 PM, Tim Wescott wrote: >> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: >> >>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >>>>> Hi, >>>>> >>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear >>>> about >>>>> "partial results can be used for many multiplications (regardless of >>>>> symmetry)" That slide may be based on multiplier with logic cell in >>>> FPGA, >>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial >>>>> results can >>>> be >>>>> used for many multiplications (regardless of symmetry)'? I only >>>>> think >>>> that >>>>> to save 50% multiplier taking advantage of FIR coef symmetric >>>> characteristic. >>>>> >>>>> >>>>> Could you tell me how to understand about the partial results? >>>>> >>>>> Thank, >>>> >>>> can we read that tutorial? >>>> >>>> Kaz --------------------------------------- Posted through >>>> http://www.FPGARelated.com >>> >>> Here is the link:https://www.google.ca/url? >> sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http >> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% >> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w >>> >>> Thanks, >> >> Well, the guy throws out one unfounded statement and then never >> supports it. In a live presentation you could raise your hand and ask >> about it. >> Can you send him an email? >> >> I can see ways that, given a predetermined set of coefficients, you may >> be able to get the gate count down, but that's not really within the >> scope of the talk. > > I think it is clear that he is talking about a hard coded set of > coefficients. If the coefficients are variables in registers, there is > no efficient way to optimize the multipliers. With fixed coefficients > and multipliers in the fabric, this would be an important optimization. > He discusses using fabric for multipliers in later slides. It is not > unreasonable to expect the tools to do this optimization automatically. I totally missed that -- yes, one would hope that in 2015 the tools would be able to figure out how to optimize fixed multipliers. This is a tangent, but it makes me wonder -- I saw a paper ages ago that was basically saying that if addition and subtraction are equally costly, then you can optimize a multiplication by using both -- i.e., if you're multiplying (x) by 11110001111b, then you can either do eight adds, or you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings. So -- do modern optimizers do this when multiplying by fixed coefficients, or not? -- www.wescottdesign.comArticle: 158238
On 9/27/2015 6:44 PM, Tim Wescott wrote: > On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote: > >> On 9/27/2015 4:08 PM, Tim Wescott wrote: >>> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: >>> >>>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >>>>>> Hi, >>>>>> >>>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear >>>>> about >>>>>> "partial results can be used for many multiplications (regardless of >>>>>> symmetry)" That slide may be based on multiplier with logic cell in >>>>> FPGA, >>>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial >>>>>> results can >>>>> be >>>>>> used for many multiplications (regardless of symmetry)'? I only >>>>>> think >>>>> that >>>>>> to save 50% multiplier taking advantage of FIR coef symmetric >>>>> characteristic. >>>>>> >>>>>> >>>>>> Could you tell me how to understand about the partial results? >>>>>> >>>>>> Thank, >>>>> >>>>> can we read that tutorial? >>>>> >>>>> Kaz --------------------------------------- Posted through >>>>> http://www.FPGARelated.com >>>> >>>> Here is the link:https://www.google.ca/url? >>> > sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http >>> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% >>> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w >>>> >>>> Thanks, >>> >>> Well, the guy throws out one unfounded statement and then never >>> supports it. In a live presentation you could raise your hand and ask >>> about it. >>> Can you send him an email? >>> >>> I can see ways that, given a predetermined set of coefficients, you may >>> be able to get the gate count down, but that's not really within the >>> scope of the talk. >> >> I think it is clear that he is talking about a hard coded set of >> coefficients. If the coefficients are variables in registers, there is >> no efficient way to optimize the multipliers. With fixed coefficients >> and multipliers in the fabric, this would be an important optimization. >> He discusses using fabric for multipliers in later slides. It is not >> unreasonable to expect the tools to do this optimization automatically. > > I totally missed that -- yes, one would hope that in 2015 the tools would > be able to figure out how to optimize fixed multipliers. > > This is a tangent, but it makes me wonder -- I saw a paper ages ago that > was basically saying that if addition and subtraction are equally costly, > then you can optimize a multiplication by using both -- i.e., if you're > multiplying (x) by 11110001111b, then you can either do eight adds, or > you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings. > > So -- do modern optimizers do this when multiplying by fixed > coefficients, or not? I can't imagine they wouldn't. But mostly, multiplications of any type are done using hardware multipliers available in the vast majority of FPGAs. -- RickArticle: 158239
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote: > On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: > > It would appear there are very similar resource needs for either RISC o= r Stack/Accumulator architectures when both are of the "load/store" classif= ication. > > Herein, same multi-port LUT RAM for either RISC register file or dual s= tacks. And the DSP for multiply and block RAM for main memory. "Load/stor= e" refers to using distinct instructions for moving data between LUT RAM an= d block RAM. > > > > Has someone studied this situation? > > Would appear the stack/accumulator program code would be denser? > > Would appear multiple instruction issue would be simpler with RISC? >=20 > I've done a little investigation and the instruction set for a stack=20 > processor was not much denser than the instruction set for the RISC CPU= =20 > I compared it to. I don't recall which one it was. >=20 > A lot depends on the code you use for comparison. I was using loops=20 > that move data. Many stack processors have some levels of inefficiency= =20 > because of the juggling of the stack required in some code. Usually=20 > proponents say the code can be done to reduce the juggling of operands=20 > which I have found to be mostly true. If you code to reduce the=20 > parameter juggling, stack processors can be somewhat more efficient in=20 > terms of code space usage. >=20 > I have looked at a couple of things as alternatives. One is to use VLIW= =20 > to allow as much parallelism in usage of the execution units within the= =20 > processor, they are, data unit, address unit and instruction unit. This= =20 > presents some inherent inefficiency in that a fixed size instruction=20 > field is used to control the instruction unit when most IU instructions= =20 > are just "next", for example. But it allows both the address unit and=20 > the data unit to be doing work at the same time for doing things like=20 > moving data to/from memory and counting a loop iteration, for example. >=20 > Another potential stack optimization I have looked at is combining=20 > register and stack concepts by allowing very short offsets from top of=20 > stack to be used for a given operand along with variable size stack=20 > adjustments. I didn't pursue this very far but I think it has potential= =20 > of virtually eliminating operand juggling making stack processor much=20 > faster. I'm not sure of the effect on code size optimization because of= =20 > the larger instruction size. >=20 > --=20 >=20 > Rick > I have looked at a couple of things as alternatives. One is to use VLIW= =20 > to allow as much parallelism in usage of the execution units within the= =20 > processor, they are, data unit, address unit and instruction unit. Have considered multiple stacks as a form of VLIW: each stack having its ow= n part of the VLIW instruction, or if nothing to do, providing future immed= iates for any of the other stack instructions. > Another potential stack optimization I have looked at is combining=20 > register and stack concepts by allowing very short offsets from top of=20 > stack to be used for a given operand along with variable size stack=20 > adjustments. I didn't pursue this very far but I think it has potential= =20 > of virtually eliminating operand juggling making stack processor much=20 > faster. Also this is a way to improve processing rate as there are fewer instructio= ns than "pure" stack code (each instruction has a stack/accumulator operati= on and a small offset for the other operand). While one is at it, one can = add various instructions bits for "return", stack/accumulator mode, replace= operation, stack pointer selector, ... Personally, don't have hard numbers for any of this (there are open source = stack machines with small offsets and various instruction bits, what is nee= ded is compilers so that comparisons can be done). And don't want to dupli= cate any work (AKA research) that has already been done. Jim BrakefieldArticle: 158240
On Saturday, September 26, 2015 at 8:19:29 PM UTC-5, cfbso...@gmail.com wro= te: > On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org= wrote: > >=20 > > Has someone studied this situation? > > Would appear the stack/accumulator program code would be denser? > > Would appear multiple instruction issue would be simpler with RISC? > >=20 >=20 > I worked with the 1980's Lilith computer and its Modula-2 compiler which = used a stack-based architecture. Christian Jacobi includes a detailed analy= sis of the code generated in his dissertation titled "Code Generation and t= he Lilith Architecture". You can download a copy from my website: >=20 > http://www.cfbsoftware.com/modula2/ > I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Obero= n compiler used in Project Oberon: >=20 > http://www.projectoberon.com >=20 > The code generation is described in detail in the included documentation.= =20 >=20 > I have both systems in operation and have some very similar test programs= for both. I'll experiment to see if the results give any surprises. Any co= mparison would have to take into account the fact that the Lilith was a 16-= bit architecture whereas RISC5 is 32-bit so it might be tricky. >=20 > Regards, > Chris Burrows > CFB Software > http://www.astrobe.com >Any comparison would have to take into account the fact that the Lilith wa= s a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky. And in the 1980s main memory access time was smaller multiple of clock rate= than today's DRAMs. However, the main memory for the RISC5 FPGA card is a= synchronous static RAM with a fast access time and comparable to the main m= emory of the Lilith? Jim BrakefieldArticle: 158241
On 9/27/2015 8:30 PM, jim.brakefield@ieee.org wrote: > On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote: >> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: >>> It would appear there are very similar resource needs for either >>> RISC or Stack/Accumulator architectures when both are of the >>> "load/store" classification. Herein, same multi-port LUT RAM for >>> either RISC register file or dual stacks. And the DSP for >>> multiply and block RAM for main memory. "Load/store" refers to >>> using distinct instructions for moving data between LUT RAM and >>> block RAM. >>> >>> Has someone studied this situation? Would appear the >>> stack/accumulator program code would be denser? Would appear >>> multiple instruction issue would be simpler with RISC? >> >> I've done a little investigation and the instruction set for a >> stack processor was not much denser than the instruction set for >> the RISC CPU I compared it to. I don't recall which one it was. >> >> A lot depends on the code you use for comparison. I was using >> loops that move data. Many stack processors have some levels of >> inefficiency because of the juggling of the stack required in some >> code. Usually proponents say the code can be done to reduce the >> juggling of operands which I have found to be mostly true. If you >> code to reduce the parameter juggling, stack processors can be >> somewhat more efficient in terms of code space usage. >> >> I have looked at a couple of things as alternatives. One is to use >> VLIW to allow as much parallelism in usage of the execution units >> within the processor, they are, data unit, address unit and >> instruction unit. This presents some inherent inefficiency in that >> a fixed size instruction field is used to control the instruction >> unit when most IU instructions are just "next", for example. But >> it allows both the address unit and the data unit to be doing work >> at the same time for doing things like moving data to/from memory >> and counting a loop iteration, for example. >> >> Another potential stack optimization I have looked at is combining >> register and stack concepts by allowing very short offsets from top >> of stack to be used for a given operand along with variable size >> stack adjustments. I didn't pursue this very far but I think it >> has potential of virtually eliminating operand juggling making >> stack processor much faster. I'm not sure of the effect on code >> size optimization because of the larger instruction size. >> >> -- >> >> Rick > >> I have looked at a couple of things as alternatives. One is to use >> VLIW to allow as much parallelism in usage of the execution units >> within the processor, they are, data unit, address unit and >> instruction unit. > Have considered multiple stacks as a form of VLIW: each stack having > its own part of the VLIW instruction, or if nothing to do, providing > future immediates for any of the other stack instructions. I assume you mean two data stacks? I was trying hard not to expand on the hardware significantly. The common stack machine is typically two stacks, one for data and one for return addresses. In Forth the return stack is also used for loop counting. My derivation uses the return stack for addresses such as memory accesses as well as jump/calls, so I call it the address stack. This lets you do minimal arithmetic (loop counting and incrementing addresses) and reduces stack ops on the data stack such as the two drops required for a memory write. >> Another potential stack optimization I have looked at is combining >> register and stack concepts by allowing very short offsets from top >> of stack to be used for a given operand along with variable size >> stack adjustments. I didn't pursue this very far but I think it >> has potential of virtually eliminating operand juggling making >> stack processor much faster. > Also this is a way to improve processing rate as there are fewer > instructions than "pure" stack code (each instruction has a > stack/accumulator operation and a small offset for the other > operand). While one is at it, one can add various instructions bits > for "return", stack/accumulator mode, replace operation, stack > pointer selector, ... Yes, returns are common so it can be useful to provide a minimal instruction overhead for that. The other things can require extra hardware. > Personally, don't have hard numbers for any of this (there are open > source stack machines with small offsets and various instruction > bits, what is needed is compilers so that comparisons can be done). > And don't want to duplicate any work (AKA research) that has already > been done. > > Jim Brakefield > -- RickArticle: 158242
On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote: > On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote: > > On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote: > >> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: > >>> It would appear there are very similar resource needs for either > >>> RISC or Stack/Accumulator architectures when both are of the > >>> "load/store" classification. Herein, same multi-port LUT RAM for > >>> either RISC register file or dual stacks. And the DSP for > >>> multiply and block RAM for main memory. "Load/store" refers to > >>> using distinct instructions for moving data between LUT RAM and > >>> block RAM. > >>> > >>> Has someone studied this situation? Would appear the > >>> stack/accumulator program code would be denser? Would appear > >>> multiple instruction issue would be simpler with RISC? > >> > >> I've done a little investigation and the instruction set for a > >> stack processor was not much denser than the instruction set for > >> the RISC CPU I compared it to. I don't recall which one it was. > >> > >> A lot depends on the code you use for comparison. I was using > >> loops that move data. Many stack processors have some levels of > >> inefficiency because of the juggling of the stack required in some > >> code. Usually proponents say the code can be done to reduce the > >> juggling of operands which I have found to be mostly true. If you > >> code to reduce the parameter juggling, stack processors can be > >> somewhat more efficient in terms of code space usage. > >> > >> I have looked at a couple of things as alternatives. One is to use > >> VLIW to allow as much parallelism in usage of the execution units > >> within the processor, they are, data unit, address unit and > >> instruction unit. This presents some inherent inefficiency in that > >> a fixed size instruction field is used to control the instruction > >> unit when most IU instructions are just "next", for example. But > >> it allows both the address unit and the data unit to be doing work > >> at the same time for doing things like moving data to/from memory > >> and counting a loop iteration, for example. > >> > >> Another potential stack optimization I have looked at is combining > >> register and stack concepts by allowing very short offsets from top > >> of stack to be used for a given operand along with variable size > >> stack adjustments. I didn't pursue this very far but I think it > >> has potential of virtually eliminating operand juggling making > >> stack processor much faster. I'm not sure of the effect on code > >> size optimization because of the larger instruction size. > >> > >> -- > >> > >> Rick > > > >> I have looked at a couple of things as alternatives. One is to use > >> VLIW to allow as much parallelism in usage of the execution units > >> within the processor, they are, data unit, address unit and > >> instruction unit. > > Have considered multiple stacks as a form of VLIW: each stack having > > its own part of the VLIW instruction, or if nothing to do, providing > > future immediates for any of the other stack instructions. >=20 >=20 > I assume you mean two data stacks? I was trying hard not to expand on=20 > the hardware significantly. The common stack machine is typically two=20 > stacks, one for data and one for return addresses. In Forth the return=20 > stack is also used for loop counting. My derivation uses the return=20 > stack for addresses such as memory accesses as well as jump/calls, so I= =20 > call it the address stack. This lets you do minimal arithmetic (loop=20 > counting and incrementing addresses) and reduces stack ops on the data=20 > stack such as the two drops required for a memory write. >=20 >=20 > >> Another potential stack optimization I have looked at is combining > >> register and stack concepts by allowing very short offsets from top > >> of stack to be used for a given operand along with variable size > >> stack adjustments. I didn't pursue this very far but I think it > >> has potential of virtually eliminating operand juggling making > >> stack processor much faster. > > Also this is a way to improve processing rate as there are fewer > > instructions than "pure" stack code (each instruction has a > > stack/accumulator operation and a small offset for the other > > operand). While one is at it, one can add various instructions bits > > for "return", stack/accumulator mode, replace operation, stack > > pointer selector, ... >=20 > Yes, returns are common so it can be useful to provide a minimal=20 > instruction overhead for that. The other things can require extra=20 > hardware. >=20 >=20 > > Personally, don't have hard numbers for any of this (there are open > > source stack machines with small offsets and various instruction > > bits, what is needed is compilers so that comparisons can be done). > > And don't want to duplicate any work (AKA research) that has already > > been done. > > > > Jim Brakefield > > >=20 >=20 > --=20 >=20 > Rick Reply: >I assume you mean two data stacks? Yes, in particular integer arithmetic on one and floating-point on the othe= r. >My derivation uses the return stack for addresses such as memory accesses = as well as jump/calls, so I call it the address stack. OK > I was trying hard not to expand on the hardware significantly. > The other things can require extra hardware.=20 With FPGA 6LUTs one can have several read ports (4LUT RAM can do it also, i= ts just not as efficient). At one operation per clock and mapping both dat= a and address stacks to the same LUT RAM, one has two ports for operand rea= ds, one port for result write and one port for "return" address read. Just= about any stack or accumulator operation that fits these constraints is po= ssible with appropriate instruction decode and ALU. The SWAP operation req= uires two writes, so one would need to make TOS a separate register to do i= t in one clock (other implementations possible using two multiport LUT RAMs= ). JimArticle: 158243
On 9/28/2015 12:31 AM, jim.brakefield@ieee.org wrote: > On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote: >> On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote: >>> On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman >>> wrote: >>>> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: >>>>> It would appear there are very similar resource needs for >>>>> either RISC or Stack/Accumulator architectures when both are >>>>> of the "load/store" classification. Herein, same multi-port >>>>> LUT RAM for either RISC register file or dual stacks. And >>>>> the DSP for multiply and block RAM for main memory. >>>>> "Load/store" refers to using distinct instructions for moving >>>>> data between LUT RAM and block RAM. >>>>> >>>>> Has someone studied this situation? Would appear the >>>>> stack/accumulator program code would be denser? Would appear >>>>> multiple instruction issue would be simpler with RISC? >>>> >>>> I've done a little investigation and the instruction set for a >>>> stack processor was not much denser than the instruction set >>>> for the RISC CPU I compared it to. I don't recall which one it >>>> was. >>>> >>>> A lot depends on the code you use for comparison. I was using >>>> loops that move data. Many stack processors have some levels >>>> of inefficiency because of the juggling of the stack required >>>> in some code. Usually proponents say the code can be done to >>>> reduce the juggling of operands which I have found to be mostly >>>> true. If you code to reduce the parameter juggling, stack >>>> processors can be somewhat more efficient in terms of code >>>> space usage. >>>> >>>> I have looked at a couple of things as alternatives. One is to >>>> use VLIW to allow as much parallelism in usage of the execution >>>> units within the processor, they are, data unit, address unit >>>> and instruction unit. This presents some inherent inefficiency >>>> in that a fixed size instruction field is used to control the >>>> instruction unit when most IU instructions are just "next", for >>>> example. But it allows both the address unit and the data unit >>>> to be doing work at the same time for doing things like moving >>>> data to/from memory and counting a loop iteration, for >>>> example. >>>> >>>> Another potential stack optimization I have looked at is >>>> combining register and stack concepts by allowing very short >>>> offsets from top of stack to be used for a given operand along >>>> with variable size stack adjustments. I didn't pursue this >>>> very far but I think it has potential of virtually eliminating >>>> operand juggling making stack processor much faster. I'm not >>>> sure of the effect on code size optimization because of the >>>> larger instruction size. >>>> >>>> -- >>>> >>>> Rick >>> >>>> I have looked at a couple of things as alternatives. One is to >>>> use VLIW to allow as much parallelism in usage of the execution >>>> units within the processor, they are, data unit, address unit >>>> and instruction unit. >>> Have considered multiple stacks as a form of VLIW: each stack >>> having its own part of the VLIW instruction, or if nothing to do, >>> providing future immediates for any of the other stack >>> instructions. >> >> >> I assume you mean two data stacks? I was trying hard not to expand >> on the hardware significantly. The common stack machine is >> typically two stacks, one for data and one for return addresses. In >> Forth the return stack is also used for loop counting. My >> derivation uses the return stack for addresses such as memory >> accesses as well as jump/calls, so I call it the address stack. >> This lets you do minimal arithmetic (loop counting and incrementing >> addresses) and reduces stack ops on the data stack such as the two >> drops required for a memory write. >> >> >>>> Another potential stack optimization I have looked at is >>>> combining register and stack concepts by allowing very short >>>> offsets from top of stack to be used for a given operand along >>>> with variable size stack adjustments. I didn't pursue this >>>> very far but I think it has potential of virtually eliminating >>>> operand juggling making stack processor much faster. >>> Also this is a way to improve processing rate as there are fewer >>> instructions than "pure" stack code (each instruction has a >>> stack/accumulator operation and a small offset for the other >>> operand). While one is at it, one can add various instructions >>> bits for "return", stack/accumulator mode, replace operation, >>> stack pointer selector, ... >> >> Yes, returns are common so it can be useful to provide a minimal >> instruction overhead for that. The other things can require extra >> hardware. >> >> >>> Personally, don't have hard numbers for any of this (there are >>> open source stack machines with small offsets and various >>> instruction bits, what is needed is compilers so that comparisons >>> can be done). And don't want to duplicate any work (AKA research) >>> that has already been done. >>> >>> Jim Brakefield >>> >> >> >> -- >> >> Rick > > Reply: >> I assume you mean two data stacks? > Yes, in particular integer arithmetic on one and floating-point on > the other. Yes, if you need floating point a separate stack is often used. >> My derivation uses the return stack for addresses such as memory >> accesses as well as jump/calls, so I call it the address stack. > OK > >> I was trying hard not to expand on the hardware significantly. The >> other things can require extra hardware. > > With FPGA 6LUTs one can have several read ports (4LUT RAM can do it > also, its just not as efficient). At one operation per clock and > mapping both data and address stacks to the same LUT RAM, one has two > ports for operand reads, one port for result write and one port for > "return" address read. Just about any stack or accumulator operation > that fits these constraints is possible with appropriate instruction > decode and ALU. The SWAP operation requires two writes, so one would > need to make TOS a separate register to do it in one clock (other > implementations possible using two multiport LUT RAMs). I used a TOS register for each stack and used a write port and read port for each stack in one block RAM. The write/read ports share the address. A read happens on each cycle automatically and in all the parts I have used that can be set so the data written in a cycle shows up on the read port, so it is the next on stack at all times. Managing the stack pointers can get a bit complex if an effort to keep it simple is not made. As it was the stack pointer was in the critical timing path which ended in the flag registers. The stack pointers set error flags in the CPU status register for over and underflow. I thought this would be useful for debugging, but there is likely ways to minimize the timing overhead. -- RickArticle: 158244
On Monday, September 28, 2015 at 10:49:47 AM UTC+9:30, jim.bra...@ieee.org = wrote: > And in the 1980s main memory access time was smaller multiple of clock ra= te than today's DRAMs. However, the main memory for the RISC5 FPGA card is= asynchronous static RAM with a fast access time and comparable to the main= memory of the Lilith? Rather than trying to paraphrase the information and risk getting it wrong = I refer you to a detailed description of the Lilith memory organisation in = the 'Lilith Computer Hardware Manual'. You can download a copy of this and = several other related documents from BitSavers: http://www.bitsavers.org/pdf/eth/lilith/ Regards, ChrisArticle: 158245
Jon Elson wrote: > Galina Szakacs wrote: > > >> >> I'm pretty sure that any Spartan2e info still in the latest ISE is only >> for programming (BSDL files, etc.) and you can't target those parts in >> 14.7 (or anything after 10.1.03). >> > You certainly can't select them in the GUI. Yes, impact/BDSL may be the > reason these files remain there. Well, wasn't that big of a deal. Downloaded 10.1.03, installed it, re- synthesized the corrected project, and it seems to work fine. Still a bit more testing to do to be sure there aren't any unexpected behavior, but a quick check looks OK. Thank goodness Xilinx still has these legacy versions available! JonArticle: 158246
Hi, Last time I have spent a lot of time on development of quite complex high s= peed data processing systems in FPGA. They all had pipeline architecture, a= nd data were processed in parallel in multiple pipelines with different la= tencies. The worst thing was that those latencies were changing during development. = For example some operations were performed by blocks with tree structure, s= o the number of levels depended on number of inputs handled by each node. T= he number of inputs in each node was varied to find the acceptable balance = between the number of levels and maximum clock speed. I also had to add som= e pipeline registers to improve timing. Entire designs were written in pure VHDL, so I had to adjust latencies man= ually, to ensure that data coming from different paths arrive in the next b= lock in the same clock cycle. It was really a nightmare so I dreamed about = an automated way to ensure proper equalization of latencies. After some work I have elaborated a solution which I'd like to share with t= he community. It is available under the BSD license on the OpenCores websit= e http://opencores.org/project,lateq . The paper with detailed description = is available on arXiv.org http://arxiv.org/abs/1509.08111. I'll appreciate any comments. I hope that the proposed method will be useful for others. With best regards, Wojtek=20Article: 158247
wzab01@gmail.com wrote: > Last time I have spent a lot of time on development of quite > complex high speed data processing systems in FPGA. > They all had pipeline architecture, and data were processed in > parallel in multiple pipelines with different latencies. > The worst thing was that those latencies were changing > during development. For example some operations were > performed by blocks with tree structure, so the number of > levels depended on number of inputs handled by each node. > The number of inputs in each node was varied to find the > acceptable balance between the number of levels and maximum > clock speed. I also had to add some pipeline registers to > improve timing. I have heard that some synthesis software now knows how to move around pipeline registers to optimize timing. I haven't tried using the feature yet, though. I think it can move registers, but maybe not add them. You might need enough registers in place for it to move them around. I used to work on systolic arrays, which are really just very long (hundred or thousands of stages) pipelines. It is pretty hard to hand optimize them that long. -- glenArticle: 158248
W dniu wtorek, 29 wrze=C5=9Bnia 2015 07:49:09 UTC+1 u=C5=BCytkownik glen he= rrmannsfeldt napisa=C5=82: > wzab01@gmail.com wrote: >=20 > > Last time I have spent a lot of time on development of quite=20 > > complex high speed data processing systems in FPGA.=20 > > They all had pipeline architecture, and data were processed in=20 > > parallel in multiple pipelines with different latencies. > =20 > > The worst thing was that those latencies were changing=20 > > during development. For example some operations were=20 > > performed by blocks with tree structure, so the number of=20 > > levels depended on number of inputs handled by each node.=20 > > The number of inputs in each node was varied to find the=20 > > acceptable balance between the number of levels and maximum=20 > > clock speed. I also had to add some pipeline registers to=20 > > improve timing. >=20 > I have heard that some synthesis software now knows how to move > around pipeline registers to optimize timing. I haven't tried > using the feature yet, though. =20 >=20 > I think it can move registers, but maybe not add them. You might > need enough registers in place for it to move them around. >=20 > I used to work on systolic arrays, which are really just very long > (hundred or thousands of stages) pipelines. It is pretty hard to=20 > hand optimize them that long. > Yes, of course the pipeline registers may be moved (e.g. using the "retimin= g" feature). I usually keep this option switched on for implementation. My method only ensures, that the number of pipeline stages is the same in a= ll parallel paths. And keeping track of that was really a huge problem in b= igger designs. --=20 WojtekArticle: 158249
>W dniu wtorek, 29 wrzeÅnia 2015 07:49:09 UTC+1 użytkownik glen >herrmannsfeldt napisaÅ: >> wzab01@gmail.com wrote: >> >> > Last time I have spent a lot of time on development of quite >> > complex high speed data processing systems in FPGA. >> > They all had pipeline architecture, and data were processed in >> > parallel in multiple pipelines with different latencies. >> >> > The worst thing was that those latencies were changing >> > during development. For example some operations were >> > performed by blocks with tree structure, so the number of >> > levels depended on number of inputs handled by each node. >> > The number of inputs in each node was varied to find the >> > acceptable balance between the number of levels and maximum >> > clock speed. I also had to add some pipeline registers to >> > improve timing. >> >> I have heard that some synthesis software now knows how to move >> around pipeline registers to optimize timing. I haven't tried >> using the feature yet, though. >> >> I think it can move registers, but maybe not add them. You might >> need enough registers in place for it to move them around. >> >> I used to work on systolic arrays, which are really just very long >> (hundred or thousands of stages) pipelines. It is pretty hard to >> hand optimize them that long. >> > >Yes, of course the pipeline registers may be moved (e.g. using the >"retiming" feature). I usually keep this option switched on for implementation. >My method only ensures, that the number of pipeline stages is the same in >all parallel paths. And keeping track of that was really a huge problem in >bigger designs. >-- >Wojtek Not sure why you expect the tool to do what you should do and do so for simulation tool. How can you you simulate a design that synthesis will put for you registers? Kaz --------------------------------------- Posted through http://www.FPGARelated.com
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z