Messages from 161375

Article: 161375
Subject: Microchip UNI/O controller core for FPGA
From: wzab01@gmail.com
Date: Thu, 13 Jun 2019 12:51:49 -0700 (PDT)
Links: << >> << T >> << A >>

Hi,

I needed to access the Microchip 11AA02E48 EEPROM located on a FPGA board.=
=20
Unfortunately, I couldn't find any VHDL/Verilog sources of a UNI/O controll=
er.
Therefore, I have decided to write my own. It took a few hours, but it seem=
s that it works quite reasonably. I have published sources under PUBLIC DOM=
AIN od CC0 1.0 Universal license at https://groups.google.com/forum/#!topic=
/alt.sources/-H7CjN9Y_u0 and in https://github.com/wzab/wzab-hdl-library/tr=
ee/master/unio .
I hope, that they may be useful for somebody, however they are published wi=
thout any warranty. You use it on your own risk. You should also check if y=
our use of UNI/O does not violate Michrochip's (or other) patents.

BR,
Wojtek

Article: 161376
Subject: Re: bare-metal ZYNQ
From: Dimitrij Klingbeil <nospam@no-address.com>
Date: Fri, 14 Jun 2019 00:05:42 +0200
Links: << >> << T >> << A >>

On 2019-06-13 17:09, John Larkin wrote:
>
> Separate FPGA and CPU chips is an option that we use a lot already,
> but it needs a chip-chip parallel interface that uses a lot of balls,
> or a slow SPI link.
>
> The NXP uP that we usually use for this combo, LPC3250, looks to be
> EOL, so we're looking for a next-generation product platform.

The chip-chip parallel interface is quickly becoming a chip-chip serial
interface, now that most higher-end embedded CPUs have PCIe.

NXP i.MX series has many variants with PCIe. So do many DSPs from TI.

It looks like nowadays PCIe gets to be the go-to interface both between
CPU and DSP and between CPU (or DSP) and FPGA. Few balls and high speed.

For CPU-DSP, the application CPU is the typically the root complex and
the DSP(s) is(are) typically the endpoint(s). The endpoint side can send
interrupt packets when it has data (or otherwise requires attention).

Regards
Dimitrij

Article: 161377
Subject: Re: bare-metal ZYNQ
From: Richard Damon <Richard@Damon-Family.org>
Date: Thu, 13 Jun 2019 22:52:29 -0400
Links: << >> << T >> << A >>

On 6/12/19 7:32 PM, John Larkin wrote:
> 
> 
> Assume I'm a pointy-haired boss trying to help one of my guys.
> 
> I think that...
> 
> The Xilinx ZYNQ (FPGA+ARM on a chip) has a hard boot loader. It
> figures out what the boot device is (serial flash, SD card, whatever)
> and reads in a secondary boot program, which the Xilinx tools provide
> as part of a build. That loader then reads the entire FPGA config
> bitstream into DRAM, and sets up a giant DMA transfer to configure the
> FPGA. That's all standard in the tools.
> 
> But what if there's no DRAM? My guy thinks he will have to write his
> own ARM application, which is booted at load time, and inside that
> would be a routine to read from the boot media and configure the FPGA
> in chunks, using a small uP RAM buffer, maybe DMA or maybe not. He
> figures he could do that in a few days.
> 
> Seems to me that Xilinx should support booting up a ZYNQ without DRAM.
> Does the tool chain support that (people here think not) or is there
> some loader already coded somewhere?
> 
> (Our support, through a distributor, isn't very good.)
> 
> Thanks

It has been awhile since I used that chip, but my memory was that what
you are describing was the two stage boot loading process. There is a
First Level Boot Loader put into the internal flash of the device that
loads a program into the internal SRAM of the part from a limited
selection of sources (mostly limited to what you could load from with a
simple boot loader). This program is often just a Second Level
Bootloader, but could also be a simple 'bare metal' program. The Second
Level Bootloader generally had the ability to configure DRAM and load
the program it was loading into it, but it did not need to.

The other task normally done by the Boot Loader was to load the
configuration data into the FPGA, but that could also be put off till later.

When Booting to Linux, the Second Level Boot Loader actually just loaded
GRUB, and then GRUB loaded Linux and started it. GRUB and Linux required
DRAM, and much of the documentation assumes going to Linux, but the
tools did support other configurations.

Article: 161378
Subject: Re: bare-metal ZYNQ
From: Michael Kellett <mk@mkesc.co.uk>
Date: Fri, 14 Jun 2019 09:11:36 +0100
Links: << >> << T >> << A >>

On 13/06/2019 18:35, Jan Panteltje wrote:
> On a sunny day (Thu, 13 Jun 2019 08:09:16 -0700) it happened John Larkin
> <jjlarkin@highlandtechnology.com> wrote in
> <agp4getd0ualcf8f9sl2s5ug2541t03e7g@4ax.com>:
> 
>> On Thu, 13 Jun 2019 06:14:09 GMT, Jan Panteltje
>> <pNaOnStPeAlMtje@yahoo.com> wrote:
>>
>>> On a sunny day (Wed, 12 Jun 2019 16:32:35 -0700) it happened John Larkin
>>> <jjlarkin@highland_snip_technology.com> wrote in
>>> <qm13gel4ifba24lb4p8gdeeusufc2b433b@4ax.com>:
>>>
>>>> But what if there's no DRAM?
>>>
>>>
>>> That thing runs Linux?
>>> Does not Linux use the DRAM?
>>>
>>>
>>> If not using Linux and DRAM then a simpler cheaper FPGA board?
>>
>> I said "bare metal."
>>
>> Separate FPGA and CPU chips is an option that we use a lot already,
>> but it needs a chip-chip parallel interface that uses a lot of balls,
>> or a slow SPI link.
>>
>> The NXP uP that we usually use for this combo, LPC3250, looks to be
>> EOL, so we're looking for a next-generation product platform.
> 
> OK, just did a read of the 80 pages datasheet of the LPC3250.
> While reading I was thinking about the chip in the Raspberry pi
>   Broadcom BCM2835 -- 2837
> but that has no ADC.. but does have HDMI out..
> There exists a FPGA plugin board for the Raspberry.
> 
> It is a pity that so many things go EOL in a short time,
> OTOH it is a throw away society.
> And very strong competition does kill some products.
> 
> It all depends on what you want to do.
> 
> A Raspberry plus some external ADC 35$ + ??
> VERY powerful platform, really, GCC compiler, Linux, lots of I/O.
> USB, Ethernet, HDMI, analog video out, analog audio out,
> GPIO for extra boards... SDcard, camera interface, logic level serial, PWM,
> PLL frequency generators, and although every year a new model, the
> basics stay more or less the same, quadcore now, lots of DRAM,
> availability...
> 
> Depends on what you call 'bare metal' these days.
>   https://en.wikipedia.org/wiki/Raspberry_Pi
> 
> I have several in use...
> 
> It is sort of moving to an ever higher level of integration.
> 

Jan - do you know of a good, simple and fast way to get the Pi to 
exchange data with an adjacent chip (uP or FPGA). Using USB or Ethernet 
doesn't count as simple (or very fast for small data packets.)

MK

---
This email has been checked for viruses by AVG.
https://www.avg.com

Article: 161379
Subject: Re: bare-metal ZYNQ
From: Theo <theom+news@chiark.greenend.org.uk>
Date: 14 Jun 2019 10:14:18 +0100 (BST)
Links: << >> << T >> << A >>

In comp.arch.fpga John Larkin <jjlarkin@highland_snip_technology.com> wrote:
> 
> Seems to me that Xilinx should support booting up a ZYNQ without DRAM.
> Does the tool chain support that (people here think not) or is there
> some loader already coded somewhere?

Hmm... it's not the same, but on the Intel Cyclone V parts (and others I
think) there's just a FIFO.  You can push in bitstream words, and
configuration only happens when the full bitstream is provided and it meets
some kinds of checks.

The Zynq appears to drive such a process via DMA - the PCAP in chapter 6 here
https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf
It doesn't say as much, but I wonder if it's possible to transfer in chunked
DMA.  The Linux driver probably has to chunk anyway, given the RAM buffer
you want to transfer may not be in contiguous physical memory.

As to support for this in the tools, without DRAM you're probably running a
custom OS, so there's a limit to what they can do.

On the Arria 10 one 'normal' boot process is: ROM bootloader reads SD card,
starts u-boot, which writes FPGA bitstream then boots Linux.  Now you
mention it, I think u-boot must be running without DRAM because the DRAM
pins are only configured by the bitstream.  So it could be worth looking to
see if a similar process works on Zynq.

(instead of SD card, QSPI and other storage is also selectable)

Theo

Article: 161380
Subject: Re: bare-metal ZYNQ
From: Jan Panteltje <pNaOnStPeAlMtje@yahoo.com>
Date: Fri, 14 Jun 2019 09:44:34 GMT
Links: << >> << T >> << A >>

On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael Kellett
<mk@mkesc.co.uk> wrote in <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:

>Jan - do you know of a good, simple and fast way to get the Pi to 
>exchange data with an adjacent chip (uP or FPGA).



Sure, first for FPGA there is this:
 http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA
this connects via GPIO.

I notice a lot more big names have now FPGA stuff for raspberry..
 Just google 'raspberry FPGA board;.


Depending on your definition of 'fast' with a micro,
the Pi had logic level RS232 via /dev/ttyAMA0,
also hardware SPI (or software SPI of course), i2c the same.

Here used as a a large LED matrix display driver:
 http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html


You can also use 8 bits from GPIO and do byte level transfers,
a typical example of 'fast' is this:
 http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
that also uses a FIFO hardware buffer to get a smooth timed data stream
even during OS task switching.

8 bits (or more) transfer with handshake will work with most micros.

Here the Pi as JTAG programmer:
 http://panteltje.com/panteltje/raspberry_pi/

Stepper motor driver, lots of other i2c chips..
 http://panteltje.com/panteltje/xgpspc/index.html


>Using USB or Ethernet 
>doesn't count as simple (or very fast for small data packets.)

USB is slow on my older Raspberries at least, ethernet is OK.
I would prefer ethernet in some applications because of the galvanic isolation.

What is simple?
Everything is simple once you have dunnit.

Article: 161381
Subject: Re: bare-metal ZYNQ
From: Michael Kellett <mk@mkesc.co.uk>
Date: Sat, 15 Jun 2019 13:14:27 +0100
Links: << >> << T >> << A >>

On 14/06/2019 10:44, Jan Panteltje wrote:
> On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael Kellett
> <mk@mkesc.co.uk> wrote in <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:
> 
>> Jan - do you know of a good, simple and fast way to get the Pi to
>> exchange data with an adjacent chip (uP or FPGA).
> 
> 
> 
> Sure, first for FPGA there is this:
>   http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA
> this connects via GPIO.
> 
> I notice a lot more big names have now FPGA stuff for raspberry..
>   Just google 'raspberry FPGA board;.
> 
> 
> Depending on your definition of 'fast' with a micro,
> the Pi had logic level RS232 via /dev/ttyAMA0,
> also hardware SPI (or software SPI of course), i2c the same.
> 
> Here used as a a large LED matrix display driver:
>   http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html
> 
> 
> You can also use 8 bits from GPIO and do byte level transfers,
> a typical example of 'fast' is this:
>   http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
> that also uses a FIFO hardware buffer to get a smooth timed data stream
> even during OS task switching.
> 
> 8 bits (or more) transfer with handshake will work with most micros.
> 
> Here the Pi as JTAG programmer:
>   http://panteltje.com/panteltje/raspberry_pi/
> 
> Stepper motor driver, lots of other i2c chips..
>   http://panteltje.com/panteltje/xgpspc/index.html
> 
> 
>> Using USB or Ethernet
>> doesn't count as simple (or very fast for small data packets.)
> 
> USB is slow on my older Raspberries at least, ethernet is OK.
> I would prefer ethernet in some applications because of the galvanic isolation.
> 
> What is simple?
> Everything is simple once you have dunnit.
> 

Thanks for the stuff Jan, I don't think I explained quite what I meant 
by fast (although I did say that Ethernet wasn't fast enough).

So fast for me, for the applications I have in mind is:

round trip < 1us (less than 50ns preferred) - easy to do with FPGA 
memory mapped to uP and pretending to be a RAM - but I don't see how to 
do it on a Pi.
Sustained data transfer rate > 100MiB per second in both directions 
simultaneously.

You can do this kind of stuff with the Prus on the Beagleboards but it 
would be nice if it were possible on a Pi.

Simple means (in this context) not using lots of other fancy chips over 
and above the FPGA and not needing to use a GHz serial interface. 
(although if the PI had one spare that I don't know about I might have a 
go.)

I had wondered if the the camera or audio interfaces might be re-purposed.

MK

---
This email has been checked for viruses by AVG.
https://www.avg.com

Article: 161382
Subject: Re: bare-metal ZYNQ
From: Jan Panteltje <pNaOnStPeAlMtje@yahoo.com>
Date: Sat, 15 Jun 2019 12:45:04 GMT
Links: << >> << T >> << A >>

On a sunny day (Sat, 15 Jun 2019 13:14:27 +0100) it happened Michael Kellett
<mk@mkesc.co.uk> wrote in <TOWdnRE_0_exfJnAnZ2dnUU78e_NnZ2d@giganews.com>:

>On 14/06/2019 10:44, Jan Panteltje wrote:
>> On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael Kellett
>> <mk@mkesc.co.uk> wrote in <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:
>> 
>>> Jan - do you know of a good, simple and fast way to get the Pi to
>>> exchange data with an adjacent chip (uP or FPGA).
>> 
>> 
>> 
>> Sure, first for FPGA there is this:
>>   http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA
>> this connects via GPIO.
>> 
>> I notice a lot more big names have now FPGA stuff for raspberry..
>>   Just google 'raspberry FPGA board;.
>> 
>> 
>> Depending on your definition of 'fast' with a micro,
>> the Pi had logic level RS232 via /dev/ttyAMA0,
>> also hardware SPI (or software SPI of course), i2c the same.
>> 
>> Here used as a a large LED matrix display driver:
>>   http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html
>> 
>> 
>> You can also use 8 bits from GPIO and do byte level transfers,
>> a typical example of 'fast' is this:
>>   http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
>> that also uses a FIFO hardware buffer to get a smooth timed data stream
>> even during OS task switching.
>> 
>> 8 bits (or more) transfer with handshake will work with most micros.
>> 
>> Here the Pi as JTAG programmer:
>>   http://panteltje.com/panteltje/raspberry_pi/
>> 
>> Stepper motor driver, lots of other i2c chips..
>>   http://panteltje.com/panteltje/xgpspc/index.html
>> 
>> 
>>> Using USB or Ethernet
>>> doesn't count as simple (or very fast for small data packets.)
>> 
>> USB is slow on my older Raspberries at least, ethernet is OK.
>> I would prefer ethernet in some applications because of the galvanic isolation.
>> 
>> What is simple?
>> Everything is simple once you have dunnit.
>> 
>
>Thanks for the stuff Jan, I don't think I explained quite what I meant 
>by fast (although I did say that Ethernet wasn't fast enough).
>
>So fast for me, for the applications I have in mind is:
>
>round trip < 1us (less than 50ns preferred) - easy to do with FPGA 
>memory mapped to uP and pretending to be a RAM - but I don't see how to 
>do it on a Pi.
>Sustained data transfer rate > 100MiB per second in both directions 
>simultaneously.
>
>You can do this kind of stuff with the Prus on the Beagleboards but it 
>would be nice if it were possible on a Pi.
>
>Simple means (in this context) not using lots of other fancy chips over 
>and above the FPGA and not needing to use a GHz serial interface. 
>(although if the PI had one spare that I don't know about I might have a 
>go.)
>
>I had wondered if the the camera or audio interfaces might be re-purposed.
>
>MK

Audio I do not think is usable for that, but who knows...
AFAIK the camera interface is from camera to board, so one way.
What I meant is if you have 8 or more GPIO pins, say a byte
then there is nothing stopping you from putting a byte on that,
and use a pin as handshake 'new data'.
FPGA would read the handshake and read the byte, and then set a ready pin,
Pi would then output he next byte, 
Now you have 10 pins and maximum I/O speed.
Same for 16 bits 18 pins.
The throughput problems is set by the Pi Linux multitasker, it will
interrupt the stream every now and then for a few milliseconds at least,
That is where you need the FIFO.
But that FIFO can be in FPGA RAM, no external logic needed as I do here:
  http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
I did consider doing that in FPGA, but that seemed a bit of overkill in this case.

So then maximum speed boils down to hwo fast the Pi can output data really..
have not tested that, as I never was close to that limit.
It is simple to test, write some I/O pin toggle routine in asm, or even C,
and look at the scope.
loop:
out 0x00
out 0xff
goto loop 

Pi has DMA, have not used it myself, here some discussion:
 https://www.raspberrypi.org/forums/viewtopic.php?t=8376

Maybe this is of more use to you:
 https://github.com/hzeller/rpi-gpio-dma-demo

They did the pin toggle and:
<quote>
 The resulting output wave on the Raspberry Pi 1 of 22.7Mhz, the Raspberry Pi 2 reaches 41.7Mhz and the Raspberry Pi 3 65.8 Mhz.
>end quote>

Fast enough?

Article: 161383
Subject: Re: bare-metal ZYNQ
From: Tom Gardner <spamjunk@blueyonder.co.uk>
Date: Sat, 15 Jun 2019 14:26:45 +0100
Links: << >> << T >> << A >>

On 15/06/19 13:14, Michael Kellett wrote:
> On 14/06/2019 10:44, Jan Panteltje wrote:
>> On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael Kellett
>> <mk@mkesc.co.uk> wrote in <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:
>>
>>> Jan - do you know of a good, simple and fast way to get the Pi to
>>> exchange data with an adjacent chip (uP or FPGA).
>>
>>
>>
>> Sure, first for FPGA there is this:
>>   http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA
>> this connects via GPIO.
>>
>> I notice a lot more big names have now FPGA stuff for raspberry..
>>   Just google 'raspberry FPGA board;.
>>
>>
>> Depending on your definition of 'fast' with a micro,
>> the Pi had logic level RS232 via /dev/ttyAMA0,
>> also hardware SPI (or software SPI of course), i2c the same.
>>
>> Here used as a a large LED matrix display driver:
>>   
>> http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html 
>>
>>
>>
>> You can also use 8 bits from GPIO and do byte level transfers,
>> a typical example of 'fast' is this:
>>   http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
>> that also uses a FIFO hardware buffer to get a smooth timed data stream
>> even during OS task switching.
>>
>> 8 bits (or more) transfer with handshake will work with most micros.
>>
>> Here the Pi as JTAG programmer:
>>   http://panteltje.com/panteltje/raspberry_pi/
>>
>> Stepper motor driver, lots of other i2c chips..
>>   http://panteltje.com/panteltje/xgpspc/index.html
>>
>>
>>> Using USB or Ethernet
>>> doesn't count as simple (or very fast for small data packets.)
>>
>> USB is slow on my older Raspberries at least, ethernet is OK.
>> I would prefer ethernet in some applications because of the galvanic isolation.
>>
>> What is simple?
>> Everything is simple once you have dunnit.
>>
> 
> Thanks for the stuff Jan, I don't think I explained quite what I meant by fast 
> (although I did say that Ethernet wasn't fast enough).
> 
> So fast for me, for the applications I have in mind is:
> 
> round trip < 1us (less than 50ns preferred) - easy to do with FPGA memory mapped 
> to uP and pretending to be a RAM - but I don't see how to do it on a Pi.
> Sustained data transfer rate > 100MiB per second in both directions simultaneously.

You could do that on the XMOS xCORE devices. They are
fast enough to take the 100Mb/s serial ethernet traffic,
and process it in software. *And* be doing other things
at the same time, guaranteed by design (not tests!) :)

The IDE states the maximum timing between two points,
e.g. two i/o operations, or loop times. That's possible
since there are no caches, no interrupts, and the
latency is <100ns (much less in my experience).



> You can do this kind of stuff with the Prus on the Beagleboards but it would be 
> nice if it were possible on a Pi.
> 
> Simple means (in this context) not using lots of other fancy chips over and 
> above the FPGA and not needing to use a GHz serial interface. (although if the 
> PI had one spare that I don't know about I might have a go.)

The xCORE i/o is very nice, easy to use, and is
similar to FPGAs in flexibility (e.g. SERDES,
or strobes, or...)


> I had wondered if the the camera or audio interfaces might be re-purposed.

Article: 161384
Subject: Re: bare-metal ZYNQ
From: Dimitrij Klingbeil <nospam@no-address.com>
Date: Sat, 15 Jun 2019 20:19:00 +0200
Links: << >> << T >> << A >>

On 2019-06-15 14:14, Michael Kellett wrote:

> So fast for me, for the applications I have in mind is:
>
> round trip < 1us (less than 50ns preferred) - easy to do with FPGA
> memory mapped to uP and pretending to be a RAM - but I don't see how
>  to do it on a Pi. Sustained data transfer rate > 100MiB per second
> in both directions simultaneously.

100 MiB as in 100 Mi Bytes (rather than bits) per second, full duplex?

That does not look realistic without some high speed serial interface.

As the rPi does not provide any, maybe other similar boards:

- ROCKPro64 (with Rockchip RK3399 SoC) has a PCIe x4 port
- Banana Pi M2 (with Allwinner R40 SoC) has a SATA port

PCIe x4 should easily do it, even when not using all those 4 lanes.

SATA might be somewhat tricky from the controller side, but it can work
if the FPGA target can emulate a block device.

Depending on controller capabilities, this could mean mostly half-duplex
transfers however. That would require SATA 2 with 3 Gb/s minimum (SATA 1
with 1.5Gb/s would be enough if the bulk transfers are full duplex).

Article: 161385
Subject: Re: bare-metal ZYNQ
From: Bart Fox <bartfox@gmx.net>
Date: Sat, 15 Jun 2019 21:50:18 +0200
Links: << >> << T >> << A >>

As others mentioned: there is 256kByte OCM on Zynq.
I assume you want configure the device from flash.

Step 1: FSBL
Step 2: start u-boot (with fpga configuration support)
Step 3: configure u-boot to configure the fpga
Step 4: let u-boot load and start your bare metal application

Bart Fox

Article: 161386
Subject: Re: bare-metal ZYNQ
From: Michael Kellett <mk@mkesc.co.uk>
Date: Sun, 16 Jun 2019 09:53:55 +0100
Links: << >> << T >> << A >>

On 15/06/2019 13:45, Jan Panteltje wrote:
> On a sunny day (Sat, 15 Jun 2019 13:14:27 +0100) it happened Michael Kellett
> <mk@mkesc.co.uk> wrote in <TOWdnRE_0_exfJnAnZ2dnUU78e_NnZ2d@giganews.com>:
> 
>> On 14/06/2019 10:44, Jan Panteltje wrote:
>>> On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael Kellett
>>> <mk@mkesc.co.uk> wrote in <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:
>>>
>>>> Jan - do you know of a good, simple and fast way to get the Pi to
>>>> exchange data with an adjacent chip (uP or FPGA).
>>>
>>>
>>>
>>> Sure, first for FPGA there is this:
>>>    http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA
>>> this connects via GPIO.
>>>
>>> I notice a lot more big names have now FPGA stuff for raspberry..
>>>    Just google 'raspberry FPGA board;.
>>>
>>>
>>> Depending on your definition of 'fast' with a micro,
>>> the Pi had logic level RS232 via /dev/ttyAMA0,
>>> also hardware SPI (or software SPI of course), i2c the same.
>>>
>>> Here used as a a large LED matrix display driver:
>>>    http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html
>>>
>>>
>>> You can also use 8 bits from GPIO and do byte level transfers,
>>> a typical example of 'fast' is this:
>>>    http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
>>> that also uses a FIFO hardware buffer to get a smooth timed data stream
>>> even during OS task switching.
>>>
>>> 8 bits (or more) transfer with handshake will work with most micros.
>>>
>>> Here the Pi as JTAG programmer:
>>>    http://panteltje.com/panteltje/raspberry_pi/
>>>
>>> Stepper motor driver, lots of other i2c chips..
>>>    http://panteltje.com/panteltje/xgpspc/index.html
>>>
>>>
>>>> Using USB or Ethernet
>>>> doesn't count as simple (or very fast for small data packets.)
>>>
>>> USB is slow on my older Raspberries at least, ethernet is OK.
>>> I would prefer ethernet in some applications because of the galvanic isolation.
>>>
>>> What is simple?
>>> Everything is simple once you have dunnit.
>>>
>>
>> Thanks for the stuff Jan, I don't think I explained quite what I meant
>> by fast (although I did say that Ethernet wasn't fast enough).
>>
>> So fast for me, for the applications I have in mind is:
>>
>> round trip < 1us (less than 50ns preferred) - easy to do with FPGA
>> memory mapped to uP and pretending to be a RAM - but I don't see how to
>> do it on a Pi.
>> Sustained data transfer rate > 100MiB per second in both directions
>> simultaneously.
>>
>> You can do this kind of stuff with the Prus on the Beagleboards but it
>> would be nice if it were possible on a Pi.
>>
>> Simple means (in this context) not using lots of other fancy chips over
>> and above the FPGA and not needing to use a GHz serial interface.
>> (although if the PI had one spare that I don't know about I might have a
>> go.)
>>
>> I had wondered if the the camera or audio interfaces might be re-purposed.
>>
>> MK
> 
> Audio I do not think is usable for that, but who knows...
> AFAIK the camera interface is from camera to board, so one way.
> What I meant is if you have 8 or more GPIO pins, say a byte
> then there is nothing stopping you from putting a byte on that,
> and use a pin as handshake 'new data'.
> FPGA would read the handshake and read the byte, and then set a ready pin,
> Pi would then output he next byte,
> Now you have 10 pins and maximum I/O speed.
> Same for 16 bits 18 pins.
> The throughput problems is set by the Pi Linux multitasker, it will
> interrupt the stream every now and then for a few milliseconds at least,
> That is where you need the FIFO.
> But that FIFO can be in FPGA RAM, no external logic needed as I do here:
>    http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
> I did consider doing that in FPGA, but that seemed a bit of overkill in this case.
> 
> So then maximum speed boils down to hwo fast the Pi can output data really..
> have not tested that, as I never was close to that limit.
> It is simple to test, write some I/O pin toggle routine in asm, or even C,
> and look at the scope.
> loop:
> out 0x00
> out 0xff
> goto loop
> 
> Pi has DMA, have not used it myself, here some discussion:
>   https://www.raspberrypi.org/forums/viewtopic.php?t=8376
> 
> Maybe this is of more use to you:
>   https://github.com/hzeller/rpi-gpio-dma-demo
> 
> They did the pin toggle and:
> <quote>
>   The resulting output wave on the Raspberry Pi 1 of 22.7Mhz, the Raspberry Pi 2 reaches 41.7Mhz and the Raspberry Pi 3 65.8 Mhz.
>> end quote>
> 
> Fast enough?
> 
Thanks Jan,
The toggle rate is quite good but the downside of using software 
controlled IO is that at least 1 core is 100% occupied.
For things that don't need the full rate and can be block processed it 
would do.

MK



---
This email has been checked for viruses by AVG.
https://www.avg.com

Article: 161387
Subject: Re: bare-metal ZYNQ
From: Michael Kellett <mk@mkesc.co.uk>
Date: Sun, 16 Jun 2019 09:56:49 +0100
Links: << >> << T >> << A >>

On 15/06/2019 14:26, Tom Gardner wrote:
> On 15/06/19 13:14, Michael Kellett wrote:
>> On 14/06/2019 10:44, Jan Panteltje wrote:
>>> On a sunny day (Fri, 14 Jun 2019 09:11:36 +0100) it happened Michael 
>>> Kellett
>>> <mk@mkesc.co.uk> wrote in 
>>> <98SdnfMZpZZdy57AnZ2dnUU78LXNnZ2d@giganews.com>:
>>>
>>>> Jan - do you know of a good, simple and fast way to get the Pi to
>>>> exchange data with an adjacent chip (uP or FPGA).
>>>
>>>
>>>
>>> Sure, first for FPGA there is this:
>>>   
>>> http://www.latticesemi.com/en/Products/DevelopmentBoardsAndKits/RaspberryPiFPGA 
>>>
>>> this connects via GPIO.
>>>
>>> I notice a lot more big names have now FPGA stuff for raspberry..
>>>   Just google 'raspberry FPGA board;.
>>>
>>>
>>> Depending on your definition of 'fast' with a micro,
>>> the Pi had logic level RS232 via /dev/ttyAMA0,
>>> also hardware SPI (or software SPI of course), i2c the same.
>>>
>>> Here used as a a large LED matrix display driver:
>>> http://panteltje.com/panteltje/raspberry_pi_FDS132_matrix_display_driver/index.html 
>>>
>>>
>>>
>>> You can also use 8 bits from GPIO and do byte level transfers,
>>> a typical example of 'fast' is this:
>>>   http://panteltje.com/panteltje/raspberry_pi_dvb-s_transmitter/
>>> that also uses a FIFO hardware buffer to get a smooth timed data stream
>>> even during OS task switching.
>>>
>>> 8 bits (or more) transfer with handshake will work with most micros.
>>>
>>> Here the Pi as JTAG programmer:
>>>   http://panteltje.com/panteltje/raspberry_pi/
>>>
>>> Stepper motor driver, lots of other i2c chips..
>>>   http://panteltje.com/panteltje/xgpspc/index.html
>>>
>>>
>>>> Using USB or Ethernet
>>>> doesn't count as simple (or very fast for small data packets.)
>>>
>>> USB is slow on my older Raspberries at least, ethernet is OK.
>>> I would prefer ethernet in some applications because of the galvanic 
>>> isolation.
>>>
>>> What is simple?
>>> Everything is simple once you have dunnit.
>>>
>>
>> Thanks for the stuff Jan, I don't think I explained quite what I meant 
>> by fast (although I did say that Ethernet wasn't fast enough).
>>
>> So fast for me, for the applications I have in mind is:
>>
>> round trip < 1us (less than 50ns preferred) - easy to do with FPGA 
>> memory mapped to uP and pretending to be a RAM - but I don't see how 
>> to do it on a Pi.
>> Sustained data transfer rate > 100MiB per second in both directions 
>> simultaneously.
> 
> You could do that on the XMOS xCORE devices. They are
> fast enough to take the 100Mb/s serial ethernet traffic,
> and process it in software. *And* be doing other things
> at the same time, guaranteed by design (not tests!) :)
> 
> The IDE states the maximum timing between two points,
> e.g. two i/o operations, or loop times. That's possible
> since there are no caches, no interrupts, and the
> latency is <100ns (much less in my experience).
> 
> 
> 
>> You can do this kind of stuff with the Prus on the Beagleboards but it 
>> would be nice if it were possible on a Pi.
>>
>> Simple means (in this context) not using lots of other fancy chips 
>> over and above the FPGA and not needing to use a GHz serial interface. 
>> (although if the PI had one spare that I don't know about I might have 
>> a go.)
> 
> The xCORE i/o is very nice, easy to use, and is
> similar to FPGAs in flexibility (e.g. SERDES,
> or strobes, or...)
> 
> 
>> I had wondered if the the camera or audio interfaces might be 
>> re-purposed.

Thanks for the suggestion.
My problem isn't in getting the FPGA to jump through the hoops but in 
getting the PI or similar Linux running platform to transfer data to it. 
The Xmos parts could (sometimes) replace the FPGA but not the Linux end.

MK

---
This email has been checked for viruses by AVG.
https://www.avg.com

Article: 161388
Subject: Re: bare-metal ZYNQ
From: Michael Kellett <mk@mkesc.co.uk>
Date: Sun, 16 Jun 2019 10:00:57 +0100
Links: << >> << T >> << A >>

On 15/06/2019 19:19, Dimitrij Klingbeil wrote:
> On 2019-06-15 14:14, Michael Kellett wrote:
> 
>> So fast for me, for the applications I have in mind is:
>>
>> round trip < 1us (less than 50ns preferred) - easy to do with FPGA
>> memory mapped to uP and pretending to be a RAM - but I don't see how
>>  to do it on a Pi. Sustained data transfer rate > 100MiB per second
>> in both directions simultaneously.
> 
> 100 MiB as in 100 Mi Bytes (rather than bits) per second, full duplex?
> 
> That does not look realistic without some high speed serial interface.
> 
> As the rPi does not provide any, maybe other similar boards:
> 
> - ROCKPro64 (with Rockchip RK3399 SoC) has a PCIe x4 port
> - Banana Pi M2 (with Allwinner R40 SoC) has a SATA port
> 
> PCIe x4 should easily do it, even when not using all those 4 lanes.
> 
> SATA might be somewhat tricky from the controller side, but it can work
> if the FPGA target can emulate a block device.
> 
> Depending on controller capabilities, this could mean mostly half-duplex
> transfers however. That would require SATA 2 with 3 Gb/s minimum (SATA 1
> with 1.5Gb/s would be enough if the bulk transfers are full duplex).

100 mega bytes per second is quite feasible locally (on board) with // 8 
bit bus - I've had less trouble getting 1Gbit ethernet PHYs to talk to 
FPGAs using // connections than serial.

The great thing about the Zynq is that the // interface is on chip 
rather than on board so almost easy. The downside is that an RPi is so 
much cheaper.

An RPI with a free PCie port and OS support - that would be nice.

MK

---
This email has been checked for viruses by AVG.
https://www.avg.com

Article: 161389
Subject: Re: bare-metal ZYNQ
From: Tom Gardner <spamjunk@blueyonder.co.uk>
Date: Sun, 16 Jun 2019 14:54:41 +0100
Links: << >> << T >> << A >>

On 16/06/19 09:56, Michael Kellett wrote:
> My problem isn't in getting the FPGA to jump through the hoops but in getting 
> the PI or similar Linux running platform to transfer data to it. The Xmos parts 
> could (sometimes) replace the FPGA but not the Linux end.

Exactly so. xCORE/xC are unique and in an interesting niche, but
like all other technologies they are not The General Solution.

But it is fun trying to (and succeeding) in mixing-and-matching
technologies to work around each's limitations :)

Article: 161390
Subject: Unique uses for the DSP48
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Thu, 27 Jun 2019 17:26:59 -0700 (PDT)
Links: << >> << T >> << A >>

I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic=
, but they really aren't that useful for that.  The new ones can do a 96-bi=
t unary XOR, which can be used for GF(2) matrix multiplication, but the mul=
tipliers themselves aren't of much use for Galois math.  I wondered what un=
usual uses (besides FIR filters or integer matrix multipliers) people have =
used these for.  Here are some of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and s=
hifting columsn serially out of P using the pattern matcher)
- Barrel shifters (Not that good for wide buses, though)
- Modulo by a constant (using Barrett's Reduction)
- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

Article: 161391
Subject: Re: Unique uses for the DSP48
From: Rick C <gnuarm.deletethisbit@gmail.com>
Date: Thu, 27 Jun 2019 20:52:28 -0700 (PDT)
Links: << >> << T >> << A >>

On Thursday, June 27, 2019 at 8:27:03 PM UTC-4, Kevin Neilson wrote:
> I've tried to figure out how to use the Xilinx DSP48s for Galois arithmet=
ic, but they really aren't that useful for that.  The new ones can do a 96-=
bit unary XOR, which can be used for GF(2) matrix multiplication, but the m=
ultipliers themselves aren't of much use for Galois math.  I wondered what =
unusual uses (besides FIR filters or integer matrix multipliers) people hav=
e used these for.  Here are some of mine:
>=20
> - Transposers (shifting rows up a DSP column in A/B, latching into P, and=
 shifting columsn serially out of P using the pattern matcher)
> - Barrel shifters (Not that good for wide buses, though)
> - Modulo by a constant (using Barrett's Reduction)
> - GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

I'm no expert, but I believe Galois filters use modulo 2 arithmetic without=
 carries, so multipliers are not what you want.  Since there is no carry, t=
here is no need to use any special features.  The typical fabric logic will=
 do the job quite nicely. =20

Certainly barrel shifters would be useful.  Don't know what Barrett's Reduc=
tion is. =20

I don't think GF(2) would be useful since you can't use a multiplier withou=
t the carry as far as I know.  Am I mistaken?=20

--=20

  Rick C.

  - Get 1,000 miles of free Supercharging
  - Tesla referral code - https://ts.la/richard11209

Article: 161392
Subject: Re: Unique uses for the DSP48
From: gtwrek@sonic.net (gtwrek)
Date: Fri, 28 Jun 2019 16:41:38 -0000 (UTC)
Links: << >> << T >> << A >>

In article <c159d7b2-da4b-4852-b90f-aa619fc9a1b4@googlegroups.com>,
Kevin Neilson  <kevin.neilson@xilinx.com> wrote:
>I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that.  The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix
>multiplication, but the multipliers themselves aren't of much use for Galois math.  I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for.  Here are some
>of mine:
>
>- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)
>- Barrel shifters (Not that good for wide buses, though)
>- Modulo by a constant (using Barrett's Reduction)
>- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)


I've used one to implement a 3-input median filter (12-bit max inputs).
Used the SIMD modes to evaluate each comparison between the three
inputs.

Regards,

Mark

Article: 161393
Subject: HOW TO READ A 64 BIT REGISTER IN 2 CLOCK CYCLES IN VERILOG
From: Anonymous <bazgha.amin@gmail.com>
Date: Fri, 28 Jun 2019 13:35:11 -0700 (PDT)
Links: << >> << T >> << A >>

There is a 64bit input that is stored in register. A single clock cycle can=
 read 32bits at a time. For implementing this 4x1 mux is used as 32bit imag=
e is divided into 8bit. The problem I'm facing is I don't know how to read =
a single reg in 2 clock cycles i.e. 32bit in one cycle and remaining 32bit =
in second cycle. I'll be grateful if you can help me out with it. Thankyou.

Article: 161394
Subject: Re: Unique uses for the DSP48
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Fri, 28 Jun 2019 15:44:11 -0700 (PDT)
Links: << >> << T >> << A >>

I have to do a lot of GF(2) multiplications, which end up being a vector ti=
mes a matrix, with the matrix being sometimes 128x128 bits or bigger.  Ther=
e are no carries, like you say.  Mostly it's an AND of the vector and a col=
umn of the matrix and then an XOR of that, and that is done for each column=
.  It can use up a lot of LUTs and the routing can get congested, especiall=
y when the synthesizer tries to share terms.  The XOR is a tree with 3-4 le=
vels of logic.

I think some processors can do a "carryless" multiply for this, so the colu=
mns are added but no carries are taken to the next column.  You end up with=
 a result that is wider than the field, so you have to do a reduction stage=
, but it's still advantageous.  If you could disable the carries in the DSP=
48s, you could use them in the GF(2) multipliers, but unfortunately there's=
 no such option.

Article: 161395
Subject: Re: HOW TO READ A 64 BIT REGISTER IN 2 CLOCK CYCLES IN VERILOG
From: gtwrek@sonic.net (gtwrek)
Date: Fri, 28 Jun 2019 22:55:46 -0000 (UTC)
Links: << >> << T >> << A >>

In article <47eef6e5-1fdf-46ad-81c3-08387dfd526e@googlegroups.com>,
Anonymous  <bazgha.amin@gmail.com> wrote:
>There is a 64bit input that is stored in register. A single clock cycle can
>read 32bits at a time. For implementing this 4x1 mux is used as 32bit image
>is divided into 8bit. The problem I'm facing is I don't know how to read 
>a single reg in 2 clock cycles i.e. 32bit in one cycle and remaining 32bit
>in second cycle. I'll be grateful if you can help me out with it. Thankyou.

In FPGA design, one "reads" a 64-bit register just be referencing the
designed signal in HDL.  i.e. in verilog:
  reg [ 63 : 0 ] foo;
  reg [ 63 : 0 ] bar;

  always @*
    bar = foo;

One can randomly access "foo" (or bar, or any other wire/register) as many 
times as you wish.  There's no limits on size/etc.  (Other than perhaps
FPGA routing/logic resources).

My feeling, however, is you're talking about a software-ish read from an
(unspecified) processor across an (unspecified) bus.

Without more details, we can only guess.  But the normal procedure one
follows here, is to simply issue two software read operations.  One to
lower half of the 32-bit word, one the the upper half of the 32-bit
word.  Some software massaging to correctly organize the 2-independant
32-bit reads into the singular 64-bit software variable.

Fill in some more details, and we can perhaps help you more.

Regards,

Mark

Article: 161396
Subject: How do big compagnies use Verilog/VHDL for processor designs?
From: Benjamin Couillard <benjamin.couillard@gmail.com>
Date: Tue, 2 Jul 2019 08:34:29 -0700 (PDT)
Links: << >> << T >> << A >>

I have a question on how big companies like Intel/AMD use VHDL and Verilog internally for their processors. 

For example, if they implement an ALU. Do they implement the ALU on an RTL-level or do they instantiate hand-optimized components (adder, barrel shifter, multiplier). 

Basically, does the synthesizer actually do something or does it only connect hand-optimized components?

Regards

Article: 161397
Subject: Re: Unique uses for the DSP48
From: David Brown <david.brown@hesbynett.no>
Date: Thu, 4 Jul 2019 11:16:07 +0200
Links: << >> << T >> << A >>

On 29/06/2019 00:44, Kevin Neilson wrote:
> I have to do a lot of GF(2) multiplications, which end up being a vector times a matrix, with the matrix being sometimes 128x128 bits or bigger.  There are no carries, like you say.  Mostly it's an AND of the vector and a column of the matrix and then an XOR of that, and that is done for each column.  It can use up a lot of LUTs and the routing can get congested, especially when the synthesizer tries to share terms.  The XOR is a tree with 3-4 levels of logic.
> 
> I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column.  You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous.  If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.
> 

Do you really mean GF(2) ?  That is just single bits.  Addition is XOR,
multiplication is AND.

If you meant something like GF(2^8), which is popular in RAID and other
forward error correction systems on byte-wide data, then I can't think
of any smart way to handle multiplication in an FPGA.  Multiplying by 2
is easy enough, but arbitrary multiplication is done using log tables in
software.  If there is a nice hardware method, I'd love to know.

Article: 161398
Subject: Re: How do big compagnies use Verilog/VHDL for processor designs?
From: David Brown <david.brown@hesbynett.no>
Date: Thu, 4 Jul 2019 11:22:22 +0200
Links: << >> << T >> << A >>

On 02/07/2019 17:34, Benjamin Couillard wrote:
> I have a question on how big companies like Intel/AMD use VHDL and
> Verilog internally for their processors.
> 
> For example, if they implement an ALU. Do they implement the ALU on
> an RTL-level or do they instantiate hand-optimized components (adder,
> barrel shifter, multiplier).
> 
> Basically, does the synthesizer actually do something or does it only
> connect hand-optimized components?
> 
> Regards
> 

You might get useful answers here in comp.arch - there are regulars
there who work (or worked) for companies like AMD in processor design.

My understanding - which could be /very/ flawed, so don't trust it - is
that for something as big and complex as a big CPU, people rarely use
"raw" Verilog or VHDL.  They use higher level languages that generate
lower level languages or RTL output.  For an example in the open source
world (where information is easier to obtain!), look at
<https://github.com/SpinalHDL/VexRiscv>

Article: 161399
Subject: Re: Unique uses for the DSP48
From: Allan Herriman <allanherriman@hotmail.com>
Date: Thu, 04 Jul 2019 07:15:49 -0500
Links: << >> << T >> << A >>

On Thu, 04 Jul 2019 11:16:07 +0200, David Brown wrote:

> On 29/06/2019 00:44, Kevin Neilson wrote:
>> I have to do a lot of GF(2) multiplications, which end up being a
>> vector times a matrix, with the matrix being sometimes 128x128 bits or
>> bigger.  There are no carries, like you say.  Mostly it's an AND of the
>> vector and a column of the matrix and then an XOR of that, and that is
>> done for each column.  It can use up a lot of LUTs and the routing can
>> get congested, especially when the synthesizer tries to share terms. 
>> The XOR is a tree with 3-4 levels of logic.
>> 
>> I think some processors can do a "carryless" multiply for this, so the
>> columns are added but no carries are taken to the next column.  You end
>> up with a result that is wider than the field, so you have to do a
>> reduction stage, but it's still advantageous.  If you could disable the
>> carries in the DSP48s, you could use them in the GF(2) multipliers, but
>> unfortunately there's no such option.
>> 
>> 
> Do you really mean GF(2) ?  That is just single bits.  Addition is XOR,
> multiplication is AND.
> 
> If you meant something like GF(2^8), which is popular in RAID and other
> forward error correction systems on byte-wide data, then I can't think
> of any smart way to handle multiplication in an FPGA.  Multiplying by 2
> is easy enough, but arbitrary multiplication is done using log tables in
> software.  If there is a nice hardware method, I'd love to know.

I think he does mean GF(2).
Here's the user guide:
<https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf>

Allan

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search