Messages from 151025

Article: 151025
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: Tricky <trickyhead@gmail.com>
Date: Tue, 1 Mar 2011 06:33:59 -0800 (PST)
Links: << >> << T >> << A >>

On Mar 1, 12:59=A0pm, a s <nospa...@gmail.com> wrote:
> Dear all,
>
> I have come up with 2 solutions in VHDL, how to count number of bits
> in input data.
> The thing I don't understand is why the 2 solutions produce different
> results, at least with Xilinx ISE and its XST.
> There is quite a substantial difference in required number of slices/
> LUTs.
>
> 1. solution with unrolled loop: =A0 =A0 =A0 =A041 slices, =A073 LUTs
> 2. solution with loop: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A054 slices, =
100 LUTs
>
> The entity of both architectures is the same:
>
> entity one_count is
> =A0 Port ( din : in =A0STD_LOGIC_vector(31 downto 0);
> =A0 =A0 =A0 =A0 =A0dout : out =A0STD_LOGIC_vector(5 downto 0)
> =A0 =A0 =A0 =A0 );
> end one_count;
>
> The architecture with an unrolled loop is the following:
>
> library IEEE;
> use IEEE.STD_LOGIC_1164.ALL;
> use IEEE.NUMERIC_STD.ALL;
>
> entity one_count is
> =A0 Port ( din : in =A0STD_LOGIC_vector(31 downto 0);
> =A0 =A0 =A0 =A0 =A0dout : out =A0STD_LOGIC_vector(5 downto 0)
> =A0 =A0 =A0 =A0 );
> end one_count;
>
> architecture one_count_unrolled_arch of one_count is
>
> =A0 signal =A0cnt : integer range 0 to 32;
>
> begin
>
> =A0 =A0cnt <=3D to_integer(unsigned(din( 0 downto =A00))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 1 downto =A01))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 2 downto =A02))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 3 downto =A03))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 4 downto =A04))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 5 downto =A05))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 6 downto =A06))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 7 downto =A07))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 8 downto =A08))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din( 9 downto =A09))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(10 downto 10))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(11 downto 11))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(12 downto 12))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(13 downto 13))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(14 downto 14))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(15 downto 15))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(16 downto 16))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(17 downto 17))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(18 downto 18))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(19 downto 19))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(20 downto 20))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(21 downto 21))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(22 downto 22))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(23 downto 23))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(24 downto 24))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(25 downto 25))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(26 downto 26))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(27 downto 27))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(28 downto 28))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(29 downto 29))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(30 downto 30))) +
> =A0 =A0 =A0 =A0 =A0 to_integer(unsigned(din(31 downto 31)));
>
> =A0 =A0dout <=3D std_logic_vector(to_unsigned(cnt,6));
>
> end one_count_unrolled_arch ;
>
> And the architecture with a loop is the following:
>
> architecture one_count_loop_arch of one_count_loop is
>
> signal =A0cnt : integer range 0 to 32;
>
> begin
>
> =A0 process(din) is
> =A0 =A0 variable =A0tmp : integer range 0 to 32;
> =A0 =A0 begin
> =A0 =A0 =A0 tmp :=3D to_integer(unsigned(din(0 downto 0)));
> =A0 =A0 =A0 for i in 1 to 31 loop
> =A0 =A0 =A0 =A0 =A0 tmp :=3D tmp + to_integer(unsigned(din(i downto i)));
> =A0 =A0 =A0 end loop;
> =A0 =A0 =A0 cnt <=3D tmp;
> =A0 end process;
>
> =A0 dout <=3D std_logic_vector(to_unsigned(cnt,6));
>
> end one_count_loop_arch ;
>
> I would be really grateful if somebody could point out what I did
> wrong with the 2. solution with loop.
> It certainly must be my mistake, but I can not find it...
>
> Additionally, I know that this "brute-force" one counting might not be
> the optimal approach,
> but this is just my first attempt to get the job done. If somebody has
> a better solution, I would
> appreciate it if you could share it.
>
> Regards,
> Peter

see what you get with this function instead (a function I have used
before):

function count_ones(slv : std_logic_vector) return natural is
  varaible n_ones : natural :=3D 0;
begin
  for i in slv'range loop
    if slv(i) =3D '1' then
      n_ones :=3D n_ones + 1;
    end if;
  end loop;

  return n_ones;
end function count_ones;

....

inside architecture, no process needed:

dout <=3D std_logic_vector( to_unsigned(count_ones(din), dout'length) );


The beauty with this is the function will work with a std_logic_vector
of any length.

Article: 151026
Subject: Re: Simulation vs. Hardware mismatch
From: rickman <gnuarm@gmail.com>
Date: Tue, 1 Mar 2011 06:39:52 -0800 (PST)
Links: << >> << T >> << A >>

On Feb 27, 11:02 am, Patrick <Patr...@hotmail.com> wrote:
> Hi again,
>
> Thanks to all your input, I implemented your suggestions, however the
> problem remains the same. The result in simulation works fine, but the
> hardware
> outputs something different. Just to briefly recap, I have two ctrl
> signals that determine the behaviour of the entity:
>
>   GET    (ctrl = "00000000") sets register tx to input of op1
>   SH1_L (ctrl = "00000001") res := (op1 << 1) | tx;
>                                          tx  := tx >> 31;
>
>    library ieee;
>    use ieee.std_logic_1164.all;
>
>    entity test is
>    port
>    (
>      op1   : in  std_logic_vector(31 downto 0);      -- Input operand
>      ctrl   : in std_logic_vector(7 downto 0);          -- Control signal
>      clk   : in  std_logic;                                     -- clock
>      res   : out std_logic_vector(31 downto 0)       -- Result
>    );
>    end;
>
>    architecture rtl of test is
>
>      type res_sel_type is (GET, SH1_L);
>
>      constant Z : std_logic_vector(31 downto 0) := (others => '0');
>
>      signal res_sel  : res_sel_type;
>      signal load      : std_logic := '0';
>      signal shl        : std_logic := '0';
>
>      signal tx        : std_logic_vector(31 downto 0) := (others => '0');
>      signal inp1    : std_logic_vector(31 downto 0) := (others => '0');
>
>    begin
>
>      dec_op: process (ctrl, op1)
>      begin
>
>        res_sel  <= GET;
>        load      <= '0';
>        shl        <= '0';
>        inp1      <= ( others => '0');
>
>        case ctrl is
>
>           -- store operand
>               when "00000000" =>
>                  inp1      <= op1;
>                  load      <= '1';
>                  res_sel <= GET;
>
>               -- 1-bit left-shift with carry
>               when "00000001" =>
>                inp1      <= op1;
>                shl        <= '1';
>                res_sel <= SH1_L;
>
>               when others =>
>                  -- Leave default values
>
>               end case;
>
>      end process;
>
>      sel_out: process (res_sel, inp1, tx)
>      begin
>
>        case res_sel is
>
>         when SH1_L =>
>          res  <= ( inp1(30 downto 0) & '0' ) or tx;
>
>           when others =>
>              res <= (others => '0');
>
>        end case;
>
>      end process;
>
>      sync: process(clk)
>      begin
>       if clk'event and clk = '1' then
>            if load = '1' then
>               tx <= op1;
>            elsif shl = '1' then
>               tx <= Z(30 downto 0) & op1(31);
>            end if;
>       end if;
>      end process;
>
>    end rtl;
>
> TESTPROGRAM
>
> GET  0                      (this sets tx <= 0 )
> SH1_L 0xfedcba90     exp. output: 0xfdb97520  act. output = 0xfdb97521
> SH1_L 0x7654321f     exp. output: 0xeca8643f  act. output = 0xeca8643f
> SH1_L 0x71234567    exp. output: 0xe2468ace  act. output = 0xe2468ace
>
> As you can see, the last bit is wrong for the first SH1_L operation. The
> first SH1_L operation produces a carry for the NEXT SH1_L operation since
> the MSB is set to one of the input, however, it seems that this carry is
> already considered in the current SH1_L operation, which is wrong (tx
> should be zero).
> I checked the synthesis report and there are no latches, so I am a bit
> clueless and almost desperate what is going wrong here. I use Xilinx ISE
> 12.1 for
> synthesis, could there be a problem because I do not have a reset signal
> in my architecture, that the wrong kind of latches are instantiated?
>
> Many thanks for further helpful comments to solve this issue,
> Patrick

I'm not sure what your results mean really.  It depends on when you
are looking at the output. If you are applying the inputs to the UUT
(unit under test) and toggle the clock, then look at the outputs, then
I think this result is correct.  As soon as you get a rising edge on
the clock, tx will change and your output will change.  The tx
register is the only register in your design.  The rest of the circuit
is combinatorial, sot that those signals all change as soon as the
inputs change.

So what is the timing between your input stimulus, the output
observation and the clock?

Rick

Article: 151027
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: Andy <jonesandy@comcast.net>
Date: Tue, 1 Mar 2011 06:49:39 -0800 (PST)
Links: << >> << T >> << A >>

A good synthesis tool should be able to optimize either version to the
same implementation. But there are semantic differences that Xilinx
may be getting hung up on.

In the unrolled version, you have a long expression, and there is
freedom within vhdl to evaluate it in different orders or groups of
operations. In the loop version, since you are continually updating
tmp, you are describing an explicitly sequential order in which the
calculation takes place. Like I said, a good synthesis tool should be
able to handle either one equally well, but you get what you pay for
in synthesis tools.

If you are looking for a general solution to the problem for any size
vector, try a recursive function implentation of a binary tree and see
what happens.

Just for kicks, you might also put the whole calculation in the loop
(0 to 31), with temp set to 0 before the loop. Shouldn't make any
difference, but then again, we're already seeing differences where
there should be none.

On the other hand, if what you have works (fits and meets timing), use
the most maintainable, understandable version. It will save you time
(=money) in the long run. It is often interesting to find out what is
"the optimal" way to code something such that it results in the
smallest/fastest circuit. But in the big picture, it most often does
not matter, especially when you write some cryptic code to squeeze the
last pS/LUT out of it, and you had plenty of slack and space to spare
anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
handy every once in a while.

Andy

Article: 151028
Subject: Re: xilinx spartan 6
From: Ed McGettigan <ed.mcgettigan@xilinx.com>
Date: Tue, 1 Mar 2011 08:26:46 -0800 (PST)
Links: << >> << T >> << A >>

On Feb 28, 7:56=A0am, Serkan <ok...@su.sabanciuniv.edu> wrote:
> I need to route a FAST CLK (that is used for deserializing and input
> to only one bank) to another bank's IODELAY2 and IOSERDES2 elements.
> Is this possible?
>
> Please remember that I also need to send signals like,
>
> -serdesstrobe,
> -fast ioclk(sampling fast serial data),
> -parallel clk(clk whose frequency is the same as parallel
> data(deserialized data)) to these elements.
>
> Serkan

This isn't possible.  These clocks can only operate within one bank.

You may be able to just use a BUFG if the data rate isn't too high.

Ed McGettigan
--
Xilinx Inc.

Article: 151029
Subject: Re: PLL Cyclone III vs PLL(DLL) Spartan-3AN
From: Ed McGettigan <ed.mcgettigan@xilinx.com>
Date: Tue, 1 Mar 2011 08:34:13 -0800 (PST)
Links: << >> << T >> << A >>

On Feb 28, 11:32=A0am, Tim Wescott <t...@seemywebsite.com> wrote:
> On 02/28/2011 06:36 AM, Eugen_pcad_ru wrote:
>
>
>
>
>
> > Hello all!
> > I need pll which can:
> > 1) 40 MHz -> =A0320 MHz (0 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (15 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (30 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (45 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (60 deg)
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (75 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (90 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (105 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (120 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (135 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (150 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (165 deg),
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 320 MHz (180 deg).
> > They can be together or not.
> > And I have two fpgas: Cyclone III (Altera), Spartan-3AN.
> > What fpga is better for me? Why? Or its no difference?
>
> I don't know about the Altera part, but Xilinx is too cool to use
> phase-locked loops -- they use delay-locked loops instead (see the data
> sheet). =A0This means that they can maybe generate the clock you need, bu=
t
> they'll do it by delaying the 40MHz clock, and they'll demand that your
> clock edges have no more than 150ps of jitter. =A0In other words, you nee=
d
> to feed it a 40MHz clock that jitters no worse than a good 320MHz clock.
>
> --
>
> Tim Wescott
> Wescott Design Serviceshttp://www.wescottdesign.com
>
> Do you need to implement control loops in software?
> "Applied Control Theory for Embedded Systems" was written for you.
> See details athttp://www.wescottdesign.com/actfes/actfes.html- Hide quote=
d text -
>
> - Show quoted text -

In the Spartan-3A family the DCM CLKIN jitter is specified at +/-300pS
at 40 MHz.

> but Xilinx is too cool to use phase-locked loops

While this is true for older families, Virtex-5, Virtex-6, and
Spartan-6 all include PLL clocking elements.

Ed McGettigan
--
Xilinx Inc.

Article: 151030
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: a s <nospamas@gmail.com>
Date: Tue, 1 Mar 2011 08:41:46 -0800 (PST)
Links: << >> << T >> << A >>

Dear Andy, Tricky,

thank you both for your valuable input. Please find my comments below.

On Mar 1, 3:49=A0pm, Andy <jonesa...@comcast.net> wrote:
> A good synthesis tool should be able to optimize either version to the
> same implementation. But there are semantic differences that Xilinx
> may be getting hung up on.

Aha, that's a good thing, it means that I did not make some obvious
mistake. ;-)

> If you are looking for a general solution to the problem for any size
> vector, try a recursive function implentation of a binary tree and see
> what happens.

OK, I didn't quite get that but will consider it again.

> Just for kicks, you might also put the whole calculation in the loop
> (0 to 31), with temp set to 0 before the loop. Shouldn't make any
> difference, but then again, we're already seeing differences where
> there should be none.

Sorry I didn't tell you before. I have already tried that and in this
case XST produces the same result.

> On the other hand, if what you have works (fits and meets timing), use
> the most maintainable, understandable version. It will save you time
> (=3Dmoney) in the long run. It is often interesting to find out what is
> "the optimal" way to code something such that it results in the
> smallest/fastest circuit. But in the big picture, it most often does
> not matter, especially when you write some cryptic code to squeeze the
> last pS/LUT out of it, and you had plenty of slack and space to spare
> anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
> handy every once in a while.

Andy, I completely agree with what you have written above.
One should strive for maintainable and understandable version.
Although, on my particular case, I have to find a good solution
in terms of LUT resources, because I need 8 instances of
one counters with 64-bit input data. And the device is getting full...

Tricky, your approach does indeed look very neat. I like it.
Although it is far less efficient than mine. For the same input/output
ports, your version with function requires 171 Slices in 313 LUTs.
(The minimum that I get with unrolled version is 41 Slices and 73
LUTs).

Article: 151031
Subject: Re: xilinx spartan 6
From: Serkan <oktem@su.sabanciuniv.edu>
Date: Tue, 1 Mar 2011 11:47:39 -0800 (PST)
Links: << >> << T >> << A >>

On Mar 1, 6:26=A0pm, Ed McGettigan <ed.mcgetti...@xilinx.com> wrote:
> On Feb 28, 7:56=A0am, Serkan <ok...@su.sabanciuniv.edu> wrote:
>
> > I need to route a FAST CLK (that is used for deserializing and input
> > to only one bank) to another bank's IODELAY2 and IOSERDES2 elements.
> > Is this possible?
>
> > Please remember that I also need to send signals like,
>
> > -serdesstrobe,
> > -fast ioclk(sampling fast serial data),
> > -parallel clk(clk whose frequency is the same as parallel
> > data(deserialized data)) to these elements.
>
> > Serkan
>
> This isn't possible. =A0These clocks can only operate within one bank.
>
> You may be able to just use a BUFG if the data rate isn't too high.
>
> Ed McGettigan
> --
> Xilinx Inc.

Dear Ed,

Are you sure about this?

Because I could be able to drive two PLLs with BUFIO2s using two
ISERDES(DFB) outputs.
Maybe I'm missing something but below config placed and routed.

same GCLK --> same IBUFGDS-->first ISERDES2(DFB)=3D=3D>first bufio2=3D=3D=
=3D>
PLL 1
same GCLK --> same IBUFGDS-->2nd ISERDES2(DFB)=3D=3D>2nd bufio2=3D=3D=3D> P=
LL 2

I'm trying to do this because I do not want to be limited to 400Mhz of
Spartan 6 BUFGs while deserializing.

Article: 151032
Subject: Re: PLL Cyclone III vs PLL(DLL) Spartan-3AN
From: Tim Wescott <tim@seemywebsite.com>
Date: Tue, 01 Mar 2011 12:13:12 -0800
Links: << >> << T >> << A >>

On 03/01/2011 08:34 AM, Ed McGettigan wrote:
> On Feb 28, 11:32 am, Tim Wescott<t...@seemywebsite.com>  wrote:
>> On 02/28/2011 06:36 AM, Eugen_pcad_ru wrote:
>>
>>
>>
>>
>>
>>> Hello all!
>>> I need pll which can:
>>> 1) 40 MHz ->    320 MHz (0 deg),
>>>                320 MHz (15 deg),
>>>                320 MHz (30 deg),
>>>                320 MHz (45 deg),
>>>                320 MHz (60 deg)
>>>                320 MHz (75 deg),
>>>                320 MHz (90 deg),
>>>                320 MHz (105 deg),
>>>                320 MHz (120 deg),
>>>                320 MHz (135 deg),
>>>                320 MHz (150 deg),
>>>                320 MHz (165 deg),
>>>                320 MHz (180 deg).
>>> They can be together or not.
>>> And I have two fpgas: Cyclone III (Altera), Spartan-3AN.
>>> What fpga is better for me? Why? Or its no difference?
>>
>> I don't know about the Altera part, but Xilinx is too cool to use
>> phase-locked loops -- they use delay-locked loops instead (see the data
>> sheet).  This means that they can maybe generate the clock you need, but
>> they'll do it by delaying the 40MHz clock, and they'll demand that your
>> clock edges have no more than 150ps of jitter.  In other words, you need
>> to feed it a 40MHz clock that jitters no worse than a good 320MHz clock.
>
> In the Spartan-3A family the DCM CLKIN jitter is specified at +/-300pS
> at 40 MHz.

By the data sheet that I looked at, that is true unless you're asking 
for synthesized frequencies > 150MHz, in which case it needs to be 150ps.

>> but Xilinx is too cool to use phase-locked loops
>
> While this is true for older families, Virtex-5, Virtex-6, and
> Spartan-6 all include PLL clocking elements.
>

That's good to know -- sometimes a real PLL is a good thing, when you 
have a clock to clean up.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" was written for you.
See details at http://www.wescottdesign.com/actfes/actfes.html

Article: 151033
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: johnp <jprovidenza@yahoo.com>
Date: Tue, 1 Mar 2011 13:41:06 -0800 (PST)
Links: << >> << T >> << A >>


Here's a slightly different approach to your problem....  It tries to
take advantage
of the fact that the LUTs are pretty good as lookup tables.  It's in
Verilog, but
you should easily be able to convert it to VHDL.

`define V1

module tst (
    input       [31:0]          data,
    output      [5:0]           cnt
);


`ifdef V1

/*
    Device utilization summary:
    ---------------------------

    Selected Device : 3s50pq208-5

     Number of Slices:                       35  out of    768
4%
     Number of 4 input LUTs:                 62  out of   1536
4%
     Number of IOs:                          38
     Number of bonded IOBs:                   0  out of    124
0%
*/


function [1:0]  cnt3;
    input       [2:0]   d_in;
    begin
    case (d_in)
    3'h0:       cnt3 = 2'h0;
    3'h1:       cnt3 = 2'h1;
    3'h2:       cnt3 = 2'h1;
    3'h3:       cnt3 = 2'h2;
    3'h4:       cnt3 = 2'h1;
    3'h5:       cnt3 = 2'h2;
    3'h6:       cnt3 = 2'h2;
    3'h7:       cnt3 = 2'h3;
    endcase
    end
endfunction

assign cnt = cnt3(data[2:0])
           + cnt3(data[5:3])
           + cnt3(data[8:6])
           + cnt3(data[11:9])
           + cnt3(data[14:12])
           + cnt3(data[17:15])
           + cnt3(data[20:18])
           + cnt3(data[23:21])
           + cnt3(data[26:24])
           + cnt3(data[29:27])
           + cnt3({1'b0, data[31:30]})
           ;

`endif


`ifdef V2
/*
    Selected Device : 3s50pq208-5

     Number of Slices:                       44  out of    768
5%
     Number of 4 input LUTs:                 79  out of   1536
5%
     Number of IOs:                          38
     Number of bonded IOBs:                   0  out of    124
0%
*/

function [2:0]  cnt4;
    input       [3:0]   d_in;
    begin
    case (d_in)
    4'h0:       cnt4 = 3'h0;
    4'h1:       cnt4 = 3'h1;
    4'h2:       cnt4 = 3'h1;
    4'h3:       cnt4 = 3'h2;
    4'h4:       cnt4 = 3'h1;
    4'h5:       cnt4 = 3'h2;
    4'h6:       cnt4 = 3'h2;
    4'h7:       cnt4 = 3'h3;

    4'h8:       cnt4 = 3'h1;
    4'h9:       cnt4 = 3'h2;
    4'ha:       cnt4 = 3'h2;
    4'hb:       cnt4 = 3'h3;
    4'hc:       cnt4 = 3'h2;
    4'hd:       cnt4 = 3'h3;
    4'he:       cnt4 = 3'h3;
    4'hf:       cnt4 = 3'h4;

    endcase
    end
endfunction

assign cnt = cnt4(data[3:0])
           + cnt4(data[7:4])
           + cnt4(data[11:8])
           + cnt4(data[15:12])
           + cnt4(data[19:16])
           + cnt4(data[23:20])
           + cnt4(data[27:24])
           + cnt4(data[31:28])
           ;


`endif
endmodule


John Providenza

Article: 151034
Subject: Slice Usage
From: Turj <Turj@hotmail.com>
Date: Tue, 01 Mar 2011 22:03:33 +0000
Links: << >> << T >> << A >>

Dear all,

I am using XST 12.1 for sythensis and for the first time I had a look at
how many slices are occupied by my design. I have to say, I am bit 
overwhelmed by the results, with Virtex II you essentially got one
figure which was the total of the slices.

In Virtex 5, XST seems break this down as follows:

Design Summary:
Slice Logic Utilization:
Number of Slice Registers:                3,127 out of  28,800    9%
Number used as Flip Flops:                3,127
Number of Slice LUTs:                     7,521 out of  28,800   25%
Number used as logic:                     7,256 out of  28,800   25%
Number using O6 output only:              7,001
Number using O5 output only:              68
Number using O5 and O6:                   75
Number used as Memory:                    37 out of   7,680    1%
Number used as Shift Register:            37
Number using O6 output only:              37
Number used as exclusive route-thru:      3
Number of route-thrus:                    72
Number using O6 output only:              60
Number using O5 output only:              9

Slice Logic Distribution:
Number of occupied Slices:                2,451 out of   7,200   32%
Number of LUT Flip Flop pairs used:       8,625
Number with an unused Flip Flop:          5,001 out of   7,707   63%
Number with an unused LUT:                507 out of     7,707    5%
Number of fully used LUT-FF pairs:        2,532 out of   7,707   31%
Number of unique control sets:            233

So what it now the number of total slices occupied on my board? Which 
ones do I have to add up?

Maybe: Number of occupied slices + Number of Slice Registers + Number 
Used as Logic?

thanks,
Turj

Article: 151035
Subject: Re: PLL Cyclone III vs PLL(DLL) Spartan-3AN
From: Chris Maryan <kmaryan@gmail.com>
Date: Tue, 1 Mar 2011 14:17:35 -0800 (PST)
Links: << >> << T >> << A >>

Generally a post of this sort begs the question: what are you doing and why=
 do you think you need 15 degree phase increments of a 320MHz clock? Do you=
 need them all at once or can you switch between them? Do you need those ph=
ases to be exact? Do you need exactly 15 degree increments or just some swe=
ep from 0 to 180.

I can't off hand recall the features of the DCMs in the Xilinx part, but so=
mething to consider is also whether you have the clocking resources to get =
what you want. Typically in a single region in the FPGA, you have access to=
 something like 8-12 regional or global clocks. So if you want all 13 phase=
s going to some logic via the clock network - it might be tough or impossib=
le to do. Going via general routing resources would create horrible skew an=
d it would probably defeat the purpose of having precise phases.

I think in general you need to elaborate on what you are trying to do.

Chris

Article: 151036
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: Andy <jonesandy@comcast.net>
Date: Tue, 1 Mar 2011 14:36:33 -0800 (PST)
Links: << >> << T >> << A >>

> > If you are looking for a general solution to the problem for any size
> > vector, try a recursive function implentation of a binary tree and see
> > what happens.
>
> OK, I didn't quite get that but will consider it again.
>

The synthesis tool may not be able to figure out that it need not
carry all the bits of the sum through every caculation in the loop. A
binary tree implementation can manage the sum width at every stage.

For a recursive binary tree implementation, define a function that
divides the vector into two ~equal parts (i.e. n/2 and n/2 + n mod 2),
calls itself on each one, and returns the sum of the two results. Stop
recursion when the size of the incoming vector is 1 (just return 1 if
the bit is set, and 0 if not). This is synthesizeable as long as the
recursion is statically bound (which it is, by the size of the
original vector).

It should work out pretty close to what johnp's approach does, except
work for any size input vector.

Andy

Article: 151037
Subject: Re: Slice Usage
From: Chris Maryan <kmaryan@gmail.com>
Date: Tue, 1 Mar 2011 14:47:22 -0800 (PST)
Links: << >> << T >> << A >>

"Number of occupied slices" is the total number of slices used. "Number of =
slice registers" is the number of FFs used in the whole FPGA, there are 4 F=
Fs and 4 LUTs to a slice in a V5 (7200*4 =3D 28800).

"Number used as logic" is the number of LUTs used as a true LUT. A LUT can =
also be used as distributed RAM (Number used as Memory) (see Xilinx docs) o=
r as a shift register (See SRL in the docs).

A slice can have anywhere from 0 to all 4 of it's FFs used, same goes for t=
he LUTs. Sometimes a slice is partially empty because the placer chooses to=
 group the LUTs/FFs that way for no good reason, sometimes there are legiti=
mate reasons why certain groups of FFs can't be put in the same slice as so=
me others (look for "control sets" in the Xilinx documentation).

If you are trying to get an idea of how big your design is, the number of L=
UTs (Number of Slice LUTs) or the number of FFs (Number of Slice Registers)=
 is usually a good start - In my experience, MAP/PAR are able to fill up to=
 about 75-80% of the FFs or LUTs in a chip before it becomes difficult to f=
it everything.

Chris

Article: 151038
Subject: Re: Slice Usage
From: Thomas Womack <twomack@chiark.greenend.org.uk>
Date: 02 Mar 2011 00:04:13 +0000 (GMT)
Links: << >> << T >> << A >>

In article <ikjqfr$uah$1@speranza.aioe.org>, Turj  <Turj@hotmail.com> wrote:
>Dear all,
>
>I am using XST 12.1 for sythensis and for the first time I had a look at
>how many slices are occupied by my design. I have to say, I am bit 
>overwhelmed by the results, with Virtex II you essentially got one
>figure which was the total of the slices.

>In Virtex 5, XST seems break this down as follows:

>Design Summary:
>Number of Slice Registers:                3,127 out of  28,800    9%
>Number of Slice LUTs:                     7,521 out of  28,800   25%
>
>Slice Logic Distribution:
>Number of occupied Slices:                2,451 out of   7,200   32%
>
>So what it now the number of total slices occupied on my board? Which 
>ones do I have to add up?

'Number of occupied slices' is the number you want; the slices are
quite big (each contains four registers and four LUTs, with not all of
them necessarily being used).  At the moment, you could double the
size of your design with no trouble; you might be able to triple it
assuming things pack nicely (you'd be using 75% of the LUTs and maybe
the software would be able to pack them into less than the 96% of the
slices which simply writing the current design down three times would
occupy); it would be unwise to attempt to quadruple it.

Tom

Article: 151039
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces different results
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Wed, 2 Mar 2011 00:20:05 +0000 (UTC)
Links: << >> << T >> << A >>

In comp.arch.fpga Andy <jonesandy@comcast.net> wrote:
(snip)

> The synthesis tool may not be able to figure out that it need not
> carry all the bits of the sum through every caculation in the loop. A
> binary tree implementation can manage the sum width at every stage.

Is that the same as a Carry Save Adder tree?

Maybe not.  The CSA has three inputs and two outputs, so it
isn't exactly binary.  

-- glen

Article: 151040
Subject: S3E1600 Digilent drivers
From: "fili" <filisoft@n_o_s_p_a_m.gmail.com>
Date: Tue, 01 Mar 2011 18:43:59 -0600
Links: << >> << T >> << A >>

Hello

Today I finally got my Spartan 3E 1600 eval board from Digilent. The
problem is that that I can't make it work with the USB cable programmer. I
think it's something wrong with the drivers.
When connecting it to Windows XP SP3 it reports the VID/PID pair as 03FD /
000D. On the other hand, in the dmodusb.inf from the Adept drivers I found
4 pairs:
VID_1443 PID_0007
VID_1443 PID_0005
VID_1443 PID_0003
VID_1443 PID_0001
This means that it can't recognize my board and fails to install any
driver. Replacing any pair with 03FD/000D will allow the driver to be
installed but Adept still can't find my board and the green LED next to the
USB connector is not lit. I guess I should replace all pairs with something
meaningful, but I don't know exactly the values.
On the other hand, Windows 7 recognizes the board without any driver (as a
Xilinx USB cable), the green LED is lit but Adept can't see it ("No devices
connected"). The funny thing is that in Win7 the VID/PID pair is
03FD/0008.
Rather strange I might say... The board seems to be functional because when
in normal operating mode it writes text on the LCD and blinks LED's.

Does anybody have any advice?
(searching for the Ubuntu install CD..., let's see what VID/PID I get there
:P )

Thanks,
Fili

	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151041
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: Brian Davis <brimdavis@aol.com>
Date: Tue, 1 Mar 2011 16:59:58 -0800 (PST)
Links: << >> << T >> << A >>

Peter wrote:
>
> If somebody has a better solution, I would appreciate it if you could share it.
>
 When I looked at this some years back, XST worked well enough
at creating an adder cascade using the old "mask and add" software
trick that I never bothered writing something more optimal:

 gen_bcnt32: if ( ALU_WIDTH = 32 ) generate
   begin

     process(din)
       -- multiple variables not needed, but make intermediate steps
visible in simulation
       variable temp  : unsigned (ALU_MSB downto 0);
       variable temp1 : unsigned (ALU_MSB downto 0);
       variable temp2 : unsigned (ALU_MSB downto 0);
       variable temp3 : unsigned (ALU_MSB downto 0);
       variable temp4 : unsigned (ALU_MSB downto 0);
       variable temp5 : unsigned (ALU_MSB downto 0);

       begin
         temp  := unsigned(din);

         temp1 := (temp  AND X"5555_5555") + ( ( temp  srl 1)  AND
X"5555_5555");  -- 0..2 out  x16

         temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2)  AND
X"3333_3333");  -- 0..4 out  x8

         temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4)  AND
X"0707_0707");  -- 0..8 out  x4

         temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl 8)  AND
X"001f_001f");  -- 0..16 out x2

         temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND
X"0000_003f");  -- 0..32 out

         cnt <= std_logic_vector(temp5(5 downto 0));

       end process;

  end generate gen_bcnt32;


Brian

Article: 151042
Subject: Re: encryption in FPGA
From: anudeep <reddy.mrp@gmail.com>
Date: Tue, 1 Mar 2011 20:25:20 -0800 (PST)
Links: << >> << T >> << A >>

On Mar 1, 6:37=A0pm, "RCIngham"
<robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
> >my project is implementing of blowfish algorithm in FPGA and sending
> >the data from PC through FPGA and encrypt the data.for this which
> >protocols i can use.please tell me some links related to it.
>
> What is the data rate you need to send from PC to FPGA?
> Is it streaming?
> Are you sending any data back from FPGA to PC?
>
> --------------------------------------- =A0 =A0 =A0 =A0
> Posted throughhttp://www.FPGARelated.com

I m using Virtex II pro board. so i can use one side high speed data
transfer and one side serial data transfer. and i need to send a
file .but i dnt knw wht you are asking streaming...and yes i want to
send it to file back to PC.

Article: 151043
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces different results
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date: Wed, 2 Mar 2011 05:55:31 +0000 (UTC)
Links: << >> << T >> << A >>

In comp.arch.fpga Brian Davis <brimdavis@aol.com> wrote:
(snip)

> When I looked at this some years back, XST worked well enough
> at creating an adder cascade using the old "mask and add" software
> trick that I never bothered writing something more optimal:

(snip)

>         temp1 := (temp  AND X"5555_5555") + ( ( temp  srl 1)  AND
> X"5555_5555");  -- 0..2 out  x16
> 
>         temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2)  AND
> X"3333_3333");  -- 0..4 out  x8

OK, that would be a binary tree.  I believe the CSA adder tree
is slightly more efficient, though it might depend on the number
of inputs.

The binary tree cascade works especially well on word oriented
machines, and can be easily written in many high-level languages
(with the assumption of knowing the word size).

The first stage of a CSA tree starts with N inputs, and generates
N/3 two bit outputs that are the sums and carries from N/3 full adders.
(If N isn't a multiple of three, then one bit may bypass the stage,
and two bits go into a half adder.)

The next stage takes the N/3 ones and N/3 twos, and generates
N/9 ones, 2N/9 twos, and N/9 fours.   You can continue until
there is only one bit of each, or sometimes there are other
optimizations near the end.  Last time I did one, I only needed
to know zero, one, two, three, or more than three, which simplifies
it slightly.  

It also pipelines well, but then so does the binary tree.

-- glen

Article: 151044
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: a s <nospamas@gmail.com>
Date: Tue, 1 Mar 2011 22:50:18 -0800 (PST)
Links: << >> << T >> << A >>

Andy nailed it again when he said you get what you pay for
regarding synthesis tool. Initially I was synthesising with
ISE12.4 and the results were, hm, inconsistent. After
switching the synthesis tool to SynplifyPro v2010.03 the
results were as expected and, of course, even better than that.

Please see the table below. Tricky's version is denoted "funct",
where there are major differences between the 2 tools:

---------- 32-bit input data --------
unrolled: XST    74 LUTs,  41 slices
unrolled: SynplifyPro 57 LUTs,  34 slices

loop:     XST   100 LUTs,  54 slices
loop:     SynplifyPro 57 LUTs,  34 slices

funct:    XST   317 LUTs, 161 slices
funct:    SynplifyPro 58 LUTs,  34 slices

---------- 64-bit input data --------
unrolled: XST   174 LUTs,  96 slices
unrolled: SynplifyPro 129 LUTs,  80 slices

loop:     XST   227 LUTs, 121 slices
loop:     SynplifyPro 129 LUTs,  80 slices

funct:    XST   813 LUTs, 436 slices
funct:    SynplifyPro 130 LUTs,  82 slices



Peter

Article: 151045
Subject: Re: S3E1600 Digilent drivers
From: "fili" <filisoft@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 02 Mar 2011 01:10:52 -0600
Links: << >> << T >> << A >>

It's me again :)

I finally got it working, but not the way I expected. I've uninstalled all
drivers and installed Xilinx ISE Webpack. This one came with drivers that
have the correct VID/PID pairs and now I'm able to program the board via
Impact. Digilent's Adept still can't see the board.
I'm still interested if there's a way to program the board via Adept (I
don't want to install webpack on all computers) but at least I'm not
desperate anymore, as I have a way to do it.

Thanks... err.. for listening my problems :)
Fili	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151046
Subject: Re: Count bits in VHDL, with loop and unrolled loop produces
From: a s <nospamas@gmail.com>
Date: Tue, 1 Mar 2011 23:33:34 -0800 (PST)
Links: << >> << T >> << A >>

Johnp, Brian, thank you too for your input! Much appreciated.

I have ran your code through 2 synthesisers and have updated the table
of required resources.

-------------- 32-bit input data --------------
unrolled: XST               74 LUTs,  41 slices
unrolled: SynplifyPro       57 LUTs,  34 slices

loop:     XST              100 LUTs,  54 slices
loop:     SynplifyPro       57 LUTs,  34 slices

funct:    XST              317 LUTs, 161 slices
funct:    SynplifyPro       58 LUTs,  34 slices

JohnpV1:  XST               62 LUTs,  35 slices
JohnpV1:  SynplifyPro       57 LUTs,  33 slices

JohnpV2:  XST               78 LUTs,  43 slices
JohnpV2:  SynplifyPro       54 LUTs,  32 slices

Brian:    XST               57 LUTs,  39 slices
Brian:    SynplifyPro       57 LUTs,  41 slices


The latest 3 pairs of results are interesting because even
XST produces good results, especially in Brian's version
where XST is surprisingly even slightly better. But anyway,
it's not that XST is so clever, it is a clever coding of the design.

Regards,
Peter

Article: 151047
Subject: Re: encryption in FPGA
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 02 Mar 2011 04:23:52 -0600
Links: << >> << T >> << A >>

>On Mar 1, 6:37=A0pm, "RCIngham"
><robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
>> >my project is implementing of blowfish algorithm in FPGA and sending
>> >the data from PC through FPGA and encrypt the data.for this which
>> >protocols i can use.please tell me some links related to it.
>>
>> What is the data rate you need to send from PC to FPGA?
>> Is it streaming?
>> Are you sending any data back from FPGA to PC?
>>
>> --------------------------------------- =A0 =A0 =A0 =A0
>> Posted throughhttp://www.FPGARelated.com
>
>I m using Virtex II pro board. so i can use one side high speed data
>transfer and one side serial data transfer. and i need to send a
>file .but i dnt knw wht you are asking streaming...and yes i want to
>send it to file back to PC.
>

So, not streaming, then.

How big a file?
How long can you wait for it to be sent and returned?

	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151048
Subject: iir filter
From: "francesco_pincio" <franz_as_tux@n_o_s_p_a_m.yahoo.it>
Date: Wed, 02 Mar 2011 08:37:35 -0600
Links: << >> << T >> << A >>

Hello!
I'm new in the forum and just done an FPGA university course, very very
small...we have only turned on/off led with finite state machine and so
on...now i'm tryng to develope an IIR filter with XSA50 board form Xess
with spartanIIe50 fpga. FIlter kernel is just a 2 pole system with a zero
in 0, i would do a bandpass with changable passaband with pushbuttons; i
've idealized that main structure of the program would be a module with a
counter for generating clock for the ADC/DAC, a module that pass this
samples in the filtern kernel, the filter kernel iir itself and a module
that passes filtererd samples to DAC; mainly i have 2 problems:

1) i can do only operation with radix-2 coefficient, so i can use only 1/2,
1/4 an so on. i don't understand how to pass a float value and multiply it

2)do i need a ram to store at least y[n-2] sample?

I know myquestions sounds stupid, maybe i have not a good idea of what i
have to do, if you could illuminate me on this...

best regards

francesco

	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Article: 151049
Subject: Re: iir filter
From: "RCIngham" <robert.ingham@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com>
Date: Wed, 02 Mar 2011 09:23:56 -0600
Links: << >> << T >> << A >>

>
>1) i can do only operation with radix-2 coefficient, so i can use only
1/2,
>1/4 an so on. i don't understand how to pass a float value and multiply
it

You probably don't need floating point. Floating-point is difficult on
FPGAs.
Imagine your ADC data as between -1 and +1. Then, when you multiply, the
result is still between -1 and +1. After additions, scale your data so it
is again between -1 and +1.

There is a difficulty with IIR filters in digital hardware, which is
arithmetic saturation. You will need to do simulations to prove that either
it doesn't happen or that you detect and mitigate it.

>
>2)do i need a ram to store at least y[n-2] sample?

Unless the number and size of samples is very large, it could be done with
registers (flip/flops). You need to sketch out how the data flows through
the arithmetic elements.

	   
					
---------------------------------------		
Posted through http://www.FPGARelated.com

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search