Messages from 158525

Article: 158525
Subject: Re: modulo 2**32-1 arith
From: KJ <kkjennings@sbcglobal.net>
Date: Wed, 16 Dec 2015 09:36:39 -0800 (PST)
Links: << >> << T >> << A >>

On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:
> The tool was not using the carry in.  Here is a version that uses the=20
> carry in.  66 LEs in one clock cycle.  It can be reduced to about 40 LEs=
=20
> if done in two clock cycles.
>=20
Here is the fitter summary showing 98 logic cells for this design that you =
posted.  Presumably you're using a different tool or different target devic=
e to come up with 66 so that number only becomes relevant if you use that s=
ame tool and target device to synthesize all of the other variants.
+--------------------------------------------------------------------------=
-------+
; Fitter Summary                                                           =
       ;
+------------------------------------+-------------------------------------=
-------+
; Fitter Status                      ; Successful - Wed Dec 16 12:31:58 201=
5      ;
; Quartus II 64-Bit Version          ; 13.1.0 Build 162 10/23/2013 SJ Web E=
dition ;
; Revision Name                      ; Junk                                =
       ;
; Top-level Entity Name              ; FPGA                                =
       ;
; Family                             ; Cyclone IV GX                       =
       ;
; Device                             ; EP4CGX22CF19C6                      =
       ;
; Timing Models                      ; Final                               =
       ;
; Total logic elements               ; 98 / 21,280 ( < 1 % )               =
       ;
;     Total combinational functions  ; 98 / 21,280 ( < 1 % )               =
       ;
;     Dedicated logic registers      ; 0 / 21,280 ( 0 % )                  =
       ;
; Total registers                    ; 0                                   =
       ;
; Total pins                         ; 96 / 167 ( 57 % )                   =
       ;
; Total virtual pins                 ; 0                                   =
       ;
; Total memory bits                  ; 0 / 774,144 ( 0 % )                 =
       ;
; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % )                      =
       ;
; Total GXB Receiver Channel PCS     ; 0 / 4 ( 0 % )                       =
       ;
; Total GXB Receiver Channel PMA     ; 0 / 4 ( 0 % )                       =
       ;
; Total GXB Transmitter Channel PCS  ; 0 / 4 ( 0 % )                       =
       ;
; Total GXB Transmitter Channel PMA  ; 0 / 4 ( 0 % )                       =
       ;
; Total PLLs                         ; 0 / 4 ( 0 % )                       =
       ;
+------------------------------------+-------------------------------------=
-------+

Kevin Jennings

Article: 158526
Subject: Re: modulo 2**32-1 arith
From: KJ <kkjennings@sbcglobal.net>
Date: Wed, 16 Dec 2015 09:45:00 -0800 (PST)
Links: << >> << T >> << A >>

On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com 
> If you can do with a result every other cycle, you could try following:
> - Make an 32b adder with carry in and carry out
> - First cycle: Add the two numbers with carry.
> - Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.
> 
> This should be below 40 LEs. But I have not tried it out...
> 
I have yours coming in at 36 if I coded up what you said correctly, not tested.  You're in the lead now with Ilya in second.

Kevin Jennings

==== START OF CODE ====
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(
    SP1_METHOD:     integer range 0 to 1;   -- Controls how 'Sum + 1' is calculated
    METHOD:         integer range 0 to 7);
port(
    Clock:  in  std_ulogic;
    A:      in  unsigned(31 downto 0);
    B:      in  unsigned(31 downto 0);
    C:      out unsigned(31 downto 0);
    Valid:  out std_ulogic
);
end Custom_Adder;

-- Results:
--          Logic Elements
-- Method#  SP1=0   SP1=1   Notes
-- 0        32      32      C <= A+B, no modulo '2**32-1' as a baseline
-- 1        32      32      C <= A+B+1, as another reference point 
-- 2        76      76      Ilya's method although implemented all in one clock cycle
-- 3        97      97      Rickman's method #1
-- 4        97      97      Rickman's method #2
-- 5        65      65      Ilya's method as stated, two clock cycles
-- 6        76      108     Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1
-- 7        36      36      Thomas' method, two clock cycles

architecture RTL of Custom_Adder is
    signal Sum:     unsigned(32 downto 0);
    signal Sum_P1:  unsigned(32 downto 0);
    constant MAX:   unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
    Sum     <= resize(A, Sum'length) + resize(B, Sum'length);
    Sum_P1  <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1;
    -- KJ note:  If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells 
    --           increases by 27 (i.e. almost one for each bit of the adder

    -- KJ note:  Performing all calculations on one clock cycle in order to determine logic cells.  

    GEN_METHOD0: if (METHOD = 0) generate
        C   <= Sum(C'range);
    end generate GEN_METHOD0;

    GEN_METHOD1: if (METHOD = 1) generate
        C   <= Sum_P1(C'range);
    end generate GEN_METHOD1;

    GEN_METHOD2: if (METHOD = 2) generate
        -- Ilya's method, implemented all in one clock cycle:
        -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result 
        -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
        C   <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
    end generate GEN_METHOD2;

    GEN_METHOD3: if (METHOD = 3) generate
        signal Mod_Res: unsigned(32 downto 0);
    begin
        -- Rickman's first method
        -- sum <= RESIZE(a, 33) + RESIZE(b, 33); 
        -- temp <= sum + 1; 
        -- mod_result <= sum + temp(32); 
        -- answer <= mod_result(31 downto 0); 
        Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
        C       <= Mod_Res(C'range);
    end generate GEN_METHOD3;
    GEN_METHOD4: if (METHOD = 4) generate
        -- Rickman's second method
        -- signal Y: unsigned(31 downto 0); 
        -- signal Y, Y1, A, B: unsigned(32 downto 0);   <-- KJ note:  Redclares 'Y'
        -- Y1 <= A + B + 1;     <-- KJ note:  Error:  must have 33 elements not 32
        -- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32); 
        signal Y:   unsigned(31 downto 0); 
        signal Y1:  unsigned(32 downto 0); 
    begin
        Y1  <= resize(A, 33) + resize(B, 33) + 1; 
        Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32);
        C   <= Y;
    end generate GEN_METHOD4;

    GEN_METHOD5: if (METHOD = 5) generate
        -- Ilya's method:
        -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result 
        -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
        signal Sum_Dlyd:    unsigned(Sum'range);
        signal Sum_Dlyd_P1: unsigned(Sum'range);
    begin
        Sum_Dlyd    <= Sum when rising_edge(Clock);
        Sum_Dlyd_P1 <= Sum_Dlyd + 1;
        C   <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range);
    end generate GEN_METHOD5;

    GEN_METHOD6: if (METHOD = 6) generate
        -- Same as method 2 (Ilya's method, implemented all in one clock cycle) except
        -- the ordering of operands for the computation of 'Sum + 1' is modified.
        signal Sum_P1:  unsigned(Sum'range);
    begin
        Sum_P1  <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length);
        C   <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
    end generate GEN_METHOD6;

    GEN_METHOD7: if (METHOD = 7) generate
        -- Thomas' method
        -- Make an 32b adder with carry in and carry out 
        -- First cycle: Add the two numbers with carry. 
        -- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result. 
        -- This should be below 40 LEs. But I have not tried it out...
        signal Sum_With_Carry:      unsigned(Sum'range);
        signal Carry_In:            unsigned(0 downto 0);
        signal Carry_Out:           std_ulogic;
    begin
        Carry_In        <= not(Carry_In);
        Sum_With_Carry  <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In;
        Carry_Out       <= Sum_With_Carry(32) when rising_edge(Clock);
        C               <= Sum_With_Carry(C'range);
        Valid           <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out));
    end generate GEN_METHOD7;
end RTL;
==== END OF CODE ====

Article: 158527
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Wed, 16 Dec 2015 12:46:54 -0500
Links: << >> << T >> << A >>

On 12/16/2015 12:36 PM, KJ wrote:
> On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:
>> The tool was not using the carry in.  Here is a version that uses the
>> carry in.  66 LEs in one clock cycle.  It can be reduced to about 40 LEs
>> if done in two clock cycles.
>>
> Here is the fitter summary showing 98 logic cells for this design that you posted.  Presumably you're using a different tool or different target device to come up with 66 so that number only becomes relevant if you use that same tool and target device to synthesize all of the other variants.
> +---------------------------------------------------------------------------------+
> ; Fitter Summary                                                                  ;
> +------------------------------------+--------------------------------------------+
> ; Fitter Status                      ; Successful - Wed Dec 16 12:31:58 2015      ;
> ; Quartus II 64-Bit Version          ; 13.1.0 Build 162 10/23/2013 SJ Web Edition ;
> ; Revision Name                      ; Junk                                       ;
> ; Top-level Entity Name              ; FPGA                                       ;
> ; Family                             ; Cyclone IV GX                              ;
> ; Device                             ; EP4CGX22CF19C6                             ;
> ; Timing Models                      ; Final                                      ;
> ; Total logic elements               ; 98 / 21,280 ( < 1 % )                      ;
> ;     Total combinational functions  ; 98 / 21,280 ( < 1 % )                      ;
> ;     Dedicated logic registers      ; 0 / 21,280 ( 0 % )                         ;
> ; Total registers                    ; 0                                          ;
> ; Total pins                         ; 96 / 167 ( 57 % )                          ;
> ; Total virtual pins                 ; 0                                          ;
> ; Total memory bits                  ; 0 / 774,144 ( 0 % )                        ;
> ; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % )                             ;
> ; Total GXB Receiver Channel PCS     ; 0 / 4 ( 0 % )                              ;
> ; Total GXB Receiver Channel PMA     ; 0 / 4 ( 0 % )                              ;
> ; Total GXB Transmitter Channel PCS  ; 0 / 4 ( 0 % )                              ;
> ; Total GXB Transmitter Channel PMA  ; 0 / 4 ( 0 % )                              ;
> ; Total PLLs                         ; 0 / 4 ( 0 % )                              ;
> +------------------------------------+--------------------------------------------+

I don't know what to tell you.  I used Lattice Diamond 3.3.  The design 
is simple, two 33 bit adders, 1 LUT per bit.  There is nothing special 
about either the tool or the device (standard 4 input LUTs with carry 
support for adders).  I think your Quartus tool is broken.  Try looking 
at the actual logic produced.  BTW, doesn't the Cyclone IV have 6 input 
LUTs?  What is the Altera tool doing that is so inefficient?

-- 

Rick

Article: 158528
Subject: Re: modulo 2**32-1 arith
From: KJ <kkjennings@sbcglobal.net>
Date: Wed, 16 Dec 2015 09:48:15 -0800 (PST)
Links: << >> << T >> << A >>

On Wednesday, December 16, 2015 at 12:45:05 PM UTC-5, KJ wrote:

Correction to Method 7:
Posted:  Carry_In <= not(Carry_In);
Supposed to be:  Carry_In <= not(Carry_In) when rising_edge(Clock);

Results are not changed, still 36 Logic Elements.

Kevin

Article: 158529
Subject: Re: modulo 2**32-1 arith
From: KJ <kkjennings@sbcglobal.net>
Date: Wed, 16 Dec 2015 10:03:32 -0800 (PST)
Links: << >> << T >> << A >>

On Wednesday, December 16, 2015 at 12:12:15 PM UTC-5, Rob Gaddi wrote:
>=20
> If I'm right about the checksum, then Ilya's
> method (at least on a reading) gives you an operating duty cycle of
> 50%, you can put new data no faster than every other clock because
> you're waiting for the result of that "add one more decision" to
> percolate back around.  Leave out the pipeline registers and your fmax
> gets ugly.

Ilya's method (Method=3D6 in the code I posted) should be computing one out=
put every clock cycle.  The output will just have a one clock cycle latency=
 compared with the pure combinatorial logic version (Method=3D2).  Thomas' =
method (Method=3D7) though will only accept one input every other clock cyc=
le but consumes roughly half the logic resource.

Kevin Jennings

Article: 158530
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Wed, 16 Dec 2015 13:25:52 -0500
Links: << >> << T >> << A >>

On 12/16/2015 12:45 PM, KJ wrote:
> On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com
>> If you can do with a result every other cycle, you could try following:
>> - Make an 32b adder with carry in and carry out
>> - First cycle: Add the two numbers with carry.
>> - Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.
>>
>> This should be below 40 LEs. But I have not tried it out...
>>
> I have yours coming in at 36 if I coded up what you said correctly, not tested.  You're in the lead now with Ilya in second.
>
> Kevin Jennings
>
> ==== START OF CODE ====
> library ieee;
> use ieee.std_logic_1164.all;
> use ieee.numeric_std.all;
>
> entity Custom_Adder is generic(
>      SP1_METHOD:     integer range 0 to 1;   -- Controls how 'Sum + 1' is calculated
>      METHOD:         integer range 0 to 7);
> port(
>      Clock:  in  std_ulogic;
>      A:      in  unsigned(31 downto 0);
>      B:      in  unsigned(31 downto 0);
>      C:      out unsigned(31 downto 0);
>      Valid:  out std_ulogic
> );
> end Custom_Adder;
>
> -- Results:
> --          Logic Elements
> -- Method#  SP1=0   SP1=1   Notes
> -- 0        32      32      C <= A+B, no modulo '2**32-1' as a baseline
> -- 1        32      32      C <= A+B+1, as another reference point
> -- 2        76      76      Ilya's method although implemented all in one clock cycle
> -- 3        97      97      Rickman's method #1
> -- 4        97      97      Rickman's method #2
> -- 5        65      65      Ilya's method as stated, two clock cycles
> -- 6        76      108     Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1
> -- 7        36      36      Thomas' method, two clock cycles
>
> architecture RTL of Custom_Adder is
>      signal Sum:     unsigned(32 downto 0);
>      signal Sum_P1:  unsigned(32 downto 0);
>      constant MAX:   unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
> begin
>      Sum     <= resize(A, Sum'length) + resize(B, Sum'length);
>      Sum_P1  <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1;
>      -- KJ note:  If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells
>      --           increases by 27 (i.e. almost one for each bit of the adder
>
>      -- KJ note:  Performing all calculations on one clock cycle in order to determine logic cells.
>
>      GEN_METHOD0: if (METHOD = 0) generate
>          C   <= Sum(C'range);
>      end generate GEN_METHOD0;
>
>      GEN_METHOD1: if (METHOD = 1) generate
>          C   <= Sum_P1(C'range);
>      end generate GEN_METHOD1;
>
>      GEN_METHOD2: if (METHOD = 2) generate
>          -- Ilya's method, implemented all in one clock cycle:
>          -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
>          -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
>          C   <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
>      end generate GEN_METHOD2;
>
>      GEN_METHOD3: if (METHOD = 3) generate
>          signal Mod_Res: unsigned(32 downto 0);
>      begin
>          -- Rickman's first method
>          -- sum <= RESIZE(a, 33) + RESIZE(b, 33);
>          -- temp <= sum + 1;
>          -- mod_result <= sum + temp(32);
>          -- answer <= mod_result(31 downto 0);
>          Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
>          C       <= Mod_Res(C'range);
>      end generate GEN_METHOD3;
>      GEN_METHOD4: if (METHOD = 4) generate
>          -- Rickman's second method
>          -- signal Y: unsigned(31 downto 0);
>          -- signal Y, Y1, A, B: unsigned(32 downto 0);   <-- KJ note:  Redclares 'Y'
>          -- Y1 <= A + B + 1;     <-- KJ note:  Error:  must have 33 elements not 32
>          -- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
>          signal Y:   unsigned(31 downto 0);
>          signal Y1:  unsigned(32 downto 0);
>      begin
>          Y1  <= resize(A, 33) + resize(B, 33) + 1;
>          Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32);
>          C   <= Y;
>      end generate GEN_METHOD4;
>
>      GEN_METHOD5: if (METHOD = 5) generate
>          -- Ilya's method:
>          -- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
>          -- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
>          signal Sum_Dlyd:    unsigned(Sum'range);
>          signal Sum_Dlyd_P1: unsigned(Sum'range);
>      begin
>          Sum_Dlyd    <= Sum when rising_edge(Clock);
>          Sum_Dlyd_P1 <= Sum_Dlyd + 1;
>          C   <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range);
>      end generate GEN_METHOD5;
>
>      GEN_METHOD6: if (METHOD = 6) generate
>          -- Same as method 2 (Ilya's method, implemented all in one clock cycle) except
>          -- the ordering of operands for the computation of 'Sum + 1' is modified.
>          signal Sum_P1:  unsigned(Sum'range);
>      begin
>          Sum_P1  <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length);
>          C   <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
>      end generate GEN_METHOD6;
>
>      GEN_METHOD7: if (METHOD = 7) generate
>          -- Thomas' method
>          -- Make an 32b adder with carry in and carry out
>          -- First cycle: Add the two numbers with carry.
>          -- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.
>          -- This should be below 40 LEs. But I have not tried it out...
>          signal Sum_With_Carry:      unsigned(Sum'range);
>          signal Carry_In:            unsigned(0 downto 0);
>          signal Carry_Out:           std_ulogic;
>      begin
>          Carry_In        <= not(Carry_In);
>          Sum_With_Carry  <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In;
>          Carry_Out       <= Sum_With_Carry(32) when rising_edge(Clock);
>          C               <= Sum_With_Carry(C'range);
>          Valid           <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out));
>      end generate GEN_METHOD7;
> end RTL;
> ==== END OF CODE ====

I see one issue.  You are calculating Sum and Sum_P1 even for methods 
that don't use them.  This would then depend on the tool to optimize 
away this logic.  I see in some methods you recompute these values.  Why 
not fix this and add it to the code for the methods that use it?

-- 

Rick

Article: 158531
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sat, 19 Dec 2015 10:03:12 -0800 (PST)
Links: << >> << T >> << A >>

I thank you all for this very interesting and useful discussion. There is a lot to think about.
But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
I manged to meet timings by using Sum(32) to decide what sum I should use.

    Sum <= resize(A, 33)+resize(B, 33);
    Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            if Sum(32) = '1' then
                C<=resize(Sum_P1,32);
            else
                C<=resize(Sum,32);
            end if;
        end if;
    end process;

In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.

If I use
Sum_P1 <= Sum+1;
istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.


If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
    Sum <= resize(A, 33)+resize(B, 33);
    Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            if Sum_P(32) = '1' then
                C<=resize(Sum_P1,32);
            else
                C<=resize(Sum,32);
            end if;
        end if;
    end process;
we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)

in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.

if we do it like that:
    Y1  <= resize(A, 33) + resize(B, 33) + 1;
    Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32); 
    
    process(clk)
    begin
        if rising_edge(clk) then
            A <= A_in;
            B <= B_in;
            
            C <= Y;
        end if;
    end process;
we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.

Article: 158532
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sat, 19 Dec 2015 10:13:47 -0800 (PST)
Links: << >> << T >> << A >>

I also should apologize for my absence for long time. Device I'm working on suddenly started to lose PCIe link (without any reason) and I have to put off all other works until this problem is resolved.

Article: 158533
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sat, 19 Dec 2015 14:18:23 -0500
Links: << >> << T >> << A >>

On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
> I thank you all for this very interesting and useful discussion. There is a lot to think about.
> But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
> A+B if A+B<2**32
> A+B-2**32+1 if A+B>=2**32
> I manged to meet timings by using Sum(32) to decide what sum I should use.
>
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
>
> In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.
>
> If I use
> Sum_P1 <= Sum+1;
> istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.
>
>
> If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum_P(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
> we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)
>
> in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.
>
> if we do it like that:
>      Y1  <= resize(A, 33) + resize(B, 33) + 1;
>      Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32);
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              C <= Y;
>          end if;
>      end process;
> we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.

A couple of comments.  The timing from the synthesis step is pretty much 
garbage.  The only timing results that matter are those from place and 
route.  At least that has been my experience.  So I would never toss an 
approach based on timings from the synthesis step unless they were very 
far apart.

The actual definition of "modulo 2**32-1" would be:
  A+B if A+B<2**32-1
  A+B-2**32+1 if A+B>=2**32-1

In the last example you give the tool seems to treat resize(Y1(32 downto 
32),33) as another operator and devotes a full adder 33 bits long to add 
that to the sum of the other two operands, obviously not efficient.  My 
other example gets around that.  But I have another way of expressing it.

This method calculates a carry out of the 32 bit addition of A+B+1 which 
is added to the sum A+B to get the final answer.  Rather than trying to 
force the tools to use a carry in to propagate the carry, just make it 
one longer adder.  Built in carries tend to be fast, so it will be a 
foot race between the carry propagation and the LUT and routing delays 
of using the mux.

   signal A,B : unsigned(WIDTH-1 downto 0);
   signal AxA, BxB, Y : unsigned(2*WIDTH-1 downto 0);

BEGIN
   AxA <= A & A;
   BxB <= B & B;
   Y <= AxA + BxB + 1;

   process(clk)
   begin
     if rising_edge(clk) then
       A <= A_in;
       B <= B_in;

       C <= Y (2*WIDTH-1 downto WIDTH);
     end if;
   end process;

This produced 64 bits of adder and no additional logic.  I didn't 
simulate it to make sure the logic is correct.  You will need to do your 
own timing analysis since your target device is different from what I 
work with.  Whether it is faster depends on the speed of the carry 
chains in your device.

-- 

Rick

Article: 158534
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sat, 19 Dec 2015 13:00:14 -0800 (PST)
Links: << >> << T >> << A >>

On Saturday, December 19, 2015 at 10:18:36 PM UTC+3, rickman wrote:

> A couple of comments.  The timing from the synthesis step is pretty much 
> garbage.  The only timing results that matter are those from place and 
> route.  At least that has been my experience.  So I would never toss an 
> approach based on timings from the synthesis step unless they were very 
> far apart.

It's just a way to have a rough estimate of the timings without insertion of this block in actual design or having deals with timings from/to input ports.
May be I'll check this designs with
set_false_path -from A_in -to A
and so on. 

> 
> The actual definition of "modulo 2**32-1" would be:
>   A+B if A+B<2**32-1
>   A+B-2**32+1 if A+B>=2**32-1

It is. But as I said this particular algorithm uses nonstandard definition. (obviously they wanted to make hardware implementation easier)


> In the last example you give the tool seems to treat resize(Y1(32 downto 
> 32),33) as another operator and devotes a full adder 33 bits long to add 
> that to the sum of the other two operands, obviously not efficient.  My 
> other example gets around that.  But I have another way of expressing it.
> 
> This method calculates a carry out of the 32 bit addition of A+B+1 which 
> is added to the sum A+B to get the final answer.  Rather than trying to 
> force the tools to use a carry in to propagate the carry, just make it 
> one longer adder.  Built in carries tend to be fast, so it will be a 
> foot race between the carry propagation and the LUT and routing delays 
> of using the mux.
> 
>    signal A,B : unsigned(WIDTH-1 downto 0);
>    signal AxA, BxB, Y : unsigned(2*WIDTH-1 downto 0);
> 
> BEGIN
>    AxA <= A & A;
>    BxB <= B & B;
>    Y <= AxA + BxB + 1;
> 
>    process(clk)
>    begin
>      if rising_edge(clk) then
>        A <= A_in;
>        B <= B_in;
> 
>        C <= Y (2*WIDTH-1 downto WIDTH);
>      end if;
>    end process;
> 
> This produced 64 bits of adder and no additional logic.  I didn't 
> simulate it to make sure the logic is correct.  You will need to do your 
> own timing analysis since your target device is different from what I 
> work with.  Whether it is faster depends on the speed of the carry 
> chains in your device.
> 
> -- 
> 
> Rick

Cool! It makes LUT2 + 16*CARRY4 with 3.112 ns estimated propagation time (before route) and uses 64 lutes.

Article: 158535
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sat, 19 Dec 2015 16:33:29 -0500
Links: << >> << T >> << A >>

On 12/19/2015 4:00 PM, Ilya Kalistru wrote:
> On Saturday, December 19, 2015 at 10:18:36 PM UTC+3, rickman wrote:
>
>> A couple of comments.  The timing from the synthesis step is pretty much
>> garbage.  The only timing results that matter are those from place and
>> route.  At least that has been my experience.  So I would never toss an
>> approach based on timings from the synthesis step unless they were very
>> far apart.
>
> It's just a way to have a rough estimate of the timings without insertion of this block in actual design or having deals with timings from/to input ports.
> May be I'll check this designs with
> set_false_path -from A_in -to A
> and so on.
>
>>
>> The actual definition of "modulo 2**32-1" would be:
>>    A+B if A+B<2**32-1
>>    A+B-2**32+1 if A+B>=2**32-1
>
> It is. But as I said this particular algorithm uses nonstandard definition. (obviously they wanted to make hardware implementation easier)
That doesn't make sense.  Either it is one definition or the other.  The 
what I wrote above is what is produced by my code.  I believe your code 
can be simply modified to produce the same result, just use Sum_P1(32) 
in the conditional instead of Sum(32).  I don't know how you can use 
your code if you need a modulo function since it will produce a result 
of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid 
result.


>> In the last example you give the tool seems to treat resize(Y1(32 downto
>> 32),33) as another operator and devotes a full adder 33 bits long to add
>> that to the sum of the other two operands, obviously not efficient.  My
>> other example gets around that.  But I have another way of expressing it.
>>
>> This method calculates a carry out of the 32 bit addition of A+B+1 which
>> is added to the sum A+B to get the final answer.  Rather than trying to
>> force the tools to use a carry in to propagate the carry, just make it
>> one longer adder.  Built in carries tend to be fast, so it will be a
>> foot race between the carry propagation and the LUT and routing delays
>> of using the mux.
>>
>>     signal A,B : unsigned(WIDTH-1 downto 0);
>>     signal AxA, BxB, Y : unsigned(2*WIDTH-1 downto 0);
>>
>> BEGIN
>>     AxA <= A & A;
>>     BxB <= B & B;
>>     Y <= AxA + BxB + 1;
>>
>>     process(clk)
>>     begin
>>       if rising_edge(clk) then
>>         A <= A_in;
>>         B <= B_in;
>>
>>         C <= Y (2*WIDTH-1 downto WIDTH);
>>       end if;
>>     end process;
>>
>> This produced 64 bits of adder and no additional logic.  I didn't
>> simulate it to make sure the logic is correct.  You will need to do your
>> own timing analysis since your target device is different from what I
>> work with.  Whether it is faster depends on the speed of the carry
>> chains in your device.
>>
>> --
>>
>> Rick
>
> Cool! It makes LUT2 + 16*CARRY4 with 3.112 ns estimated propagation time (before route) and uses 64 lutes.

Since the carry chains are not part of the routing, you will find little 
additional delay if the placement is good.

-- 

Rick

Article: 158536
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sat, 19 Dec 2015 13:38:21 -0800 (PST)
Links: << >> << T >> << A >>

Here you are:
after implenetation Rick's Y <= AxA + BxB + 1; approach give us 3.222 ns.

Multiplexor approach give us 3.506 ns and it is not nearly as elegant as Rick's solution.

If we want to stick to weird modulo definition used in algorithm, we use
Y <= AxA + BxB;
and have the same 3.222 ns as was expected.

Article: 158537
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sat, 19 Dec 2015 14:10:31 -0800 (PST)
Links: << >> << T >> << A >>


> I believe your code 
> can be simply modified to produce the same result, just use Sum_P1(32) 
> in the conditional instead of Sum(32).  

It's exactly what I've done in my second post - there's test results for both definitions of "modulo" with Sum_P1(32) and Sum(32) used as an input of the multiplexer. 


> I don't know how you can use 
> your code if you need a modulo function since it will produce a result 
> of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid 
> result.
Sorry, I don't understand you here. Could you paraphrase it a bit simpler or wider?
I don't need a modulo function, I need a function which is like modulo but with this strange difference.
I've done testing of both definitions for the sake of comparison between them and to be able to compare my results of "standard modulo" on Xilinx with "standard modulo" results of KJ on Altera.
 
> Since the carry chains are not part of the routing, you will find little 
> additional delay if the placement is good.

I want to check how routing affects performance in real design when I fix this problem with the bloody PCIe. I hope I won't forget to report my results here.

Article: 158538
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sat, 19 Dec 2015 23:56:02 -0500
Links: << >> << T >> << A >>

On 12/19/2015 5:10 PM, Ilya Kalistru wrote:
>
>> I believe your code
>> can be simply modified to produce the same result, just use Sum_P1(32)
>> in the conditional instead of Sum(32).
>
> It's exactly what I've done in my second post - there's test results for both definitions of "modulo" with Sum_P1(32) and Sum(32) used as an input of the multiplexer.
>
>
>> I don't know how you can use
>> your code if you need a modulo function since it will produce a result
>> of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid
>> result.
> Sorry, I don't understand you here. Could you paraphrase it a bit simpler or wider?
> I don't need a modulo function, I need a function which is like modulo but with this strange difference.
> I've done testing of both definitions for the sake of comparison between them and to be able to compare my results of "standard modulo" on Xilinx with "standard modulo" results of KJ on Altera.

Using Sum high bit as the selector means the mux will produce an output 
of 2**n-1.  This is not a valid output for a modulo 2**n-1 function.  In 
other words this is not a valid function to produce the modulo 2**n-1 
function.  It is the wrong circuit.

I can maybe illustrate this better with 8 bits.  Mod 255 means the 
output will range from 0 to 254.  So if you used Sum high bit to control 
the mux it will produce an range of 0 to 255 which is not correct.  Use 
Sum_p1 high bit to control the mux and the output will range from 0 to 
254 which is correct.

>> Since the carry chains are not part of the routing, you will find little
>> additional delay if the placement is good.
>
> I want to check how routing affects performance in real design when I fix this problem with the bloody PCIe. I hope I won't forget to report my results here.

I'm interested.  Placement will have a huge effect.  I look at routed 
results for my implementation and I saw one route of 4.5 ns.  Looks like 
they shoved the input FFs into the IOBs, so the design can't all be 
close together as the IO is scattered around the chip perimeter.  I'd 
need to kill that or add another layer of FFs so the routes of interest 
are all internal FFs which could be co-located.  Still doesn't say they 
*will* be close together.  I find tools to be hard to manage in that 
regard unless you do floor-planning.  I much prefer a slow clock... lol

-- 

Rick

Article: 158539
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sun, 20 Dec 2015 00:57:39 -0800 (PST)
Links: << >> << T >> << A >>


>=20
> Using Sum high bit as the selector means the mux will produce an output=
=20
> of 2**n-1.  This is not a valid output for a modulo 2**n-1 function.  In=
=20
> other words this is not a valid function to produce the modulo 2**n-1=20
> function.  It is the wrong circuit.
>=20
> I can maybe illustrate this better with 8 bits.  Mod 255 means the=20
> output will range from 0 to 254.  So if you used Sum high bit to control=
=20
> the mux it will produce an range of 0 to 255 which is not correct.  Use=
=20
> Sum_p1 high bit to control the mux and the output will range from 0 to=20
> 254 which is correct.

Ok. It's wrong function to produce the modulo 2**n-1. But I don't need corr=
ect modulo function, I need function which gives us that:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=3D2**32=20
It's absolutely different from modulo function, but it's what they use in t=
he description of algorithm. It's not my responsibility to change algorithm=
 and this algorithm is used in software implementation already and if I cha=
nged this function to correct modulo function my hardware implementation wo=
uld be incompatible with software implementation and wouldn't comply to req=
uirements of government.


> I'm interested.  Placement will have a huge effect.  I look at routed=20
> results for my implementation and I saw one route of 4.5 ns.  Looks like=
=20
> they shoved the input FFs into the IOBs, so the design can't all be=20
> close together as the IO is scattered around the chip perimeter.  I'd=20
> need to kill that or add another layer of FFs so the routes of interest=
=20
> are all internal FFs which could be co-located.  Still doesn't say they=
=20
> *will* be close together.  I find tools to be hard to manage in that=20
> regard unless you do floor-planning.  I much prefer a slow clock... lol

That's why I didn't use postplacement results in my first posts. Later I've=
 worked around it. As you can see I've used A_in,B_in,C_out as a ports and =
A,B,C as a FF. I've also made a rule that the paths from *_in port to A,B F=
Fs and from C FFs to C_out ports are "don't care" paths.

set_false_path -from [get_ports A_in*] -to [get_cells A_reg*]
set_false_path -from [get_ports B_in*] -to [get_cells B_reg*]
set_false_path -from [get_cells C_reg*] -to [get_ports C_out*]

It allows to place FF anywhere on the chip.

I usually don't do any floor-planning at all because it's usually enough to=
 set constrains well for my designs at ~ 250 MHz and Vivado placement tool =
does its job well.

Article: 158540
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sun, 20 Dec 2015 10:56:29 -0500
Links: << >> << T >> << A >>

On 12/20/2015 3:57 AM, Ilya Kalistru wrote:
>
>>
>> Using Sum high bit as the selector means the mux will produce an output
>> of 2**n-1.  This is not a valid output for a modulo 2**n-1 function.  In
>> other words this is not a valid function to produce the modulo 2**n-1
>> function.  It is the wrong circuit.
>>
>> I can maybe illustrate this better with 8 bits.  Mod 255 means the
>> output will range from 0 to 254.  So if you used Sum high bit to control
>> the mux it will produce an range of 0 to 255 which is not correct.  Use
>> Sum_p1 high bit to control the mux and the output will range from 0 to
>> 254 which is correct.
>
> Ok. It's wrong function to produce the modulo 2**n-1. But I don't need correct modulo function, I need function which gives us that:
> A+B if A+B<2**32
> A+B-2**32+1 if A+B>=2**32
> It's absolutely different from modulo function, but it's what they use in the description of algorithm. It's not my responsibility to change algorithm and this algorithm is used in software implementation already and if I changed this function to correct modulo function my hardware implementation would be incompatible with software implementation and wouldn't comply to requirements of government.

If that is your requirement, then the other solutions are wrong, like 
mine.

In your original post you say you want a function to produce "modulo 
2**32-1".  The description you supply is *not* "modulo 2**32-1".  It 
sounds like you have an existing implementation that may or may not be 
correct but you have been tasked with duplicating it.  If I were tasked 
with this I would go through channels to get verification that the 
specification is correct.  It wouldn't be the first time there was an 
error in an implementation that works most of the time, but fails 1 time 
in 2^32 in this case.

What happens downstream if your function spits out a number that is not 
in the range of "modulo 2**32-1"?


>> I'm interested.  Placement will have a huge effect.  I look at routed
>> results for my implementation and I saw one route of 4.5 ns.  Looks like
>> they shoved the input FFs into the IOBs, so the design can't all be
>> close together as the IO is scattered around the chip perimeter.  I'd
>> need to kill that or add another layer of FFs so the routes of interest
>> are all internal FFs which could be co-located.  Still doesn't say they
>> *will* be close together.  I find tools to be hard to manage in that
>> regard unless you do floor-planning.  I much prefer a slow clock... lol
>
> That's why I didn't use postplacement results in my first posts. Later I've worked around it. As you can see I've used A_in,B_in,C_out as a ports and A,B,C as a FF. I've also made a rule that the paths from *_in port to A,B FFs and from C FFs to C_out ports are "don't care" paths.
>
> set_false_path -from [get_ports A_in*] -to [get_cells A_reg*]
> set_false_path -from [get_ports B_in*] -to [get_cells B_reg*]
> set_false_path -from [get_cells C_reg*] -to [get_ports C_out*]
>
> It allows to place FF anywhere on the chip.
>
> I usually don't do any floor-planning at all because it's usually enough to set constrains well for my designs at ~ 250 MHz and Vivado placement tool does its job well.

The problem I found was the input FFs were shoved into the IOB (a common 
opimization).  This spreads the FFs around the IOs which are spaced far 
apart.  You need to tell the tools not to do that *or* place additional 
FFs in the circuit so the ones in the IOBs are not part of the path 
being timed.  Maybe you already have IOB FFs disabled.

-- 

Rick

Article: 158541
Subject: Re: modulo 2**32-1 arith
From: Richard Damon <Richard@Damon-Family.org>
Date: Sun, 20 Dec 2015 13:39:07 -0500
Links: << >> << T >> << A >>

On 12/20/15 3:57 AM, Ilya Kalistru wrote:
>
> Ok. It's wrong function to produce the modulo 2**n-1. But I don't need correct modulo function, I need function which gives us that:
> A+B if A+B<2**32
> A+B-2**32+1 if A+B>=2**32
> It's absolutely different from modulo function, but it's what they use in the description of algorithm. It's not my responsibility to change algorithm and this algorithm is used in software implementation already and if I changed this function to correct modulo function my hardware implementation would be incompatible with software implementation and wouldn't comply to requirements of government.
>
This is absolutely a module function, It just treats 2^32-1 as == 0 
(which it is module 2^31-1).

Article: 158542
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sun, 20 Dec 2015 11:07:46 -0800 (PST)
Links: << >> << T >> << A >>

>=20
> If that is your requirement, then the other solutions are wrong, like=20
> mine.

They are. But they give me ideas and approaches.=20

> In your original post you say you want a function to produce "modulo=20
> 2**32-1".  The description you supply is *not* "modulo 2**32-1".  It=20
> sounds like you have an existing implementation that may or may not be=20
> correct but you have been tasked with duplicating it.  If I were tasked=
=20
> with this I would go through channels to get verification that the=20
> specification is correct.  It wouldn't be the first time there was an=20
> error in an implementation that works most of the time, but fails 1 time=
=20
> in 2^32 in this case.

In my original post I was wrong. I forgot about this oddity and gave you wr=
ong description. This specification is 26 years old and I have tons of impl=
ementations :). I don't think it's a good idea to try to get verification. =
It's an old standard set by government and it's a very long way to go. Espe=
cially if the part of work you are responsible for haven't done yet.


> What happens downstream if your function spits out a number that is not=
=20
> in the range of "modulo 2**32-1"?
>=20

Nothing special. It's just a way to generate pseudorandom data. It could be=
 done this way or other way, but should be similar on all devices. The only=
 difference - security, but I wouldn't check algorithm after professional m=
athematicians - not my business.


>=20
> The problem I found was the input FFs were shoved into the IOB (a common=
=20
> opimization).  This spreads the FFs around the IOs which are spaced far=
=20
> apart.  You need to tell the tools not to do that *or* place additional=
=20
> FFs in the circuit so the ones in the IOBs are not part of the path=20
> being timed.  Maybe you already have IOB FFs disabled.
>=20

I think that it's pointless to shove FF inside the IOB if there is a rule t=
hat we don't care about timings between ports and FFs. Why could we do that=
?
I haven't disable IOB FFs.

Article: 158543
Subject: Re: modulo 2**32-1 arith
From: Ilya Kalistru <stebanoid@gmail.com>
Date: Sun, 20 Dec 2015 11:10:38 -0800 (PST)
Links: << >> << T >> << A >>


> This is absolutely a module function, It just treats 2^32-1 as == 0 
> (which it is module 2^31-1).

Maybe. Let's not fight about mathematical definitions. :)

Article: 158544
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sun, 20 Dec 2015 15:14:06 -0500
Links: << >> << T >> << A >>

On 12/20/2015 1:39 PM, Richard Damon wrote:
> On 12/20/15 3:57 AM, Ilya Kalistru wrote:
>>
>> Ok. It's wrong function to produce the modulo 2**n-1. But I don't need
>> correct modulo function, I need function which gives us that:
>> A+B if A+B<2**32
>> A+B-2**32+1 if A+B>=2**32
>> It's absolutely different from modulo function, but it's what they use
>> in the description of algorithm. It's not my responsibility to change
>> algorithm and this algorithm is used in software implementation
>> already and if I changed this function to correct modulo function my
>> hardware implementation would be incompatible with software
>> implementation and wouldn't comply to requirements of government.
>>
> This is absolutely a module function, It just treats 2^32-1 as == 0
> (which it is module 2^31-1).

That's the problem.  Some of the algorithms provided create mod 2^32-1 
while others include 2^32-1 in the output and so *are not* mod 2^32-1. 
To get mod 2^32-1 you need to add a 1 to the sum that controls if a one 
is added to the sum before taking mod 2^32.  The way Ilya is specifying 
the algorithm the 1 is not added to the sum that controls the adding of 
the 1 to the output sum and so 2^32-1 will show up in the output, not 
converted to 0 as it should be for the mod function.

-- 

Rick

Article: 158545
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sun, 20 Dec 2015 15:25:23 -0500
Links: << >> << T >> << A >>

On 12/20/2015 2:07 PM, Ilya Kalistru wrote:
>>
>> If that is your requirement, then the other solutions are wrong, like
>> mine.
>
> They are. But they give me ideas and approaches.
>
>> In your original post you say you want a function to produce "modulo
>> 2**32-1".  The description you supply is *not* "modulo 2**32-1".  It
>> sounds like you have an existing implementation that may or may not be
>> correct but you have been tasked with duplicating it.  If I were tasked
>> with this I would go through channels to get verification that the
>> specification is correct.  It wouldn't be the first time there was an
>> error in an implementation that works most of the time, but fails 1 time
>> in 2^32 in this case.
>
> In my original post I was wrong. I forgot about this oddity and gave you wrong description. This specification is 26 years old and I have tons of implementations :). I don't think it's a good idea to try to get verification. It's an old standard set by government and it's a very long way to go. Especially if the part of work you are responsible for haven't done yet.

Ok, if they have been using this method for a long time I guess the 
issue of being out of range does not cause a problem.  But certainly if 
this were being used in a critical application (such as cryptography) it 
would be a major concern since I am almost positive the greater 
algorithm is expecting numbers in the more restricted range.  Otherwise 
why specify it as mod 2^n-1?

In any event, remove the " + 1" from the end of the sum I offered and 
you should have the behavior you are looking for.

>> What happens downstream if your function spits out a number that is not
>> in the range of "modulo 2**32-1"?
>>
>
> Nothing special. It's just a way to generate pseudorandom data. It could be done this way or other way, but should be similar on all devices. The only difference - security, but I wouldn't check algorithm after professional mathematicians - not my business.

This sounds familiar.  I think I was in a conversation about this 
algorithm in another group sometime not too long ago.  The mod 2^n-1 
rings that bell.

>> The problem I found was the input FFs were shoved into the IOB (a common
>> opimization).  This spreads the FFs around the IOs which are spaced far
>> apart.  You need to tell the tools not to do that *or* place additional
>> FFs in the circuit so the ones in the IOBs are not part of the path
>> being timed.  Maybe you already have IOB FFs disabled.
>>
>
> I think that it's pointless to shove FF inside the IOB if there is a rule that we don't care about timings between ports and FFs. Why could we do that?
> I haven't disable IOB FFs.

I think I'm not being clear.  In my P&R the input FFs are automatically 
placed into the IOBs.  I don't know if the output FFs are also placed in 
the IOBs, I didn't dig this info out of the report once I saw the inputs 
were not right.

This placement distorts the timing numbers because the routes have to be 
much longer to reach the widely spread IOBs.  I don't know if your tool 
is doing this or not.  In my case I would either need to find the 
setting to prevent placing the FFs in IOBs or I need to add extra FFs to 
the test design so that the routes I want to time are all between fabric 
FFs.

This has nothing to do with timing rules.  In fact, if you add a timing 
rule that says to ignore timings between IOB FFs and the internal FFs it 
may not time your paths at all.  But I expect your tool is not using the 
IOB FFs so you aren't seeing this problem.

Please don't feel you need to continue this conversation if you have 
gotten from it what you need.  I don't want to be bugging you with 
nagging details.

-- 

Rick

Article: 158546
Subject: Re: modulo 2**32-1 arith
From: Gabor Szakacs <gabor@alacron.com>
Date: Sun, 20 Dec 2015 17:11:27 -0500
Links: << >> << T >> << A >>

On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
> I thank you all for this very interesting and useful discussion. There is a lot to think about.
> But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
> A+B if A+B<2**32
> A+B-2**32+1 if A+B>=2**32

So in fact this is what I originally suggested, i.e. "end around carry."
This is often used in checksum algorithms.  it's the equivalent of
(without the type / size changes):

sum <= A + B;
out <= sum(31 downto 0) + sum(32);

Note that if you do the whole thing in one cycle by looping the carry
out of a 32-bit full adder to its carry in, you have not only two passes
through a 32-bit carry chain (worst case) but also the non-dedicated
routing in the loopback connection from carry out to carry in.

If your device supports a full 64-bit addition in one cycle, you could
work around the routing issue by using a 64-bit addition on replicated
inputs like:

sum = (A & A) + (B & B);
out = sum (63 downto 32);

Here the low 32-bits of the adder just provide the original carry out,
and the second adder allows you to use dedicated carry chain routing
assuming your device is large enough to support 64 bits in one
column.

If you can use two clocks of latency, then your pipeline scheme
below should work best:

> I managed to meet timings by using Sum(32) to decide what sum I should use.
>
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
>
> In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.
>
> If I use
> Sum_P1 <= Sum+1;
> istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.
>
>
> If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum_P(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
> we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)
>
> in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.
>
> if we do it like that:
>      Y1  <= resize(A, 33) + resize(B, 33) + 1;
>      Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32);
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              C <= Y;
>          end if;
>      end process;
> we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.
>

-- 
Gabor

Article: 158547
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Sun, 20 Dec 2015 23:23:52 -0500
Links: << >> << T >> << A >>

On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>> I thank you all for this very interesting and useful discussion. There
>> is a lot to think about.
>> But I should apologize: later on it turns out that the definition of
>> "modulo 2**32-1" which is used in the algorithm description other than
>> I had thought. It is weird from the mathematical point of view:
>> A+B if A+B<2**32
>> A+B-2**32+1 if A+B>=2**32
>
> So in fact this is what I originally suggested, i.e. "end around carry."
> This is often used in checksum algorithms.  it's the equivalent of
> (without the type / size changes):
>
> sum <= A + B;
> out <= sum(31 downto 0) + sum(32);
>
> Note that if you do the whole thing in one cycle by looping the carry
> out of a 32-bit full adder to its carry in, you have not only two passes
> through a 32-bit carry chain (worst case) but also the non-dedicated
> routing in the loopback connection from carry out to carry in.

Another problem with this method is that it creates a combinatorial loop 
which prevents proper static timing analysis.  It is very efficient with 
the logic however.

I'm just very curious about why this is the desired algorithm while it 
is being called "mod 2^32-1", but obviously we will never know the 
origin of this.

-- 

Rick

Article: 158548
Subject: Re: modulo 2**32-1 arith
From: Gabor Szakacs <gabor@alacron.com>
Date: Mon, 21 Dec 2015 11:26:43 -0500
Links: << >> << T >> << A >>

On 12/20/2015 11:23 PM, rickman wrote:
> On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
>> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>>> I thank you all for this very interesting and useful discussion. There
>>> is a lot to think about.
>>> But I should apologize: later on it turns out that the definition of
>>> "modulo 2**32-1" which is used in the algorithm description other than
>>> I had thought. It is weird from the mathematical point of view:
>>> A+B if A+B<2**32
>>> A+B-2**32+1 if A+B>=2**32
>>
>> So in fact this is what I originally suggested, i.e. "end around carry."
>> This is often used in checksum algorithms.  it's the equivalent of
>> (without the type / size changes):
>>
>> sum <= A + B;
>> out <= sum(31 downto 0) + sum(32);
>>
>> Note that if you do the whole thing in one cycle by looping the carry
>> out of a 32-bit full adder to its carry in, you have not only two passes
>> through a 32-bit carry chain (worst case) but also the non-dedicated
>> routing in the loopback connection from carry out to carry in.
>
> Another problem with this method is that it creates a combinatorial loop
> which prevents proper static timing analysis.  It is very efficient with
> the logic however.
>
> I'm just very curious about why this is the desired algorithm while it
> is being called "mod 2^32-1", but obviously we will never know the
> origin of this.
>

I have always heard this method referred to as "end around carry,"
however if you are using this as an accumulator, i.e. A is the next
data input and B is the result of the previous sum, then it is in
effect the same as taking the sum of inputs modulo 2**32 - 1, with
the only difference being that the final result can be equal to
2**32 - 1 where it would normally be zero.  Intermediate results
equal to 2**32-1 do not change the final outcome vs. doing a true
modulo operator.

-- 
Gabor

Article: 158549
Subject: Re: modulo 2**32-1 arith
From: rickman <gnuarm@gmail.com>
Date: Mon, 21 Dec 2015 13:57:18 -0500
Links: << >> << T >> << A >>

On 12/21/2015 11:26 AM, Gabor Szakacs wrote:
> On 12/20/2015 11:23 PM, rickman wrote:
>> On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
>>> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>>>> I thank you all for this very interesting and useful discussion. There
>>>> is a lot to think about.
>>>> But I should apologize: later on it turns out that the definition of
>>>> "modulo 2**32-1" which is used in the algorithm description other than
>>>> I had thought. It is weird from the mathematical point of view:
>>>> A+B if A+B<2**32
>>>> A+B-2**32+1 if A+B>=2**32
>>>
>>> So in fact this is what I originally suggested, i.e. "end around carry."
>>> This is often used in checksum algorithms.  it's the equivalent of
>>> (without the type / size changes):
>>>
>>> sum <= A + B;
>>> out <= sum(31 downto 0) + sum(32);
>>>
>>> Note that if you do the whole thing in one cycle by looping the carry
>>> out of a 32-bit full adder to its carry in, you have not only two passes
>>> through a 32-bit carry chain (worst case) but also the non-dedicated
>>> routing in the loopback connection from carry out to carry in.
>>
>> Another problem with this method is that it creates a combinatorial loop
>> which prevents proper static timing analysis.  It is very efficient with
>> the logic however.
>>
>> I'm just very curious about why this is the desired algorithm while it
>> is being called "mod 2^32-1", but obviously we will never know the
>> origin of this.
>>
>
> I have always heard this method referred to as "end around carry,"
> however if you are using this as an accumulator, i.e. A is the next
> data input and B is the result of the previous sum, then it is in
> effect the same as taking the sum of inputs modulo 2**32 - 1, with
> the only difference being that the final result can be equal to
> 2**32 - 1 where it would normally be zero.  Intermediate results
> equal to 2**32-1 do not change the final outcome vs. doing a true
> modulo operator.

I guess I see what you are saying, even if this only applies to an 
accumulator rather than taking a sum of two arbitrary numbers.  When a 
sum is equal to 2**N-1 the carry is essentially delayed until the next 
cycle.  But... if the next value to be added is 2**n-1 (don't know if 
this is possible) a carry will happen, but there should be a second 
carry which will be missed.  So as long as the input domain is 
restricted to number from 0 to 2**N-2 this will work if the final result 
is adjusted to zero when equal to 2**N-1.

-- 

Rick

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search