FPGA-FAQ 0016

Wired AND/OR

Vendor	Xilinx, Virtex
FAQ Entry Author	Christian Plessl, Brian Philofsky, Evan, Phil Hays, Keith Williams, Jan Gray, Kent Orthner
FAQ Entry Editor	Philip Freidin
FAQ Entry Date	02/18/2001

Q. How do I implement fast Wired OR logic

In my design, I need to evaluate boolean functions of a large number
of input signals. These signals are outputs of statemachines in the
design.

What I need, is a _fast_ boolean OR resp. AND operations on all of
these output signals. Since there are quite a lot of output signals,
say typically more than 40 signals, I need several levels of logic,
when implementing this in the obivous tree-like structure with a tree
of 4 Input AND/OR gates.

I think this problem could also be solved by using a wired-or
function, where - in the case of the OR operations - all signals
either drive a line to logical '1' or go into high-impedance state
'Z'. The only additional thing I need is a pulldown resistor on this
line.

Does anybody know, whether a circuit like this will work and is
realizable in Xilinx Virtex FPGAs? How could this be done? Im working
with VHDL using Xilinx Foundation F3.1i.

A.

Brian Philofsky :
You can create a wired-AND/OR funtion however I don't think you would
realize the same speed as if you used the carry-chain instead. Although
this is not the best method on all situations, I can sugest for you to use
the Virtex carry chain to create lare AND or OR gate functions. Depending
on exaclty how many inputs you have, the approximate delay will be 1 LUT
delay + (inputs/4 * carry chain delay).

The basic structure of and AND cate is to configure the LUT as a four
input AND gate and tie the output to the MUXCY select. Tie the MUX 0
input to the MUX to the 1 input CIN and the other to ground. The initial
CIN needs to be ties to VCC.

OR is a similar structure except now create an OR gate in the LUT and
reverse all VCC and ground connections I stated above. Obviousl6y other
logical structures such as wide decoders can be made in a similar manner
if you use your imagination.

Most likely, you will have to hand create this structure as I am not sure
if any synthesis tools currently infer this structure. Maybe in the near
future though...
.......
To help you out, I created a small piece of VHDL to show you what I mean. The
code shows how to create a 32-input AND gate using the carry chain. It can
easily be expaned to accomodate more inputs. You can modify the code to best
suit your needs, this is just an example. I used XST for the synthesis tool
so some changes maybe necessary depending on your synthesis tool you use.

============= carry_and.vhd
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

-- synopsys translate_off
library UNISIM;
use unisim.Vcomponents.all;
-- synopsys translate_on

entity carry_and is
    Port ( And_in : in std_logic_vector(3 downto 0);
           Carry_in : in std_logic;
           Carry_out : out std_logic);
end carry_and;

architecture behavioral of carry_and is

-- 2-to-1 Multiplexer for Carry Logic with General Output

component MUXCY
port (
    DI : in std_logic;
    CI : in std_logic;
    S : in std_logic;
    O : out std_logic
);
end component;

--4-Bit Look-Up-Table with General Output

component LUT4
-- synopsys translate_off
generic (INIT : bit_vector := x"8000");
-- synopsys translate_on
port (
    I0 : in std_logic;
    I1 : in std_logic;
    I2 : in std_logic;
    I3 : in std_logic;
    O : out std_logic
);
end component;

signal LUT_OUT: std_logic;

attribute INIT: string;
attribute INIT of LUT_INST: label is "8000";

begin

LUT_INST: LUT4 port map (I0=>And_in(0), I1=>And_in(1), I2=>And_in(2),
I3=>And_in(3), O=>LUT_OUT);

MUXCY_INST: MUXCY port map (DI=>'0', CI=>Carry_in, S=>LUT_OUT,
O=>Carry_out);

end behavioral;

============= big_and.vhd
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

-- synopsys translate_off
library UNISIM;
use unisim.Vcomponents.all;
-- synopsys translate_on

entity big_and is
Port ( And_in : in std_logic_vector(31 downto 0);
And_out : out std_logic);
end big_and;

architecture behavioral of big_and is

component carry_and
    port (
      And_in : in std_logic_vector(3 downto 0);
      carry_in : in std_logic;
      carry_out : out std_logic
    );
end component;

component XORCY
    port (
      LI : in std_logic;
      CI : in std_logic;
      O : out std_logic
      );
end component;

signal CARRY: std_logic_vector(8 downto 0);

signal zero: std_logic;

begin

   AND_GEN:
     for I in 0 to 7 generate
        CARRY_AND_INST: carry_and port map(And_in=>And_in((I*4+3) downto (I*4)),
                                       carry_in=>CARRY(I), carry_out=>CARRY(I+1));
     end generate;

-- By Adding a redundant XORCY generally gives access to faster routing than
-- exiting the carry chain

XORCY_INST : XORCY port map (LI=>'0', CI=>CARRY(8), O=>And_out);

-- This is used to initialize the carry chain

CARRY(0) <= '1';

end behavioral;

Evan :

The best way to do this is to use a carry chain to combine the outputs
of multiple 4-in LUTs - look up an OR16 in the online docs. This gives
you a 1-level delay, plus the carry chain delay. You can't directly
extend the OR16 primitive, but you'll be able to code up a wider
version by instantiating the primitives in the diagram.

===== and also

To put a more structural spin on all this, a carry chain can be used
to combine the outputs of multiple LUTs. The chain can implement an
AND function, as coded in Arrigo's code, or an OR function, by doing
the obvious De Morganising.

One thing that you can put in a 4-in LUT is a 2-in compare (2 xnor's
combined with an AND gate). The simple way to do this is to
instantiate a LUT with an INIT attribute of 9009. A wide n-in compare
can therefore be implemented in n/2 LUTs, with an AND-type carry
chain.

My own personal preference is to do this sort of thing (ie. basic
components) completely structurally, to leave the synth and the mapper
the minimum opportunity to de-optimise your code.

Phil Hays :

Carry chain can be infered by synthesis tools, however the code may not be
highly readable. For example, to create an OR gate:

OR_temp <= '0' & A & B & C & D & E;
Result_temp = OR_temp + "011111"
Result = Result_temp(5); -- result is zero unless (A or B or C or D or E) = 1

I'd suggest using a proceedure to improve readability.

Biggest gain in speed is from using the carry chain for priority encoders, large
AND and OR gates gain some.

Wide AND is:

AND_temp <= '0' & A & B & C & D & E;
Result_temp = AND_temp + "000001"
Result = Result_temp(5); -- result is zero unless (A AND B AND C AND D AND E) =1

Keith R. Williams :

However, the results from SynplifyPro don't show much of a
gain. I must be missing something. If I just did a simple 12 bit add
(for a 12 bit AND) Synplify inferred twelve LUTS feeding twelve
MUXCYs. The speed was nothing to write home about (3.5ns in a VirtexE
-7). A two stage 12 input AND Synplify infers from the normal a
normal IF/THEN/ELSE comes in at 2.5ns. Both tests the AND feeds the
DFF.

I then instantiated three four-input ANDs and then did a three-bit
add. Synplicity inferred a two stage LUT feeding the flop. Grrr.

Only when I extended the AND to three stages (32 bits) did the carry
method become a tad faster. Synplify says the ADD method is 3.5ns,
vs. 3.6 for the three stage LUT (seems strange as I write, I'll have
to look at this again).

Jan Gray :

There's a nice example of a Virtex-carry-chain-optimized n-bit comparator in
VHDL from Arrigo Benedetti in the fpga-cpu list archive, at
http://groups.yahoo.com/group/fpga-cpu/message/73

Keith R. Williams :

Very nice indeed! Synplify says I cdan do a 12-bit comparitor at just
a hair under 200MHz (Virtex-E -7) with registers around it. That
might be a bit slow for what I need, but not by much. I modified it
slightly to do two XORs and an AND in the LUT (I have four inputs, why
waste two :-) before feeding the MUXCY and Synplicity reports 224MHz.
Nice indeed! If true (and it wires nicely), I can cut a pipeline
stage out of my address compare.

Kent Orthner :

A little while ago, I was playng around with a 32-bit wide
trinary compare function (32 bits & 32 bit compare with 32
bits mask), and I tested both the carry-chain method and the
pull-up (wired-or) method, and I did indeed find that the
Carry chain was faster.

Note that you will need to break the carry chain in half if
the device height isn't enough for your entire carry chain.
(Which is what I did.)

I've stuck my compare component here so you can see how I
did it. I tested it using Foundatoin 3.3iISE.

Hope this helps,
-Kent

-------------------------------------------------------------------------------
-- Title      : Fast Trinary Compare Component
-- Project    : Common Component
-------------------------------------------------------------------------------
-- File       : FastCompare.vhd
-- Author     : K.Orthner
-- Created    : 2001/01/27
-- Last update: 2001-01-25
-- Platform   : Active-HDL/FPGA Express(Synopsys)
-------------------------------------------------------------------------------
-- Description: A Fast Trinary Compare component.
--              Completely combinational.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
-- synopsys translate_off
library unisim;
use unisim.all;
-- synopsys translate_on

entity FastCompare is
generic (
    Width    : integer);
port (
    Comparand0 : in std_logic_vector(Width-1 downto 0);
    Comparand1 : in std_logic_vector(Width-1 downto 0);
    Mask       : in std_logic_vector(Width-1 downto 0);
    Match      : out std_logic );

end FastCompare;

----------------------------------------------------------------------------------------------------
-- Architecture Xilinx_Carry
----------------------------------------------------------------------------------------------------
architecture Xilinx_Carry of FastCompare is

component MUXCY is
    port (
      O : out std_ulogic;
      DI : in std_ulogic;
      CI : in std_ulogic;
      S : in std_ulogic);
end component MUXCY;

signal BitMatch    : std_logic_vector(Width-1 downto 0);
signal CarryChain0 : std_logic_vector(Width/2 downto 0);
signal CarryChain1 : std_logic_vector(Width/2 downto 0);
signal Logic_0     : std_logic;

begin

Logic_0 <= '0';

----------------------------------------------------------------------------------------------------
-- Determine bitwise matching.
----------------------------------------------------------------------------------------------------
GenBitMatch : process( Comparand0, Comparand1, Mask) is
begin
    for i in BitMatch'range loop
      if (Mask(i) = '0') then
        BitMatch(i) <= '1';
      elsif (Comparand0(i) = Comparand1(i)) then
        BitMatch(i) <= '1';
      else
        BitMatch(i) <= '0';
      end if;
    end loop;

end process GenBitMatch;

GenMux : for i in 0 to ((Width/2)-1) generate
    MUXCY_0 : MUXCY
      port map (
        O => CarryChain0(i+1),
        DI => Logic_0,
        CI => CarryChain0(i),
        S => BitMatch(i));

    MUXCY_1 : MUXCY
      port map (
        O => CarryChain1(i+1),
        DI => Logic_0,
        CI => CarryChain1(i),
        S => BitMatch(i+(Width/2)));

end generate GenMux;

CarryChain0(0) <= '1';
CarryChain1(0) <= '1';

Match <= CarryChain0(Width/2) and CarryChain1(Width/2);--and CarryChain2(Width/4) and CarryChain3(Width/4);

end Xilinx_Carry;

Christian Plessl :

I've made little test circuits to compare your proposals, and want
shortly show the results:

I compared 3 different architectures for a 32 input AND gate

a) Simply using the 'and' operator

b) Using Brian Philofsky's scheme, by instanciating LUT's which
implement a 4bit Boolean function and passing the intermediate results
via the carry chain.

c) Phils Hays's clever idea of using the desing tools capability to
infere adders that use the carry chain for constructing wide boolean
functions.

All designs were implemented using Xilinx Foundation Tools Version
3.3i Servicepack 6 using VHDL toolflow. The target FPGA is Xilinx
Virtex-XCV1000-4.

Results:

+-----------------------------------------------+
| Cirucit | Slices used | LUTs used | Delay     |
+-----------------------------------------------+
| a       | 9          | 11        | 17 ns     |
+-----------------------------------------------+
| b       | 5          | 8        | 13 ns     |
+-----------------------------------------------+
| c       | 17          | 0        | 15.25 ns |
+-----------------------------------------------+

Remarks:

a) shows that the tools cannot infere a higly-efficient implementation
when using just the obivous naive way of coding wide logic functions.

b) Brians scheme generates the fastest wired-and implementation for
32-input ANDs. The logic infered is as expected, each slice implements
2 4input LUTs each of the LUTs implements a 4-input AND. All the
outputs of the LUTs control the CYSEL multiplexers and the results are
passed via carry-chain.

c) Phils scheme doesn't use any LUTs at all, all the logic is
implemented using the carry-chain and the LUT just used for routing 1
single signal to the multiplexer, which means the circuit is similar
to b) but every slice handles only 2 bits, instead of 8 bits in
circuit b). Surprisingly the circuit is quite fast. Seems as if the
Virtex Carry chains are _really_ fast.

FPGA-FAQ FAQ Root