Research Article  Open Access
P. Balasubramanian, D. A. Edwards, W. B. Toms, "Redundant Logic Insertion and Latency Reduction in SelfTimed Adders", VLSI Design, vol. 2012, Article ID 575389, 13 pages, 2012. https://doi.org/10.1155/2012/575389
Redundant Logic Insertion and Latency Reduction in SelfTimed Adders
Abstract
A novel concept of logic redundancy insertion is presented that facilitates significant latency reduction in selftimed adder circuits. The proposed concept is universal in the sense that it can be extended to a variety of selftimed design methods. Redundant logic can be incorporated to generate efficient selftimed realizations of iterative logic specifications. Based on the case study of a 32bit selftimed carryripple adder, it has been found that redundant implementations minimize the data path latency by 21.1% at the expense of increases in area and power by 2.3% and 0.8% on average compared to their nonredundant counterparts. However, when considering further peephole logic optimizations, it has been observed in a specific scenario that the delay reduction could be as high as 31% while accompanied by only meager area and power penalties of 0.6% and 1.2%, respectively. Moreover, redundant logic adders pave the way for spacer propagation in constant time and garner actual case latency for addition of valid data.
1. Introduction
The 2009 International Technology Roadmap on Semiconductor (ITRS) design predicts that adaptive digital circuits will be increasingly necessary for the future as a consequence of increase in variability [1]. This is owing to a blurring of the boundary between catastrophic faults in circuits caused due to manufacturing defects and parametric faults resulting from device and interconnects variability. The ITRS roadmap [1] projects a growing requirement for asynchronous global signaling and emphasizes the need for a continuous development of asynchronous logic/circuit design tools. This is significant in the context of a key challenge faced in modern IC design, namely, distribution of a centralized clock signal throughout the chip with acceptably low skew whilst having to keep the power, congestion, and area costs of traditional repeater insertion in long global clock lines to a minimum. Indeed as variability increases, circuits can exhibit faulty behavior similar to that caused by catastrophic defects. The major sources of failures include (i) process variations—statistical variations of device parameters such as channel length, threshold voltage, and mobility, (ii) lifetime variations—variations causing shift in physical parameters over the operating life of a circuit, and (iii) intrinsic noise—noise sources (shot noise, thermal noise, and random noise) which are inherent to normal device operation that becomes dominant at small feature sizes. At a time when the issue of variability has become prominent and the reliability aspect tending to assume greater significance than quality of results in nanometer scale digital circuits, the selftimed design paradigm offers an attractive alternative to conventional synchronous design. In fact, selftimed logic circuits are inherently tolerant of process, temperature, and parameter uncertainties [2–6]. A recent work [7] by Chelcea et al. demonstrated the superior resiliency of asynchronous circuits visàvis their synchronous counterparts in the presence of parametric variations (probabilistic device delays) for the case of a 32bit BrentKung adder and a 16bit multiplier. Selftimed circuits also have better electromagnetic compatibility [8] and noise susceptibility attributes compared to synchronous designs [9], consume power only where and when active [10, 11], and feature excellent design reusability [12]. Moreover, selftimed circuits are selfchecking [13, 14] and are latency insensitive thus being naturally elastic or adaptive.
Although the term “selftimed” has been used to refer to asynchronous circuits, it is important to note that selftimed circuits actually constitute a robust class of asynchronous circuits, namely, input/output mode circuits. In general, circuits corresponding to the input/output operating mode do not impose timing assumptions on when the environment should respond to the circuit. The robustness attribute in selftimed circuits usually results from employing a delayinsensitive (DI) code for data representation, communication, and processing, and a 4phase (returntozero) handshake signaling convention is commonly adopted. Among the family of DI codes [15], the dualrail (1of2) code is widely preferred owing to its simplicity and ease of logic implementation.
According to dualrail data encoding, each data wire is represented using two encoded data wires d0 () and d1 () as shown in Figure 1. A transition on the wire indicates that a zero has been transmitted, while a transition on the wire indicates that a one has been transmitted. Since the request signal is embedded within the data wires, a transition on either or informs the receiver about the validity of data. The condition of both and being a zero at the same time is referred to as the spacer or empty data. Both d0 and d1 are not allowed to transition simultaneously as it is illegal and invalid since the coding scheme utilized is unordered [16], where no codeword forms a subset of another codeword.
With reference to Figure 1, the 4phase handshake protocol is explained as follows (the explanation remains valid for data representation using any DI data encoding scheme).(i)The dualrail data bus is initially in the spacer state. The sender transmits the codeword (valid data). This results in “low” to “high” transitions on the bus wires (i.e., any one of the rails of all the dualrail signals is assigned logic “high” state), which correspond to nonzero bits of the codeword.(ii)After the receiver receives the codeword, it drives the ackout (ackin) wire “high” (“low”).(iii)The sender waits for the ackin to go “low” and then resets the data bus (i.e., the data bus is driven to the spacer state).(iv)After an unbounded but finite (positive) amount of time, the receiver drives the ackout (ackin) wire “low” (“high”). A single transaction is now said to be complete, and the system is ready to proceed with the next transaction.
The timing diagram for the 4phase asynchronous signaling protocol is shown in Figure 2, with the request (req) signal, which is actually embedded within the data wires, explicitly shown to describe the handshaking. The dualrail code is the simplest member of the general family of delayinsensitive mofn codes [15], where lines are asserted “high” out of a total of physical lines to represent a codeword. The size (i.e., number of unique symbols) of a generic mofn code is given by the binomial coefficient choose . The dualrail code is ideally suited for representing a single bit of binary information. To represent two bits of information, the dualrail code can be concatenated as shown in Table 1 or can equivalently be represented through a 1of4 code.

The 1of4 encoded values of singlerail inputs given in Table 1 represent only one of many possible encodings, and an arbitrary choice is portrayed here. Two binary bits of information are represented by asserting only half of the physical lines as logic “high” in the 1of4 code in comparison with a dualrail code, although both the coding schemes require the same number of physical lines. As a result, the 1of4 encoding scheme experiences only half the transitions of the dualrail encoding convention. Thus the dynamic power dissipation of the former scheme is likely to be better than that of the latter due to reduced switching activity. This phenomenon was confirmed with the practical example of an ARM thumb instruction decoder [17]. However, considering the additional encoding and decoding circuitry required for realizing 1of4 encoded selftimed data paths in comparison with dualrail encoded selftimed data paths [18], the power savings gained by the former might diminish.
Although higher order encoding schemes are available, apart from the dualrail code that allows easier mapping between conventional binary functions, the other widely used DI code is the 1of4 code. This is owing to the reason that for selftimed data paths, encoding by sender and membership test and decoding by receiver are important aspects, and consequently the encoding and decoding complexity is dependent on the message space to be coded [19]. When the dualrail code and 1of4 code are used to represent exactly one bit and two bits of binary information, respectively, they are said to be complete [14]. A code is said to be complete if and only if it contains all code words as implied by its definition. Even with one missing codeword, it would be labeled incomplete. A DI coding scheme, in general, is required to be unordered and complete.
Seitz classified a selftimed logic circuit into two robust categories on the basis of its indicating (acknowledging) genre as strongly indicating and weakly indicating [20]. It was also shown therein that a legal interconnection of strongly or weakly indicating logic circuits gives rise to a larger strong or weakindication logic circuit.(i)Strong Indication. In this case, the selftimed circuit waits for all of its inputs (valid/spacer) to arrive before it starts to produce all the outputs (valid/spacer). The sequencing constraints are given below:(a)all the inputs become defined (valid)/undefined (spacer) before any output becomes defined/undefined; that is, any or all of the output(s) become defined/undefined only after all the inputs have become defined/undefined,(b)all the outputs become defined/undefined before any input becomes undefined/defined.(ii)Weak Indication. According to this, the selftimed circuit is allowed to produce any of the outputs (valid/spacer) even with a subset of the inputs (valid/spacer). However, Seitz’s weak timing specifications require that at least one output (valid/spacer) should not have been produced until after all the inputs (valid/spacer) have arrived. The sequencing constraints in this case are as follows.(a)Some inputs become defined (undefined) before some outputs become defined (undefined); that is, some outputs could become defined (undefined) only after at least some inputs have become defined (undefined).(b)All the inputs become defined (undefined) before all the outputs become defined (undefined); that is, all the outputs could become defined (undefined) only after all the inputs have become defined (undefined).(c)All the outputs become defined (undefined) before any input becomes undefined (defined).
The signaling scheme for strong and weakindication timing regimes in terms of the inputoutput characteristics is illustrated graphically in Figure 3, which summarizes the sequencing constraints mentioned above. In general for iterative circuits, weakly indicating implementations are preferable compared to strongly indicating versions since the former’s computation time is data dependent for valid data and may exhibit constant latency for spacer data, while the latter is always bound by worstcase latency for both valid data and spacers [21].
2. Redundant Logic Insertion
This section deals with an efficient method of reducing the critical path delay of selftimed adders by means of a novel concept called redundant logic insertion. In general, the concept can be extended to effect latency reduction in any iterative logic circuit that comprises a cascade of basic building blocks. Redundancy insertion, in general, implies inclusion of extra redundant logic into a non redundant implementation without modifying the original function that synthesizes the desired functionality to enable speeding up the propagation of certain signals, which are required to drive the subsequent stages of a circuit cascade.
Logic redundancy can be incorporated into a selftimed circuit implementation by careful duplication of similar logic, and this can lead to multiple acknowledgements, which might be useful in simplifying the timing assumptions. Additionally, this procedure could facilitate faster reset of logic during the returntozero phase with a constant latency. Logic redundancy achieved through inputincomplete gates basically introduces weakindication property into the circuit as it relaxes the indication constraints of those outputs that are considered as candidates for optimization. (Inputincomplete gates need not have to wait for the arrival of all their inputs to produce the required output under all scenarios; examples include AND gates and OR gates. If any one of its inputs is assigned a 0(1), the output of the AND gate (OR gate) becomes a 0(1)). It can either be implicit or explicit in the circuit. The minor drawbacks of this approach are insignificant increases in area and power parameters. Since logic duplication is involved, switching activity would increase due to multiple acknowledgements, consequently pushing up the dynamic power and resulting in increased average power dissipation. However, the area and power overheads may be marginal depending upon the functionality and its initial nonredundant implementation, and eventually the degree of logic redundancy introduced. We will now consider some case studies to demonstrate the benefits of redundancy insertion on the basis of the selftimed ripple carry adder (RCA) architecture, where logic redundancy is targeted towards the carry output function since the carry is required to propagate between successive stages of the adder architecture.
2.1. Implicit Logic Redundancy
The basic equations corresponding to a dualrail encoded full adder are given by (1)–(4). Here (a0, a1), (b0, b1), and (cin0, cin1) represent the dualrail encoded augend, addend, and carry inputs of the adder, while (Sum0, Sum1) and (Cout0, Cout1) represent the dualrail encoded sum and carry outputs of the adder, respectively:
The circuit shown in Figure 4 corresponds to our synthesized dualrail encoded full adder, henceforth referred to as the SSSC_DRE adder (single sum, single carry dualrail encoded adder). Three steps are involved in the synthesis process—(i) deriving the minimum orthogonal sumofproducts form of a given logic functionality [22], (ii) speedindependent decomposition of logic to facilitate realization using standard cells [23], and (iii) performing logic optimizations to pave the way for latency reduction. In the figures, the Celement is represented by the AND gate symbol with the marking on its periphery. (The Muller Celement governs the rendezvous of input signals. It produces a 1(0) if all its inputs are 1(0); otherwise it retains its existing steady state. The Celement (also called Cgate) is classified as an inputcomplete gate as it waits for the arrival of all its input(s) to produce the desired output).
Firstly, it can be noticed that the responsibility of indication is confined to the sum outputs of the adder block, thereby freeing the carry signal from indication constraints which facilitates fast carry propagation. Even with the arrival of a subset of the inputs, the carry outputs could become defined/undefined, while the sum outputs would have to wait for the arrival of all the inputs to become defined/undefined. Thus the full adder satisfies Seitz’s weakindication timing constraints. This style of implementation is labeled as the biased approach [24], as there is no distribution of inputs indication between the primary outputs. In other words, the primary outputs are not collectively responsible for acknowledging the arrival of all the primary inputs and internal outputs. Our proposed synthesis solution corresponds to a direct synthesis strategy and differs from the method presented in [24] in that the latter generates a dualrail asynchronous gate pair or a delayinsensitive minterm synthesis (DIMS) equivalent [25] of each synchronous logic gate. In fact, the process of generating a dualrail asynchronous gate pair for a synchronous logic gate is based on the dualrail combinational logic style [26, 27]. The asynchronous dualrail gate pair equivalent or the DIMS equivalent of each synchronous logic gate is eventually realized using proprietary null convention logic (NCL) macros [28], which are constructed on the basis of threshold logic [29].
Secondly, the full adder block depicted in Figure 4 features implicit logic redundancy. The intermediate gate output functions “int1” and “int2” are found embedded within the logic producing the carry outputs Cout0 and Cout1, respectively, however, in their inputincomplete forms. The principal advantages of this full adder with respect to the nbit selftimed RCA architecture shown in Figure 5 are (i) fast carry output production and propagation when carrygenerate () and carrykill () conditions occur and (ii) reset of the entire adder circuitry with an approximate propagation delay of only two full adders during the returntozero phase regardless of the adder size. The latter advantage results from the fact that the intermediate dualrail output carries of all the full adder modules connected in a cascade could be reset in parallel as the dualrail encoded augend and addend inputs of every adder stage are reset. Subsequently, the dualrail sum outputs of all the adder stages would be reset as their input carries assume a spacer state. This leads to a constant latency operation for application of spacer data, while datadependent latency would manifest when valid data is applied. Indeed, this attribute becomes inherent in all the redundancy incorporated selftimed adders. The worstcase latency results when the carrypropagate mode is activated with respect to all the individual full adder stages; this happens for the case when a1 = b0 = 1 or a0 = b1 = 1. The SSSC_DRE adder has some similar properties as that of Martin’s full adder [30], which is nevertheless a standalone fullcustom transistor level realization.
2.2. Explicit Logic Redundancy
We now consider a variety of scenarios where logic redundancy is explicit in a circuit design. To this end, we analyze some adder circuits which employ a uniform DI data encoding protocol (dualrail encoding) for both primary inputs and outputs, or a combination of DI codes (dualrail and 1of4 codes) for primary inputs, but a single DI code (dualrail code) for the primary outputs.
2.2.1. SingleBit Adder Based on Hybrid Input Encoding
The term “hybrid input encoding” specifies a mix of at least two different DI data encoding schemes as adopted for the primary inputs. Considering the singlebit full adder block, the augend and addend input bits can be encoded using a 1of4 code, while the carry input, sum and carry outputs can adopt the dualrail code; that is, hybrid encoding of primary inputs and uniform encoding of primary outputs are resorted to. The structure of the nbit hybrid input encoded selftimed RCA is depicted by Figure 6, which is similar to the topology shown in Figure 5 with the exception that the augend and addend singlerail inputs are now encoded using the 1of4 code.
The general expressions governing a full adder block utilizing hybrid input encoding for inputs and dualrail encoding for outputs are given below. In the equations that follow, (i0, i1, i2, i3) represents the 1of4 encoded equivalent of the singlerail adder inputs (a, b), with a singlerail to 1of4 data representation scheme adopted as shown in Table 1:
The full adder block that synthesizes equation (2) inclusive of carry output logic optimization is portrayed by Figure 7. Henceforth, this adder module shall be identified as the SSSC_HIE_NRL adder (single sum, single carry hybrid input encoded nonredundant logic adder). As the name implies, all the gates that constitute this adder are irredundant. It can be observed from Figure 7 that the sum outputs are entrusted with the responsibility of inputs indication, while the carry outputs could evaluate to the correct state whenever the carrykill or carrygenerate condition occurs without having to wait for the carry input. Thus the SSSC_HIE_NRL adder corresponds to the weakindication timing model.
The synthesized hybrid input encoded full adder block that incorporates logic redundancy is shown in Figure 8.
Here, gates and denote 2input Celements, while gates and represent 2input AND gates. It can be noticed in the figure that the functions realized by and are identical to that implemented by and , respectively, for the case of upgoing transitions. Hence, redundancy is explicit in the present design, henceforth referred to as the SSSC_HIE_RL adder (single sum, single carry hybrid input encoded redundant logic adder). With respect to this adder design, logic redundancy is found to be beneficial in two ways. During the spacer phase, all the sum outputs could be reset in a parallel fashion, as the dualrail carry output of the th stage of a nbit adder could be reset based on its 1of4 encoded augend and addend inputs, and the dualrail sum output of the th adder stage would depend only on the dualrail carry input of its preceding stage. There is also a benefit in terms of improving the computation speed during the valid data phase. This would be obvious by comparing the designs portrayed by Figures 7 and 8; it can be observed that the carry propagation delay is less in case of the SSSC_HIE_RL adder (AND2, OR2 gate delays) in comparison with the SSSC_HIE_NRL adder (CE2, OR2 gate delays).
2.2.2. DualBit Adder Utilizing Homogeneous Data Encoding
We now analyze the effect of introducing redundant logic in a selftimed dualbit adder module that employs homogeneous data encoding for both its primary inputs and outputs. The homogeneous encoding procedure refers to a similar DI data encoding protocol as adopted for all the primary inputs and outputs of a function block—here dualrail data encoding. The dualbit adder block consists of dualrail encoded versions of five singlerail inputs, namely, a1, a0, b1, b0, and cin, and three singlerail outputs Cout, Sum1 and Sum0, where (a1, a0) and (b1, b0) represent the addend and augend inputs and cin the carry input. The output Cout is the carry output or overflow bit of the addition process, and Sum1 and Sum0 signify the most significant and least significant sum output bits, respectively.
The reduced orthogonal sumofproducts forms corresponding to the encoded outputs of the dualbit adder are given below, expressed in terms of their encoded inputs. In an orthogonal sumofproducts form, the logical conjunction of any pair of product terms yields a null:
The architecture of the nbit selftimed carryripple adder structure that features dualbit adder modules is shown in Figure 9. The synthesized dualbit adder module is portrayed by Figure 10. It shall be referred to as the DSSC_DRE adder module (dual sum, single carry dualrail encoded adder) in the subsequent discussions. Figure 10 depicts the redundant AND gates (shaded gates) inserted into the DSSC_DRE adder block. The nonredundant adder block would not feature the AND gates rg1 and rg2, and so one of the inputs for the OR2 gates producing Cout1 and Cout0 would be the outputs of Celements ( and ), which are the nets labeled as gn2 and gn3, respectively. In fact, gn2 and gn3 would be isochronic forks in the nonredundant version. Isochronic forks are the weakest compromise to delay insensitivity [31], leading to quasidelayinsensitive (QDI) circuit implementations. According to the isochronicity assumption, if a transition on a wire fork is acknowledged, then the transition on the other wire forks is also said to be acknowledged. It was shown in [32] that QDI circuits which include isochronic fork assumptions can be realized in even nanometer scale dimensions. Indeed, QDI circuits are the practically implementable DI circuits, and they constitute the robust class of selftimed circuits. In the redundant dualbit adder shown in Figure 10, the OR2 gates producing Cout1 and Cout0 have gn1 and gn4 feeding as inputs, respectively. For the case of lowtohigh transitions, the AND gates rg1 and rg2 are functionally equivalent to Celements and .
The gate output node labeled “isf” signifies an isochronic fork junction. Referring to Figure 10, it can be observed that an upgoing transition on the fork isf (isf↑) would be followed by either gn2↑ or gn3↑ in case of the nonredundant DSSC_DRE adder block and by (gn1↑, gn2↑) or (gn3↑, gn4↑) in case of the DSSC_DRE adder module that incorporates logic redundancy; this explains the possible multiple acknowledgements. It can be observed that there is a possibility for fast or eager reset during the returntozero phase as a result of introducing logic redundancy into the adder. During the spacer phase, all the sum outputs could be reset in a parallel fashion, as the carry output of the previous dualbit adder stage could be reset even by its corresponding augend and addend inputs without having to wait for an input carry from the preceding stage. The advantage of latency reduction gained by introduction of redundant logic is due to the lower data path delay encountered, as the critical path in every dualbit adder stage contains inputincomplete gates instead of a mix of inputcomplete and inputincomplete gates as in the original nonredundant version.
2.2.3. DualBit Adder Incorporating Heterogeneous Data Encoding
The heterogeneous encoding procedure implies a combination of at least two different DI codes (say, dualrail and 1of4 codes), used to encode the primary inputs and outputs of a selftimed logic circuit. A dualbit adder block based on heterogeneous DI data encoding can represent the augend, addend inputs, and sum outputs by a 1of4 code, while the input and output carry signals can be represented using the dualrail code. Adopting such an encoding scheme, the minimized expressions for the function block outputs are given below. It is to be noted that the 1of4 code assignments for the augend, addend inputs, and the sum outputs are the reverse of the assignments given in Table 1:
The dualbit adder module that synthesizes (4)–(9) is shown in Figure 11. Henceforth, this adder shall be referred to as the DSSC_HE adder (dual sum, single carry heterogeneously encoded adder). The DSSC_HE adder block satisfies the weakindication timing constraints. The 1of4 encoded sum outputs assume responsibility for indicating the arrival of all the adder inputs, while the dualrail encoded carry output can be relaxed with respect to ensuring input completeness. Logic redundancy, as introduced into the DSSC_HE adder module, is shown in the figure with the inputincomplete AND gates (shaded gates) marked as rg1 and rg2. Similar notations have been used as that of Figure 10 so that the discussions of the previous section would hold well for this scenario too. As in the earlier case, the sum output(s) of the (i+1)th dualbit adder stage could be reset based on the carry input from the th dualbit adder stage, and there does not arise any need for resetting of the entire carry chain during the returntozero phase.
The nbit selftimed carryripple adder architecture that encompasses heterogeneously encoded dualbit adder modules is shown in Figure 12. The selftimed system configuration that supports the RCA topology is depicted in Figure 13. A subset of the dualrail inputs (augends and addends) is 1of4 encoded before being fed to the function block for data processing, while the remaining inputs (dualrail encoded input carry) are fed as such. The nondualrail outputs produced by the logic block (sum outputs) are decoded before being passed onto the next stage, while the dualrail outputs (output carry) are driven to the next stage. The encoding and decoding costs equate to 28 and 12 transistors per bit, respectively.
(a)
(b)
(c)
3. Simulation Mechanism and Results
To demonstrate the usefulness of the proposed concept of logic redundancy insertion, simulations have been performed by considering a 32bit selftimed RCA architecture. In this context, a subset of wellknown selftimed design methods [25, 33, 34] is considered in this work. Various 32bit selftimed RCAs were built by considering different adder building blocks—32 singlebit adder blocks or 16 dualbit adder modules. Before discussing the simulation results, the mechanism of estimating the design metrics is elucidated. The delay parameter refers to the maximum propagation delay (critical path delay) encountered in the data path, which is a sum of the latencies of the input register and that of the combinatorial adder logic. The delay metric was estimated using PrimeTime. To avoid the notion of a clock source, a virtual clock was used as a remote reference to constrain the input and output ports of the design. The area and power metrics correspond to the input registers, completion detection logic, and the 32bit combinatorial adder. The delay and power metrics consider estimated parasitics in addition to the parameters associated with actual components (gates). The area metric gives a combined account of the area of all the logic cells. The total/average power dissipation is the sum of dynamic and static power components, where dynamic power is in turn composed of switching and internal power consumption values. NCSim has been used for functional simulation and also to obtain the switching activity files corresponding to gatelevel simulations of Verilog descriptions of various 32bit selftimed adders. Input data were supplied to the adders at a time interval of 15 ns through a random test bench which models the environment. The switching activity files obtained were subsequently used for power estimation using PrimeTime PX. The simulations targeted a PVT corner of the 130 nm bulk CMOS standard cell library whose recommended supply voltage is 1.32 V and the ambient junction temperature is −40°C. All the circuit inputs possess the driving strength of the minimumsized inverter of the cell library, while the outputs are associated with a fanoutof4 drive strength. Appropriate buffering for the input acknowledgement signal was provided where necessary to eliminate timing violations. Since identical registers and a similar completion detection circuit were used for all the 32bit adders, the area and power metrics can be correlated with that of the function block, thus paving the way for a straightforward comparison between adders synthesized on the basis of different selftimed design methods. Strong/weakindication adders corresponding to various selftimed design methods were constructed manually and were subsequently optimized for minimum latency by taking into account the physical constraints of the target cell library. (A 130 nm CMOS standard cell library was used. The maximum fanin of AND gate and OR gate in this library is 4 and 3, respectively. The granularity of the Celement ranges from 2 to 4 inputs, and the gate level Celement models are given in [35]). The delay, area, and power metrics corresponding to the simulations of various nonredundant 32bit selftimed RCAs are given in Table 2.
 
The dual sum, single carry (DSSC) adder realization based on the DIMS method required careful speedindependent logic decomposition to decompose the high fanin Cgates. 
The nature of indication of the different adders is mentioned within brackets in the 1st column of the Table. The values specified within brackets in the 3rd column of the table signify the area of the respective individual singlebit/dualbit selftimed adder block. The delay, area and power parameters of the different redundant logic incorporated 32bit selftimed RCAs are given in Table 3. Introduction of logic redundancy into the dualbit adder module synthesized on the basis of Toms and Edwards, approach [33] was not considered, since it would change the indication property of the original synthesis solution. Therefore, redundant versions of other adders were alone considered for comparison in Table 3. By comparing the results given in Tables 2 and 3, it is found that logic redundancy insertion has enabled a mean delay reduction of 21.1%, with associated area and power penalties to the tune of 2.3% and 0.8%, respectively. On average, the increase in size of an individual selftimed singlebit/dualbit adder module is found to be 2.8% after incorporating redundant logic.
With reference to the DSSC_HE adder module shown in Figure 11, a further peephole optimization was carried out by merging the gate rg1 and the OR gate producing Cout1 and rg2 and the OR gate producing Cout0 and replacing these combinations using complex gates (AO12 cells). Simulations were repeated for this case study, and the delay, area, and power values corresponding to the 32bit RCA, comprising optimized redundant DSSC_HE adder blocks, are found to be 4 ns, 10953 μm^{2}, and 696.8 μW respectively. The optimized redundant DSSC_HE adder block occupies less area than the nonoptimized redundant DSSC_HE adder block by 1.5%. Hence, the 32bit selftimed RCA comprising a cascade of optimized redundant DSSC_HE adder blocks exhibits reduced delay in comparison with the nonredundant DSSC_HE adder module based 32bit selftimed RCA by 31%. However, in terms of area and average power, the latter features reduced figures to the tune of 0.6% and 1.2%, respectively, compared to the former.
4. Conclusions
A new concept of redundant logic insertion was described in this paper that can be used to minimize the data path delay of selftimed arithmetic circuits. It was shown that introduction of logic redundancy is feasible with respect to many selftimed design methods, especially for synthesizing iterative logic specifications. The advantages of logic redundancy insertion have been propounded on the basis of a 32bit selftimed carryripple addition. It has been inferred from the simulation results that significant reduction in latency could be achieved at the expense of only marginal increases in area and power metrics. It was also discussed how logic redundancy paves the way for constant latency operation by permitting fast reset when applying spacer data, while actual case latency is encountered for addition of valid data.
Acknowledgments
This research was supported in part by the Engineering and Physical Sciences Research Council, UK, under Grant EP/D052238/1. The first author was additionally supported by a bursary from the School of Computer Science of the University of Manchester.
References
 “Semiconductor Industry Association’s International Technology Roadmap for Semiconductors,” Design Report, 2009, http://www.itrs.net/. View at: Google Scholar
 A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J. Hazewindus, “The first asynchronous microprocessor: the test results,” ACM SIGARCH Computer Architecture News, vol. 17, no. 4, pp. 95–98, 1989. View at: Google Scholar
 K. J. Kulikowski, V. Venkataraman, Z. Wang, A. Taubin, and M. Karpovsky, “Asynchronous balanced gates tolerant to interconnect variability,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '08), pp. 3190–3193, May 2008. View at: Publisher Site  Google Scholar
 I. J. Chang, S. P. Park, and K. Roy, “Exploring asynchronous design techniques for processtolerant and energyefficient subthreshold operation,” IEEE Journal of SolidState Circuits, vol. 45, no. 2, pp. 401–410, 2010. View at: Publisher Site  Google Scholar
 M. Zamani and M. B. Tahoori, “A transient error tolerant selftimed asynchronous architecture,” in Proceedings of the 15th IEEE European Test Symposium (ETS'10), pp. 88–93, May 2010. View at: Publisher Site  Google Scholar
 J. Hamon and L. Fesquet, “Robust and programmable selftimed ring oscillators,” in Proceedings of the IEEE 9th International New Circuits and Systems Conference (NEWCAS '11), pp. 249–252, 2011. View at: Google Scholar
 T. Chelcea, G. Venkataramani, and S. C. Goldstein, “Area optimizations for dualrail circuits using relativetiming analysis,” in Proceedings of the 13th IEEE International Symposium on Asynchronous Circuits and Systems, pp. 1–12, 2007. View at: Google Scholar
 G. F. Bouesse, G. Sicard, A. Baixas, and M. Renaudin, “Quasi delay insensitive asynchronous circuits for low EMI,” in Proceedings of the 4th International Workshop on ElectroMagnetic Compatibility of Integrated Circuits, pp. 27–31, 2004. View at: Google Scholar
 W. A. Lien, P. Day, C. Farnsworth et al., “Noise in selftimed and synchronous implementations of a DSP,” in Proceedings of the IEEE Radio and Wireless Conference, pp. 75–78, 1998. View at: Google Scholar
 L. S. Nielsen, C. Niessen, J. Sparso, and K. van Berkel, “Lowpower operation using selftimed circuits and adaptive scaling of the supply voltage,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp. 391–397, 1994. View at: Publisher Site  Google Scholar
 O. C. Akgun, J. Rodrigues, and J. Sparsø, “Minimumenergy subthreshold selftimed circuits: design methodology and a case study,” in Proceedings of the 16th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC '10), pp. 41–51, May 2010. View at: Publisher Site  Google Scholar
 C. H. Van Kees Berkel, M. B. Josephs, and S. M. Nowick, “Scanning the technology applications of asynchronous circuits,” Proceedings of the IEEE, vol. 87, no. 2, pp. 223–233, 1999. View at: Google Scholar
 I. David, R. Ginosar, and M. Yoeli, “Selftimed is selfchecking,” Journal of Electronic Testing, vol. 6, no. 2, pp. 219–228, 1995. View at: Publisher Site  Google Scholar
 S. J. Piestrak and T. Nanya, “Towards totally selfchecking delayinsensitive systems,” in Proceedings of the 25th International Symposium on FaultTolerant Computing, pp. 228–237, June 1995. View at: Google Scholar
 T. Verhoeff, “Delayinsensitive codes—an overview,” Distributed Computing, vol. 3, no. 1, pp. 1–8, 1988. View at: Publisher Site  Google Scholar
 B. Bose, “On unordered codes,” IEEE Transactions on Computers, vol. 40, no. 2, pp. 125–131, 1991. View at: Publisher Site  Google Scholar
 D. W. Lloyd and J. D. Garside, “A practical comparison of asynchronous design styles,” in Proceedings of the 7th International Symposium on Asynchronous Circuits and Systems, (ASYNC '01), pp. 36–45, March 2010. View at: Publisher Site  Google Scholar
 W. J. Bainbridge, W. B. Toms, D. A. Edwards, and S. B. Furber, “Delayinsensitive, pointtopoint interconnect using MofN codes,” in Proceedings of the 9th IEEE International Symposium on Asynchronous Circuits and Systems, (ASYNC '03), pp. 132–140, May 2003. View at: Publisher Site  Google Scholar
 V. Akella, N. H. Vaidya, and G. R. Redinbo, “Limitations of VLSI implementation of delaysensitive codes,” in Proceedings of the 1996 26th International Symposium on FaultTolerant Computing, pp. 208–217, June 1996. View at: Google Scholar
 C. L. Seitz, “System timing,” in Introduction to VLSI Systems, C. Mead and L. Conway, Eds., pp. 218–262, AddisonWesley, Reading, Mass, USA, 1980. View at: Google Scholar
 J. Sparso and S. B. Furber, Eds., Principles of Asynchronous Circuit Design: A Systems Perspective, Kluwer Academic Publishers, 2001.
 P. Balasubramanian and D.A. Edwards, “Selftimed realization of combinational logic,” in Proceedings of the 19th International Workshop on Logic and Synthesis, pp. 55–62, 2010. View at: Google Scholar
 P. Balasubramanian and D. A. Edwards, “A new design technique for weakly indicating function blocks,” in Proceedings of the 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS '08), pp. 116–121, April 2008. View at: Publisher Site  Google Scholar
 C. Jeong and S. M. Nowick, “Blocklevel relaxation for timingrobust asynchronous circuits based on eager evaluation,” in Proceedings of the 14th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC '08), pp. 95–104, April 2008. View at: Publisher Site  Google Scholar
 J. Sparsø and J. Staunstrup, “Delayinsensitive multiring structures,” Integration, The VLSI Journal, vol. 15, no. 3, pp. 313–340, 1993. View at: Google Scholar
 “Aperiodic circuits,” in SelfTimed Control of Concurrent Processes: The Design of Aperiodic Logical Circuits in Computers and Discrete Systems, V. I. Varshavsky, Ed., chapter 4, pp. 77–85, Kluwer Academic Publishers, 1990. View at: Google Scholar
 C. D. Nielsen, “Evaluation of function block designs,” Tech. Rep. IDTR: 1994135, Department of Computer Science, Technical University of Denmark, 1994. View at: Google Scholar
 K. M. Fant and G. E. Sobelman, “Null convention threshold gate,” US Patent 5664211, 1997. View at: Google Scholar
 P. M. LewisII and C. L. Coates, Threshold Logic, Wiley, New York, NY, USA, 1967.
 A. J. Martin, “Asynchronous datapaths and the design of an asynchronous adder,” Formal Methods in System Design, vol. 1, no. 1, pp. 117–137, 1992. View at: Publisher Site  Google Scholar
 A.J. Martin, “The limitation to delayinsensitivity in asynchronous circuits,” in Proceedings of the 6th MIT Conference on Advanced Research in VLSI, pp. 263–278, 1990. View at: Google Scholar
 A. J. Martin and P. Prakash, “Asynchronous nanoelectronics: preliminary investigation,” in Proceedings of the 14th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC '08), pp. 58–68, gbr, April 2008. View at: Publisher Site  Google Scholar
 W. B. Toms and D. A. Edwards, “Efficient synthesis of speed independent combinational logic circuits,” in Proceedings of the 10th Asia and SouthPacific Design Automation Conference, pp. 1022–1026, 2005. View at: Google Scholar
 B. Folco, V. Bregier, L. Fesquet, and M. Renaudin, “Technology mapping for area optimized quasi delay insensitive circuits,” in Proceedings of the IFIP International Conference on VLSISoC, pp. 146–151, 2005. View at: Google Scholar
 P. Balasubramanian, Selftimed logic and the design of selftimed adders, Ph.D. thesis, The University of Manchester, 2010.
Copyright
Copyright © 2012 P. Balasubramanian et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.