# Mono3D: Open Source Cell Library for Monolithic 3-D Integrated Circuits

Chen Yan, Student Member, IEEE, and Emre Salman<sup>10</sup>, Senior Member, IEEE

Abstract-Monolithic 3-D (M3-D) integrated circuits (ICs) provide vertical interconnects with comparable size to on-chip metal vias, and therefore, achieve ultra-high density device integration. This fine-grained connectivity enabled by monolithic inter-tier vias reduces the silicon area, overall wirelength, and power consumption. An open source standard cell library for design automation of large-scale transistor-level M3-D ICs is developed, thereby facilitating future research on the critical aspects of M3-D technology. The cell library is based on fullcustom design of each standard cell and is fully characterized by using existing design automation tools. The proposed open source cell library is utilized to demonstrate the M3-D implementation of several benchmark circuits of various sizes ranging from 2.7-K gates to 1.6-M gates. Both power and timing characteristics of the M3-D ICs are quantified. Several versions of the cell library are developed with different number of routing tracks to better understand the issue of routing congestion in the M3-D ICs. The effect of the number of routing tracks on area, power, and delay characteristics is investigated. Finally, the primary clock tree characteristics of the M3-D ICs are discussed.

*Index Terms*—Monolithic 3D integration, 3D cell library, 3D routing track distribution, 3D routing congestion, 3D signal integrity, 3D clock tree synthesis.

# I. INTRODUCTION

THREE-dimensional (3D) integrated circuits (ICs) have emerged as an effective solution to some of the critical issues encountered in planar technologies such as longer global interconnects and difficulty in scaling the transistors [1], [2]. Through silicon via (TSV) based 3D ICs have attracted significant attention during the past decade with emphasis on both fabrication and design methodologies [3]–[5]. In TSV based 3D integration, multiple dies are thinned, aligned, and vertically bonded, thereby enabling shorter global interconnects (and therefore reduced power consumption) and heterogeneous integration [3], [6], [7].

A typical TSV diameter, however, is in the range of several micrometers, which is multiple orders of magnitude greater

The authors are with the Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794 USA (e-mail: chen.yan@stonybrook.edu; emre.salman@stonybrook.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2017.2768330



Fig. 1. Three different design styles for monolithic 3D (M3D) technology: (a) transistor-level M3D, (b) gate-level M3D, and (c) block-level M3D.

than nanoscale transistor dimensions. Thus, bulky TSVs not only restrict the integration density, but also limit the power and performance advantages of vertical integration due to significant TSV capacitance [8]–[10].

More recently, interest on monolithic 3D integration has grown due to encouraging developments on sequentially fabricating multiple transistor layers (particularly the thermal characteristics) [11]–[13]. In monolithic vertical integration, stacked transistors are sequentially fabricated after the bottom layers have been manufactured. Communication among the tiers is achieved by monolithic inter-tier vias (MIVs). A critical challenge in the fabrication of monolithic 3D ICs is to minimize the detrimental effect of the manufacturing process of the top tier on bottom tier [14]. Thus, significant research on the fabrication side has focused on developing low thermal budget processes, typically less than 500-600 °C for the upper tiers [15], [16].

MIVs have comparable size to conventional on-chip metal vias since multiple tiers can be aligned with lithographic alignment precision [17]. Thus, MIV based 3D integration enables significantly higher interconnect density as compared to TSV based vertical integration.

Transistor-, gate-, and block-level 3D monolithic integration have been proposed, as depicted in Fig. 1 [18], [19]. In transistor-level monolithic 3D integration, as focused in this paper, nMOS and pMOS transistors within a circuit are separated into two different tiers. For example, the pull-down network of each gate is placed within one tier whereas the pull-up networks are placed in another tier. This approach not only achieves fine-grained 3D integration with intra-cell MIVs, but also enables the individual optimization of the bottom and top tier devices [20]. Existing design automation methodologies (with modifications) can be used for this approach.

In gate-level monolithic 3D integration, multiple cells within a functional block are partitioned into multiple tiers. MIVs are utilized for inter-cell communication. Finally, block-level monolithic 3D integration represents a more coarse-grain integration where the partitioning of the IC is achieved based on individual functional blocks.

1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.

Manuscript received June 11, 2017; revised September 14, 2017; accepted October 21, 2017. Date of publication November 16, 2017; date of current version February 15, 2018. This work was supported in part by the National Science Foundation under Grant CCF-1253715, Grant CNS-1646318, and Grant CNS-1717306, in part by the Semiconductor Research Corporation under Contract TS-2767 and Contract TJ-2449, in part by the National Institute of Health under Grant 1R21AR068572-01A1, and in part by the Simons Foundation under Grant 241210. This paper was recommended by Associate Editor F. Pareschi. (*Corresponding author: Emre Salman.*)

An open source cell library based on full-custom design of each cell is developed in this paper for transistorlevel monolithic 3D integration [21]. The power and timing characteristic of each cell is fully characterized with both SPICE-level simulations and a commercial library characterization tool to ensure accuracy. The proposed cell library is used to evaluate the power and timing characteristics of multiple benchmark circuits of various sizes ranging from 2.7K gates to 1.6M gates. The effect of number of routing tracks on area, power, and delay characteristics is investigated by developing three versions of the cell library with different heights. This analysis is important since routing congestion is one of the primary physical design issues in monolithic 3D ICs. Clock tree characteristics of a large FFT core are also investigated.

The rest of the paper is organized as follows. Related previous work and contributions of this paper are summarized in Section II. The details of the proposed open source cell library, characterization, and comparison with 2D cells are provided in Section III. Power/timing and several important physical design characteristics of 3D monolithic ICs are investigated in Section IV. Finally, the paper is concluded in Section V.

# II. SUMMARY OF PREVIOUS WORKS

Liu and Lim have investigated the design tradeoffs in monolithic 3D ICs considering both transistor-level and gate-level monolithic integration [19]. Useful physical design guidelines and insight into the routability issue have been provided. The effects of inter-tier process variation have also been investigated. In this work, however, authors have assumed that the monolithic 3D gates and traditional 2D gates have the same power and timing characteristic. This assumption is not accurate due to different parasitic impedances within a 3D monolithic cell and the existence of MIVs.

Lee *et al.* have fixed this limitation by individually characterizing the timing and power consumption of transistor-level monolithic 3D cells [22]. The power characteristics of several 3D monolithic benchmark circuits have been investigated and compared with 2D versions at similar timing performances. The authors have also discussed the issue of routing congestion in monolithic 3D ICs. The authors, however, have adopted the cell-folding method and used the same pull-up and pull-down networks as in 2D cells. MIVs have been inserted in between these two networks. As a result, the proposed 3D cells are not optimized for footprint. In addition, in [22], the timing constraints are relatively relaxed, which may prevent to investigate the behavior of the monolithic 3D technology under tighter clock frequency constraints.

Shi *et al.* have recently demonstrated the power benefits of transistor-level monolithic 3D ICs through custom design of a cell library in 14 nm FinFET technology, also utilizing the cell-folding method [23]. A dedicated track is assumed for the MIVs. A detailed cell-level *RC* extraction methodology is described. The authors have also shown how routability is affected by two different cell heights. Timing, power, area characteristics for different cell heights and related tradeoffs, however, have not been investigated. The effect of routing

congestion on timing constraints and power consumption was not discussed.

Recently, the issue of routing congestion due to reduced cell height in monolithic 3D cells and its impact on power and delay characteristics have been discussed in [24]. A custom 3D cell library was developed and integrated into a standard design flow. The use of the cell library in certain applications was also demonstrated [25], [26]. However, this work assumes that there are five metal layers within the bottom tier, which is not practical considering the existing monolithic 3D fabrication capabilities. Some of the important cells such as clock buffers and latch are not included. More importantly, none of the existing works have investigated the optimum number of routing tracks for transistor-level monolithic 3D ICs. As demonstrated in this paper, the number of routing tracks plays a critical role on system-level power, performance, and area characteristics.

The primary contributions of this paper are as follows:

- The monolithic 3D cells are developed in full-custom methodology with cell-stacking technique, while optimizing the footprint. The cell library contains not only the basic gates, but also clock buffers, latch, and some cells with higher drive strength to produce a fully placed and routed circuit including clock and power networks. The automated cell characterization results are verified with SPICE-level simulations.
- Additional versions of the 3D monolithic cell library with different heights are developed to analyze the effects of number of tracks on routing congestion, area, power, and delay characteristics.
- Detailed data (such as number of overflows and DRC violations) are provided to further investigate the important issue of routing congestion in monolithic 3D ICs.
- Both the performance and power characteristics of large scale 3D monolithic ICs are investigated at both relaxed and tight timing constraints.
- Detailed data on clock tree characteristics are provided for 3D monolithic ICs.
- Finally, the proposed cell library and all of the related automation files are made publicly available [21] to facilitate future research on various important aspects of 3D monolithic integration such as thermal integrity, design-for-test, and interaction between the manufactur-ing/device development and the design process. To the best of the authors' knowledge, this study represents the first open source cell library with full integration into design flow for monolithic 3D ICs.

# III. OPEN SOURCE CELL LIBRARY FOR MONOLITHIC 3D ICS

The primary characteristics of the proposed cell library are described in Section III-A. The design flow to integrate the proposed library into the design process is discussed in Section III-B. Cell-level simulation results and comparison of 3D cells with 2D cells are provided in Section III-C.

## A. Library Development

In this work, the *Mono3D*, an open source standard cell library for transistor-level monolithic 3D technology is



Fig. 2. Cross-sections of the (a) conventional 2D and (b) transistor-level monolithic (TL-Mono) 3D technology with two tiers. The top tier hosts the nMOS transistors whereas the pMOS transistors are placed within the bottom tier.

developed in 45 nm technology [21]. Mono3D consists of two tiers where each tier is based on the 2D 45 nm process design kit FreePDK45 from North Carolina State University (NCSU) [27]. Thus, the process and physical characteristics (transistor models and characteristics of the on-chip metal layers) are obtained from the FreePDK45. Similar to [22] and [23], the pull-down network of a CMOS gate (nMOS transistors) is built within the top tier whereas the pull-up network (pMOS transistors) is fabricated within the bottom tier. Note that the processing temperature of the top tier is constrained to be less than 500-600 °C [15] to not damage the transistors within the bottom tier. This relatively low processing temperature, however, degrades the quality of the top tier devices. Thus, pMOS devices (that already have lower mobility) are placed within the bottom tier. As such, the proposed cell library can only be used for transistor-level monolithic 3D approach since MIVs exist within each standard cell to connect nMOS and pMOS devices. The transistor device characteristics are the same as in 2D FreePDK45. Thus, any processing temperature related degradations are not considered. However, the impact of novel devices/models and manufacturing steps for 3D monolithic integration can be captured by replacing/modifying the device models within the provided design kit. System-level effects of varying device characteristics (due to, for example, the manufacturing steps of the top tiers) can therefore be investigated.

In the proposed *Mono3D*, two metal layers are allocated to the bottom tier (metal1\_btm and metal2\_btm), as illustrated in Fig. 2. These metal layers are primarily for routing the intra-cell signals. The top tier is separated from the bottom tier with an inter-layer dielectric (ILD) with a thickness of 100 nm. Inter-tier coupling is minimized at this thickness, as experimentally validated [17]. The 10 metal layers that exist in 2D *FreePDK45* are maintained the same for the top tier in *Mono3D*. The intra-cell connections that span the two tiers are achieved by MIVs. Each MIV has a width of 50 nm and height of 215 nm [18].

Currently, 20 standard cells exist in *Mono3D*, as listed in Table I. In addition to the fundamental cells, multiple clock buffers and a latch are included. Each cell is developed with a full-custom design methodology using a cell stacking technique. As opposed to [22] and [23] where the power

TABLE I LIST OF STANDARD CELLS IN THE MONOLITHIC 3D LIBRARY

| AND2X1   | INVX2    |
|----------|----------|
| AOI21X1  | INVX4    |
| BUFX2    | LATCHNEG |
| BUFX4    | MUX2X1   |
| CLKBUF1  | NAND2X1  |
| CLKBUF2  | NOR2X1   |
| CLKBUF3  | OAI21X1  |
| DFFPOSX1 | OR2X1    |
| FILL     | XNOR2X1  |
| INVX1    | XOR2X1   |

(within the bottom tier) and ground (within the top tier) rails overlap, in the proposed *Mono3D*, the power rail is located at the top of the bottom tier and ground rail is located at the bottom of the top tier. These power and ground rails at each cell row are connected to the system-level power network through power and ground rings placed during the placement and routing process.

A specific track is allocated for intra-cell MIVs, which are distributed within the cell to minimize the interconnect length and reduce the cell height. Each cell within the 2D *NanGate* library has 14 routing tracks [28]. Alternatively, in this study, three monolithic 3D cell libraries are developed with different number of tracks: 8-track (*Mono3D\_v1*), 9-track (*Mono3D\_v2*), and 10-track (*Mono3D\_v3*). Number of tracks plays an important role on chip-level routing congestion, a primary issue in monolithic 3D ICs (see Section IV for more details). The cell heights in *Mono3D\_v1*, *Mono3D\_v2*, and *Mono3D\_v3* are, respectively, 1.33  $\mu$ m, 1.52  $\mu$ m, and 1.71  $\mu$ m. These cell heights are, respectively, 46%, 38%, and 31% shorter than the standard cell height (2.47  $\mu$ m) in *NanGate* cell library.

The layout of a NAND cell is illustrated in Fig. 3 in both 2D and 3D monolithic technologies with three different cell heights. Cell dimensions and the three MIVs are highlighted. Similarly, a 2D D-flip-flop cell and 3D monolithic D-flip-flop cell within  $Mono3D_vv1$  are compared in Fig. 4. In this case, the top and bottom tiers are separately depicted. Note that the width of the 3D flip-flop cell increases by approximately 7% due to MIVs and intra-cell routing. Also note that special emphasis is given to provide white space at the top tier (depending upon the number of routing tracks) to avoid pin block issue induced routing congestion.

#### B. Design Flow

The design flow adopted in this work and the modifications required for 3D monolithic technology are depicted in Fig. 5. A new technology file (*.tf*) is generated for *Mono3D* to include all of the new layers (interconnects, via, ILD, and MIV). Based on these modifications, a new display resource file (*.drf*) is generated to develop full-custom layouts of the 3D cells. The design rule check (DRC), layout versus schematic (LVS) and parasitics extraction (PEX) are performed using *Calibre* [29]. The DRC rule file is modified to include new features for the additional metal layers, vias, transistors, ILD and MIV. For example, minimum spacing between two MIVs is equal to 120 nm, producing an MIV pitch of 170 nm.



Fig. 3. Comparison of the layout views of a NAND gate in (a) traditional 2D technology with 14 routing tracks in each cell, (b) monolithic 3D technology with 8 routing tracks, (c) monolithic 3D technology with 9 routing tracks, and (d) monolithic 3D technology with 10 routing tracks, illustrating the three MIVs used to connect the top and bottom tiers.



Fig. 4. Comparison of the layout views of a D-flip-flop in (a) traditional 2D and (b) transistor-level monolithic 3D technology. The top and bottom tiers are separately depicted for the 3D technology.

The LVS rule file is also modified for the tool to be able to independently identify transistors located in separate tiers. The extracted netlist with MIVs is analyzed to accurately determine the interconnections between nMOS (within the top tier) and pMOS (within the bottom tier) transistors. The RC extraction rule file is modified to be able to recognize the new device tier, new metal layers, and MIVs. For metal interconnects, intrinsic plate capacitance, intrinsic fringe capacitance, and nearbody (coupling) capacitance are considered between silicon and metal, and metal and metal. A single MIV is characterized with a resistance of 5.5  $\Omega$ s and a capacitance of 0.04 fF, based on [23] where device-level extraction is performed. The only parasitic component that is not considered during the extraction process is the tier-to-tier coupling capacitance. As experimentally demonstrated in [17], this component is negligible when the inter-layer dielectric is 100 nm thick.

After *RC* extraction, 3D cells are characterized with *Encounter Library Characterizer (ELC)* [30] to obtain the timing and power characteristics (lookup tables) of each cell. The extracted 3D cell netlists are also simulated with *HSPICE* [31] to ensure the accuracy of the characterization process. More details on the area, timing, and power characteristics of the 3D cells and comparison with 2D cells are provided in Section III-C.

The *.lib* file for the *Mono3D* generated by *ELC* is converted into the *.db* format, which is used for circuit synthesis, placement, clock tree synthesis, and routing. Since all of the I/O pins of the 3D cells are located within the top tier, existing physical design tools can be used for these steps.

#### C. Cell-Level Evaluation

1) Area: Cell-level area improvement obtained by monolithic 3D technology is shown in Fig. 6. According to this figure, the reduction in cell area varies from 6.5% to 64.1% in  $Mono3D_v1$ , -6.9% to 59.0% in  $Mono3D_v2$ , and -13.5%to 53.8% in  $Mono3D_v3$ , depending upon the specific cell. An average improvement of 32%, 22%, and 14% is achieved for, respectively,  $Mono3D_v1$ ,  $Mono3D_v2$ , and  $Mono3D_v3$ . Note that a negative percent implies that the cell area increases as compared to the 2D cell. This behavior occurs for cells where the reduction in cell height causes a considerable increase in cell width. Similarly, the average area reduction is not as large as the reduction in cell height since, on average, the cell width slightly increases due to MIVs and intra-cell routing within the reduced cell footprint.

2) Delay and Power Consumption: HSPICE simulations are performed on the extracted 3D netlists to compare monolithic 3D technology with the conventional 2D technology at the



Fig. 5. Integration of the proposed open source cell library into design flow, illustrating the required modifications.



Fig. 6. Percent reduction in area achieved by each individual monolithic 3D cell as compared to the 2D cells. Results are provided for each 3D library, *Mono3D\_v1*, *Mono3D\_v2*, and *Mono3D\_v3*.

 
 TABLE II

 Average Delay and Power Characteristics of 2D and Monolithic 3D Cells With 8 (Mono3D\_v1), 9 (Mono3D\_v2), and 10 (Mono3D\_v3) Routing Tracks. The Percent Changes With Respect to 2D Cells are Listed

| Cells    |       | ]              | Delay (ps)     |                | Power $(\mu W)$ |                |                |                |  |
|----------|-------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|--|
| Cells    | 2D    | 3D_v1          | 3D_v2          | 3D_v3          | 2D              | 3D_v1          | 3D_v2          | 3D_v3          |  |
| AND2X1   | 17.63 | 19.27 (9.3%)   | 19.5 (10.6%)   | 19.52 (10.7%)  | 2.82            | 2.98 (5.7%)    | 3.01 (6.7%)    | 3.03 (7.4%)    |  |
| AOI21X1  | 13.68 | 13.58 (-0.7%)  | 13.69 (0.1%)   | 13.80 (0.9%)   | 3.32            | 3.33 (0.3%)    | 3.34 (0.6%)    | 3.35 (0.9%)    |  |
| BUFX2    | 17.89 | 17.56 (-1.8%)  | 17.86 (-0.2%)  | 17.94 (0.3%)   | 14.04           | 13.71 (-2.4%)  | 13.92 (-0.9%)  | 13.97 (-0.5%)  |  |
| BUFX4    | 15.82 | 14.97 (-5.4%)  | 15.29 (-3.4%)  | 15.76 (-0.4%)  | 29.00           | 28.97 (-0.1%)  | 28.98 (-0.1%)  | 29.14 (0.5%)   |  |
| CLKBUF1  | 27.01 | 27.28 (1.0%)   | 27.32 (1.2%)   | 27.34 (1.2%)   | 64.07           | 62.17 (-3.0%)  | 62.52 (-2.4%)  | 62.84 (-1.9%)  |  |
| CLKBUF2  | 39.57 | 40.05 (1.2%)   | 40.17 (1.5%)   | 40.24 (1.7%)   | 93.25           | 90.43 (-3.0%)  | 90.88 (-2.5%)  | 91.05 (-2.4%)  |  |
| CLKBUF3  | 51.73 | 52.74 (2.0%)   | 53.04 (2.5%)   | 53.27 (3.0%)   | 121.4           | 118.6 (-2.3%)  | 119.0 (-2.0%)  | 119.1 (-1.9%)  |  |
| DFFPOSX1 | 41.69 | 34.54 (-17.2%) | 34.72 (-16.7%) | 34.93 (-16.2%) | 26.75           | 27.13 (1.4%)   | 27.18 (1.6%)   | 27.39 (2.4%)   |  |
| INVX1    | 6.73  | 6.10 (-9.4%)   | 6.42 (-4.6%)   | 6.50 (-3.4%)   | 4.69            | 4.64 (-1.1%)   | 4.68 (-0.2%)   | 4.72 (0.6%)    |  |
| INVX2    | 6.54  | 6.08 (-7.0%)   | 6.32 (-3.4%)   | 6.40 (-2.1%)   | 9.31            | 9.15 (-1.7%)   | 9.24 (-0.8%)   | 9.35 (0.4%)    |  |
| INVX4    | 6.44  | 6.08 (-5.6%)   | 6.29 (-2.3%)   | 6.38 (-0.9%)   | 18.01           | 17.99 (-0.1%)  | 18.16 (0.8%)   | 18.36 (1.9%)   |  |
| MUX2X1   | 16.25 | 17.21 (5.9%)   | 17.23 (6.0%)   | 17.27 (6.2%)   | 5.81            | 6.14 (5.7%)    | 6.15 (5.9%)    | 6.17 (6.2%)    |  |
| NAND2X1  | 10.22 | 9.76 (-4.5%)   | 9.78 (-4.3%)   | 9.89 (-3.2%)   | 1.63            | 1.57 (-3.7%)   | 1.58 (-3.1%)   | 1.61 (-1.2%)   |  |
| NOR2X1   | 11.41 | 11.78 (3.2%)   | 12.16 (6.6%)   | 12.18 (6.8%)   | 1.63            | 1.66 (1.8%)    | 1.69 (3.7%)    | 1.73 (6.1%)    |  |
| OAI21X1  | 12.89 | 12.72 (-1.3%)  | 12.88 (-0.1%)  | 12.93 (0.3%)   | 3.27            | 3.24 (-0.9%)   | 3.25 (-0.6%)   | 3.26 (-0.3%)   |  |
| OR2X1    | 18.33 | 20.89 (14.0%)  | 21.12 (15.2%)  | 21.14 (15.3%)  | 2.54            | 2.84 (11.8%)   | 2.85 (12.2%)   | 2.87 (12.9%)   |  |
| XNOR2X1  | 36.05 | 41.32 (14.6%)  | 41.76 (15.8%)  | 41.92 (16.3%)  | 12.66           | 14.12 (11.5%)  | 14.13 (11.6%)  | 14.18 (12.0%)  |  |
| XOR2X1   | 35.59 | 41.95 (17.9%)  | 42.41 (19.2%)  | 42.69 (19.9%)  | 12.53           | 14.16 (13.0%)  | 14.24 (13.6%)  | 14.28 (13.9%)  |  |
| Average  | 21.42 | 21.88 (2.15%)  | 22.11 (3.22%)  | 22.23 (3.78%)  | 23.71           | 23.49 (-0.93%) | 23.60 (-0.46%) | 23.69 (-0.08%) |  |

cell level. At 1.1 V power supply, 50 ps transition time, and 27 °C temperature, average delay and power consumption are analyzed, as listed in Table II for 2D and each version

of the 3D technology. According to this table,  $Mono3D_v1$  cells have, on average, 2.15% (3.22% in  $Mono3D_v2$  and 3.78% in  $Mono3D_v3$ ) higher propagation delay and 0.93%

|   | NUMBER AND TYPE OF CELLS FOR EACH BENCHMARK CIRCUIT OPERATING AT 500 MHz |     |     |      |     |     |       |     |     |          |  |
|---|--------------------------------------------------------------------------|-----|-----|------|-----|-----|-------|-----|-----|----------|--|
| t | NO. Gates                                                                | DFF | INV | NAND | NOR | AND | OR    | MUX | OAI | XOR/XNOR |  |
| V | 2 697                                                                    | 392 | 206 | 9    | 332 | 138 | 1 228 | 258 | 134 | 0        |  |

TABLE III

| Circuit | NO. Gates | DFF     | INV     | NAND    | NOR     | AND     | OR      | MUX    | OAI     | XOR/XNOR |
|---------|-----------|---------|---------|---------|---------|---------|---------|--------|---------|----------|
| SIMON   | 2,697     | 392     | 206     | 9       | 332     | 138     | 1,228   | 258    | 134     | 0        |
| s38584  | 22,273    | 1,426   | 6,798   | 916     | 1,040   | 8,559   | 3,534   | 0      | 0       | 0        |
| FFT64   | 89,991    | 18,963  | 14,069  | 16,266  | 5,477   | 20,468  | 1       | 6,881  | 7,866   | 0        |
| FFT128  | 691,839   | 96,746  | 74,945  | 104,613 | 81,293  | 106,012 | 102,242 | 17,374 | 108,614 | 0        |
| FFT256  | 1,467,815 | 190,025 | 152,803 | 193,864 | 210,231 | 197,293 | 246,543 | 34,924 | 242,132 | 0        |

TABLE IV NUMBER AND TYPE OF CELLS FOR EACH BENCHMARK CIRCUIT OPERATING AT 1.5/2 GHz

| Circuit | NO. Gates | DFF     | INV     | NAND    | NOR     | AND     | OR      | MUX    | OAI     | XOR/XNOR |
|---------|-----------|---------|---------|---------|---------|---------|---------|--------|---------|----------|
| SIMON   | 3,456     | 392     | 294     | 648     | 509     | 650     | 547     | 33     | 383     | 0        |
| s38584  | 35,804    | 1,426   | 7,653   | 2,531   | 2,101   | 15,057  | 7,036   | 0      | 0       | 0        |
| FFT64   | 96,123    | 18,963  | 23,992  | 15,056  | 0       | 19,258  | 3,502   | 6,882  | 8,470   | 0        |
| FFT128  | 770,370   | 96,746  | 79,375  | 220,357 | 62,881  | 322,196 | 83,353  | 46,588 | 108,874 | 0        |
| FFT256  | 1,617,784 | 190,025 | 254,358 | 213,765 | 187,063 | 218,183 | 243,049 | 66,489 | 244,852 | 0        |

(0.46% in *Mono3D\_v2* and 0.08\% in *Mono3D\_v3*) lower power consumption as compared to the 2D standard cells. This slight increase in delay is due to denser cell layout, producing additional coupling capacitances and MIV impedances. Note that in a D-Flip-Flop cell, both clock-to-Q delay and power are improved as compared to 2D cells since the D-Flip-Flop cell has relatively longer average interconnect length where the monolithic 3D technology is helpful. Also note that the cell-level change in delay and power highly depends upon the individual cell layout, interconnects, and MIVs.

#### **IV. EXPERIMENTAL RESULTS**

The proposed open source Mono3D cell libraries are used to investigate the footprint, power, and timing characteristics as well as routing congestion of several benchmarks with various number of gates, ranging from 2.7K gates to 1.6M gates. For the conventional 2D technology and synthesis, the 45 nm NanGate cell library and the FreePDK45 process kit are used whereas for the monolithic 3D technology, the proposed Mono3D libraries are used (all of the libraries have the same type of cells for fair comparison). Circuits are synthesized using Synopsys Design Compiler [32] at 500 MHz (relaxed timing constraint with no timing violations) and 1.5/2 GHz (tighter constraint with negative slack) clock frequencies. Note that for the relatively small benchmarks SIMON (lightweight encryption core) and s38584 (academic benchmark), the high frequency constraint is 2 GHz whereas for larger FFT cores (64-, 128-, and 256-point [33]), the high frequency constraint is 1.5 GHz. The synthesized netlists are placed (at 70% placement density) and routed using Cadence Encounter [34]. The overall number of gates and the number of each cell are listed in Tables III and IV, for, respectively, 500 MHz and 1.5/2 GHz. According to these tables, those cells that achieve above average reduction in area are typically used more than the other cells during the synthesis process, thereby maximizing the reduction in system-level footprint.

Area and wirelength characteristics of the benchmark circuits and the issue of routing congestion are discussed in Section IV-A. Power, timing, and clock network characteristics are described, respectively, in Sections IV-B, IV-C, and IV-D.

## A. Footprint, Wirelength, and Routing Congestion

The comparison of footprint and overall wirelength in 2D and 3D designs is listed in Table V. As an example, the layout views of the 2D and 3D versions of the 128-point FFT core are depicted in Fig. 7, illustrating the effect of the number of tracks on chip-level footprint.

According to Table V, benchmark circuits developed with transistor-level monolithic 3D libraries consume, on average, 37.3% (Mono3D\_v1), 32.4% (Mono3D\_v2), and 25.1% (Mono3D\_v3) less area as compared to conventional 2D designs. As expected, the reduction in footprint is reduced as the cell-level number of tracks increases.

At 500 MHz, no DRC violations are reported. At high frequency constraint, however, there are relatively a large number of DRC violations (indication of routing congestion) for Mono3D\_v1 (8 routing tracks) due to significantly denser layout as compared to 2D technology. These DRC violations are significantly reduced in Mono3D\_v2 (9 routing tracks), and completely eliminated in *Mono3D* v3 (10 routing tracks), emphasizing the need for additional interconnect resources in monolithic 3D technologies.

The reduction in the overall wirelength is typically in the range of 10% to 24% depending upon the specific 3D library and operating frequency. An important observation is that a larger reduction in wirelength is achieved once the number of tracks is increased from 8 to 9. An additional routing track provides flexibility during the routing process, thereby further reducing the overall wirelength. However, if the number of tracks is increased to 10 (Mono3D\_v3), the reduction in wirelength is reduced. This behavior is due to the relatively larger increase in footprint for *Mono3D\_v3*.

To gain more insight into routing congestion in monolithic 3D ICs, the number of overflows is reported for each benchmark circuit. These results are listed in Table VI. A global cell has overflow if the routing resources assigned to the cell exceed the available routing resource for that cell. According to the table, 3D designs with 8 routing tracks in each cell exhibit a large number of overflows due to congestion. Increasing the number of routing tracks significantly reduces the number of overflows, particularly at

COMPARISON OF FOOTPRINT, WIRELENGTH, AND NUMBER OF DRC VIOLATIONS (VIOS) IN 2D AND MONOLITHIC 3D TECHNOLOGIES WITH 8 (Mono3D\_v1), 9 (Mono3D\_v2), AND 10 (Mono3D\_v3) ROUTING TRACKS IN EACH 3D CELL. THE PERCENT CHANGES WITH RESPECT TO 2D CELLS ARE LISTED

TABLE V

| Operatir | ng Frequency |           | 500    | MHz        |        | 2 GHz / 1.5 GHz |        |            |        |      |
|----------|--------------|-----------|--------|------------|--------|-----------------|--------|------------|--------|------|
| Circuit  | Design Style | Footprint | Change | Wirelength | Change | Footprint       | Change | Wirelength | Change | DRC  |
| eneun    | Design Style | $(mm^2)$  | (%)    | (µm)       | (%)    | $(mm^2)$        | (%)    | (µm)       | (%)    | Vios |
|          | 2D           | 0.0110    | -      | 30,260     | -      | 0.0122          | -      | 44,242     | -      | 0    |
| SIMON    | 3D_v1        | 0.0066    | -40.0  | 24,942     | -17.6  | 0.0079          | -35.2  | 31,129     | -29.6  | 32   |
| SIMON    | 3D_v2        | 0.0073    | -33.6  | 24,027     | -20.6  | 0.0084          | -31.1  | 26,449     | -40.2  | 2    |
|          | 3D_v3        | 0.0083    | -24.5  | 26,402     | -12.7  | 0.0090          | -26.2  | 29,557     | -33.2  | 0    |
|          | 2D           | 0.077     | -      | 174,114    | -      | 0.079           | -      | 203,703    | -      | 0    |
| .38584   | 3D_v1        | 0.048     | -38.3  | 144,108    | -17.2  | 0.048           | -38.7  | 167,532    | -17.8  | 231  |
| \$36364  | 3D_v2        | 0.051     | -34.2  | 142,442    | -18.2  | 0.051           | -35.4  | 164,039    | -19.5  | 53   |
|          | 3D_v3        | 0.055     | -29.5  | 152,605    | -12.4  | 0.056           | -29.5  | 182,051    | -10.6  | 0    |
|          | 2D           | 0.45      | -      | 965,796    | -      | 0.59            | -      | 1,202,699  | -      | 0    |
| FFT64    | 3D_v1        | 0.28      | -37.7  | 786,683    | -18.6  | 0.40            | -32.2  | 1,031,401  | -14.2  | 253  |
| 111104   | 3D_v2        | 0.30      | -33.3  | 771,457    | -20.1  | 0.42            | -28.8  | 975,095    | -18.9  | 11   |
|          | 3D_v3        | 0.33      | -26.7  | 813,338    | -15.8  | 0.44            | -25.4  | 1,031,922  | -14.2  | 0    |
|          | 2D           | 2.54      | -      | 12,205,011 | -      | 2.94            | -      | 15,201,864 | -      | 0    |
| FFT128   | 3D_v1        | 1.55      | -39.1  | 9,436,940  | -22.7  | 1.76            | -40.2  | 11,670,182 | -23.2  | 568  |
| 111120   | 3D_v2        | 1.59      | -37.2  | 9,240,148  | -24.3  | 1.84            | -37.5  | 11,407,021 | -24.9  | 7    |
|          | 3D_v3        | 1.77      | -30.3  | 10,178,772 | -16.6  | 2.04            | -30.6  | 12,973,791 | -14.6  | 0    |
|          | 2D           | 5.72      | -      | 40,787,944 | -      | 5.88            | -      | 39,094,466 | -      | 0    |
| FFT256   | 3D_v1        | 3.90      | -31.7  | 37,688,337 | -7.60  | 4.16            | -29.3  | 37,298,924 | -4.59  | 931  |
| 111230   | 3D_v2        | 4.36      | -23.7  | 33,404,062 | -18.1  | 4.58            | -22.1  | 33,983,460 | -13.1  | 29   |
|          | 3D_v3        | 4.89      | -14.5  | 35,994,158 | -11.9  | 5.18            | -11.9  | 36,626,092 | -6.31  | 0    |



Fig. 7. The layout views of a highly parallelized 128-point FFT core in (a) conventional 2D technology with 14 routing tracks in each cell, (b) transistor-level monolithic 3D technology with 8 routing tracks, (c) monolithic 3D technology with 9 routing tracks, and (d) monolithic 3D technology with 10 routing tracks.

higher frequency. The increase in the number of overflows at higher frequency is partly due to the lack of sufficient higher drive strength cells in the proposed library, and partly due to a tighter timing constraint with additional limitations on available wiring resources. The routing congestion in monolithic 3D circuits can also be observed by comparing the overall reduction in wirelength with the reduction in footprint. For each of the benchmarks, the percent reduction in footprint exceeds the percent reduction in overall wirelength, indicating exacerbated routing congestion for monolithic 3D technology.

# B. Power Characteristics

The power consumption of 2D and monolithic 3D designs is compared in Table VII. All of the three components of power consumption (internal, switching, and leakage) are provided. Internal power is consumed due to the intra-cell device and interconnect capacitances and short-circuit current during the switching activity of a cell. Switching power is consumed by the inter-cell interconnect (net) capacitances. Due to considerable reduction in overall wirelength in monolithic 3D designs, the switching power is reduced, on average, by 22% (8 number of tracks) and 24% (9 number of tracks) at 500 MHz. Note that if the number of routing tracks is increased from 8 to 10, the switching power first decreases, then slightly increases. This behavior follows the same trend as the wirelength, as described in the previous subsection. Thus, largest reduction in switching power is achieved when the number of routing tracks is 9 ( $Mono3D_v2$ ).

This characteristic also affects the internal power component by changing the signal transition times since the short-circuit power strongly depends upon the input transition time (which in turn depends upon the wirelength). Specifically, when the

| Operatir      | ng Frequency | 50                | 00 MHz          |       | 2 GHz             |                 |        |  |
|---------------|--------------|-------------------|-----------------|-------|-------------------|-----------------|--------|--|
| Circuit       | Design Style | C                 | verflow         |       | Overflow          |                 |        |  |
| Cheun         | Design Style | Horizontal Tracks | Vertical Tracks | Total | Horizontal Tracks | Vertical Tracks | Total  |  |
|               | 2D           | 0                 | 0               | 0     | 0                 | 0               | 0      |  |
| SIMON         | 3D_v1        | 0                 | 0               | 0     | 5                 | 1               | 6      |  |
| SIMON         | 3D_v2        | 0                 | 0               | 0     | 2                 | 0               | 2      |  |
|               | 3D_v3        | 0                 | 0               | 0     | 0                 | 0               | 0      |  |
|               | 2D           | 0                 | 0               | 0     | 1                 | 0               | 1      |  |
| e38584        | 3D_v1        | 121               | 17              | 138   | 157               | 31              | 186    |  |
| \$50504       | 3D_v2        | 32                | 2               | 34    | 19                | 3               | 22     |  |
|               | 3D_v3        | 0                 | 1               | 1     | 8                 | 0               | 8      |  |
|               | 2D           | 4                 | 0               | 4     | 11                | 8               | 19     |  |
| FFT64         | 3D_v1        | 127               | 341             | 468   | 372               | 1,621           | 1,993  |  |
| 11104         | 3D_v2        | 23                | 76              | 99    | 47                | 309             | 356    |  |
|               | 3D_v3        | 1                 | 4               | 5     | 6                 | 21              | 27     |  |
|               | 2D           | 3                 | 0               | 3     | 14                | 12              | 26     |  |
| <b>FFT128</b> | 3D_v1        | 891               | 1,427           | 2,318 | 1,160             | 16,807          | 17,967 |  |
| 111120        | 3D_v2        | 152               | 42              | 194   | 19                | 1,010           | 1,467  |  |
|               | 3D_v3        | 18                | 19              | 37    | 8                 | 140             | 202    |  |
|               | 2D           | 640               | 264             | 904   | 1,419             | 258             | 1,677  |  |
| FFT256        | 3D_v1        | 260               | 4,475           | 4,735 | 584               | 13,161          | 13,475 |  |
| 111230        | 3D_v2        | 2,310             | 2,042           | 4,352 | 985               | 3,462           | 4,447  |  |
|               | 3D_v3        | 1,236             | 655             | 1,891 | 568               | 1,244           | 1,812  |  |

 
 TABLE VI

 COMPARISON OF OVERFLOW CHARACTERISTICS IN 2D AND MONOLITHIC 3D TECHNOLOGIES WITH 8 (Mono3D\_v1), 9 (Mono3D\_v2), and 10 (Mono3D\_v3) ROUTING TRACKS IN EACH 3D CELL

#### TABLE VII

COMPARISON OF POWER CONSUMPTION IN 2D AND MONOLITHIC 3D TECHNOLOGIES WITH 8 (*Mono3D\_v1*), 9 (*Mono3D\_v2*), and 10 (*Mono3D\_v3*) ROUTING TRACKS IN EACH CELL. *INT, SWI*, and *LK* REFER, RESPECTIVELY, TO INTERNAL, SWITCHING (NET), AND LEAKAGE POWER

| Operatii | ng Frequency |       | 500                      | MHz      |                          | 2 GHz / 1.5 GHz |                          |          |                          |  |
|----------|--------------|-------|--------------------------|----------|--------------------------|-----------------|--------------------------|----------|--------------------------|--|
| Cinnuit  | Danian State |       | Power comp               | onent (m | W)                       |                 | Power comp               | onent (m | W)                       |  |
| Circuit  | Design Style | INT   | SWI (Change)             | LK       | Total (Change)           | INT             | SWI (Change)             | LK       | Total (Change)           |  |
|          | 2D           | 8.67  | 4.48 (-)                 | 0.571    | 13.73 (-)                | 34.15           | 14.61 (-)                | 0.546    | 49.3 (-)                 |  |
| SIMON    | 3D_v1        | 10.08 | 2.96 (-34%)              | 0.551    | 13.59 ( <b>-1.01%</b> )  | 35.45           | 11.75 ( <b>-19%</b> )    | 0.542    | 47.7 (-3.25%)            |  |
| SIMON    | 3D_v2        | 9.88  | 2.82 ( <b>-37</b> %)     | 0.548    | 13.25 ( <b>-3.49%</b> )  | 34.72           | 11.54 ( <b>-21%</b> )    | 0.537    | 46.8 ( <b>-5.07%</b> )   |  |
|          | 3D_v3        | 9.95  | 2.89 (-35%)              | 0.554    | 13.39 ( <b>-2.48</b> %)  | 35.93           | 11.61 ( <b>-21%</b> )    | 0.537    | 48.1 ( <b>-2.43</b> %)   |  |
|          | 2D           | 55.04 | 13.80 (-)                | 2.472    | 71.31 (-)                | 223.4           | 58.16 (-)                | 2.822    | 284.4 (-)                |  |
| .29594   | 3D_v1        | 54.35 | 11.27 ( <b>-18%</b> )    | 2.515    | 68.13 ( <b>-4.46%</b> )  | 223.1           | 49.43 (-15%)             | 2.955    | 275.4 ( <b>-3.16%</b> )  |  |
| \$36364  | 3D_v2        | 51.43 | 11.24 ( <b>-19%</b> )    | 2.558    | 65.23 ( <b>-8.52</b> %)  | 214.0           | 47.01 ( <b>-19%</b> )    | 2.870    | 263.9 ( <b>-7.21%</b> )  |  |
|          | 3D_v3        | 51.83 | 11.41 (-17%)             | 2.559    | 65.80 ( <b>-7.73</b> %)  | 220.7           | 47.47 (-18%)             | 2.865    | 271.0 ( <b>-4.70%</b> )  |  |
|          | 2D           | 352   | 160.8 (-)                | 17.4     | 530 (-)                  | 713             | 324.7 (-)                | 17.7     | 1,056 (-)                |  |
| FFT64    | 3D_v1        | 377   | 128.8 ( <b>-19.90%</b> ) | 17.0     | 523 ( <b>-1.32%</b> )    | 757             | 259.3 ( <b>-20.14%</b> ) | 18.2     | 1,034 ( <b>-2.08%</b> )  |  |
| 11104    | 3D_v2        | 356   | 125.5 ( <b>-21.95%</b> ) | 17.0     | 498 ( <b>-6.04%</b> )    | 720             | 252.8 ( <b>-22.14%</b> ) | 17.8     | 990 ( <b>-6.25</b> %)    |  |
|          | 3D_v3        | 360   | 125.6 ( <b>-21.89%</b> ) | 16.9     | 503 ( <b>-5.09%</b> )    | 747             | 256.5 ( <b>-21%</b> )    | 17.9     | 1,021 ( <b>-3.31</b> %)  |  |
|          | 2D           | 2,365 | 924.8 (-)                | 119.5    | 3,510 (-)                | 7,891           | 2,859 (-)                | 144.9    | 10,895 (-)               |  |
| FFT128   | 3D_v1        | 2,340 | 750.3 ( <b>-18.87</b> %) | 119.4    | 3,210 ( <b>-8.55</b> %)  | 6,936           | 2,351 (-17.77%)          | 146.5    | 9,437 ( <b>-13.38%</b> ) |  |
| 111120   | 3D_v2        | 2,309 | 726.0 ( <b>-21.50%</b> ) | 118.8    | 3,154 ( <b>-10.14%</b> ) | 6,863           | 2,309 ( <b>-19.24</b> %) | 145.9    | 9,318 ( <b>-14.47%</b> ) |  |
|          | 3D_v3        | 2,302 | 749.7 ( <b>-18.93%</b> ) | 119.9    | 3,172 ( <b>-9.63</b> %)  | 6,884           | 2,333 ( <b>-18.40</b> %) | 145.3    | 9,363 ( <b>-14.06</b> %) |  |
|          | 2D           | 4,852 | 1,956 (-)                | 252.2    | 7,060 (-)                | 9,674           | 3,919 (-)                | 288.1    | 13,881 (-)               |  |
| FFT256   | 3D_v1        | 5,359 | 1,603 (-18%)             | 254.8    | 7,217 ( <b>1.76%</b> )   | 10,594          | 3,237 ( <b>-17.40</b> %) | 286.3    | 14,117 ( <b>1.7%</b> )   |  |
| 111250   | 3D_v2        | 5,176 | 1,571 ( <b>-19.7%</b> )  | 254.3    | 7,001 ( <b>-0.83</b> %)  | 10,286          | 3,210 ( <b>-18.09</b> %) | 287.9    | 13,784 ( <b>-0.69</b> %) |  |
|          | 3D_v3        | 5,355 | 1,622 (-17.1%)           | 254.4    | 7,232 ( <b>2.44</b> %)   | 10,646          | 3,288 ( <b>-16.10%</b> ) | 285.5    | 14,221 ( <b>2.45</b> %)  |  |

number of routing tracks in each cell is increased from 8 to 10, the internal power consumption first decreases (due to shorter wirelength), then slightly increases. Note that the change in internal power in 3D designs depends upon the specific circuit. For example, for some of the benchmarks, the internal power consumed by the 3D versions is slightly less (such as s38584 and FFT128) whereas for some others (such as SIMON, FFT64, and FFT256), 3D versions consume slightly more internal power than the 2D counterpart. This variation depends upon the number of times each cell is used

in the circuit since the 3D cell power may increase or decrease depending upon the specific cell (see Table II). For example, comparing the cell type and number of FFT128 and FFT256 (listed in Tables III and IV), FFT256 contains a significantly higher number of OR, NOR, and MUX gates. According to Table II, the 3D versions of these gates consume more power as compared to the traditional 2D gates. Since the internal power is still the dominant power component in these benchmarks, this fluctuation significantly affects the overall power savings, despite a consistent and reasonable reduction

#### TABLE VIII

COMPARISON OF TIMING CHARACTERISTICS IN 2D AND MONOLITHIC 3D TECHNOLOGIES WITH 8 (*Mono3D\_v1*), 9 (*Mono3D\_v2*), AND 10 (*Mono3D\_v3*) ROUTING TRACKS IN EACH CELL. *WS*, *WNS*, AND *TNS* REFER, RESPECTIVELY, TO WORST SLACK, WORST NEGATIVE SLACK, AND TOTAL NEGATIVE SLACK

| Operatir | ng Frequency | 500 MHz |          | 2 GHz /  | 1.5 GHz              |
|----------|--------------|---------|----------|----------|----------------------|
| Circuit  | Design Style | WS (ns) | WNS (ns) | TNS (ns) | Number of Violations |
|          | 2D           | 0.326   | 0.051    | 0        | 0                    |
| SIMON    | 3D_v1        | 0.296   | -0.03    | -1.404   | 98                   |
| SINON    | 3D_v2        | 0.429   | 0.048    | 0        | 0                    |
|          | 3D_v3        | 0.468   | 0.045    | 0        | 0                    |
|          | 2D           | 0.760   | -0.216   | -10.98   | 252                  |
| c38584   | 3D_v1        | 0.736   | -0.224   | -16.68   | 330                  |
| \$36364  | 3D_v2        | 0.767   | -0.205   | -5.459   | 169                  |
|          | 3D_v3        | 0.762   | -0.212   | -8.665   | 183                  |
|          | 2D           | 0.561   | -0.070   | -151.799 | 1,107                |
| EET64    | 3D_v1        | 0.476   | -0.097   | -192.580 | 2,656                |
| 11104    | 3D_v2        | 0.606   | -0.055   | -125.390 | 1,792                |
|          | 3D_v3        | 0.602   | -0.068   | -136.983 | 1,901                |
|          | 2D           | 0.21    | -0.104   | -516.100 | 8,097                |
| EET128   | 3D_v1        | 0.16    | -0.116   | -725.474 | 9,118                |
| FF1120   | 3D_v2        | 0.23    | -0.091   | -302.582 | 6,221                |
|          | 3D_v3        | 0.22    | -0.097   | -319.003 | 6,574                |
|          | 2D           | 0.145   | -0.152   | -1029.1  | 18,404               |
| EET256   | 3D_v1        | -0.197  | -0.686   | -4626.1  | 30,087               |
| FF1256   | 3D_v2        | 0.024   | -0.202   | -1322.4  | 19,431               |
|          | 3D_v3        | 0.015   | -0.237   | -2392.5  | 26,919               |

in switching power in all of the benchmarks. For example, up to 10% (at 500 MHz) and 14% (at 1.5 GHz) reduction in overall power consumption is achieved for FFT128. For FFT256, however, the power reduction is only 0.8% (at 500 MHz) and 0.7% (at 1.5 GHz) due to an increase in the internal power component of the 3D versions.

#### C. Timing Characteristics

The timing characteristics of the 2D and monolithic 3D circuits are compared in Table VIII where the worst slack (WS), worst negative slack (WNS), total negative slack (TNS), and number of timing violations are listed at both 500 MHz (with no timing violations) and 1.5/2 GHz (with timing violations).

An important observation from Table VIII is that the timing characteristics are degraded when monolithic 3D circuits with 8 routing tracks (Mono3D v1) are considered. This degradation is due to 1) higher average cell delay for monolithic 3D technology and 2) routing congestion. However, if the number of routing tracks in each cell is increased to 9 (Mono3D\_v2), the timing characteristics of all of the 3D benchmarks are enhanced. Furthermore, some of the 3D benchmarks (SIMON, s38584, and FFT128) outperform the 2D counterparts at both 500 MHz and 1.5/2 GHz operating frequencies. At 500 MHz, the positive slack increases. At 1.5/2 GHz, the WNS, TNS, and number of violations are reduced. Alternatively, for FFT64 and FFT256 (where the number of OR and NOR gates is significantly higher), the 3D versions cannot outperform the 2D version due to higher cell-level delays of 3D OR and 3D NOR gates (see Table II). Similar to power characteristics, *Mono3D* v2 achieves the best timing characteristics among the 3D versions for each benchmark.

The results obtained from these benchmarks are summarized in Fig. 8 when the clock frequency is 1.5/2 GHz. According to this figure, if sufficient number of routing tracks is not provided during cell library development (as in *Mono3D* v1), timing characteristics of the 3D circuits degrade as compared to 2D designs due to routing congestion. Mono3D\_v2 (9 routing tracks in each cell) achieve the most reduction in power, while also enhancing the timing characteristics. If the number of routing tracks is further increased to 10, the power and timing characteristics slightly degrade due to longer overall wirelength. Thus, for relatively low performance applications with relaxed timing constraints, monolithic 3D technology can be leveraged to achieve the highest reduction in footprint (therefore cost) by developing highly dense 3D cell layouts. For high performance applications with tighter timing constraints, however, interconnects and the routing process play a significant role in system timing and power consumption. In this case, metal resources for routing (such as number of tracks) should be carefully considered to alleviate routing congestion and prevent timing degradation at the expense of slightly reduced savings in footprint.

#### D. Clock Tree Characteristics

Since clock networks play a significant role in both performance and power in large-scale circuits, the clock tree synthesis (CTS) results of one of the FFT cores are also reported to quantify the benefits of monolithic 3D technology in clocking. The clock trees obtained by *Encounter* for both 2D and 3D technologies ( $Mono3D_v2$ ) are shown in Fig. 9. The number of sinks for both designs is 96,796. Both the skew and slew constraints are set to 100 ps. The smaller 3D footprint is helpful for enhancing primary clocking characteristics, as listed in Table IX. Due to reduced footprint, the number of clock buffers is reduced from 8,231 to 6,427, which reduces the clock internal power by approximately 28%. The clock wirelength is also reduced by 28% and the clock net power



Fig. 8. Summary of the results obtained from the benchmarks operating at 1.5/2 GHz: (a) footprint, (b) wirelength, (c) power consumption, and (d) worst negative slack.



Fig. 9. Clock tree floorplan of a 128-point FFT core with approximately 97K flip-flops: (a) traditional 2D technology and (b) monolithic 3D technology with 9 routing tracks in each cell (*Mono3D\_v2*).

TABLE IX Comparison of Primary Clock Tree Characteristics of 2D 128-Point FFT Core and Monolithic 3D 128-Point FFT Core

| Circuit                    | 128-poin | 128-point FFT core |  |  |  |
|----------------------------|----------|--------------------|--|--|--|
| Design style               | 2D       | 3D_v2              |  |  |  |
| Number of sinks            | 96,796   | 96,796             |  |  |  |
| Number of buffers          | 8,231    | 6,427              |  |  |  |
| Clock wirelength (mm)      | 732      | 527                |  |  |  |
| Clock cap (pF)             | 1,052    | 719                |  |  |  |
| Max. buffer slew (ps)      | 110.1    | 98.7               |  |  |  |
| Max. sink slew (ps)        | 107.8    | 95.1               |  |  |  |
| Skew (ps)                  | 51.7     | 45.5               |  |  |  |
| Clock internal power (mW)  | 2,419    | 1,738              |  |  |  |
| Clock switching power (mW) | 2,430    | 1,767              |  |  |  |
| Clock leakage power (mW)   | 9.2      | 7.3                |  |  |  |
| Overall clock power (mW)   | 4,858    | 3,512              |  |  |  |

is reduced by approximately 27%. The overall clock power is reduced by 28%.

The clock tree of the 2D design exhibits slew violations, which are fixed in the 3D clock network (due to shorter and therefore less resistive clock nets). The global skew decreases from 51.7 ps in 2D FFT core to 45.5 ps in 3D FFT core, implemented by  $Mono3D_v2$ . The 3D FFT core also exhibits



Fig. 10. Comparison of clock insertion delay histograms for 2D 128-point FFT core and monolithic 3D 128-point FFT core.

lower clock insertion delays, as shown in Fig. 10 where the insertion delay histograms are compared for 2D and 3D designs. Lower insertion delays are helpful in reducing the variation-induced skew or corner-to-corner skew variation.

#### V. CONCLUSION

An open source transistor-level monolithic 3D cell library is developed and integrated into a digital design flow. The proposed library is used to investigate several important characteristics of monolithic 3D ICs such as 1) footprint, timing and power consumption at both relaxed and tight timing constraints, 2) routing congestion, 3) the effect of number of routing tracks in each cell, and 4) clock tree. The results of a 128-point FFT core operating at 1.5 GHz demonstrate that the monolithic 3D technology can reduce the footprint and overall power consumption by, respectively, 38% and 14%. The effect of routing congestion on timing characteristics is stronger in monolithic 3D technology, where the cell-level number of routing tracks plays an important role. An optimum number of routing tracks exists that achieves the largest improvements in both power and timing characteristics.

The entire proposed library and related files for tool integration are publicly available to facilitate future research in some of the critical design aspects of monolithic 3D technology such as thermal integrity and design-for-test methodologies as well as manufacturing aspects such as the effects of tier-specific device characteristics on system-level performance.

#### REFERENCES

- [1] E. Salman and E. G. Friedman, *High Performance Integrated Circuit Design*. New York, NY, USA: McGraw-Hill, Aug. 2012.
- [2] V. F. Pavlidis, I. Savidis, and E. G. Friedman, *Three-Dimensional Integrated Circuit Design*, 2nd ed. San Mateo, CA, USA: Morgan Kaufmann, 2017.
- [3] D. H. Kim et al., "3D-MAPS: 3D massively parallel processor with stacked memory," in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2012, pp. 188–190.
- [4] V. F. Pavlidis and E. G. Friedman, "Interconnect-based design methodologies for three-dimensional integrated circuits," *Proc. IEEE*, vol. 97, no. 1, pp. 123–140, Jan. 2009.
- [5] S. M. Satheesh and E. Salman, "Power distribution in TSV-based 3-D processor-memory stacks," *IEEE Trans. Emerg. Sel. Topics Circuits Syst.*, vol. 2, no. 4, pp. 692–703, Dec. 2012.
- [6] L. Brunet et al., "First demonstration of a CMOS over CMOS 3D VLSI CoolCube integration on 300 mm wafers," in Proc. IEEE Symp. VLSI Technol., Jun. 2016, pp. 1–2.
- [7] C. Erdmann *et al.*, "A heterogeneous 3D-IC consisting of two 28 nm FPGA die and 32 reconfigurable high-performance data converters," *IEEE J. Solid-State Circuits*, vol. 50, no. 1, pp. 258–269, Jan. 2015.
- [8] D. H. Kim, K. Athikulwongse, and S. K. Lim, "A study of throughsilicon-via impact on the 3D stacked IC layout," in *Proc. ACM Int. Conf. Comput.-Aided Design*, Nov. 2009, pp. 674–680.
- [9] I. Savidis and E. G. Friedman, "Closed-form expressions of 3-D via resistance, inductance, and capacitance," *IEEE Trans. Electron Devices*, vol. 56, no. 9, pp. 1873–1881, Sep. 2009.
- [10] H. Wang, M. H. Asgari, and E. Salman, "Compact model to efficiently characterize TSV-to-transistor noise coupling in 3D ICs," *Integr., VLSI J.*, vol. 47, no. 3, pp. 296–306, Jun. 2014.
- [11] O. Billoint *et al.*, "From 2D to monolithic 3D: Design possibilities, expectations and challenges," in *Proc. ACM Int. Symp. Phys. Design*, Mar. 2015, p. 127.
- [12] S. Wong, A. El-Gamal, P. Griffin, Y. Nishi, F. Pease, and J. Plummer, "Monolithic 3D integrated circuits," in *Proc. IEEE Int. Symp. VLSI Technol., Syst. Appl.*, Apr. 2007, pp. 1–4.
- [13] M. Vinet *et al.*, "3D monolithic integration: Technological challenges and electrical results," *Microelectron. Eng.*, vol. 88, no. 4, pp. 331–335, Apr. 2011.
- [14] P. Batude et al., "Advances in 3D CMOS sequential integration," in Proc. IEEE Int. Electron Devices Meeting, Dec. 2009, pp. 1–4.
- [15] F. Fenouillet-Beranger *et al.*, "FDSOI bottom MOSFETs stability versus top transistor thermal budget featuring 3D monolithic integration," *Solid-State Electron.*, vol. 113, pp. 2–8, Nov. 2015.
- [16] B. Rajendran *et al.*, "Low thermal budget processing for sequential 3-D IC fabrication," *IEEE Trans. Electron Devices*, vol. 54, no. 4, pp. 707–714, Apr. 2007.
- [17] P. Batude *et al.*, "GeOI and SOI 3D monolithic cell integrations for high density applications," in *Proc. IEEE Int. Symp. VLSI Technol.*, Jun. 2009, pp. 166–167.
- [18] S. A. Panth, K. Samadi, Y. Du, and S. K. Lim, "Design and CAD methodologies for low power gate-level monolithic 3D ICs," in *Proc. IEEE/ACM Int. Symp. Low Power Electron. Design*, Aug. 2014, pp. 171–176.
- [19] C. Liu and S. K. Lim, "A design tradeoff study with monolithic 3D integration," in *Proc. 13th Int. Symp. Quality Electron. Design*, Mar. 2012, pp. 529–536.
- [20] P. Batude et al., "3D sequential integration opportunities and technology optimization," in Proc. IEEE Int. Interconnect Technol. Conf., Adv. Metall. Conf., May 2014, pp. 373–376.
- [21] Mono3D, Open Source Cell Library for Transistor-Level Monolithic 3D Integration. Accessed: Mar. 2017. [Online]. Available: http://www.ece.stonybrook.edu/~emre/Mono3D.zip
- [22] Y.-J. Lee, D. Limbrick, and S. K. Lim, "Power benefit study for ultra-high density transistor-level monolithic 3D ICs," in *Proc. 50th* ACM/EDAC/IEEE Design Autom. Conf., May 2013, pp. 104:1–104:10.
- [23] J. Shi, D. Nayak, M. Ichihashi, S. Banna, and C. A. Moritz, "On the design of ultra-high density 14 nm FinFET based transistor-level monolithic 3D ICs," in *Proc. IEEE Comput. Soc. Annu. Symp. VLSI*, Jul. 2016, pp. 449–454.

- [24] C. Yan, S. Kontak, H. Wang, and E. Salman, "Open source cell library Mono3D to develop large-scale monolithic 3D integrated circuits," in *Proc. IEEE Int. Symp. Circuits Syst.*, May 2017, pp. 2581–2584.
- [25] C. Yan, J. Dofe, S. Kontak, Q. Yu, and E. Salman, "Hardwareefficient logic camouflaging for monolithic 3D ICs," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, to be published. [Online]. Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8026129
- [26] J. Dofe, C. Yan, S. Kontak, E. Salman, and Q. Yu, "Transistor-level camouflaged logic locking method for monolithic 3D IC security," in *Proc. IEEE Asian HOST*, Dec. 2016, pp. 1–6.
- [27] FreePDK45. Accessed: Mar. 2017. [Online]. Available: http://www. eda.ncsu.edu/wiki/FreePDK45:Contents
- [28] Nangate Cell Library. Accessed: Mar. 2017. [Online]. Available: http://www.nangate.com
- [29] Mentor Graphics Calibre. Accessed: Mar. 2017. [Online]. Available: https://www.mentor.com/products/icnanometerdesign/verificationsignoff/
- [30] Cadence Encounter Library Characterizer (ELC). Accessed: Mar. 2017. [Online]. Available: http://www.cadence.com/products/di/ library-characterizer/pages/default.aspx
- [31] Synopsys HSPICE. Accessed: Mar. 2017. [Online]. Available: https://www.synopsys.com/tools/Verification/AMSVerification/Circuit-Simulation/HSPICE/Pages/default.aspx
- [32] Synopsys Design Compiler. Accessed: Mar. 2017. [Online]. Available: http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Design-Compiler/Pages/default.aspx
- [33] P. Milder, F. Franchetti, J. C. Hoe, and M. Püschel, "Computer generation of hardware for linear digital signal processing transforms," ACM *Trans. Design Autom. Electron. Syst.*, vol. 17, no. 2, pp. 15:1–15:33, Apr. 2012.
- [34] Cadence Encounter. Accessed: Mar. 2017. [Online]. Available: https://www.cadence.com/content/cadencewww/global/enUS/ home/tools-/digitaldesignandsignoff/hierarchicaldesignandfloorplanning/ innovus-implementationsystem.html



**Chen Yan** (S'17) received the B.S. degree in computer engineering from the University of Science and Technology Beijing, Beijing, China, in 2013. He is currently pursuing the Ph.D. degree in electrical engineering with Stony Brook University, Stony Brook, NY, USA. Since 2017, he has been with GLOBALFOUNDRIES as a Graduate Intern. His current research is on monolithic 3-D integrated circuits.



**Emre Salman** (S'03–M'10–SM'17) received the B.S. degree in microelectronics engineering from Sabanci University, Istanbul, Turkey, in 2004, and the M.S. and Ph.D. degrees in electrical engineering from the University of Rochester, NY, USA, in 2006 and 2009, respectively.

He was with STMicroelectronics, Synopsys, and NXP Semiconductors, where he was involved in research in the fields of custom circuit design, timing, and noise analysis. Since 2010, he has been with the Department of Electrical and Computer

Engineering, Stony Brook University (SUNY), NY, USA, where he is currently an Associate Professor and the Director of the Nanoscale Circuits and Systems Laboratory. He is the leading author of a comprehensive tutorial book *High Performance Integrated Circuit Design* (McGraw-Hill, 2012, Chinese translation, 2015). He has also authored three book chapters and over 60 papers in refereed IEEE/ACM journals and conferences, and holds two issued and two pending U.S. patents. His broad research interests include analysis, modeling, and design methodologies for high performance and energy efficient integrated circuits with emphasis on power, signal, and sensing integrity.

Dr. Salman received the National Science Foundation Faculty Early Career Development Award in 2013 and the Outstanding Young Engineer Award from IEEE Long Island, NY, USA, in 2014. He also received multiple outreach initiative awards from the IEEE Circuits and Systems Society. He is a SUNY Inaugural Discovery Prize Finalist. He served on the Editorial Board of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION Systems. He currently serves as the Americas Regional Editor for the *Journal of Circuits, Systems and Computers* and on the organizational/technical committees of various IEEE and ACM conferences.