Accurate Quantitative Physics-of-Failure Approach to Integrated Circuit Reliability
Modern electronics typically consist of microprocessors and other complex integrated circuits (ICs) such as FPGAs, ADCs, and memory. They are susceptible to electrical, mechanical and thermal modes of failure like other components on a printed circuit board, but due to their materials, complexity and roles within a circuit, accurately predicting a failure rate has become difficult, if not impossible. Development of these critical components has conformed to Moore's Law, where the number of transistors on a die doubles approximately every two years. This trend has been successfully followed over the last four decades through reduction in transistor sizes creating faster, smaller ICs with greatly reduced power dissipation. Although this is great news for developers and users of high performance equipment, including consumer products and analytical instrumentation, a crucial, yet underlying reliability risk has emerged. Semiconductor failure mechanisms, which are far worse at these minute feature sizes (tens of nanometers), result in higher failure rates, shorter device lifetimes and unanticipated early device wearout. This is of special concern to users whose requirements include long service lifetimes and rugged environmental conditions, such as aerospace, defense, and other high performance (ADHP) industries. To that end, the Aerospace Vehicle Systems Institute (AVSI) has conducted research in this area, and DfR Solutions has performed much of the work as a contractor to AVSI.
Physics-of-Failure (PoF) knowledge and an accurate mathematical approach which utilizes semiconductor formulae, industry accepted failure mechanism models, and device functionality can access reliability of those integrated circuits vital to system stability. Currently, four semiconductor failure mechanisms that exist in silicon-based ICs are analyzed: Electromigration (EM), Time Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI), and Negative Bias Temperature Instability (NBTI). Mitigation of these inherent failure mechanisms, including those considered wearout, is possible only when reliability can be quantified. Algorithms have been folded into a software application not only to calculate a failure rate, but also to produce confidence intervals and lifetime curves, using both steady state and wearout failure rates, for the IC under analysis. The algorithms have been statistically verified through testing and employ data and formulae from semiconductor materials (including technology node parameters), circuit fundamentals, transistor behavior, circuit design and fabrication processes. Initial development has yielded a user friendly software module with the ability to address silicon- based integrated circuits of the 0.35m, 0.25m, 0.18m, 0.13m and 90nm technology nodes.
INTRODUCTION In the ADHP industries, there is considerable interest in assessing the long term reliability of electronics whose anticipated lifetimes extend beyond those of consumer "throw away" electronics. Because complex integrated circuits within their designs may face wearout or even failure within the period of useful life, it is necessary to investigate the effects of use and environmental conditions on these components. The main concern is that submicron process technologies drive device wearout into the regions of useful life well before wearout was initially anticipated to occur. The continuous scaling down of semiconductor feature sizes raises challenges in electronic circuit reliability prediction. Smaller and faster circuits cause higher current densities, lower voltage tolerances and higher electric fields, which make the devices more vulnerable to early failure. Emerging new generations of electronic devices require improved tools for reliability prediction in order to investigate new manifestations of existing failure mechanisms, such as NBTI, EM, HCI, and TDDB.
Working with AVSI, DfR Solutions has developed an integrated circuit (IC) reliability calculator using a multiple failure mechanism approach. This approach successfully models the simultaneous degradation behaviors of multiple failure mechanisms on integrated circuit devices. The multiple mechanism model extrapolates independent acceleration factors for each semiconductor mechanism of concern based on the transistor stress states within each distinct functional group. Integrated circuit lifetime is calculated from semiconductor materials and technology node, IC complexity, and operating conditions.
A major input to the tool is integrated circuit complexity. This characteristic has been approached by using specific functionality cells called functional groups. The current set of functional groups covers memory-based devices and analog-to- digital conversion circuitry. Technology node process parameters, functional groups and their functionality, and field/test operating conditions are used to perform the calculations. Prior work verified the statistical assessment of the algorithms for aerospace electronic systems and confirmed that no single semiconductor failure mechanism dominates failures in the field.
Two physics-of-failure approaches are used within the tool to determine each of four semiconductor failure mechanisms’ contribution to the overall device failure rate. The tool calculates a failure rate and also produces confidence intervals and a lifetime curve, using both steady state and wearout failure rates, for the part under analysis.
Reliability prediction simulations are the most powerful tools developed over the years to cope with these challenging demands. Simulations may provide a wide range of predictions, starting from the lower-level treatment of physics-of-failure (PoF) mechanisms up to high-level simulations of entire devices As with all simulation problems, primary questions need to be answered, such as: “How accurate are the simulation results in comparison to in-service behavior?” and “What is the confidence level achieved by the simulations?” Thus the validation and calibration of the simulation tools becomes a most critical task. Reliability data generated from field failures best represents the electronic circuit reliability in the context of the target system/application. Field failure rates represent competing failure mechanisms' effects and include actual stresses, in contrast to standard industry accelerated life tests.
In this paper, the failure rates of recorded field data from 20012 to 2018 were determined for various device process technologies and feature sizes (or “technology nodes”). Theses failure rates are used to verify the PoF models and a competing failure approach, as implemented in the software. Comparison of the actual and simulated failure rates shows a strong correlation. Furthermore, comparing the field failure rates with those obtained from the standard industry High Temperature Operating Life (HTOL) Test reveals the inadequacy of the HTOL to predict integrated circuit (IC) failure rates.
Background Semiconductor life calculations were performed using an integrated circuit reliability prediction software tool developed by DfR Solutions in cooperation with AVSI. The software uses component accelerated test data and physics-of-failure (PoF) based die-level failure mechanism models to calculate the failure rate of integrated circuit components during their useful lifetime. Integrated circuit complexity and transistor behavior are contributing factors to the calculation of the failure rate. Four failure mechanisms are modeled in this software using readily available, published models from the semiconductor reliability community and NASA/Jet Propulsion Laboratory (JPL) as well as research from the University of Maryland, College Park. These mechanisms are Electromigration (EM), Time Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Taking the reliability bathtub curve into consideration, research shows that EM and TDDB are considered steady-state failure modes (constant random failure) where as HCI and NBTI have wearout behavior (increasing failure rate).
ach of these failure mechanisms is driven by a combination of temperature, voltage, current, and frequency. Traditional reliability predictions assume temperature, and sometimes voltage as the only accelerators of failure. Each failure mechanism affects the on-die circuitry in a unique way. Therefore, each is modeled independently and later combined with the others using associated proprietary weighting factors. This software uses circuit complexity, test and field operating conditions, derating values, and transistor behavior as mathematical parameters. Since there is not one dominant parameter set, each mechanism could have the largest contribution to a component's failure rate depending on the use conditions. In general, there is no dominant failure mechanism - and thus for a specific component, any combination of the four mechanisms can affect it.
Failure rates were calculated using a specialized set of time to failure (TTF) equations. Time to Fail is the approximate reciprocal of failure rate. Mean time to failure (MTTF) is the mean or expected value of the probability distribution defined as a function of time. MTTF is used with non-reparable systems, like an integrated circuit. Non-reparable systems can fail only once. For reparable systems, like a re-workable printed circuit board or assembly, mean time between failures (MTBF) is used as the metric for probability distribution. Reparable systems can fail several times. In general, it takes more time for the first failure to occur than it does for subsequent failures to occur. The mathematics are the same for MTTF and MTBF. Since this analysis method is for integrated circuits, they can be replaced on the assembly, but they themselves are non- repairable circuitry.
Reliability Effects of Scaling Smaller and faster circuits have higher current densities, lower voltage tolerances and higher electric fields, which make integrated circuits more vulnerable to electrically based failure. New generations of electronic devices and circuits demand new means of investigation to check the possibility of introducing new problems or new versions of old issues. New devices with new designs and materials require failure analysis to find new models for both individual failure mechanism and also the possible interaction between them. Understanding these potential interactions is particularly important and requires serious investigation.
In the sub-micrometer region, the demand for higher performance is in conflict with reliability. Proper tradeoffs in the early design stage are a dominating challenge. After performing a quick and effective reliability analysis (like the one performed for this project), both a lifetime estimation for the device and a failure mechanism dominance hierarchy are achieved. Using reliability knowledge and improvement techniques, higher reliability integrated circuits can be developed using two methods: suppression of die-level failure mechanisms and the adjustment of circuit structures. This has been realized for electromigration (through Black's equation) using design techniques, however, it is counter-productive across industry in the aim of device scaling to adjust transistor sizes. Redesign of transistor architecture and circuit schematics is too resource intensive both in time and costs to be the corrective action for reliability concerns. The end user must decide what reliability goals need to be achieved; more so, it has become the user’s responsibility to determine how to achieve those goals without any influence on component design, manufacturing, or quality. This type of reliability assessment is crucial for the end user as adjustments to electrical conditions and thermal management seem to be the only way to improve reliability of modern technology nodes. The tradeoff in performance can be significantly reduced by using devices from larger technology nodes as they provide larger operating tolerances and the architectures necessary to reduce the effects of multiple mechanism degradation behaviors.
As technology shifts to the smaller nodes, the operating voltage of the device is not reduced proportionally with the gate oxide thickness, which results in a higher electric field; moreover, the increasing density of transistors on a chip causes more
power dissipation and in turn increases operating temperature through self-heating. Conversely, introducing nitrogen into the dielectric to aid in gate leakage reduction together with boron penetration control has its own effect - linearly worsening NBTI and other modes of degradation. Because the threshold voltage of new devices is not being reduced proportionally to the operating voltage, there is more degradation for the same threshold voltage.
RELIABILITY MODELING AND SIMULATION History There has been steady progress over the years in the development of a physics-of-failure understanding of the effects that various stress drivers have on semiconductor structure performance and wearout. This has resulted in better modeling and simulation capabilities. Early investigators sought correlations between the degradation of single device parameters (e.g. Vth, Vdd or Isub) and the degradation of parameters related to circuit performance such as the delay between read and write cycles. It was quickly realized that the degradation of a broad range of parameters describing device performance had to be considered, rather than just a single parameter . Most of the simulation tools tend to simulate a single failure mechanism such as Electromigration System-level simulators attempting to integrate several mechanisms into a single model have been developed as well. The latest circuit design tools, such as Cadence Ultrasim and Mentor Graphics Eldo, have integrated reliability simulators. These simulators model the most significant physical failure mechanisms and help designers address the lifetime performance requirements. However, inadequacies, such as complexity in the simulation of large-scale circuits and a lack of prediction of wearout mechanisms, hinder broader adoption of these tools.
Validation Concerns Reliability simulations are commonly based on combinations of PoF models, empirical data and statistical models developed over the years by different research groups and industries. The inevitable consequence of a wide range of models and approaches is a lack of confidence in the predictions obtained for any given model. From the point of view of a real-world end-user, single failure mechanism modeling and simulation is less meaningful then the system level reliability.
Validation and calibration of simulations is accomplished by comparing simulation predictions with empirical data obtained from laboratory tests or by analyzing field data (or both). To evaluate the reliability of their devices, semiconductor manufacturers use laboratory tests such as environment stress screens (ESS), highly accelerated life testing1 (HALT), HTOL and other accelerated life tests (ALT). Several concerns cause doubts about the prediction accuracy derived from such tests. The assumption of single failure mechanism is an inaccurate simplification of actual failure dynamics. Furthermore, ALT tests based on sampling a set of devices have the inherent problem of a lack of statistical confidence in the case of zero observed failures. Finally, ALT tests can only mimic actual field conditions to estimate real-world reliability and extrapolation from test environmental stresses to field stresses can be misleading
FAILURE MECHANISMS AND MODELS Introduction The dominant failure mechanisms in Si-based microelectronic devices that are most commonly simulated are EM, TDDB, NBTI and HCI. Other degradation models do exist but are less prevalent. These mechanisms can be generally categorized as either Steady State Failure Modes (EM and TDDB) and Wearout Failure Modes . A brief explanation of each failure mechanism is necessary to understand their contribution to the overall device failure rate.
Steady State Failure Modes Electromigration can lead to interconnect failure in an integrated circuit. It is characterized by the migration of metal atoms in a conductor in the direction of the electron flow. Electromigration causes opens or voids in some portions of the conductor and corresponding hillocks in other portions [.
Time Dependent Dielectric Breakdown is caused by the formation of a conducting path through the gate oxide to the substrate due to an electron tunneling current. If the tunneling current is sufficient, it will cause permanent damage to the oxide and surrounding material. This damage will result in performance degradation and eventual failure of the device. If the tunneling current remains very low, it will increase the field necessary for the gate to turn on and impede its functionality. The gate dielectric breaks down over a long period of time for devices with larger feature sizes (>90 nm) due to a comparatively low electric field. Although core voltages have been scaled down as feature sizes have shrunk, supply voltages have remained constant. These field strengths are an even greater concern since high fields exacerbate the effects of TDDB.
Wearout Failure Modes Negative Bias Temperature Instability occurs only in pMOS devices stressed with a negative gate bias voltage while at elevated temperatures. Degradation occurs in the gate oxide region allowing electrons and holes to become trapped. Negative bias is driven by smaller electric fields than hot carrier injection, which makes it a more significant threat at smaller technology nodes where increased electric fields are used in conjunction with smaller gate lengths. The interface trap density generated by NBTI is found to be more pronounced with thinner oxides.
Hot Carrier Injection occurs in both nMOS and pMOS devices stressed with drain bias voltage. High electric fields energize the carriers (electrons or holes), which are injected into the gate oxide region. Like NBTI, the degraded gate dielectric can then more readily trap electrons or holes, causing a change in threshold voltage, which in turn results in a shift in the subthreshold leakage current. HCI is accelerated by an increase in bias voltage and is the predominate mechanism at lower stress temperatures . Therefore, hot carrier damage, unlike the other failure mechanisms, will not be accelerated by HTOL tests, which are commonly used for accelerated life testing.
Trending Analysis of Failure Mechanisms A trending study was performed to understand the risk associated with reduction in feature size to facilitate better design decisions and mitigation strategies. Five component types were identified:
1.Microprocessor (MPU) - No change in core voltage per node 2.Microprocessor (CPU) - Different core voltage per node 3.Static Random Access Memory (SRAM) 4.ASIC Microcontroller 5.Analog to Digital Converter (ADC)
One component of each type from each of five technology nodes (0.35m, 0.25m, 0.18m, 0.13m and 90nm) was selected for analysis to show lifetime trends. The components selected are industrial grade. Thermal characteristics were researched and electrical parameters, commonly found on the component's datasheet, were identified for the calculations. Component complexity and electrical characteristics were extracted from corresponding component documentation for use in the calculator. The results of the calculations are used to correlate expected life for each component to technology node for a specified use environment (identified as 65°C).
Research showed that conductor material improvements were made around the 0.18 micron node and later to reduce the effects of electromigration. The resulting trend shows a reduction in failure rate from electromigration. However, as feature sizes decrease, the wearout effects of hot carrier injection and negative bias temperature instability become more prevalent. Two failure models for TDDB are used. Research shows that the applicable electro-chemical models for TDDB follow the dielectric (oxide) thickness at each node. A change occurs when scaling passes 5nm in thickness (corresponding to the 0.13 micron node). TDDB becomes a constant failure rate as oxide thickness approaches 1nm. Above 5nm, however, failure rate increases as the thickness approaches this turn-over point. Differences in trending can be seen for each failure mechanism:
Electromigration trends (plateauing) correspond to conductor materials which have improved to negate the effects of electromigration TDDB has two separate increasing trends above and below an oxide thickness of 5nm The effects of both HCI and NBTI increase with feature scaling
The combined failure rate graph for the microprocessor device type is shown in Figure 3. This graph shows that a technology node dependent trend does exist for failure rates. As feature sizes are scaled down, failure rate does increase. The microprocessor device type is a prime example of this trending as the electrical and thermal conditions of these parts are consistent over each technology node.
The science behind the visible trends of each failure mechanism across the technology nodes is worth discussing. Consider 90nm technology as an appropriate starting-point for future technology node trending. A main differentiation between 0.35 micron and 90nm is conductor materials. Electromigration is directly influenced by this, which is why industry has made process improvements to both reduce the effects of EM through metallurgical improvements, and development of design rules to mitigate EM. The former increases the activation energy required to start degradation from ~0.6eV to ~0.75eV. However, even with design rules, i.e. Black's law, the latter can only forestall EM for a finite period of time by ensuring properly laid out geometries of traces and interconnects on die. It is unknown at this point in time whether or not any more improvements will be made to conductor metals (Al, Al + Cu) beyond what has already been done. The overall trend of electromigration is lower reliability and lifetime trending which shows reduction in lifetimes as a result of feature scaling. However, it can be considered a constant additive to failure rate because the trend is two plateaus (three or more if the material changes again). The failure rate constituent from EM will likely be the same for future nodes.
Although small compared to EM and TDDB at these nodes, hot carrier injection (HCI) and negative bias temperature instability (NBTI) contributions to failure rates increase as features are scaled down. Hot carrier injection will be almost negligible at high temperatures, i.e. 65°C operating environment. Although 65nm and 45nm process data are not currently included in this calculator, the projected contribution to failure rate of both of these failure mechanisms will surely increase and exceed those of EM and TDDB (which are also constant, as mentioned above).
The inverted trend of TDDB has to do with voltage tolerances of each component type. Above 5nm oxide thickness (0.25 micron and 0.35 micron nodes), the influence of TDDB is directly related to the electric field on the gate and in turn the voltage on it. The industry accepted reliability models at these nodes are different from that at 0.18 micron, 0.13 micron and 90nm. TDDB is trending toward a time independent mechanism and will induce random failures instead of acting as a wearout function. Reasons for this include all possible failure sites being sufficiently large within the bounds of the electric field to cause instantaneous failure. The effects of TDDB from 0.18 micron down to likely the ~32nm node will be a plateau just like EM. It is driven by voltage rather than magnitude of the electric field on the gate oxide. Therefore, when considering "old technology" as 180, 130 and even 90nm compared to the high performance 65 and 45nm, the effects of both EM and TDDB will be the same. For trending purposes, these contributions could be subtracted out altogether. This would result in increasing failure rate trends for all analyzed device types as feature sizes are scaled down.
THE SIMULATION TOOL Approach The simulation tool used for this research is a web-based application based on recent PoF circuit reliability prediction methodologies that were developed by the University of Maryland (UMD), in cooperation with AVSI, for 130 nm and 90 nm devices . The two methods developed are referred to by the acronyms of FaRBS (Failure-Rate-Based SPICE [spacecraft, planet, instrument, C-matrix, events]) and MaCRO (Maryland Circuit Reliability-Oriented). FaRBS is a reliability prediction process that uses accelerated test data and PoF based die-level failure mechanism models to calculate the failure rate of integrated circuit components during their useful lifetime. As its name implies, it uses mathematical techniques to determine the failure rate of an integrated circuit . MaCRO contains SPICE (Simulation Program with Integrated Circuit Emphasis) analyses using several different commercial applications, wearout models, system reliability models, lifetime qualification, and reliability and performance tradeoffs in order to achieve system and device reliability trends, prediction and analysis . The simulation tool implements two simplified approaches to compute reliabilities:
Independent of Transistor Behavior (ITB) Dependent on Transistor Behavior (DTB)
These approaches are used to determine each failure mechanism's contribution to overall device failure. The ITB approach makes two assumptions:
1.In each integrated circuit, each failure mechanism has an equal opportunity to initiate a failure 2.Each failure mechanism can take place at a random interval during the time of operation
Conversely, DTB utilizes back-end SPICE simulation to determine these contributions based on transistor behavior and circuit function. Using these mechanism weighting factors, sub-circuit cell counts, and transistor quantities, an overall component failure rate is calculated.
The software assumes that all the parameters for these models are technology node dependent. Although many different intermediate process technologies can be identified for devices under analysis, only major nodes are used. Nodes considered major nodes of CMOS processes on the International Technology Roadmap for Semiconductors (ITRS) reflect a trend of 70% scaling every 2-3 years and falls within the projections of Moore's Law2. It is assumed that the technology qualification (process qualification) has been performed and at least one screening process has been applied before a device is packaged. This reliability prediction covers the steady-state random failures and wearout portions of the bathtub curve.
Mathematical Theory Each failure mechanism described above would have a failure rate,i , driven by a combination of temperature, voltage, current, and frequency. Parametric degradation of each type affects the on-die circuitry in its own unique way; therefore, the relative acceleration of each one must be defined and averaged for the applied condition. The failure rate contribution of each can be normalized by taking into account the effect of the weighted percentage of that failure rate. We ignore interactions between failure mechanisms for practical reasons although more rigorous studies of potential interactions could be made in the future. For the four mechanisms of EM, HCI, NBTI and TDDB, the normalized failure rate can be defined as EM , HCI ,respectively. In order to achieve more accuracy in the overall failure rate estimation, it is useful to split the IC into equivalent function sub-circuits and refer to it as a system of functional group cells, for example: 1 bit of SRAM, 1 bit of DRAM, one stage of a ring oscillator, and select modules within Analog-to-Digital circuitry (ADC), etc. For each functional group type, the failure rate can be defined as a weighted summation of each failure rate type multiplied by a normalization constant for the specific failure mechanism.
Failure Rate Calculation An assembly, the output from a system reliability assessment application, and/or the Bill of Materials (BOM) for an assembly is examined for complex integrated circuits that could be analyzed with the Integrated Circuit Lifetime Prediction calculator. The current limitations of the software are number of functional group types and technology node data beyond 90nm. Thermal characteristics are researched and electrical parameters, commonly found on the component's datasheet, are identified for the calculations. Component complexity and electrical characteristics are extracted from corresponding component documentation for use in the calculator. Thermal parameters for field conditions are acquired through prototyping, direct thermal measurements, and simulations.
Preliminary analysis of the device uses a process that divides an integrated circuit into smaller functional blocks to apply acceleration factors at the most basic level. Equivalent function sub-circuits are used as part of the calculator to organize the complexity of the integrated circuit being analyzed into functional group cells, i.e. one (1) bit of DRAM. As an example, the functional group block diagram for National Semiconductor's 12-bit ADC component, ADC124S021, is shown below. It contains a multiplexer group, track and hold function, control logic, and 12-bit analog-to-digital converter.The standard procedure for integrated circuit analysis uses high temperature operating life (HTOL) test conditions for the test conditions used for extrapolation:
ambient temperature supply voltage core voltages
The HTOL ambient temperature was calculated for each component (except when supplied by the manufacturer). Thermal information was obtained from the datasheet and/or thermal characteristic documentation and each manufacturer's website. Using junction temperature, power dissipation, and junction-to-air thermal resistance are used to calculate ambient temperatur Junction-to-air thermal resistance was obtained either from a component's datasheet or from thermal characteristic databases for package type and size; i.e. Texas Instruments or NXP Semiconductors websites. The ambient temperature calculation for an example component Inputs on the calculator are the test parameters and results from the standard JEDEC accelerated test and information pertaining to the integrated circuit: JEDEC Standard No. 47D3 o25 devices under test o1000 hour test duration ozero (0) failures o50% confidence level Pre- or user-defined process node parameters Device complexity as broken down by functional groups and quantity of cells within each functional group, where applicable Accelerated test information (qty. of failures, qty. of devices, test duration) Duty cycle of device (i.e. diurnal cycling or 50%) Confidence level of calculation/test Field and test conditions (field conditions allow for multiple operating modes) oambient temperature ooperating frequency ocore voltage osupply voltage Failure mechanism parameters and corresponding equations An extensive field study was conducted in order to demonstrate the simulation tool and verify its prediction capabilities. Reliability predictions were performed based on field failures of DRAM, microcontrollers and microprocessors
VALIDATION RESULTS The reliability calculations are based on the time domain of the host computer. Except for the microcontroller, which is stressed 24 hours a day, we assume that memory parts and the processor are partly stressed depending on the user profile. A conservative assumption is that a regular user will stress the parts two shifts/day, i.e. 16 hours/day. The predicted failure rates were calculated using the methodology described in the section The Simulation Tool. Field and test condition inputs were extracted from the components' datasheets. These calculator inputs are shown in Table 2. The functional group distribution of each IC was found in each component's description as provided on the datasheets. Table 3 shows the field failure rates, as obtained using Eq. (8), and the corresponding results of the predictions. Figure 10 shows the comparison of the field failure rates and the prediction results, along with the 95% confidence intervals obtained by the Weibull analysis.It should be noted that the DRAM failure rates presented in Table 3 and Figure 10 refer to critical faults which forced the user to replace the part. They do not reflect specific rates of different kind of errors (correctable or non-correctable data errors or single event upsets from radiation) but rather a complete part failure rate.
Failure Rate based HTOL Qualification Test JEDEC standards define a family of accelerated life tests, of which the one most commonly used for estimating component failure rate is HTOL (JEDEC Standard number 47D, “Stress-Test-Driven Qualification of Integrated Circuits”). It consists of stressing 77 pieces per qualification lot, for an extended time, usually 1,000 hours, at an accelerated voltage and temperature (typically +125°C). The stated purpose of HTOL testing is to simulate device operation at elevated temperatures and higher than nominal operating voltages to provide sufficient acceleration to simulate many years of operation at ambient temperatures (typical +55°C). The data obtained from the HTOL test are traditionally translated to a lower temperature by using the Arrhenius temperature acceleration model. In a standard HTOL where 77 parts per lot are taken from 3 different lots (total of 231 parts tested) during 1,000 hours in +125°C, the calculated acceleration factor,AF , using the Arrhenius model would be 78 [assuming: 1) Ea=0.7, 2) the ambient temperature is +55°C, and 3) temperatures refer to junction]. The equivalent field time is ~18 million hours. In case of zero failures in test, the upper limit for the failure rate at 60% confidence would be 51 FIT. It is clearly apparent that the predicted failure rate, based on HTOL, is misleading.
Activation energy is the parameter used to express the degree of acceleration related to temperature. Single failure mechanisms are accompanied with unique activation energy values (JEDEC Publication No. 122B). However, it is a traditional method to use an activation energy of 0.7eV as it is generally assumed as average activation energy for failure mechanisms that occur during the useful life of a device. This useful life lies beyond the early stages of infant mortality failures (defect driven failures). Industry is widely using this value of 0.7eV in the following two cases:
1.When estimating an overall failure rate without focusing on a single failure mechanism. It is assumed to be a conservative value regarding the mixture of single mechanism activation energies. 2.When the failure mechanisms degrading a device is unknown.
The goal of HTOL is to gain maximum possible acceleration to accumulate maximum equivalent field time, with zero failures. Assuming higher activation energies will serve this goal, but will reduce the failure rate upper limit. For example, assuming activation energy of 1.0 instead of 0.7eV will raise the acceleration factor to 504 instead of 78 (6.5 times more). On the other hand the failure rate will reduce from 51 FIT to only 8 FIT, which is even more overoptimistic
DISCUSSION The validation study has shown strong correlation between the field failure rates and the rates obtained by the prediction tool. The results in Figure 10 clearly demonstrate the accuracy and repeatability of the multi-mechanism model to predict the field performance of complex integrated circuits.
The simulated estimates lie well within the confidence intervals except for the Intel processor, where a small deviation of 60 FIT observed. The small deviation between the rough estimates and the point estimations obtained from the statistical plots justify the use of the exponential distribution. For memories, an average failure rate of 720 FIT was observed with an average deviation of 10% between the field and simulated failure rates. The average interval of the field failure rate (upper limit- lower limit) is 280 FIT. Considering the fact that the 512MB DRAM node technology is quite similar to the 1GB DRAM (100 nm and 110 nm accordingly), both parts actually exhibit the same failure rate of 0.8 FIT per 1 MB. In contrast, the 256MB DRAM with 689 FIT does not correspond to this projection which should have led to a failure rate of 205 FIT. This is explained by the lower accelerated test ambient temperature to which the 256MB DRAM is exposed, relative to the other two memories. Nevertheless, components whose predicted failure rate is relatively large compared to similar device types, i.e. 1GB DRAM, might be categorized as more sensitive to electro-thermal tolerances. They will be subjected to greater stresses at the peripheries of these sensitive operating ranges. Components with large operating ranges are typically operated at an average nominal value. Therefore, small fluctuations away from the mean of these larger ranges will not excessively stress the components. A graphical depiction of this is shown in Figure 11. The microcontroller and the processor experienced lower failure rates then the memories. Furthermore, the average failure rate is 220 FIT with interval of 120 FIT.
The multi-mechanism limitation: Ideally, yet unrealistically, a complete lifetime distribution of a component is generated under a fixed set of accelerated loading conditions and where inference about the reliability of the components under a different set of loading conditions can be made confidently, using a proven acceleration model. One accelerated test, such as HTOL, cannot stimulate all the major failure mechanisms (i.e. HCI), and the acceleration factor obtained from some of them is negligible. Under the assumption of multiple failure mechanisms, each will be accelerated differently depending on the physics for each mechanism. If a HTOL test is performed at an arbitrary voltage and temperature for acceleration based only on a single failure mechanism, then only that mechanism will be reasonably accelerated. In that instance, which is generally true for most devices, the reported FIT rate (especially one based on zero failures) will be meaningless with respect to other failure mechanisms.
The zero failure limitation: The fact that HTOL is a zero-failure test limits the statistical confidence of the predicted failure rate. Zero failures in HTOL is a poor indicator of the expected failure rate. To obtain statistical confidence, failures must be observed.
ACKNOWLEDGEMENTS This work was performed by DfR Solutions and initially funded by Aero Engine Controls, Boeing, General Electric, the National Aeronautics and Space Administration, the Department of Defense, and the Federal Aviation Administration in cooperation with the Aerospace Vehicle Systems Institute (AVSI) for the 0.13m and 90nm technology nodes. DfR is now working to extend the capability of the tool into smaller technology nodes, including 65nm, 45nm, and 32nm. Several commercial organizations have indicated a willingness to assist with the development and validation of 45nm technology through IC test components and acquisition of field failure data. Continued development would incorporate this information and would expand into functional groups relevant for analog and processor based (e.g. DSP and FPGA) integrated circuits.