The effectiveness of an error-tolerant approach in VLSI answers
qualitatively the question whether it is reasonable to apply
a particular technique or not. In other words, under certain
conditions it may be preferable not to apply error tolerance
since its application may have a detrimental effect. A similar
analysis was carried out in the case of redundancy-repairing
circuits against manufacturing faults for yield improvement [Hir-01].
In the following, we present the basic idea that drives
us to study the effectiveness of error-tolerant techniques in
VLSI circuits.
Let the reliability, R, be quantitatively defined as the
probability that a system will not fail under specified
conditions [Lyo-62]. If the redundant system employs the
classical Triple Modular Redundancy (TMR) the resulting
reliability given the reliability of one module,
, is
. (1)
Several assumptions are performed in order to obtain the previous
formula:
1. the failures of the three modules are statistically
independent and have equal probability, and
2. the majority voter or voting circuit is fault-free.
Even if the assumptions are not exactly fullfilled a general
observation can be made regarding Eq. (1). There is no
increase of reliability if the reliability of a module,
,
is less than a certain value, which in this particular case is 50%.
This observation is just a result of the dependency of the
redundant system reliability,
, on the non-redundant system
reliability,
. Now, one can postulate that this dependency
exists in general in error-tolerant VLSI circuits and that the
application of a particular error-tolerant technique is only
effective, i.e. it yields an increase in reliability, under
certain specific conditions. Since a quantitative analysis seems
not an easy task, we will content ourselves with deriving a
framework to obtain the effectiveness of a particular concurrent
error-correction technique and draw some final outlines.
Rather than working with module reliabilities, which seem to be
difficult to measure in VLSI circuits, it is reasonable to use
transient fault densities as parameters. The transient fault
density of a VLSI circuit,
, can be defined to account for the
ratio of the average number of faults per unit of area in an
arbitrary time window. It is a quantitative figure, the measure
of which is independent of any module definition and it is not
related to any specific circuit area. Furthermore, if we focus on
real-time signal processing applications where concurrent-error
correction capabilities are considered, the time window matches
the sample period and the effect of the span of a fault is
included in the fault averaging. Altogether, it will be seen that
for a given circuit of area A the average number of faults 
determines a measurement for simple error-tolerant diagnostics.
Now, assuming that the circuit is partitioned into an infinite
number of statistically independent subareas we get that the
transient faults follow a Poisson distribution with average 
. If faults do not occur independently in the different regions
but rather tend to cluster, as is the case for permanent faults,
then we can make use of the Negative Binomial distribution with a
fitting cluster factor
. See Figure below.


At this point, it may be critisized that these simple models
are probably not accurate enough. It should be noted that it is
not our purpose to predict accurate figures for reliability but
rather generate qualitative insights of the effectiveness of a
given error-tolerant technique. Thus, following the ease of
analysis we define the type-I single error correction (SEC) as a
concurrent-error correction technique where any single error in a
given area
, is corrected. Obviously, the
error-tolerant circuitry does not come for free and it has a
penalty in area of
when compared to the non-redundant
circuit with area
. The question that arises now is whether this
overhead anhiliates or not the gain in reliability due to the
incorporation of a type-I SEC. In order to answer this question
it is noted that the probability of a working non-redundant
circuit represented by
is given by,

where
represents the number of errors, and the new probability
of a redundant circuit is,


(2)
The aforementioned question can be now stated as follows,
(3)
Before we get into solving the inequality of Eq. (3) it
is interesting to see the enhancement in reliability (if any) and
its trend for a wide range of area and overhead values. The
enhancement in reliability, or in other words, the enhancement in
dynamic yield (by analogy with random-defect yield) is defined as
(4)
The following Figures show specific examples for
circuit area and overhead figures up to 140
and 100%,
respectively.




In all cases, the enhancement increases as area and
fault density increase making more and more attractive, i.e.
effective, the use of type-I SEC capabilities. On the other hand,
enhancement decreases with increasing overhead as one could
intuitively expect. The trend shows that with worse conditions
and bigger areas the attractiveness of error-tolerant circuitry
increases. We will see that this trend is changed whenever the
product
surpasses a certain value.
In a sense, one may argue that these results are totally
dependent on the values given for the fault density. We have
chosen
faults/
as an initial figure
setting up the order of magnitude that one could expect for the
transient faults to come into important consideration. This value
is taken from the predicted constant random-defect density given
by the ITRS roadmap 2003.
So far we have seen that in all examples we have obtained a
positive enhancement. Now, we turn to compute the threshold value
for which the inequality
does not hold anymore, or in
other words, for which there is no enhancement. In order to
compute this threshold we may restate Eq. (3) as

where
represents the relative yield ratio of a type-I SEC
circuit; that is, if
then there is enhancement.
Otherwise,
, it is better not to apply type-I SEC. To
illustrate, the following Figure shows the relative yield ratio for
two different circuit areas versus the fault density.


It is apparent that the curves bend to cross the horizontal line
representing the value 1 at a certain value of D. From that value
on it does not make sense to apply type-I SEC capabilities. This
change can be better understood by looking at the product 
. In particular, if the average number of faults in the whole
circuit (including an error-tolerant circuitry overhead of around
100%),
, goes above an average number of 1.25 faults, then
it is not advantageous to correct single errors. Alternatively,
one could apply type-I SEC capabilities at a lower granularity,
reducing the area figure, without increasing to a big extent the
overhead (if possible) to retain enhancement. It is remarkable
that under worse conditions, that is, high fault density and high
area, application of type-I SEC capabilities is questionable.
In case of having clusters of faults, a Negative Binomial
distribution with a fitting parameter, i.e. the cluster factor 
, can be used. If we assume a cluster factor of
we
obtain the following Figures.




As compared to the Poisson distribution results, it is apparent
from the figures that there is still enhancement even at higher
values of D, but this enhancement is lower.
[Hir-01] J. Hirase, “Yield Increase of VLSI after Redundancy-repairing”, Proc. 10th Asian Test Symposium, 353–8, 2001.
[Lyo-62] R. E. Lyons and W. Vanderbulk, “The use of Triple-Modular Redundancy to improve computer reliability”, IBM Journal, 200–9, April 1962.