Timing Recovery by Digital Interpolation

Astoundingly enough, J.G. Proakis textbook “Digital Communications” in its 4th edition does not mention how to perform truly digital timing recovery. That is, timing recovery with a sampling clock which is unrelated to the symbol rate, or in other words, timing recovery where symbol rate (1/T) and sampling rate (1/Ts) are incommensurate (T/Ts isn’t an integer). Thus, there is no need to control the sampling instant (no need for a Voltage-Controlled Oscillator, VCO; the receiver uses a free-running oscillator for sampling).

Fortunately, H. Meyr, M. Moeneclaey and S.A. Fechtel textbook “Digital Communication Receiver” comes to the rescue. The trick is to use the shift-property of band-limited functions applied to the sampling theorem,

x(t+\epsilon) = \sum_{n=-\infty}^{\infty}{x(\epsilon+nT_s)\cdot\text{sinc}[\frac{\pi}{T_s}(t-nT_s)] }

= \sum_{n=-\infty}^{\infty}{x(nT_s) \cdot\text{sinc}[\frac{\pi}{T_s}(t+\epsilon-nT_s)]}

with \epsilon an arbitrary time shift. This states that our signal x(t) can be represented by samples either at (nTs+\epsilon) or (nTs)! Thus, we can perform interpolation through sinc filtering to obtain a set of samples at the needed (nTs+\epsilon) instants. Interpolation can be performed by a digital time-variant FIR filter, in most practical cases, a 4-tap FIR filter running at 1/Ts. If oversampling (Ts \geq T) was performed then decimation is needed. The time shift can be estimated using a digital phase-locked loop (DPLL).

Theoretical aspects are discussed in Chapter 4. Chapters 9 and 10 are a must for the practitioner.

Leave a Comment

Effectiveness Analysis of Error-Tolerant VLSI Circuits with Concurrent Error Correction

The effectiveness of an error-tolerant approach in VLSI answers
qualitatively the question whether it is reasonable to apply
a particular technique or not. In other words, under certain
conditions it may be preferable not to apply error tolerance
since its application may have a detrimental effect. A similar
analysis was carried out in the case of redundancy-repairing
circuits against manufacturing faults for yield improvement [Hir-01].

In the following, we present the basic idea that drives
us to study the effectiveness of error-tolerant techniques in
VLSI circuits.

Let the reliability, R, be quantitatively defined as the
probability that a system will not fail under specified
conditions [Lyo-62]. If the redundant system employs the
classical Triple Modular Redundancy (TMR) the resulting
reliability given the reliability of one module, R_{M}, is

R=3R_{M}^{2}-2R_{M}^{3}.                 (1)

Several assumptions are performed in order to obtain the previous
formula:

1. the failures of the three modules are statistically
independent and have equal probability, and

2. the majority voter or voting circuit is fault-free.

Even if the assumptions are not exactly fullfilled a general
observation can be made regarding Eq. (1). There is no
increase of reliability if the reliability of a module, R_{M},
is less than a certain value, which in this particular case is 50%.
This observation is just a result of the dependency of the
redundant system reliability, R, on the non-redundant system
reliability, R_{M}. Now, one can postulate that this dependency
exists in general in error-tolerant VLSI circuits and that the
application of a particular error-tolerant technique is only
effective, i.e. it yields an increase in reliability, under
certain specific conditions. Since a quantitative analysis seems
not an easy task, we will content ourselves with deriving a
framework to obtain the effectiveness of a particular concurrent
error-correction technique and draw some final outlines.

Rather than working with module reliabilities, which seem to be
difficult to measure in VLSI circuits, it is reasonable to use
transient fault densities as parameters. The transient fault
density of a VLSI circuit, D, can be defined to account for the
ratio of the average number of faults per unit of area in an
arbitrary time window. It is a quantitative figure, the measure
of which is independent of any module definition and it is not
related to any specific circuit area. Furthermore, if we focus on
real-time signal processing applications where concurrent-error
correction capabilities are considered, the time window matches
the sample period and the effect of the span of a fault is
included in the fault averaging. Altogether, it will be seen that
for a given circuit of area A the average number of faults \lambda=A\cdot D
determines a measurement for simple error-tolerant diagnostics.

Now, assuming that the circuit is partitioned into an infinite
number of statistically independent subareas we get that the
transient faults follow a Poisson distribution with average \lambda
. If faults do not occur independently in the different regions
but rather tend to cluster, as is the case for permanent faults,
then we can make use of the Negative Binomial distribution with a
fitting cluster factor \alpha. See Figure below.

PoissonNegBin

At this point, it may be critisized that these simple models
are probably not accurate enough. It should be noted that it is
not our purpose to predict accurate figures for reliability but
rather generate qualitative insights of the effectiveness of a
given error-tolerant technique. Thus, following the ease of
analysis we define the type-I single error correction (SEC) as a
concurrent-error correction technique where any single error in a
given area A(1+\delta), \delta>0, is corrected. Obviously, the
error-tolerant circuitry does not come for free and it has a
penalty in area of \delta A when compared to the non-redundant
circuit with area A. The question that arises now is whether this
overhead anhiliates or not the gain in reliability due to the
incorporation of a type-I SEC. In order to answer this question
it is noted that the probability of a working non-redundant
circuit represented by R_{0} is given by,

R_{0}=Prob(k=0;A,D)=e^{AD}

where k represents the number of errors, and the new probability
of a redundant circuit is,

R = Prob(k=0,1;A(1+\delta),D)  =

Prob(k=0;A(1+\delta),D)+Prob(k=1;A(1+\delta),D)  =

e^{(1+\delta)AD}+\{(1+\delta)AD\}\cdot e^{(1+\delta)AD}. (2)

The aforementioned question can be now stated as follows,

R_{0}\gtrless R. (3)

Before we get into solving the inequality of Eq. (3) it
is interesting to see the enhancement in reliability (if any) and
its trend for a wide range of area and overhead values. The
enhancement in reliability, or in other words, the enhancement in
dynamic yield (by analogy with random-defect yield) is defined as

\Delta Y_{dyn}=\frac{R-R_{0}}{R_{0}}. (4)

The following Figures show specific examples for
circuit area and overhead figures up to 140 mm^{2} and 100%,
respectively.

Dyn_Yield_Poisson_22
Dynamic_Yield_Enh_SEC_I_Poisson_22
Dyn_Yield_Poisson
Dynamic_Yield_Enh_SEC_I_Poisson

In all cases, the enhancement increases as area and
fault density increase making more and more attractive, i.e.
effective, the use of type-I SEC capabilities. On the other hand,
enhancement decreases with increasing overhead as one could
intuitively expect. The trend shows that with worse conditions
and bigger areas the attractiveness of error-tolerant circuitry
increases. We will see that this trend is changed whenever the
product AD surpasses a certain value.

In a sense, one may argue that these results are totally
dependent on the values given for the fault density. We have
chosen D=2.2\times10^{-3} faults/mm^{2} as an initial figure
setting up the order of magnitude that one could expect for the
transient faults to come into important consideration. This value
is taken from the predicted constant random-defect density given
by the ITRS roadmap 2003.

So far we have seen that in all examples we have obtained a
positive enhancement. Now, we turn to compute the threshold value
for which the inequality R_{0}<R does not hold anymore, or in
other words, for which there is no enhancement. In order to
compute this threshold we may restate Eq. (3) as

\Gamma=\frac{Prob(k=0;A,D)-Prob(k=0;(1+\delta)A,D)}{Prob(k=1;(1+\delta)A,D)}\gtrless1,

where \Gamma represents the relative yield ratio of a type-I SEC
circuit; that is, if \Gamma<1 then there is enhancement.
Otherwise, \Gamma\ge1, it is better not to apply type-I SEC. To
illustrate, the following Figure shows the relative yield ratio for
two different circuit areas versus the fault density.

Gamma_SEC_I_Poisson_20
Gamma_SEC_I_Poisson_100

It is apparent that the curves bend to cross the horizontal line
representing the value 1 at a certain value of D. From that value
on it does not make sense to apply type-I SEC capabilities. This
change can be better understood by looking at the product \lambda=AD
. In particular, if the average number of faults in the whole
circuit (including an error-tolerant circuitry overhead of around
100%), \lambda, goes above an average number of 1.25 faults, then
it is not advantageous to correct single errors. Alternatively,
one could apply type-I SEC capabilities at a lower granularity,
reducing the area figure, without increasing to a big extent the
overhead (if possible) to retain enhancement. It is remarkable
that under worse conditions, that is, high fault density and high
area, application of type-I SEC capabilities is questionable.

In case of having clusters of faults, a Negative Binomial
distribution with a fitting parameter, i.e. the cluster factor \alpha
, can be used. If we assume a cluster factor of \alpha=2 we
obtain the following Figures.

Dyn_Yield_Negbin
Dynamic_Yield_Enh_SEC_I_Negbin
Gamma_SEC_I_Negbin_20
Gamma_SEC_I_Negbin_100

As compared to the Poisson distribution results, it is apparent
from the figures that there is still enhancement even at higher
values of D, but this enhancement is lower.

[Hir-01] J. Hirase, “Yield Increase of VLSI after Redundancy-repairing”, Proc. 10th Asian Test Symposium, 353–8, 2001.
[Lyo-62] R. E. Lyons and W. Vanderbulk, “The use of Triple-Modular Redundancy to improve computer reliability”, IBM Journal, 200–9, April 1962.

Leave a Comment

When Moving Average Filter + Decimation = Accumulate-and-Dump Circuit

A Moving Average (MA) filter computes the average of the last N samples. Its difference equation is given by,

y_n = \frac{1}{N}\sum_{k=0}^{N-1}x_{n-k}

where x_n is the input signal, and y_n is the output signal.

For a DSP engineer a MA filter is a “good” smoothing filter in the sense of being optimal for reducing random white noise but, a “bad” low-pass filter since its roll-off is slow and the stopband attenuation is awful (sinc function). For a digital designer a MA filter is a small, low-power and fast structure since there are no multipliers to design, only adders/subtractors. Its recursive representation is given by,

y_n = y_{n-1} + \frac{1}{N}(x_n -x_{n-N})

which yields for relatively big N’s a very efficient implementation in terms of delay elements and number of full-adders. (Note that, except for the factor 1/N, this implementation equals to the one given by an integrator-comb section of a cascaded intergrator-comb (CIC) filter).

If the output of the MA filter is to be decimated (sample rate decrease by R) then it is a good idea to place decimation within the filter so that the sample values that are not needed for the final output are not computed.  Well, you might say that this is exactly what efficient CIC decimation filters do. Right, but if the sample rate reduction factor, R, is a multiple of or equal to N, the averaging factor,  then we can do better. The point is that if we take one output sample out of R and R\geqN then we can do word-serial arithmetic, that is processing the input sample serially. An  accumulate-and-dump circuit does the job,

accdumpwhere CKfast is the clock running at the original sampling rate, CKslow is the R-times reduced clock, and tcnt is a pulse every N clock cycles of CKfast (for example, the terminal count of a counter). Note that the accumulate register (ACC) must be of length n+\lceil\log_{2}(N)\rceil bits, with n the length of the input signal, to accomodate the length of the output signal.

Compared to a first-order CIC decimation filter one saves the comb section (subtractor) at the cost of logic needed for resetting the ACC register.

Leave a Comment

Sign-Bit Processing of Signed Numbers

To my knowledge there are two ways to process the sign-bits of two signed numbers: the very well-known sign-extension method where the sign-bit is pushed forward based on the fact that -a=-2a+a, with a\in\{0,1\}, and, the correction term method, based on the formula -a=\bar{a}-1, with \bar{a} being the inverse of a.

Let’s illustrate by example: let X and Y be two signed numbers, n-bit and (n-1)-bit long, respectively. Their algebraic value in two’s complement representation is given by,

X=-x_{n-1}2^{n-1}+\sum_{i=0}^{n-2}x_i2^i and,

Y=-y_{n-2}2^{n-2}+\sum_{i=0}^{n-3}y_i2^i.

If we add both numbers, we obtain,

X+Y=-x_{n-1}2^{n-1}+(-y_{n-2}+x_{n-2})2^{n-2}+\sum_{i=0}^{n-3}(y_i+x_i)2^i,

and the term (-y_{n-2}+x_{n-2}) must be specially processed to get the final result in two’s complement representation. Now, applying sign-extension to the term yields, -y_{n-2}2^{n-1}+(y_{n-2}+x_{n-2})2^{n-2} and we are done. Alternatively, we can apply the correction term method, that is, (\bar{y}_{n-2}-1+x_{n-2})2^{n-2}. And, to get rid of the -1 we apply the fact that -1=-2+1, that is,

-2^{n-1}+(\bar{y}_{n-2}+1+x_{n-2})2^{n-2}.

What we obtain are just constant terms which can be easily processed.

In general, the sign-extension method incurs in higher fan-out of the operands, whereas the correction-term method suffers from additional inputs of the first-stage of processing. In any case, it is worth to ponder both options.

Leave a Comment

Power, Transitions, Stimulus and Standard Cells

If dynamic power consumption is all about transitions, and transitions are all about stimulus, … what about stimulus?

Long time ago I had to characterise in power a standard-cell which I designed using full-custom tools. It was a full-adder. To get the dynamic switching power figure of one output for a given output load I ran some SPICE simulations. The stimulus to the three-input (a,b and d) two-output (sum and carry) cell was given by

{1->4->6->7->3->5->2->1->4->6..}.

Thus, the inputs take the logical values {a,b,d}={0,0,1}->{1,0,0}->{1,1,0}->{1,1,1}->… and the carry output, c, changes as follows

c={0->0->1->1->1->1->0->…}.

As it can be seen, ONLY two transitions occur in c (as transitions I mean changes from 0->1 or 1->0). So, I was able to generate just 2 transitions out of 7 states (28.6%) with this sequence. My full-adder was not really much exercised!
Now, is this input sequence “random”? Well, it is a PN-sequence (PN stands for pseudo-noise) generated by a max-length linear-feedback shift-register (LFSR) and as such it complies with all the properties a sequence must have according to S. Golomb to be “random”. Is it not random enough? Is it possible to increase the number of transitions by proper stimulus assignment?

Then, I looked at two different 4-state max-length LFSR: the first one gave 4 transitions out of 15 states, and the second one gave 6 out of 15. A change of +13%. I thought for a while and realized that: if dynamic power consumption is all about transitions and, transitions is all about stimulus, then, how can I define the “right” stimulus to my cells?

Ok, Databooks of standard-cell libraries give dynamic power figures normalized to the frequency of the signal, that is, in µW/MHz. But now, how can I define the frequency of a periodic signal which does not have 50% duty cycle? Would it be a solution to use {1->4->6->7->1->4->…}?

Leave a Comment

Gated Clock Dividers

One of the most well-known structures in digital design is a “Clock divider by 2″,

ckdiv21

where the negated output Q of a D-Flip Flop is connected to its D input. Thus, CKdiv2 shows half of frequency of clock CK. If we want to gate this output by a signal x, then we can do the following

ckdiv2whenx3which is straightforward. So far it seems very intuitive. Now, we want a clock divider by 4 and gated by a signal x. One could try to cascade the previous structure and then try to incorporate an AND gate to provide a gated clock. Ok, I am not a very good friend of trial & error methods and I prefer the more systematic methods in circuit design. In this case, my favourite one is the state machine synthesis. My finite state machine (FSM) looks like this,

fsm_ckdiv4whenx2

where X stands for “don’t care”, that is, either 1 or 0. Note that we have chosen the output to be S0. Thus, we have a 2 D-Flip Flop (DFF) solution to our problem. The synthesis of a FSM follows straightforward: 1) build a logic table with x, S0 and S1 as inputs and, S0_next  and S1_next (D input of DFFs) as outputs, 2) compute per Karnaugh the logic function of S1_next and S0_next.

The resulting structure follows,

ckdiv4whenxwhich is not so intuitive.

Leave a Comment

An all-digital Manchester Decoder

A Manchester encoder is that one which encodes logic bits –usually given as a non-return-to-zero (nrz) signal– into signal transitions. For example, a logic one (zero) is encoded as a positive (negative) edge. Thus, the encoded signal has zero DC and provides enough transitions to recover easily a clock (obviously, at the cost of doubling the bandwidth). A Manchester decoder (MD) is that one which decodes the encoded signal back to nrz.

My MD is an all-digital one, it has a hard decision input, a feedforward structure,  and is based on an over-sampling & counting algorithm. Since a Manchester encoded signal is basically a signal of 4 alternating signal states (short/long ones/zeros), the idea is first to take a decision on whether a short or long one/zero was sent and then based on this information decode into nrz.

The following Figure shows the structure of the MD.

md_top

The Pre-processing block converts the Manchester encoded signal mdin –1-bit hard decision– into datain and bitsynchin and generates clk_transition (a tick everytime a transition on mdin occurs). The mdin is oversampled by 16 and the number of adjacent ones/zeros is counted. Based on the resulting count the block takes a decision: short or long one/zero. The decision intervals can be made programmable. For simplicity, we call a short one/zero a SINGLE one/zero (S1/S0) and, a long one/zero a DOUBLE one/zero (D1/D0). Thus, datain represents mdin as a sequence of symbols taken from the following alphabet: S0, S1, D0 and D1. Similarly, bitsynchin is just a subset of datain and gives information about mdin being a SINGLE or a DOUBLE, that is, a sequence of symbols from the following alphabet: S, D.

The use of clk16x, that is, an oversampling factor of 16 is arbitrary. It is worth emphasizing that if T is the bit period and Ts the sampling period, Ts/T does not be necessarily an integer, it can be rational.  Also, note that the higher the oversampling frequency, 1/Ts,  the more robust the algorithm works — obviously at the cost of power–. On the other side, just the Pre-processing block runs at the highest clock. The Finite State Machines (FSM) run at the slower derived clock clk_transition. In the following, both FSMs in the Mealy sense are shown,

fsm_nrz_decoder

fsm_md_bitsynchwhere transitions are labelled with /1 and /0 denoting output bit 1 and 0, respectively. A transition occurs from A to B if the input signal is equal to B. In the case of FSM Bit Synch., a transition occurs from Axx to Bxx if the input signal is equal to B (not attending to the subscript). To understand the FSMs you will need to think on terms of SINGLEs and DOUBLEs (to exercise I recommend to use the following timing diagram). Note that nrz is synchron to the positive and negative edges of clk_nrz.

td_md1

I do believe a decoder like this one is more robust than one based on a mid-bit finder –but, I’ve not verified this yet–. It would be great to get some feedback from the engineers out there. Furthermore, I do think one can combine both FSMs into just one and save some power. Interestingly enough, it takes more states –8 instead 4– to generate the clock signal clk_nrz than the data nrz, even though the input information needed is less — [D,S] instead [D0, D1, S1, S0]–.

This post may not seem to be related to arithmetic, but I do believe arithmetic has to do a lot with finite state machines (FSM), and, an all-digital Manchester decoder is precisely an FSM (or two!).

Comments (1)

Moving Average filters and Cascaded Integrator-Comb (CIC) filters

The best way to implement a Moving Average (MA) filter in terms of digital logic is to implement it as a Cascaded Integrator-Comb (CIC) filter. At least this is the case for standard cell-based designs and probably FPGAs.  The first time I read about MA filters described as “CIC filters” was in Oppenheim & Schafer ’s textbook “Discrete-Time Signal Processing” –page 35, though, it does not mention CICs–.  Nonetheless, the first reference is Hogenauer’s article “An economical class of digital filters for decimation and interpolation”,  IEEE Trans. on Acoustics, Speech and Signal Processing, Vol.29,No.2, April 1981.  Anyway, just have a look at Richard Lyon’s article in Embedded.com to get a feeling.

Comments (5)

Two’s Complement versus One’s Complement

Let \{z_{n-1},\dots,z_0\} represent a n-bit number in two’s complement. Its algebraic value, Z^{2s}, is given by,

Z^{2s}=-z_{n-1}2^{n-1}+\sum_{i=0}^{n-2}{z_i 2^i},

which equals to

-z_{n-1}2^{n}+[ \sum_{i=0}^{n-1}{z_i 2^i} ],

and, thus, the so-called 2^n-complement. Now,  the same n-bit number interpreted as a one’s complement number, (2^n-1)-complement, results in

Z^{1s}=-z_{n-1}(2^{n}-1)+[\sum_{i=0}^{n-1}{z_i 2^i}] .

Therefore,

Z^{1s}=Z^{2s}+z_{n-1}.

Note how the conversion is possible. Extending the previous equation we obtain

Z^{1s} = -z_{n-1}(2^{n-1}-1)+ \sum_{i=0}^{n-2}{z_i 2^i},

or,

Z^{1s}=-z_{n-1}2^{n-1}+\sum_{i=1}^{n-2}{z_i 2^i} + (z_{n-1}+z_0)2^0.

Interpreting two’s and one’s complement numbers in the framework of digit-set conversions, it is apparent that the MSB of a two’s compl. and a one’s compl. number,-z_{n-1}, takes values in [-1,0]. Unlike two’s compl., the LSB of a one’s compl. number,(z_{n-1}+z_0), takes values in [0,1,2].

Leave a Comment

Balanced Logic and Glitches

Glitches, the result of different time delays in logic gates, is responsible for part of the dynamic power consumption. A quantitave figure for glitching is difficult to estimate since it depends very much on the structure of our arithmetic block. Given the same bit-length, ripple structures are more affected by glitches than tree structures due to a higher logic depth.  In general, dynamic power is proportional to the product of the activity factor and capacitance, i.e.

P\propto\alpha\cdot C.

Balanced logic is supposed to reduce glitches and therefore power consumption. Most of the times, this does not come for free and the total capacitance is increased due to additional transistors. A very simple analysis shows that reducing power by reducing glitching is not an easy task. Let’s assume that

\alpha = (1-\rho)\cdot\alpha_{g}+\alpha_{t},

where $latex \rho$ is our glitching reduction factor due to a balanced structure, \alpha_g is the glitching activity factor and \alpha_t is the dynamic activity factor (which depends only on the input signal probabilities). On the other hand,

C = (1+\psi)\cdot C_{g}+C_{w},

where \psi is the capacitance increase factor, C_g is the total gate capacitance and C_w is the total wiring capacitance (we assume that only the gate capacitance is affected).

Let C_T = C_g + C_w and \alpha_T = \alpha_g + \alpha_t be the total capacitance and the total actitivity factor, respectively. Then, we obtain,

\alpha\cdot C=C_T\alpha_T + \{\alpha_T\psi C_g - \rho \alpha_g(C_T+\psi C_g)\}.

Now, to get a reduction in power consumption the second term in the right-hand-side must be lower than zero. In other words, the reduction in glitching activity must be bigger than,

\rho > (1+\frac{\alpha_t}{\alpha_g})\cdot\frac{1}{1+\frac{1}{\psi}(1+\frac{C_w}{C_g})}.

In order to get some feeling, let’s assume that the wiring capacitance is neglectable, C_w=0 , and that our balanced logic increases the gate capacitance by 20%, \psi=0.2. It results that it is IMPOSSIBLE to reduce power consumption if the glitching activity accounts for at least 20% of the transition activity, \alpha_g=0.2\alpha_t. (Just for reference, Veendrick, in his textbook, mentions that in a 8-bit ripple adder glitching accounts for as much as 30%.)

Also, this means that in a tree adder like, for example, a Brent-Kung adder, there is little or no benefit when balancing the tree cells.

Even if the wiring capacitance were not neglectable (deep-submicron technologies), say C_w = 3C_g, and we were able to reduce as much as 50% of glitching, then the power reduction would account for as little as 3.8% ! (assuming once again \psi=0.2 and \alpha_g=0.2\alpha_t).

Leave a Comment

Older Posts »