Next Page Previous Page Handbook Home Tools & Aids Search Handbook
4. Process Modeling
4.3. Data Collection for Process Modeling

4.3.2.

Why is experiment design important for process modeling?

Output from Process Model is Fitted Mathematical Function The output from process modeling is a fitted mathematical function with estimated coefficients. For example, in modeling resistivity Y as a function of dopant density X, an analyst may suggest the function

The functional form is the above quadratic and the coefficients to be fitted are A0, A1, and A2. Even for a given functional form, there are an infinity of potential coefficient values which potentially may be used. Each of these coefficicient values will in turn yield predicted values.

What are Good Coefficient Values? Poor values of the coefficients are those in which the resulting predicted values are considerably different from the observed raw data Y. Good values of the coefficients are those in which the resulting predicted values are close to the observed raw data Y. Best values of the coefficients are those in which the resulting predicted values are close to the observed raw data Y, and the statistical uncertainty connected with each coefficient is small.
There are two considerations that are useful for the generation of "best" coefficients:
  1. Least squares criterion
  2. Design of experiment principles
Least Squares Criterion For a given data set (e.g., 10 (X,Y) pairs), the most common procedure for automatically generating good coefficients for is the least squares estimation criterion. This criterion yields coefficients with predicted values which are closest to the raw data Y in the sense that the sum of the squared differences between the raw data and the predicted values are as mathematically small as possible.

The overwhelming majority of regression programs today make use of the least squares criterion for the estimation of the model coefficients. Least squares estimates are popular because

  1. the estimates are statistically optimal (BLUEs: Best Linear Unbiased Estimates);
  2. the estimation algorithm is mathematically tractable, in closed form, and therefore easily programmable.
How then can this be improved? For a given set of X values it cannot be; but frequently the choice of the X values is under our control. If we intervene at an early enough stage, we have a potential avenue for improved coefficient estimates.
Design of Experiment Principles As to what values should be used for the X's, we look to established experiment design principles for guidance.
Principle 1: Minimize Coefficient Estimation Variation The first principle of experiment design is to control the values within the X vector such that after the Y data is collected, the subsequent model coefficients are as good, in the sense of having the smallest variation, as possible.

The key underlying point with respect to design of experiment and process modeling is that even though (for simple (X,Y) fitting, for example) the least squares criterion may yield optimal (minimal variation) estimates for a given distribution of X values, some distributions of data in the X vector may yield better (smaller variation) coefficient estimates than other X vectors. If the analyst has options as to how the values are distributed in the X vector, then the analyst has the power to drastically change and reduce the noisiness of the subsequent least sqaures coefficient estimates.

Five Designs To see the effect of experiment design on process modeling, consider the following simplest case of fitting a line:

Suppose the analyst can afford 10 observations (that is, 10 (X,Y) pairs) for the purpose of determining optimal (that is, minimal variation) estimates of A0 and A1. What 10 X values should be used for the purpose of collecting the corresponding 10 Y values? Colloquially, where should the 10 X values be sprinkled along the horizontal axis so as to minimize the variation of the least squares estimated coefficients for A0 and A1? Should the 10 X values be:

  1. ten equi-spaced values across the range of interest?
  2. five replicated equi-spaced values across the range of interest?
  3. five values at the minimum of the X range, and five values at the maximum of the X range?
  4. one value at the minimum, eight values at the mid-range, and one value at the maximum?
  5. four values at the minimum, two values at mid-range, and four values at the maximum?
or (in terms of "quality" of the resulting estimates for A0 and A1) it perhaps doesn't make any difference?

The answer is that it DOES make a difference. For each of the above five experiment designs, there will result corresponding collected Y data, followed by the generation of least sqares estimates for A0 and A1, and so each design will in turn yield a fitted line.

Are the Fitted Lines Better for Some Designs? But are the fitted lines i.e. the fitted process models, better for some designs than others? Are the coefficient estimate variances smaller for some designs than others? For given estimates, are the resulting predicted values better (that is, closer to the observed Y values) than others? The answer to all of the above is YES.

The most popular answer to the above question about which design to use for linear modeling is design 1 with ten equi-spaced points. This, however, is not correct. It can be shown that the formula for the variance of coefficient estimates in least squares linear fitting is

    Var(estimate) proportional to 1/sss
So to obtain minimum variance estimates, one minimizes the above by maximizing the denominator on the right. To maximize the denominator, it is (for an arbitrarily fixed ), best to position the X's as far away from ) as possible. This is done by positioning half of the X's at the lower extreme and the other half of the X's at the upper extreme. This is design 3 above, and this "dumbbell" design (half low and half high) is in fact the best possible design for fitting a line. Upon reflection, this is intuitively arrived at by the adage that "2 points define a line" and so it makes most sense to determine those 2 points as far apart as possible (at the extremes) and as well as possible (having half the data at each extreme). Hence the design of experiment solution to model processing when the model is a line, is the "dumbbell" design--half the X's at each extreme.
What is the Worst Design? What is the worst design in the above case? Of the five designs, the worst design is the one which has maximum variation. In the mathematical expression above, it is the one which minimizes the denominator, and so this would be the design 4 above where almost all of the data is located at the mid-range. Clearly the estimated line in this case is going to chase the solitary point at each end and so the resulting linear fit is intuitively inferior.
Designs 1, 2, and 5 How about the other 3 designs? Designs 1, 2, and 5 are useful only for the case when we think the model may be linear, but we are not sure, and so we allow additional points which permit fitting a line if appropriate, but build into the design the "capacity" to fit beyond a line (e.g., quadratic, cubic, etc.) if necessary. In this regard, the ordering of the designs would be
  • design 5 (if our worst-case model is quadratic),
  • design 2 (if our worst-case model is quartic)
  • design 1 (if our worst-case model is quintic and beyond)
Handbook Home Tools & Aids Search Handbook Previous Page Next Page