Next Page Previous Page Handbook Home Tools & Aids Search Handbook
4. Process Modeling
4.6. Case Studies in Process Modeling
4.6.2. Alaska Pipeline

4.6.2.5.

Weighting to Improve Fit

Weighting Another approach when the assumption of constant standard deviation of the residuals (or homogeneous residuals) is violated is to perform a weighted fit. That is, we give less weight to the less precise measurements.
Finding An Appropriate Weight Function The obvious question is: how do we determine an appropriate weighting function?

Weighted least squares estimates are found by minimizing:

where
These relative weights give optimal results, when known. Unfortunately, the true weights are rarely known, so they have to be estimated.
Replication in the Data The obvious way to estimate the <i>x<sub>i</sub>, if there are replicates in the data, is
for the ith set of replicates in the data set.

However, this rarely works well because the weights are extremely variable when estimated this way.

An Improved Strategy A better strategy for estimating the weights is to find a function that relates <i>s<sub>i</sub><sup>2</sup></i> to <i>x<sub>i</sub></i>.

If

then use
as the weights.

One model, called the power model, that often works well for modeling the variances is

That is, the variances are related to a power of the idependent variable.
Estimate Weights Using Power Function To estimate the weights above using the power function shown above, fit the function
to the variances from each set of replicates in the data.

Then use <i>chat=betahat<sub>2</sub></i>. (the slope of the fit) to estimate c, and

to use as the weights.

You should check the residuals from the fit used to estimate c just to make sure everything looks reasonable. The fit does not have to meet the standards usually used, however.

Replicates Not Available If there are few or no replicates in the data, then we can approximate the replication case as follows. Divide the data into several ranges in which the responses have similar means. That is, we pick the ranges small enough so that the plot shows little non-zero slope.

We then treat each range as replicates and compute <i>xbar<sub>i</sub></i> and <i>s<sub>i</sub><sup>2</sup></i> for each range.

Then fit

and define the weights by
Approaches to Forming Replicate Groups There are several possible approaches to forming the replicate groups.
  • We could manually form the groups from the plot of the response against the predictor variable. Although this allows us the most flexibility in adjusting for the oddities of a specific data set, it is impractical for routine use. It may be useful if you have a relatively small data set that has significant gaps in the data.
  • We can divide the data into equal sized groups of size n. There is a tradeoff between picking n too small and n too large. If n is too large, we get inflated variances because we do not have common replicate groups. If n is too small, we may not get a reliable estimate for the variances. We can use the plot of the response variable against the predictor variable as a guide.
  • We can pick an increment for the predictor variable. That is, instead of picking groups of size n, we divide the range of the predictor variable into equal size widths. In this case, the size of the groups may vary. There is a similar tradeoff between picking the interval too small or picking the interval too wide. Again, we use the plot of the response variable against the predictor variable as a guide.
Note that the estimate of c is somewhat dependent on the approach used to pick the replication group. However, we are not trying to find an optimal estimate of c, simply a reasonable approximation. In practice, the resulting weighted fit, which is the real goal, is typically not particularly sensitive to small changes in c.
Weighted Residuals One complication with weighted analysis is the fact that the distribution of the residuals can vary substantially with the different values of the predictor variable.

This necessitates the use of weighted residuals when plotting residuals. The weighted residuals are given by

Fit for Estimating Weights For the pipeline data, we chose replicate groups so that each group has four observations (the last group only has three). This was generated by first sorting the data by the predictor variable and then taking four points in succession to form a replicate group.

Dataplot generated the following output for the fit of log(variances) against log(means) for the replicate groups. The output has been edited slightly for display.

  
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N       =       27
NUMBER OF VARIABLES =        1
NO REPLICATION CASE
  
  
       PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
1  A0                  -3.18451       (0.8265    )         -3.9
2  A1       XTEMP       1.69001       (0.2344    )          7.2
  
RESIDUAL    STANDARD DEVIATION =         0.8561206460
RESIDUAL    DEGREES OF FREEDOM =          25
  
      

plot of replicated variance against relicated means with fit

The fit output and plot from the replicate variances against the replicate means shows that the a linear fit provides a reasonable fit with an estimated slope of 1.69. Note that this data set has a small number of replicates, so you may get a slightly different estimate for the slope. For example, S-PLUS generated a slope estimate of 1.52. This is caused by the sorting of the predictor variable (i.e., where we have actual replicates in the data, different sorting algorithms may put some observations in different replicate groups). In practice, any choice of c in the range 1.5 to 2.0 is reasonable and should produce comparable results for the weighted fit.

We used an estimate of 1.5 for c the weighting function.

Residual Plot for Weight Function plot of residual values from fit for estimating weights reveals no obvious problems

The residual plot from the fit to determine an appropriate weighting function reveals no obvious problems.

Numerical Output from Weighted Fit Dataplot generated the following output for the weighted fit (edited slightly for display).
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N       =      107
NUMBER OF VARIABLES =        1
REPLICATION CASE
REPLICATION STANDARD DEVIATION =     0.6112687111D+01
REPLICATION DEGREES OF FREEDOM =          29
NUMBER OF DISTINCT SUBSETS     =          78
  
  
       PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
1  A0                   2.35234       (0.5431    )          4.3
2  A1       LAB        0.806363       (0.2265E-01)          36.
  
RESIDUAL    STANDARD DEVIATION =         0.3645902574
RESIDUAL    DEGREES OF FREEDOM =         105
REPLICATION STANDARD DEVIATION =         6.1126871109
REPLICATION DEGREES OF FREEDOM =          29

      
This output shows a slope of 0.81 and an intercept term of 2.35. This is compared to a slope of 0.73 and an intercept of 4.99 in the original model.
Plot of Predicted Values

plot of predicted values with raw data indicates a good fit

The plot of the predicted values with the data indicates a good fit.

6-Plot of Fit 6-plot indicates regression assumptions satisfied

We need to verify that the weighting did not result in the other regression assumptions being violated. The 6-plot indicates that the regression assumptions are satisfied.

Plot of Residuals

plot of residuals versus predictor variable shows homogeneous variances for residuals

In order to check the assumption of homogeneous variances for the residuals in more detail, we generate a full size version of the residuals versus the predictor variable. This plot shows that the residuals now exhibit homogeneous variances.

Handbook Home Tools & Aids Search Handbook Previous Page Next Page