|
4.
Process Modeling
4.4. Data Analysis for Process Modeling
|
|||
Is Not Enough!
|
Model validation is possibly the most important step in the model building
sequence. It is also one of the most overlooked. Often the validation of
a model seems to consist of nothing more than quoting the
statistic from the fit (which measures the fraction
of the total variability in the data that is accounted for by the model).
Unfortunately, a high value does not guarantee
that the model fits the data well. Use of a model that does not fit the
data well cannot provide good answers to the underlying engineering or
scientific questions under investigation.
|
||
| Main Tool: Graphical Residual Analysis |
There are many statistical tools for model validation, but the primary tool
for most process modeling applications is graphical residual analysis.
Different types of plots of the residuals (see definition
below) from a fitted
model provide information on the adequacy of different aspects of the model.
Numerical methods for model validation, such as the
statistic, are also useful, but usually to a lesser degree than graphical
methods. Graphical methods have the advantage over numerical methods for
model validation because they readily illustrate a broad range of complex
aspects of the relationship between the model and the data. Numerical methods
for model validation tend to be narrowly focused on a particular aspect of the
relationship between the model and the data and often try to compress that
information into a single descriptive number or test result.
|
||
| Numerical Methods' Forte | Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods occurs when the number of parameters being estimated is relatively close to the size of the data set. In this siutation residual plots are often difficult to inpterpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area where this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area where graphical residual analysis can be difficult. | ||
| Residuals |
The residuals from a fitted model are the differences between the responses
observed at each combination of explanatory variables and the corresponding
prediction of the response computed using the regression function.
Mathematically the definition of the residual for the ith
observation in the data set is written
,where represents the ith
response in the data set and
represents the list of explanatory variables, each set at the corresponding
values found in the ith observation in the data set.
|
||
| Example |
The data listed below is from the
Pressure/Temperature example introduced
in Section 4.1.1. The first column
contains the values of the explanatory variable, Temperature, and the second
contains the observed responses, Pressure. The third column gives the
corresponding values from the fitted straight-line regression function.
![]() The last column lists the residuals, the difference between columns two and three. |
||
| Data, Fitted Values & Residuals |
Fitted
Temperature Pressure Values Residuals
--------------------------------------------------------
54.749 225.066 222.920 2.146
23.323 100.331 99.411 0.920
58.775 230.863 238.744 -7.881
25.854 106.160 109.359 -3.199
68.297 277.502 276.165 1.336
37.481 148.314 155.056 -6.741
49.542 197.562 202.456 -4.895
34.101 138.537 141.770 -3.232
33.901 137.969 140.983 -3.014
29.242 117.410 122.674 -5.263
39.506 164.442 163.013 1.429
43.004 181.044 176.759 4.285
53.226 222.179 216.933 5.246
54.467 227.010 221.813 5.198
57.549 232.496 233.925 -1.429
61.204 253.557 248.288 5.269
42.989 169.427 176.703 -7.276
68.476 273.931 276.871 -2.940
51.144 207.969 208.753 -0.784
68.774 280.205 278.040 2.165
55.350 227.060 225.282 1.779
44.692 180.605 183.396 -2.791
50.995 206.229 208.167 -1.938
21.602 91.464 92.649 -1.186
54.673 223.869 222.622 1.247
41.449 172.910 170.651 2.259
35.451 152.073 147.075 4.998
31.489 139.894 131.506 8.388
48.599 192.561 198.748 -6.188
21.448 94.448 92.042 2.406
56.982 222.794 231.697 -8.902
47.901 199.003 196.008 2.996
40.285 168.668 166.077 2.592
25.609 109.387 108.397 0.990
22.971 98.445 98.029 0.416
25.838 110.987 109.295 1.692
49.127 202.662 200.826 1.835
54.936 224.773 223.653 1.120
50.917 216.058 207.859 8.199
41.976 171.469 172.720 -1.251
|
||
| Why Use Residuals? | Assuming the model fit to the data is correct, the residuals approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The subsections listed below detail the types of plots to use to test different aspects of a model and give guidance on the correct interpretations of different results that could be observed for each type of plot. | ||
| Model Validation Specifics |
|
||