Next Page Previous Page Handbook Home Tools & Aids Search Handbook
4. Process Modeling
4.4. Data Analysis for Process Modeling

4.4.4.

How can I tell if a model fits my data?

Is Not Enough! Model validation is possibly the most important step in the model building sequence. It is also one of the most overlooked. Often the validation of a model seems to consist of nothing more than quoting the statistic from the fit (which measures the fraction of the total variability in the data that is accounted for by the model). Unfortunately, a high value does not guarantee that the model fits the data well. Use of a model that does not fit the data well cannot provide good answers to the underlying engineering or scientific questions under investigation.
Main Tool: Graphical Residual Analysis There are many statistical tools for model validation, but the primary tool for most process modeling applications is graphical residual analysis. Different types of plots of the residuals (see definition below) from a fitted model provide information on the adequacy of different aspects of the model. Numerical methods for model validation, such as the statistic, are also useful, but usually to a lesser degree than graphical methods. Graphical methods have the advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data. Numerical methods for model validation tend to be narrowly focused on a particular aspect of the relationship between the model and the data and often try to compress that information into a single descriptive number or test result.
Numerical Methods' Forte Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods occurs when the number of parameters being estimated is relatively close to the size of the data set. In this siutation residual plots are often difficult to inpterpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area where this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area where graphical residual analysis can be difficult.
Residuals The residuals from a fitted model are the differences between the responses observed at each combination of explanatory variables and the corresponding prediction of the response computed using the regression function. Mathematically the definition of the residual for the ith observation in the data set is written

,

where represents the ith response in the data set and represents the list of explanatory variables, each set at the corresponding values found in the ith observation in the data set.
Example The data listed below is from the Pressure/Temperature example introduced in Section 4.1.1. The first column contains the values of the explanatory variable, Temperature, and the second contains the observed responses, Pressure. The third column gives the corresponding values from the fitted straight-line regression function.


The last column lists the residuals, the difference between columns two and three.
Data, Fitted Values & Residuals
                               Fitted
Temperature    Pressure        Values         Residuals
--------------------------------------------------------
  54.749        225.066        222.920          2.146
  23.323        100.331         99.411          0.920
  58.775        230.863        238.744         -7.881
  25.854        106.160        109.359         -3.199
  68.297        277.502        276.165          1.336
  37.481        148.314        155.056         -6.741
  49.542        197.562        202.456         -4.895
  34.101        138.537        141.770         -3.232
  33.901        137.969        140.983         -3.014
  29.242        117.410        122.674         -5.263
  39.506        164.442        163.013          1.429
  43.004        181.044        176.759          4.285
  53.226        222.179        216.933          5.246
  54.467        227.010        221.813          5.198
  57.549        232.496        233.925         -1.429
  61.204        253.557        248.288          5.269
  42.989        169.427        176.703         -7.276
  68.476        273.931        276.871         -2.940
  51.144        207.969        208.753         -0.784
  68.774        280.205        278.040          2.165
  55.350        227.060        225.282          1.779
  44.692        180.605        183.396         -2.791
  50.995        206.229        208.167         -1.938
  21.602         91.464         92.649         -1.186
  54.673        223.869        222.622          1.247
  41.449        172.910        170.651          2.259
  35.451        152.073        147.075          4.998
  31.489        139.894        131.506          8.388
  48.599        192.561        198.748         -6.188
  21.448         94.448         92.042          2.406
  56.982        222.794        231.697         -8.902
  47.901        199.003        196.008          2.996
  40.285        168.668        166.077          2.592
  25.609        109.387        108.397          0.990
  22.971         98.445         98.029          0.416
  25.838        110.987        109.295          1.692
  49.127        202.662        200.826          1.835
  54.936        224.773        223.653          1.120
  50.917        216.058        207.859          8.199
  41.976        171.469        172.720         -1.251
Why Use Residuals? Assuming the model fit to the data is correct, the residuals approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The subsections listed below detail the types of plots to use to test different aspects of a model and give guidance on the correct interpretations of different results that could be observed for each type of plot.
Model Validation Specifics
  1. How can I assess the sufficiency of the functional part of the model?
  2. How can I detect non-constant of variation across the data?
  3. How can I tell if there was drift in the measurement process?
  4. How can I assess whether the random errors are independent from one to the next?
  5. How can I test whether or not the random errors are distributed normally?
  6. How can I test whether all of the terms in the functional part of the model are necessary?
Handbook Home Tools & Aids Search Handbook Previous Page Next Page