Next Page Previous Page Handbook Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.4. EDA Case Studies

1.4.1.

Case Studies Introduction

Purpose The purpose of the first eight case studies is to show how characteristics of EDA graphics and quantitative measures and tests as they are applied to data from scientific processes and to critique those data with regard to the following assumptions that typically underlie a measurement process; namely, that the data behave like:
  • random drawings
  • from a fixed distributions
  • with a fixed location
  • with a fixed standard deviation
Case studies 9 and 10 show the use of EDA techniques in distributional modeling and the analysis of a designed experiment respectively.
Yi = C + Ei If the above assumptions are satisfied, the process is said to be statistically "in control" with the core characteristic of having "predictability", that is being able to make probability statements about the process, not only in the past, but also in the future.

An appropriate model for an "in control" process is

    Yi = C + Ei
where C is a constant (the "deterministic" or "structural" component), and where Ei is the error term (or "random" component).

The constant C is the "typical value" of the process--it is the primary summary number which shows up on any report. Although C is (assumed) fixed, it is unknown, and so a primary analysis objective of the engineer is to arrive at an estimate of C.

This goal partitions into 4 sub-goals:

  1. Is the most common estimator of C, , the best estimator for C? What does "best" mean?
  2. If is best, what is the uncertainty for . In particular, is the usual formula for the uncertainty of :
    valid? Here, S is the standard deviation of the data and N is the sample size.
  3. If is NOT the best estimator for C, what is a better estimator for C (for example, median, midrange, midmean)?
  4. For this better estimator , what is its uncertainty? That is, what is ?
EDA and the routine checking of underlying assumptions provides insight into all of the above.
  1. Location and variation checks provide information as to whether C is really constant.
  2. Distributional checks indicate whether is the best estimator. Techniques for distributional checking include histograms, normal probability plots, and probability plot correlation coefficient plots.
  3. Randomness checks ascertain whether the usual is valid.
  4. Distributional tests assist in determining a better estimator, if needed.
  5. Simulator tools (namely bootstrapping) provide values for the uncertainty of alternate estimators.
Assumptions not satisfied If one or more of the above assumptions is not satisfied, then we use EDA techniques, or some mix of EDA and classical techniques, to find a more appropriate model for the data. That is,
    Yi = D + Ei
where D is the deterministic part and E is an error component.

If the data are not random, then we may investigate fitting some simple time series models to the data. If the constant location and scale assumptions are violated, we may need to investigate the measurement process to see if there is an explanation.

The assumptions above are still quite relevant in the sense that for an approriate model the error component should follow the assumptions. The criterion for validating the model, or comparing competing models, is framed in terms of these assumptions.

Non-univariate data Although the case studies in this chapter concentrate on univariate data, the assumptions above are relevant for non-univariate data as well.

If the data is not univariate, then we are trying to find a model

    Yi = F(X1, ..., Xk) + Ei
where F is some function based on one or more variables. The error component, which is a univariate data set, of a good model should satisfy the assumptions given above. The criterion for validating and comparing models are based on how well the error component follows these assumptions.

The load cell calibration case study in the process modeling chapter shows an example of this in the regression context.

First three case studies operate on data with known characteristics The first three case studies operate on data which are randomly generated from the following disributions:
  • normal distribution with mean 0 and standard devaition 1
  • uniform distribution with mean 0 and standard deviation , uniform over the interval (0,1)
  • random walk
The other univariate case studies operate on data from scientific processes. The goal is to determine if
    Yi = C + Ei
a reasonable model. This is done by testing the underlying assumptions. If the assumptions are satisfied, then an estimate of C and an estimate of the uncertainty of C are computed. If the assumptions are not satisfied, we attempt to find a model where the error component does satisfy the underlying assumptions.
Graphical methods that are applied to the data To test the underlying assumptions, each data set is analyzed using four graphical methods which are particularly suited to this purpose:
  1. run sequence plot which is useful for detecting shifts of location or scale
  2. lag plot which is useful for detecting non-randomness in the data
  3. histogram which is useful for determining the underlying distribution
  4. normal probability plot for deciding whether the data follow the normal distribution
There are a number of other techniques for addressing the underlying assumptions. However, the four plots listed above provide an excellent oppurtunity for addressing all of the assumptions on a single page of graphics.

Additional graphical techniques are used in certain case studies to develop models that do have error components that satisfy the underlying assumptions.

Quantitative methods that are applied to the data The normal and uniform random number data sets are also analyzed with the following quantitative techniques which are explained in more detail in an earlier section:
  1. Summary statistics which include:
  2. Linear fit of the data as a function of time to assess drift (test for fixed location)
  3. Bartlett test for fixed variance
  4. Autocorrelation plot and coefficient to test for randomness
  5. Runs test to test for lack of randomness
  6. Anderson-Darling test for a normal distribution
  7. Grubbs test for outliers
  8. Summary report

Although the graphical methods applied to the normal and uniform random numbers are sufficient to assess the validity of the underlying assumptions, the quantitative techniques are used to show the differing flavor of the graphical and quantitative approaches.

The remaining case studies intermix one or more of these quantitative technques into the analysis where appropriate.

Handbook Home Tools & Aids Search Handbook Previous Page Next Page