Next Page Previous Page Handbook Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.1. EDA Introduction

1.1.6.

An EDA/Graphics Example

Anscombe Example A simple, classic (Anscombe) example of the the central role that graphics plays in terms of providing insight into a data set starts with the following data set:
Data
  X              Y
10.00           8.04
 8.00           6.95
13.00           7.58
 9.00           8.81
11.00           8.33
14.00           9.96
 6.00           7.24
 4.00           4.26
12.00          10.84
 7.00           4.82
 5.00           5.68

Summary Statistics If the goal of the analysis is to compute summary statistics, plus determine the best linear fit for Y as a function of X, then the analysis would yield:
    N = 11
    Mean of X = 9.0
    Mean of Y = 7.5
    Intercept = 3
    Slope = 0.5
    Residual standard deviation = 1.126
    Correlation = 81.7%

The above quantitative analysis, though valuable, gives us only limited insight into the data.

Scatter Plot In contrast, the following simple scatter plot of the data

A sample scatter plot

suggests the following:

  1. The data "behaves like" a linear curve with some scatter;
  2. there is no justiication for a more complicated model (e.g., quadratic);
  3. there are no outliers;
  4. the points are not serially correlated with one another;
  5. the vertical envelope of the data appears to be of equal-height irrespective of the X-value; this indicates that the data is equally-precise throughout and so a "regular" (that is, equi-weighted) fit is appropriate.
3 Additional Data Sets This kind of characterization for the data serves as the core for getting insight/feel for the data. Such insight/feel does not come from the quantitative statistics; on the contrary, calculations of quantitative statistics such as intercept and slope should be subsequent to the characterization and will make sense only if the characterization is true. To illustrate the loss of information that results when the graphics insight step is skipped, consider the following 3 data sets: [anscobe data sets 2, 3, and 4]
 X2     Y2       X3     Y3       X4     Y4
10.00   9.14    10.00   7.46     8.00   6.58
 8.00   8.14     8.00   6.77     8.00   5.76
13.00   8.74    13.00  12.74     8.00   7.71
 9.00   8.77     9.00   7.11     8.00   8.84
11.00   9.26    11.00   7.81     8.00   8.47
14.00   8.10    14.00   8.84     8.00   7.04
 6.00   6.13     6.00   6.08     8.00   5.25
 4.00   3.10     4.00   5.39    19.00  12.50
12.00   9.13    12.00   8.15     8.00   5.56
 7.00   7.26     7.00   6.42     8.00   7.91
 5.00   4.74     5.00   5.73     8.00   6.89
Quantitative Statistics for Data Set 2 A quantitative analysis on data set 2 yields
    N = 11
    Mean of X = 9.0
    Mean of Y = 7.5
    Intercept = 3
    Slope = 0.5
    Standard deviation of residuals = 1.126
    Correlation = 81.7%
which is identical to the analysis of data set 1. One may naively assume that the two data sets are "equivalent" since that is what the statistics tell us; but what do the statistics not tell us?
Quantitative Statistics for Data Sets 3 and 4 Remarkably, a quantitative analysis on data sets 3 and 4 also yields
    N = 11
    Mean of X = 9.0
    Mean of Y = 7.5
    Intercept = 3
    Slope = 0.5
    Standard deviation of residuals = 1.126
    Correlation = 81.7%
which implies that in some quantitative sense, all four of the data sets are "equivalent". In fact, the four data sets are far from "equivalent" and a scatter plot of each data set, which would be step 1 of any EDA approach, would tell us that immediately.
Scatter Plots

4 scatter plots that exhibit differenrt characteristcs

Interpretation of Scatter Plots Conclusions from the scatter plots are:
  1. data set 1 is clearly linear with some scatter.
  2. data set 2 is clearly quadratic.
  3. data set 3 clearly has an outlier.
  4. data set 4 clearly is the victim of a poor experimental design with a single point far removed from the bulk of the data "wagging the dog".
Importance of Exploratory Analysis These points are exactly the substance which provide and define "insight" and "feel" for a data set. They are the goals and the fruits of an open exploratory data analysis (EDA) approach to the data. Quantitative statistics are not wrong per se, but they are incomplete. They are incomplete because they are numeric SUMMARIES which in the summarization operation do a good job of focusing on a particular aspect of the data (e.g., location, intercept, slope, degree of relatedness, etc.) by judiciously reducing the data to a few numbers. Doing so also FILTERS the data, necessarily omitting and screening out other sometimes crucial information in the focusing operation. Quantitative statistics focus but also filter; and filtering is exactly what makes the quantitative approach incomplete at best and misleading at worst.

The estimated intercepts (= 3) and slopes (= 0.5) for data sets 2, 3, and 4 are misleading because the estimation is done in the context of an assumed linear model and that linearity assumption is the fatal flaw in this analysis.

The EDA approach of deliberately postponing the model-selection until further along in the analysis has many rewards--not the least of which is the ultimate convergence to a much-improved model and the formulation of valid and supportable scientific and engineering conclusions.
Handbook Home Tools & Aids Search Handbook Previous Page Next Page