Types of Exploratory Data Analysis

Exploratory data analysis applies a set of basic methods for summarizing a set of data in order to detect unexpected patterns and relationships among variables. This separates the exploratory approach from confirmatory data analysis, which emphasizes hypothesis testing. Statistician John Tukey pioneered the methods of exploratory data analysis in the 1970s. Although exploratory analysis includes some basic statistical methods, most of its techniques are visual, as graphical representations provide a means for open-minded exploration of the data.
  1. Five-Number Summary

    • This exploratory analysis technique summarizes the data combines three statistical summary measures -- known as measures of central tendency -- with two measures of variation to provide insight into the distribution of data. Analysts develop this summary by arranging the values of the data in descending order, then selecting the largest and smallest values, as well as the median, or the data point that lies in the middle. The other two values include the first quartile, or the value at which 25 percent of the observations are smaller and 75 percent are larger, and the third quartile, the value at which 75 percent are smaller and 25 percent are larger. Arraying these five numbers from smallest to largest conveys a sense of how symmetrical the data are.

    Box-and-Whisker Plot

    • The box-and-whisker plot provides a visual representation of the five-number summary by showing the shape of the data's distribution as well as central tendency and variability. The diagram consists of a rectangular box of which the lower and upper boundaries represent the first and third quartiles while a third line through the box represents the median. In addition, two lines extending from each end of the box (the "whiskers") show variations within the data by displaying the largest and smallest data points. If the data are symmetrical, the whiskers will have equal lengths and the median line will divide the box into equal halves. Most data sets, however, are not symmetrical, but skewed either to the left or right because of unusually high or low values that distort the data. The box-and-whisker plot visually displays the amount of skew in the data.

    Stem-and-Leaf Display

    • This exploratory method combines quantitative and graphical techniques by displaying raw numbers in a visual display similar to a histogram or bar graph. The display takes the raw numbers and separates the leading digits, or "stems," from the trailing digits, referred to as "the leaves." For example, a data analyst could construct a stem-and-leaf display of test scores in a college class of 30 students in which scores ranged from a low of 52 to a high of 98 by making the digit in the tens column the stem and the digit in the ones column the leaves. Thus, the stems would consist of the digits 5 through 9, with the leaves branching out from each stem. For example, if four students scored 83 on the test, the display would show a stem of 8, followed by four 3s. The stem-and-leaf diagram conveys the distribution of frequencies in the data while also allowing an analyst to see the actual values.

    Scatterplot

    • This visual display plots individual data points from two variables on a graph, with each dot or point representing the intersection of the values of two variables. For example, an economic analyst could create a scatterplot of hourly wages and years of work experience. The pattern of the points provides insight into the correlation between the two variables. If the points cluster around a straight line it suggests a stronger correlation, while a random-looking scatter suggests little or no relation between the variables.

    Descriptive Statistics

    • These include such measures as the mean, or numerical average, and the standard deviation, which conveys the amount of dispersion in the data. While means and standard deviations are valuable measures, they provide only limited insight into the data; in addition, extreme high or low values -- known as outliers -- can distort these measures. The best exploratory analysis that uses descriptive statistics does so in conjunction with other methods, such as a graphical display like a scatterplot or box-and-whisker diagram.

Learnify Hub © www.0685.com All Rights Reserved