Statistics | Exploratory Data Analysis
S470 | 11818 | Karen Kafadar

How do you analyze data? When faced with data from various sources, of
various types, what questions should one ask, and what clues can we
find in the data to further our understanding?

Statistics, broadly defined, is the science of and art of analyzing
data. Many statistical procedures require formal probability model
structures with parameters, and statistical methods offer tools for
estimating those model parameters. Sometimes the assumptions governing
those models hold, but often they do not. What analyses can provide
insight into the data and the underlying mechanisms while being
insensitive to model assumptions? Nonparametric methods are
distribution-free, but some prior analysis is needed to understand the

Exploratory data analysis is a philosophy of analyzing data. The
ubiquity of data and the emergence of "data mining" makes this course
essential for anyone who wants to analyze data. In this course, we
will learn many different tools for data analysis as well as the
commands and programs in R (free statistical software) for conducting
these analyses. Some prior familiarity with statistical methods is
assumed. Those who have had formal statistics courses can take the
course at a higher level, where connections between EDA tools and
mathematical statistical methods will be developed. This course is
valuable to anyone who has data to analyze. It is also a lot of fun;
students learn a lot.

Course objectives: Introduce philosophy of exploratory data analysis;
Teach tools for the analysis of data; Provide opportunities for
analyzing data (R/S-Plus); Demonstrate the value of oral/written
communication skills; Offer experience in preparing oral and written
reports of data analyses.

Topics: The philosophy of exploratory versus confirmatory data
analysis Summarizing batches of data: Stem-and-leaf diagrams,
boxplots, qq plots, Data Transformations (ladder of re-expressions),
Jackknife and bootstrap, Two-way and three-way analyses (median
polish), Standardization, Fitting robust-resistant lines (least
absolute deviations), Analyzing count data