Statistics | Statistical Learning and High-Dimensional Data Analysis
S475 | 27584 | Michael Trosset

STAT S475/S675  Statistical Learning and High-Dimensional Data Analysis

10:10-11:00 on Monday-Wednesday-Friday in Swain West 221

Instructor:  Michael Trosset, Department of Statistics

"Data-analytic methods for exploring the structure of high-dimensional
data.  Graphical methods, linear and nonlinear dimension reduction
techniques, manifold learning.  Supervised, semisupervised, and
unsupervised learning."

This course surveys various data-analytic approaches to detecting
structure in multivariate data sets.  Many of the topics covered are
active areas of research in multivariate statistics and machine
learning.  High-dimensional data sets arise in many applications,
e.g., gene expression levels from a microarray experiment.  Techniques
for high-dimensional data are useful in a wide variety of disciplines;
I plan to emphasize applications to bioinformatics and text mining.

Here is a rough outline of the topics that I expect to cover:

1.  Multivariate Data.  Data matrices, proximity matrices and graphs.
Labeled and unlabeled data.

2.  Graphical Methods for Exploring Multivariate Data.  Scatterplots
in two and three dimensions, grand tours, projection pursuit.
Parallel coordinates.  Brushing.

3.  Dimension Reduction.  Linear techniques: principal component
analysis, biplots and $h$-plots, principal coordinate analysis.
Spectral techniques for manifold learning: Isomap, Locally Linear
Embedding, Laplacian eigenmaps, diffusion maps.  Nonspectral embedding
techniques and their application to dimension reduction.

4.  Supervised Learning.  Linear/quadratic discriminant analysis,
nearest neighbor methods, distance/metric learning, support vector
machines. Multiple kernel learning.

5.  Unsupervised learning.  K-means clustering, self-organizing maps,
iterative denoisong.

Text:  I will rely on my own lecture notes and various talks,
technical reports, and  papers from the literature.

The essential prerequisite for this course is some familiarity with
linear algebra (vectors, matrices, eigenvalues, etc.).  We will use a
high-level statistical programming language (R), so some previous
experience with a computer programming language would be helpful.
Previous exposure to classical multivariate statistical methods is
helpful, but not essential.

For more information, please visit the course web page:

If you are uncertain whether or not you have the background to take
this course, please contact me at <>.

Computer Science graduate students may count STAT S675 as an Area 5
(Artificial Intelligence) course for the purpose of fulfilling their
area distribution requirements.  Any student who intends to do so
should notify Amr Sabry <>.