S475 | 27584 | Michael Trosset

STAT S475/S675 Statistical Learning and High-Dimensional Data Analysis 10:10-11:00 on Monday-Wednesday-Friday in Swain West 221 Instructor: Michael Trosset, Department of Statistics "Data-analytic methods for exploring the structure of high-dimensional data. Graphical methods, linear and nonlinear dimension reduction techniques, manifold learning. Supervised, semisupervised, and unsupervised learning." This course surveys various data-analytic approaches to detecting structure in multivariate data sets. Many of the topics covered are active areas of research in multivariate statistics and machine learning. High-dimensional data sets arise in many applications, e.g., gene expression levels from a microarray experiment. Techniques for high-dimensional data are useful in a wide variety of disciplines; I plan to emphasize applications to bioinformatics and text mining. Here is a rough outline of the topics that I expect to cover: 1. Multivariate Data. Data matrices, proximity matrices and graphs. Labeled and unlabeled data. 2. Graphical Methods for Exploring Multivariate Data. Scatterplots in two and three dimensions, grand tours, projection pursuit. Parallel coordinates. Brushing. 3. Dimension Reduction. Linear techniques: principal component analysis, biplots and $h$-plots, principal coordinate analysis. Spectral techniques for manifold learning: Isomap, Locally Linear Embedding, Laplacian eigenmaps, diffusion maps. Nonspectral embedding techniques and their application to dimension reduction. 4. Supervised Learning. Linear/quadratic discriminant analysis, nearest neighbor methods, distance/metric learning, support vector machines. Multiple kernel learning. 5. Unsupervised learning. K-means clustering, self-organizing maps, iterative denoisong. Text: I will rely on my own lecture notes and various talks, technical reports, and papers from the literature. The essential prerequisite for this course is some familiarity with linear algebra (vectors, matrices, eigenvalues, etc.). We will use a high-level statistical programming language (R), so some previous experience with a computer programming language would be helpful. Previous exposure to classical multivariate statistical methods is helpful, but not essential. For more information, please visit the course web page: http://mypage.iu.edu/~mtrosset/675.html If you are uncertain whether or not you have the background to take this course, please contact me at <mtrosset@indiana.edu>. Computer Science graduate students may count STAT S675 as an Area 5 (Artificial Intelligence) course for the purpose of fulfilling their area distribution requirements. Any student who intends to do so should notify Amr Sabry <sabry@cs.indiana.edu>.