Statistical models for language: structure and computation
Abstract:
Statistical models are now widely used in NLP, speech recognition,
and other forms of linguistic computation. Yet few graduate programs
afford students the opportunity to fully learn the principles behind such
statistical models. This course addresses this lack by presenting key
statistical concepts -- aggregation, variance, degrees of freedom,
parameter estimation, and significance testing -- in the context of
analyzing and modeling linguistic data. These concepts are illustrated
using log-linear models. Hidden Markov models and Probabilistic
Context Free grammars are discussed in terms of their application to
language analysis on the one hand, and in terms of statistical model
structure on the other. Course materials for demonstration and
exercises assignments will be prepared using R, a cross-platform,
statistical computing environment available under the Gnu Public
License (www.r-project.org).
No prior knowledge of statistics is assumed in this course; general
background in linguistics is assumed.