Linguistics | Advanced Natural Language Processing
L645 | 3057 | Dr. Damir Cavar
Linguistics L645 Natural Language Processing
3 Credits
In recent years, the fields of Natural Language Processing and
Information Retrieval have experienced a convergence of interests.
For researchers in NLP, IR represents a promising real-world
application. For researchers in IR, improvement of retrieval systems
and new experimental question answering systems require the
application of understandings found in NLP. Moreover, both NLP and IR
have converged on the use of probabilistic models and statistical
methods of analysis.
This course examines the convergence of NLP and IR through the lense
of large-scale analysis of texts, otherwise known as corpus
linguistics. We examine the construction of corpora, the types of
linguistic analysis that they afford, and the interesting properties
of language that can be observed in studies of corpora. In addition
we consider probabilistic studies of corpora and models of language
that help account for properties of corpora, and the nature of IR
systems as linguistic corpora. A technical focus will lie on the role
of new standards and technologies based on XML for annotation and
processing of corpora. These perspectives help us better understand
the design and performance aspects of IR systems.
Readings for the course are drawn from current research literature in
NLP, corpus linguistics and IR. There are a small number of focused
assignments which are intended to illustrate principles of corpus
construction and analysis, and probabilistic models of language. The
final course project may be either substantial (~15 page) research
paper or a programming project.