Linguistics | Advanced Natural Language Processing
L645 | 3188 | Dr. Damir Cavar

3 Credits

In recent years, the fields of Natural Language Processing and
Information Retrieval have experienced a convergence of interests.
For researchers in NLP, IR represents a promising real-world
application. For researchers in IR, improvement of retrieval systems
and new experimental question answering systems require the
application of understandings found in NLP. Moreover, both NLP and
IR have converged on the use of probabilistic models and statistical
methods of analysis.
This course examines the convergence of NLP and IR through the lense
of large-scale analysis of texts, otherwise known as corpus
linguistics. We examine the construction of corpora, the types of
linguistic analysis that they afford, and the interesting properties
of language that can be observed in studies of corpora. In addition
we consider probabilistic studies of corpora and models of language
that help account for properties of corpora, and the nature of IR
systems as linguistic corpora. A technical focus will lie on the
role of new standards and technologies based on XML for annotation
and processing of corpora. These perspectives help us better
understand the design and performance aspects of IR systems.
Readings for the course are drawn from current research literature
in NLP, corpus linguistics and IR. There are a small number of
focused assignments which are intended to illustrate principles of
corpus construction and analysis, and probabilistic models of
language. The final course project may be either substantial (~15
page) research paper or a programming project.