Unambiguous + Unlimited = Unsupervised

Marti A. Hearst

The key to many modern computational linguistics problems is to train a machine learning algorithm over large numbers of labeled examples. However, in most cases, acquisition of labeled data is expensive, and so for years researchers have been striving to develop unsupervised algorithms that require little or no labeled data.

I will discuss one type of (nearly) unsupervised algorithm which is enjoying wider applicability recently due to the nearly unlimited amount of searchable text that has become available via web search engines. The main idea is to find a way to restate the problem such that at least some unambiguous examples of the problem are likely to be found in the vast sea of text. I will show examples of this kind of algorithm applied to problems of structural ambiguity resolution and semantic relation identification and touch on the larger implications.

Joint work with Preslav Nakov. Supported in part by NSF DBI-0317510.