Skip to the content of the web site.

Research

The IN3SCAPE Project

Research Goals

The huge increase in volume of online literature has led to a parallel surge in research into methods for retrieving meaningful information from this textual data -- "content extraction" has emerged as a prominent field in natural language computing. However, little progress has as yet been made in determining the pragmatic content of a document, the "hidden" meaning such as the attitudes of the writer toward her audience, the intentions being communicated, the intra-textual relationships between document objects, and so forth. But pragmatic information carries a great deal of the underlying meaning in a document, and the inability to access this information means that current content extraction methods are still very uninformed. If we could recognize and use fine-grained relationships among documents to assist navigation through information networks, we could better address this problem.

Our goal is to develop natural language systems capable of extracting this pragmatic information in text to provide more meaningful document understanding. To this end, we are developing automated systems, using both discourse-based and Machine Learning techniques, to recognize and interpret pragmatic cues in text. This pragmatic evidence will be used to provide more-sophisticated document indexing to guide information extraction by providing detailed information on the fine-grained nature of the linking relationships between documents.

Sophisticated Computational Analysis using Pragmatic Evidence) is bootstrapping the development of a set of methods and software tools for the automated classification of links between documents in online corpora by focusing initially on the problem of automated citation classification in scholarly articles. This is a very challenging problem as there can be upwards of 35 citation categories used in scholarly writing, with fine-grained distinctions among the categories. Determining the purpose of a citation can involve recognizing linguistic features at all levels of the text: lexical cues, syntactic arrangement, and overall discourse structure. We have developed a basic classifier for automated citation classification, but to improve its performance we need more-sophisticated techniques blending discourse understanding with statistical methods for large-scale corpus analysis.

The results of our research will be a set of algorithms, methods, and software tools that can be applied to the following problems: (i) Annotation tools for both manual and automated annotation of fine-grained linguistic features; (ii) Automated analysis of document content for cues to purpose; (iii) Automated classification of semantic links between documents;
(iv) Mapping from typed document links to social networks.

Acknowledgements

IN3SCAPE is funded by a Google Research Award.

Group Members and Associates

Chrysanne DiMarco, Associate Professor, David R. Cheriton School of Computer Science

Robert E. Mercer, Professor, Department of Computer Science, The University of Western Ontario

Pascal Poupart, Assistant Professor, David R. Cheriton School of Computer Science

Victoria L. Rubin, Assistant Professor, Faculty of Information and Media Studies, The University of Western Ontario

Students

Fred Kroon (PhD, Waterloo)
Thesis topic: Mapping scientific communities using citation analysis

Jakub Gawryjolek (MMath, Waterloo)
Adam Hartfiel (PhD, Waterloo)
Trevor Maynard (MSc, UWO)
Radoslav Radoulov (MMath, Waterloo)
Jeff Taylor (PhD, UWO)
Barbara White (PhD, UWO)

Research staff

Steve Banks, Research Scientist
Olga Gladkova, Research Assistant (Linguistics, Rhetoric)
Matthew Skala, Research Assistant (Software Development)