Computer Science Seminar

2012 Feb 15 at 10:30

DC 1304

Information Discovery in Large Complex Datasets

Julia Stoyanovich, Visiting Scholar, University of Pennsylvania

The focus of my research is on enabling novel kinds of interaction between the user and the information in a variety of digital environments, ranging from social content sites, to digital libraries, to the World Wide Web. In this talk, I will give an overview of my research, and will then present two recent lines of work that focus on information discovery in two important application domains.

In the first part of this talk I will discuss information discovery in PubMed - the largest public bibliographic resource in life sciences. Life sciences researchers, practitioners and students search PubMed daily. Many search queries return thousands of results, pointing to the need for data exploration. PubMed articles are annotated with terms from the Medical Subject Headings (MeSH) vocabulary - a manually curated semantic knowledge base containing about 50 thousand terms, which we propose to use for relevance ranking. I will describe the unique structure of MeSH, which we termed a scoped polyhierarchy. I will present novel relevance measures appropriate for MeSH, and will demonstrate that ranking with these measures leads to a better user experience and can be computed efficiently. I will also describe a Skyline visualization of results that further improves a user's data exploration experience.

In the second part of this talk, I will present novel approaches for the management, querying, searching, and browsing of scientific workflow repositories. A scientific workflow, commonly used for in silico experimentation in scientific domains, is an encoding of a sequence of steps that progressively transform one or several data products. Workflows are gaining popularity, because they help make experiments reproducible, and may be used to answer questions about data provenance - the dependencies between input, intermediate, and output data. I will describe a declarative provenance framework that uses Pig Latin to capture fine-grained dependencies between data items, enabling novel kinds of analytic queries. I will demonstrate that careful design and leveraging distributed processing makes tracking and querying fine-grained provenance feasible.

Bio:

Julia Stoyanovich is a Visiting Scholar at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia's research focuses on modeling and exploring large datasets in presence of rich semantic and statistical structure. She has recently worked on personalized search and ranking in social content sites, rank-aware clustering in large structured datasets that focus on dating and restaurant reviews, data exploration in repositories of biological objects as diverse as scientific publications, functional genomics experiments and scientific workflows, and representation and inference in large datasets with missing values.