This project is to develop an advanced infrastructure in the School of Information and Library Science for information retrieval (IR) research. The funding request is for disk space to be added to an existing Sun workstation in order to be able to store a larger sub-set of an experimental data set.
Information retrieval is a topic that everyone has some familiarity with, but the remaining large problems in IR are not always evident. While Web-based search engines are able to produce matching “hits” on short order, the quality of the responses are lacking. There are numerous reasons for difficulties in achieving the goal of only retrieving relevant documents for a particular information need. Among them are:
1. Synonymy: many different words (or phrases) may be used to refer to the same subject
2. Polysemy: the same word might have different meanings in different contexts (for example, the “bank” of a river, and the “bank” where you keep your money)
3. Varying document lengths: long documents might cover many different subjects with many different words, making them more likely to be retrieved than shorter documents.
4. Varying query specificity: adding or deleting a term, or providing a paragraph narrative on the information need rather than a few keywords, can greatly alter the outcome of the search.
5. Real-world information needs: people who search using IR systems are engaged in particular situations, and have backgrounds and knowledge that might help to refine search results.
The investigator is involved with a community of faculty and students from the School of Information and Library Science (with participation open to faculty and students from elsewhere on campus) in a focused multi-semester study of these problems. The study is taking place within the rubric of the Text Retrieval Conference or TREC. TREC is an international multi-disciplinary effort to address the problems of IR through ongoing research on a growing collection of full-text documents.
The investigator is starting his third year of participation in TREC. The previous two years have resulted in attendance at the annual TREC conference and publication in the proceedings (see http://trec.nist.gov). Two doctoral students and one other faculty at SILS have also participated in TREC. By having a better infrastructure for significant and ongoing research with the TREC data, we hope to build a world-class presence in the field of IR. The SILS faculty and students are already strong in this area, but have not yet been able to work together, year round, on investigations.
This proposal is to gain additional disk space for an existing server for use by the SILS TREC community of faculty and students. With over 1.3 million documents in the TREC collection, and nearly 100,000 relevance judgments for 350 queries used in past TREC experiments, the storage needs for large-scale investigations are significant. Without the funding, it will be impossible to keep large-scale retrieval experiments online, as competing research needs for the workstation facilities otherwise require that systems resources be freed for other work.
The methods and procedures to be followed will develop further as students gain familiarity with the data. Overall, this funding will enable a broad range of study techniques. Currently, two prototype systems from last year’s TREC are available, and the investigator is involved in several different research methods. These include:
a. Comparing small-, medium- and large-size database subsets to identify trends in term (word) counts across the subsets. Can we characterize a database by sampling a subset? Subset sizes are 100MB, 500MB, and 5GB in size.
b. Performing statistical analysis on the documetrics of documents judged as relevant and non-relevant (judgments from past TRECs are part of the TREC collection). What is the impact of document length and query length? What forms of query expansion (adding terms or phrases) might result in better matches with known relevant documents?
c. Usability studies with different interfaces and different retrieval mechanisms. Subjects will be presented with several topics to research, and a variety of different retrieval mechanisms and database subsets to choose from. What is the impact of retrieval mechanism on search effectiveness, and how does this relate to the searcher’s perceptions of success?
d. Expansion of retrieval techniques to other document collections. We are specifically interested in Web-based data: what differences in effectiveness for retrieval are found with Web data, instead of the news articles and other data that make up the standard TREC collection?
In this research, we anticipate building on existing strengths in SILS and developing partnerships among faculty and students. The funding requested will enable a significant expansion in the key research facility that turns the researchers’ energy into concrete outcomes – the computing platform for retrieval experimentation.
The proposal budget has a single line item, for equipment to expand the disk storage capacity on an existing Sun workstation in the School of Information and Library Science. This expansion will enable information retrieval research that is currently impossible to sustain due to competition for the existing disk space for other research.
There is no departmental funding for this disk space expansion for the current fiscal year, and no other sources of University funding available. A new NSF initiative for digital libraries may be forthcoming, however, and SILS faculty are optimistic that the research performed as part of this proposal will help make SILS a favorable candidate for NSF funding in the future.
UNC Chancellor Academic 1998 $70,000. None
This proposal was funded to expand the teaching laboratory facilities at SILS, and has been spent. The emphasis was on adding multimedia capabilities (scanners, personal cameras, and video capture cards) to existing lab computers, and add a server for multimedia authoring. No research-oriented facilities were funded. Co-PI with Scott Barker.