This paper was not accepted for SIGIR '95, but is under submission review elsewhere. This paper should not be reproduced or distributed without the author's consent. Paper Submission for SIGIR '95 Information Space for Profile-Based Document Retrieval Submitted by: Gregory B. Newby, Ph.D. Assistant Professor Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 East Daniel Street, Room 204 Champaign, Il, 61820, USA Telephone: 217-244-7365 Fax: 217-244-3302 Email: gbnewby@uiuc.edu Paper word count: 2565 Information Space for Profile-Based Document Retrieval* ABSTRACT This paper presents the theoretical foundations of an information space for IR and describes a practical application of its concepts. The information space is similar to the well-known vector space except that concept terms in the space have non-orthogonal relations to each other. The methods for constructing the space are similar to that of a psychometric technique, multidimensional scaling (MDS). Information space profiles of two organizational work groups were used to scan for relevant documents from the full 1992 U.S. Utility Patent Abstracts database. Results indicate promise for performance comparable to that of typical CD-ROM systems, but in an environment that (a) allows for feedback, navigation, and visualization of the retrieval process; (b) does not require an explicit query, only analysis of existing documents to create the profile; and (c) provides output ranked on similarity to the profile. Future work is described, including an automatic scanning system for dynamic large-scale environments such as network news or newswire services. INTRODUCTION There are several aspects of IR research which are addressed by this paper. This section describes the conceptual background of the work. The following section describes the outcomes of a practical test of the concepts, and a closing section addresses future work. The focus of the work is "information space." Information space is different from vector space in that a metric of similarity (or, more precisely, dissimilarity) may be applied to both concept terms and documents within the space. This is contrasted with the vector space model, in which concept terms are represented by mutually orthogonal vectors (Salton & McGill, 1983). Information space is proposed as a machinized analog to "cognitive space," which is the set of concepts and relations among them as perceived by an individual, or collectively (e.g., on the average) by a group. The presumption is made that information spaces which match the cognitive spaces of particular users with particular information needs closely will be more useful for retrieval; this work includes a partial test of this presumption. One approach to measuring cognitive space is the psychometric technique of multidimensional scaling (MDS). MDS builds a metric space in which items of interest are located relative to each other, with the relative distance proportional to the perceived dissimilarity among the items (Kruskal, 1978; Woelfel & Fink, 1981). "Items" could be anything chosen, but are typically single- or multi-term concepts within a particular domain. Sets of terms chosen from the controlled vocabulary of a database, or selected from the words occurring in a manuscript or set of documents, are well-suited for an MDS study. In MDS, human respondents are presented with a list of paired items, and asked to make numeric judgments about the relatedness of each pair. All possible pairs of items are presented, so that for n items, [n(n-1)]/2 judgments are made. Eigenvectors and eigenvalues are extracted from the matrix of paired comparisons, and may be used for various purposes. A common purpose is the identification of major concepts that emerge from clusters of individual items measured. Another purpose, more closely related to IR efforts, is to build a space, suitable for visualization, in which the measured relations among items may be examined. The methods used here for building an information space approximate that of generating a cognitive space using MDS. Simple word-counts are used to select a set of words from a corpus for study. Similarity scores for the words are generated based on the tendency of each pair of terms to co-occur across the corpus. The matrix of co-occurrence scores is subjected to a statistical technique, principal components analysis, which extracts eigenvectors and eigenvalues. The eigenvectors and eigenvalues make an information space consisting of a set of relations among words suitable for use in various retrieval situations. The focus in the test reported here was to use the information space as a basis for similarity-based identification of potentially relevant documents from the Patent Abstracts database. Documents (in this case, consisting of the "abstract" field of the Patent Abstract entries) were located in the space at the center of the terms they contain. This was accomplished using the simplest of alternative methods -- identification of the documents closest to the center of the space (the center of the space is the center of the documents that built it; in this case, the documents that built it were representative of the interests of the recipients of the IR results). The information space is suitable for other purposes. Due to its metric nature, it may be used effectively for relevance feedback or browsing, much as vector space systems have been used (Salton et al., 1985). Unlike vector spaces, however, information space may be easily visualized. Vector spaces are not easily visualizable because the large number of vectors (or, in geometric terms preferred here, axes or dimensions), each representing a single word or term, are mutually orthogonal. So they provide no good basis for selecting a three-dimensional view. With the principal components analysis method, like factor analysis and other multivariate statistical techniques based on the general linear model, dimensions are extracted one at a time in order to account for the maximum remaining variance possible. Thus, selection of the first three dimensions for viewing will yield a reasonable basis for visualization, typically accounting for 10 to 25% of the total variance in the entire multidimensional information space. A report of work on the visual aspects of this method appears elsewhere (Newby, 1992). INFORMATION SPACE FOR IR The Patent Abstracts database makes use of professionally-written abstracts and index terms selected from a controlled vocabulary typical of commercial database systems. The purpose of this experiment was to retrieve documents of possible relevance to two corporate R&D work groups. The work groups had real information needs, and frequently make use of Patent Abstracts to identify possible work relevant to their own. The 1992 U.S. Utility Patent Abstracts database was used, with a total of Ê97,915 documents. Each work group supplied a collection of written documents from a project which was used to generate an information space. The information spaces, therefore, were profiles of the areas of work for the groups. The information spaces were then used to locate each of the patent abstracts, based on the terms in the abstracts. Those abstracts closest to the center of each space were presented to the appropriate work group manager for relevance judgments. Unfortunately, there was no opportunity to perform iterative searches using different methods, or to compare this experimental method with the regular CD-ROM database, due to limitations on access to the work groups. The results are promising, though, in that some relevant documents were identified from the very large corpus of patent abstracts. Project Group Profiles The first work group, for Project A, provided 23 documents consisting of some 209,301 words total. The second work group, for Project B, supplied 52 documents consisting of 148,232 words. The documents ranged in size from under 5,000 words to over 50,000 words each and were technical in nature. Basewords for each project were selected by trimming terms which occurred more or less than upper and lower limits, were at least 2 characters long, and did not appear on a generic stoplist of 509 words. The limits were selected in order to have an approximately normal frequency distribution. Project A: ¥ Number of input documents: 23 ¥ Total words in input documents: 209,301 ¥ Total unique words in documents, not including stopwords: 6779 ¥ Basewords kept (20 <= total occurrence <= 250): 930 Project B: ¥ Number of input documents: 52 ¥ Total words in input documents: 148,232 ¥ Total unique words in documents, not including stopwords: 4920 ¥ Basewords kept (19 <= total occurrence <= 175): 669 Building the Information Space A separate information space was built for each project, as follows: Step 1: Identify basewords (described above). Step 2: Build the co-occurrence matrix. This is a square symmetric matrix. Each row of the matrix consists of a set of scores for the co-occurrence of each baseword with that rowÕs baseword. Co-occurrence scores are derived from the frequencies of terms in the patent abstracts. For every pair of words in an abstract (i,j), the (i,j)Õth position in the matrix value is incremented by 1 [1]. Step 3: Extract eigenvectors and eigenvalues. Principal components analysis is used to produce a set of eigenvectors and eigenvalues (using the standard IMSL library routine "PRINC"). The most convenient means of interpreting these data is as a set of coordinates in a metric space. Step 4: Compute locations of patent abstracts in the space. The location of a document consisting of n terms is the center of the location of all terms i1, i2, ... in in the document which are also basewords in the space. Other terms are ignored; terms which occur more than once are counted as though they only occur once (in other words, the location is not weighted based on document term frequency for this experiment, due to the brevity of the patent abstract documents). Step 5: Select abstracts for retrieval. The first and easiest criterion for retrieval of individual patent abstracts is proximity to the center of the information space, ranked based on the metric distance. Project A results: ¥ Of the 930 basewords identified, 571 occurred in the patent abstract database. ¥ The number of eigenvalues required to account for 99.998% of the variance in the input co- occurrence matrix was 468. ¥ The first principal component (or first dimension) accounted for 20% of the overall variance in the co-occurrence matrix [2]. Project B results: ¥ Of the 669 basewords identified, 656 occurred in the patent abstract database. ¥ The number of eigenvalues required to account for 99.998% of the variance in the input co- occurrence matrix was 515. ¥ The first principal component accounted for 25% of the overall variance in the co-occurrence matrix. The first principal component for both projects was found to indicate terms such as "response," "operate," "occur," and "detect." These dimensions were dropped and the next seven were used for the rest of the analysis [3]. These seven dimensions totalled 10% of the total variance in Project A and 12% in Project B but did include project-specific concepts in both cases. The twenty-seven patent abstracts closest to the origin of the Project A space and the fifty-eight patent abstracts closest to the origin of the Project B space were returned to the respective project teams for comment. The cutoff points yielding 28 and 59 abstracts were arrived at by examining the list of patents abstracts ordered by proximity to the origin and including a number of abstracts that would seem to be of interest to the projects. The project team managers were asked to evaluate every retrieved patent abstract. Project A evaluation: ¥ Of 27 abstracts presented: 7 were judged to be a "good match," 17 were judged to be a "poor match." Project B evaluation: ¥ Of 55 abstracts presented: 2 were judged to be "pertinent," 3 were judged to be "not pertinent." Feedback was not given for the remaining 55 abstracts presented. An interesting question not addressed for this experiment is whether any of the more useful documents retrieved would have been retrieved using traditional keyword-based methods (e.g., with a professional database searcher). A comparison of three retrieved sets might have been illuminating, as follows: Set 1: Documents retrieved with the method described here. Set 2: Documents retrieved with a traditional keyword system. Set 3: Randomly retrieved documents. FUTURE DIRECTIONS IN INFORMATION SPACE While this experiment for automatically selecting patent abstracts for project to work groups shows some promise, it is not ready for commercial or production use. Yet, the full potential of the methods employed here has not yet been explored. The work completed has demonstrated only that the system works somewhat, not the degree to which it might function as either a replacement for traditional information retrieval methods or as a major component of a text scanning and retrieval system. Current efforts are directed at the domain of network news, where millions of unique documents per year are available in thousands of forums. This provides an excellent opportunity for ongoing feedback and profile adjustment, as well as comparison to "traditional" netnews interfaces. Additional areas for investigation which may be derived directly from the experiment described in this paper are being tested, but without the benefit (and performance imperative) of actual work group involvement. These include: ¥ Gather "good" documents for use as targets in the information space; ascertain the value of documents which are close to these targets, rather than the center of the space. ¥ Generate descriptive statistics to better understand the distribution of concepts and documents in the information space generated here, and how they vary with different choices about word stemming, stoplists, document size, etc. ¥ Test a fully interactive system which has been developed to enable users to provide natural language queries and obtain immediate results. ¥ Use the fully interactive system to do similarity based information browsing, where users would request documents similar to target documents. ¥ Employ MDS techniques to assess the relationships among concepts and documents as perceived by work group members; use this to fine-tune the automatic methods described here. The methods used in this project assess the similarity of items within a database, in order that conceptual similarity may be used for information scanning or retrieval purposes. This is offered as an alternative to, or supplement for, systems which rely on the presence or absence of particular keywords from a controlled vocabulary. In both cases, terms generated by the information seeker are the basis for discovering potentially useful documents. In the concept similarity space described here, though, the presence or absence of a particular term is not enough alone to qualify or disqualify particular documents from retrieval. In fact, the concept similarity space is useful for ascertaining multi-modal relationships, where a term may have different meanings in different contexts. Information space is a necessary component of modern multimedia and internetworked information systems. The processes described here are not the possible way to produce a searchable, navigable, and visualizable information space, but may offer a possible basis for future IR systems. NOTES: 1. It would be possible to build the information space based on the co-occurrence of terms occurring in the project documents themselves, rather than the patent abstracts. However, the project documents did not contain either enough documents or enough total terms to yield a well-populated matrix. A sparse co-occurrence matrix can be used for the principal components analysis, but the result tends to be not have the strongly front-weighted variance scores for the first few dimensions found with a less sparse matrix. 2. By definition, the complete solution to the principle components analysis accounts for 100% of the variance in the co-occurrence matrix. 3. The choice of dimension is somewhat arbitrary. All dimensions could be used for the analysis, but many account for very small amounts of variance (essentially equivalent to the presence or absence of only one or two terms). The seven dimensions retained for this analysis were chosen because the eighth dimension had a sudden drop-off in the amount of variance accounted for. Future research will assess the relative impact of the smaller dimensions. REFERENCES Kruskal, B. (1978). Multidimensional Scaling. Beverly Hills: Sage. Newby, G. B. (1992). "An Investigation of the Role of Navigation for Information Retrieval." In: Proceedings of the American Society for Information Science Annual Meeting 29: 20-25. Medford, NJ: Learned Information. Salton, G., Fox, E. A. & Voorhees, E. (1985). "Advanced feedback methods in information retrieval." Journal of the American Society for Inforrmation Science 36(3): 200-210. Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw Hill. Woelfel, J. D. & Fink, E. L. (1980). The Measurement of Communication Processes: Galileo Theory and Method. New York: Academic Press. * The work reported here was carried out using the facilities of the National Center for Supercomputing Applications in Urbana, Illinois, under a grant from Schlumberger.