Moving More Quickly toward Full Term Relations in Information Space
Gregory B. Newby*
School of Information and Library Science
This paper describes the ISpace retrieval system’s involvement in TREC8. The main goal for this year’s work was to speed up document indexing and query processing compared to previous years. This goal was achieved, but retrieval performance was not as good as for TREC7. System details for the adhoc task, small Web task, and large Web (VLC) task are presented. The paper also describes an implementation of a multidimensional tree structure for retrieval from information space based on the kd-tree. A concluding section describes future plans for ISpace.
This year’s efforts for the 8th Text REtrieval Conference (TREC) included the following:
1. Adhoc task, fully automatic.
2. Small Web track
3. Large Web track (VLC)
Throughout the work described here, the central question of interest is:
How might information space techniques reach acceptable
performance levels?
The issue of performance is ambiguous, but was defined as emphasizing the following, in decreasing order of importance:
a. Performance means being able to quickly produce a ranked response set for a query topic
b. Performance means being able to handle the full variety of queries and documents – i.e., without limitations on the number of unique terms or number of documents
c. Performance means the response set has a large proportion of relevant documents
In this hierarchy, the goal of high relevance is uncharacteristically last, but not forgotten. Because post-hoc analysis of last year’s non-judged TREC submissions (Newby, 1999) indicated reasonable recall-precision performance with exact precision of 0.14, the emphasis was on developing a more practical and usable system. While 0.14 is unremarkable compared to other groups’ TREC submissions, it represented an order of magnitude improvement from prior years (Newby, 1998).
A description of the information space technique, system design considerations for each phase of the work, and outcomes follow. A concluding section summarizes this year’s TREC activities and lays out plans for the near future.
This section introduces the approach to information retrieval employed by the author. The Information Space approach to information retrieval is comparable to Latent Semantic Indexing (LSI, see Deerwester et al., 1990), with some differences. These differences are:
- Information Space starts with the term by term correlation matrix while LSI starts with the term by document co-occurrence matrix;
- Information Space performs eigensystems analysis while LSI performs a Singular Value Decomposition (SVD);
- Information Space has not made use of eigenvalues for scaling document vectors while LSI does make use of singular values (the square root of the eigenvalues);
- The Information Space approach does not assume that higher-dimensioned eigenvectors are without merit while LSI historically has sought to discard higher-dimensioned eigenvectors
Techniques that have been applied to various types of IR have also been applied to the Information Space system described here, called ISpace. These include: query expansion, part of speech tagging, document length normalization, term weighting, and stemming.
Like many other IR systems, ISpace may be compared to the Vector Space Model (VSM). Although variations in the VSM have proliferated, fundamental differences between Information Space and the VSM are:
- Information Space measures relations among terms while VSM treats term vectors are unrelated (orthogonal)
- A fair approximation of an Information Space may be visualized in 2 or 3 dimensions while a vector space does not have a clear visual interpretation
For TREC8, the ISpace implementation from prior years was largely rewritten. The goal of rewriting was to increase modularity and enable incorporation of multiple IR techniques. For example, the same data structures, term weights, etc. used for an Information Space retrieval experiment with ISpace may also be used for a VSM retrieval experiment.
The specific steps taken for ISpace retrieval are described for the different tasks, below.
Experimentation with query expansion was the main goal for the adhoc task. The steps taken for the task were as follows:
1. Read terms from all adhoc documents, building an inverted index and term frequency list. About 367,000 Porter-stemmed terms were pre-identified from various word lists, past TREC topics, and the adhoc document set. A test run demonstrated that all possible terms could be indexed, but would include many single-use terms not likely to be in queries.
2. 3136 non-stoplist terms were selected based on IDF values for inclusion in the information space. (The SMART stoplist was used.)
3. A term co-occurrence matrix for the 3136 terms was generated from the inverted index.
4. A Pearson product moment correlation matrix was generated from the co-occurrence matrix.
5. Eigensystems analysis was performed on the correlation matrix, resulting in 3062 eigenvalues ranging from 2871 to nearly zero.
6. Each adhoc document was then re-analyzed and assigned a location vector at the geometric center of the eigenvectors from the terms it contained.
7. These document vectors were weighted with TF * IDF using simple formulas from Frakes & Baeza-Yates (1992). Then, vectors were normalized to unit length.
8. Topics were expanded by either 25 or 50 terms. Term expansion added the most highly correlated terms with each topic term.
9. The topic vector in the information space was then weighted and normalized at the center of its (expanded) term vectors.
10. The closest document vectors to the topic vector were retrieved in rank order
Four adhoc runs were submitted. Two judged runs utilized topic titles and descriptions, expanded by 25 or 50 terms (isa25, isa50). Two un-judged runs utilized titles only and expanded by 25 or 50 terms (isa25t, isa50t). Results were considerably poorer than comparable adhoc ISpace runs from TREC7, yielding average exact precision scores under 0.025.
There are two likely explanations for this poor showing. One is that part of speech tagging (Brill, 1994) and query processing, used last year, are important. This is supported by the observation that expanded topics expanded all topic terms, including terms without much discriminatory value. This may have served to bring in useful terms, but it also increased the noise in topics. For query processing, all topics were pre-analyzed for sentence structure in TREC7 so that phrases about non-relevance were eliminated. Thus, only “desirable” terms were kept.
The other likely explanation is that the 3136 terms chosen were not well suited to the TREC8 adhoc topics. Of the 283 unique stemmed terms in topics 401-450, 14 important topic terms were not among them.
The stemmed missing terms were: Burma, comet, decai, estonia, hurrican, legionnair, lockerbi, milosev, Parkinson, potassium, saharan, salvag, scotland, and typhoon. From this list, it is evident that low performance would be expected from the topics that include them: 403, 405, 406, 408, 409, 411, 415, 423, 429, 434 and 443.
However, topics 403 (“osteoporosis”) and 408 (“tropical storms”) both performed relatively well, as mentioned below. It appears that the missing term from 403, “potassium,” was not overly injurious, and the presence of both “tropical” and “storm” may have offset the missing “hurricane” and “typhoon” from topic 408.
Exact precision scores for the adhoc task ranged from 0 to .59. Table 1 presents a summary of scores. Due to a scoring error, scores for the title-only 50-word expansion are not considered here.
Topics with exact precision or average precision greater than 0.10 on the adhoc task included:
1. Topic 403, “osteoporosis.” This topic had the highest median average precision of all topics, possibly due to a large number of fairly specific terms in the expanded topic. Even the title-only run yielded good scores (Exact precision = .42). Term expansion appeared to work well, with a better exact precision for expansion by 50 than by 25 (.38 vs. .19).
2. Topic 407, “poaching, wildlife preserves.” Good performance on the title-only run, but not title plus description. .13 average precision and .17 exact precision for title-only.
3. Topic 408, “tropical storms.” Average precision scores of about .9 on all runs were unexceptional, but exact precision scores of .22 and above for all but the expand by 25 condition indicate reasonable early precision.
4. Topic 441, “Lyme disease.” This seems to be a good example of a topic well-represented by two terms. Title-only runs yielded exact precision scores in excess of .52, and average precision of over .58, while title plus description runs were very poor, below .01 on both exact and average precision.
|
|
Expand
by 25 (isa25) |
Expand
by 50 (isa50) |
Expand
by 25 title only (isa25t) |
|||
|
Exact precision |
|
|
|
|||
|
Average precision |
|
|
|
|||
|
Number
over median average precision |
0 |
0 |
0 |
5. Topic 444, “supercritical fluids.” This is the one case where title plus description heavily out-scored title-only, but just for the expansion by 50 case. 3 of the 17 (total) relevant documents were retrieved in the top 10 documents, indicating some good terms were brought in by expansion that were missed in the other runs.
The easiest interpretation to make from these results is that query term expansion is valuable, but only when applied to useful terms. Future efforts will return to a reliance on part of speech tagging and term weights to identify “good” terms to expand, while avoiding expanding the rest.
Finally, note that no runs were made with non-expanded queries. This would be a valuable comparison. Future experiments should therefore compare:
- Expansion versus no expansion
- Different levels of expansion (bringing in 25 terms versus 50 terms, or other values)
- Methods for choosing terms to expand (TF weights, part of speech, presence in <title> field, etc.)
The steps taken for the small Web track (a 2 gigabyte subset of the large Web track’s Very Large Corpus or VLC) were nearly identical to the steps for adhoc. The only difference was that the co-occurrence matrix was not based on the small Web track corpus, but rather the adhoc corpus. This mismatch should theoretically be detrimental to the recall-precision statistics, but with an average exact precision of 0.02 from the adhoc task, such differences would be difficult to see. (Performance at exact precision values of 0.02 or below is approximately equal to that expected from a random sample of documents, although higher early precision than later precision indicates the retrieved set is actually non-random.).
Four small Web track runs were submitted. As with the adhoc task, two title plus description runs were judged, but two title only runs were not judged. Due to a scoring error, scores for the title-only 50-word expansion are not considered here. See Table 2 for a summary.
|
|
Expand
by 25 (isw25) |
Expand
by 50 (isw50) |
Expand
by 25 title only (isw25t) |
|||||||
Exact precision |
|
|
|
|||||||
|
Average
precision |
|
|
|
|||||||
|
Number
over median average precision |
0 |
2 |
2 |
Topics with either exact precision or average precision over 0.10 on one or more runs included:
1. Topic 403, “osteoporosis.” Presumably for the same reasons the topic performed well for the adhoc task (above).
2. Topic 408, “tropical storms.” Interestingly, performance was somewhat less than for the adhoc task, but still better than most other topics.
3. Topic 415, “drugs, Golden Triangle.” This topic performed well with title-only, but not well with title plus description, presumably due to the ambiguous terms in the description (e.g., “organizations,” “international.”).
4. Topic 432, “profiling, motorists, police.” This topic also performed well only on title-only runs.
5. Topic 433, “Greek, philosophy, stoicism.” One or more good expansion terms were presumably identified in title plus description expansion by 50, which yielded .18 exact precision but less than .01 for the other runs.
6. Topic 448, “ship losses.” Similarly to topic 433, exact precision on title plus description expansion by 50 yielded an exact precision of .18, but very low scores on other runs.
7. Topics 441, 445, 446 and 450 all had comparable patterns of exact precision scores near or over .20, but unexciting average precision scores (except for topic 450, with an average precision of .10 or above for all runs). These topics are characterized by useful title terms and, except for 441, specific description terms.
The small Web track data set seemed to be fairly well suited for topics 401-450. Average precision scores across all TREC participants were sometimes higher for the Web track, and other times higher for the adhoc task. The mean of median average precision scores for the adhoc task and small Web track across all participants was identical to 2 decimal places at .24.
It seems reasonable to conclude that the somewhat better performance for the small Web track versus the adhoc task for the ISpace system is genuine. Some possible explanations are that an increased variety in Web documents resulted in fewer highly-ranked false hits, and that the 3136 terms pre-identified for information space document location calculation helped to eliminate a larger proportion of small Web track documents than adhoc task documents.
The large Web track was the main emphasis of this year’s work, with the primary goal of achieving reasonable speed and efficiency in processing documents and topics. As discussed below, this goal was achieved, but at the expense of retrieval performance. This section describes the system considerations for the large Web track, including a discussion of the tree data structure employed for retrieval. However, because results for the VLC were submitted after the deadline, they were not officially judged. Because almost none of the VLC documents retrieved by ISpace were judged, no useful retrieval performance statistics were generated.
As with the small Web track, the large Web track made use of the information space pre-computed from the adhoc task. Thus, the steps for retrieval were nearly identical, except that a multidimensional tree (a modified kd tree, described below) was used for retrieval, rather than a brute-force search for the closest document vectors for each query vector.
Although 3062 eigenvectors (dimensions) were available for use from the adhoc collection, the VLC documents were only processed with 300 eigenvectors per term. The 300-dimensional information saved to disk was not utilized for the actual retrieval, however: only 100 dimensions were utilized. This is beneath the threshold recommended by Deerwester et al. (1990) and others, but was necessary to fit as many documents as possible into the tree, which was memory-based.
.
|
Measure |
Performance |
Notes |
|
Time to index |
5 clock hours |
0.0012 seconds per document |
|
Index size, all files |
11GB, 18 files |
~640 bytes per document |
|
Index size, eigenvectors only |
8GB, 3 files |
300 dimensions per document |
|
Time to start retrieval engine (prior to running queries) |
20 clock minutes |
|
|
Time to run 10,000 queries |
52 seconds |
0.0052 seconds per query |
The 100GB VLC (Very Large Corpus) was stored on a disk/tape array at UNC Chapel Hill. This enabled rapid staging from tape robot to disk without requiring more than about 10GB of disk space at a time. See Table 3 for a summary of non-retrieval performance measures
Most of the large Web track was performed on UNC-CH’s Sun ES-10000 server with 36 processors and 16GB of main memory. Disk access on the disk/tape array was found to be quite fast, cutting indexing time by 60% versus a comparable server with local SCSI RAID disk. No parallelization was utilized, and the ES-10000 was shared with many other user processes.
The overhead to start the retrieval engine consisted mainly of reading in all document locations (eigenvectors) from disk to a modified kd tree in memory. The kd tree is a data structure for matching data on a large number of keys – in this case, the keys corresponded to the relatively large number of dimensions. Kd trees are type of multidimensional tree developed by Friedman, Bentley & Finkel (1977), intended to solve the challenge of quick searching of multi-keyed data. This is a similar problem to that of multi-way trees such as B-trees (cf. Knuth, 1998), but whereas B-trees are split on a single key (such as a filename), kd trees are split on multiple keys (such as dimensions in a multidimensional space). The particular memory-based implementation of the kd tree utilized for ISpace was derived from Weiss (1999).
Although up to 300 dimensions were available on disk, as mentioned above, only 100 were used for the tree. Even so, only about 7 million of the 14 million VLC documents could be considered for retrieval at one time on the ES-10000, using about 4.5GB of memory. (There was no opportunity for dedicated access to the ES-10000.) Note that only 14 million of the 18 million VLC documents were considered, because the other 4 million had none of the 3136 terms (many of these were in non-English languages).
It is possible to store the kd tree on disk, rather than storing the entire tree in memory. This is a required development to grow ISpace beyond 7 million documents in 100 dimensions.
The general purpose of the kd tree is to minimize the number of document vectors that need to be compared by brute force (one at a time) to the query. Consider that a key difference between ISpace (and other LSI-like approaches) and the VSM or Boolean systems is that a document can be “close” to a query without having any terms from the query.
For example, a document, “pig farmer” might be close in ISpace to a query, “swine domesticator.” In order to determine and rank the closest documents to a query, it is necessary to consider ALL documents, not just those with query terms. For up to a few thousand documents, if their coordinates are in main memory, an exhaustive comparison and ranking is reasonable. For millions of documents, as in the large Web track, it is desirable to group documents in some sort of structure so that a relatively small proportion of documents need to be exhaustively compared.
The modification of the kd tree used here splits the collection into three groups at each level based on a dimension corresponding to that level. There are only two differences between the modified kd tree – which is tentatively called a green-black or gb tree (for its similarity to a red-black tree). One difference is that there are three children per parent, instead of two. The other is that the decision of which branch or leaf to select at each level is based on the coordinates at the corresponding dimension, instead of on the dimension with the highest remaining variance. These changes are anticipated to be helpful with a highly multidimensional space (hundreds or thousands of dimensions), as compared to the more typical case from the literature on algorithms where only a dozen or so dimensions are used.
Insertion time for a kd tree is proportional to n log (n) where n is the number of items to be inserted. Identifying the cell with the best match in a kd tree takes time proportional to log n, but then all items in a cell (leaf) must be evaluated and ranked relative to the target query, which takes linear time with respect to the number of items in the cell.
For ISpace, the goal was to minimize the number of linear time comparisons by minimizing the number of cells that had to be examined. The challenge, as might be expected, is that the tree depth does not nearly reach the total number of dimensions. For example, a tree with 7 million items might reach a depth of 60 levels (that is, the most dimensions examined to separate documents into separate leafs – known as buckets – is 60). This is for a relatively large bucket size of 1000 (a limit of 1000 documents per leaf, before the leaf is split and the tree descends a level).
Smaller bucket sizes would result in deeper levels, but bucket creation and splitting the parent bucket’s contents is relatively expensive.
The modified kd tree used for ISpace in TREC8 yielded strong system-based performance, with time per query well under 1 second. Further work with this type of tree structure will determine the extent to which the performance win on search time can also yield good retrieval performance.
This year’s implementation of ISpace incorporated query term expansion. A modest-sized information space of only 3136 terms was built, consisting of term eigenvectors derived from a term correlation matrix.
ISpace retrieval performance was not as good for TREC8 as for TREC7. However, the indexing and speed performance was increased by at least an order of magnitude. Implementation of the following should result in regained retrieval performance:
1. Allow more terms in the information space. At least a 40,000 term eigensystems should be achievable.
2. Incorporate part of speech tagging to identify terms that should be expanded.
3. Implement sentence parsing for TREC-like queries so that terms in phrases identifying unwanted concepts may be bypassed.
In addition to these features, which have already been implemented in prior versions of ISpace, additional features should be added:
1. Enable Boolean -style queries with AND, OR and NOT
2. Multiple options for term weighting schemes
3. Term co-occurrence relations measured at the sub-document level (e.g., within an N-term window or within the same sentence.
Finally, it should be noted that this year’s ISpace is an IR system without an interface. Adding a Web-based front end, enabling relevance feedback and so forth will present no problems. Incorporating the navigable fly-through system, YAVI, described in TREC7 is also desirable in the near-term.
Figure 1: Diagram of gb tree structure. The tree consists of leafs or buckets, which contain documents, and branches, which are used to determine where a document should be placed.
|
Leaf containing documents Leaf containing documents Leaf containing documents Leaf containing documents Leaf containing documents |
Brill, Erik. (1994). “Some advances in rule-based part of speech tagging.” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94). Seattle, Washington.
Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard. 1990. “Indexing by Latent Semantic Analysis.” J. Amer. Soc. For Information Science 41(6): 391-407.
Frakes, William B. & Baeza-Yates, Ricardo (Eds.). (1992). Information Retrieval Data STructures &Algorithms. Englewood Clifs, New Jersey: Prentice-Hall.
Friedman, Jerome H.; Bentley, Jon Louis; Finkel, Raphael Ari. (1977). “An Algorithm for Finding Best Matches in Logarithmic Expected Time.” ACM Trans. On Math. Software 3(3): 209-226.
Knuth, Donald E. (1998). The Art of Computer Programming Volume 3: Sorting and Searching. Reading, Massachusetts: Addison-Wesley.
Newby, Gregory B. (1998). “Context-Based Statistical Sub-Spaces.” Text REtrieval Conference (TREC-6) Proceedings, pp. 735-746. Gaithersburg, MD: National Institute of Science and Technology.
Newby, Gregory B. (1999). “Information Space Gets Normal.” Text REtrieval Conference (TREC-6) Proceedings, pp. 567-571. Gaithersburg, MD: National Institute of Science and Technology.
Weiss, Mark Allen.
(1999). Data Structures
and Algorithm Analysis in C++ (2nd ed.). Reading, Massachusetts: Addison-Wesley.
* School of Information and Library Science, Campus box 3360 Manning Hall, Chapel Hill, NC, 27599-3360. Email: gbnewby@ils.unc.edu