Indexing Large Text Collections Using the Vector Space Model

Posted on February 11, 2015

Kevin C. O’Kane, Professor Emeritus

University of Northern Iowa

From the multimillion document collections of research reports, memos, and email at large corporations, to the nearly endless mishmash of Internet web pages, tweets, texts and cat photos, someone, somewhere wants to search them. To accomplish this, we need efficient ways to identify, organize, store, search, and retrieve content. One of the most widely used approaches to this problem is called the vector space model. It works by constructing, from the indexing vocabulary itself, a multidimensional hyperspace whose axes (possibly thousands) correspond to the terms of the indexing vocabulary. A point along an axis indicates the importance of a term. Each document in the collection is then expressed as a vector whose components correspond to the terms of the vocabulary, the values of which indicate the importance of the corresponding term in the document. Thus, based on its component vocabulary, a document vector describes a point in the hyperspace. Queries are likewise rendered as vectors which also define points in the hyperspace. Documents that lie within an adjustable, multidimensional envelope from the queries are retrieved and ranked according to their distances from the query points. The first use of this model was in the SMART System (Salton 1988, 1992) and it has been the basis of many implementations since.

The talk will discuss the model with examples drawn from its application to a collection of 293,000 medical abstracts (sorry, no cat pictures). Topics will include:

Zipf’s Law,
Word frequency analysis, stop list generation, and word stemming,
Term weighting: Inverse Document Frequency weights, and discrimination coefficients,
Similarity functions,
Construction of document-term, term-document, term-term, and document-document matrices,
Synonym and phrase identification,
Term and document clustering,
Database implications: SQL or NoSQL?
Document retrieval.

Salton, G., (1988)
Automatic Text Processing, Addison-Wesley, Reading.

Salton, G., (1992) The state of retrieval system evaluation,
Information Processing & Management, Vol. 28, No. 4, pp. 441-449.

Dr. O’Kane is former professor and head of Computer Science at the University of Northern Iowa and, prior to that, professor and head of Computer Science at the University of Alabama. He also taught at the University of Tennessee and the Ohio State University College of Medicine. He is the author of about 50 publications in the areas of information retrieval and medical informatics. He received his S.B. in chemistry from Boston College and his Ph.D. in computer science from the Pennsylvania State University. He is the author of an open source IS&R workbench as well as an open source compiler and interpreter for the Mumps language.