The goal of the indexing task is to assign characterizations (terms) to documents that are deemed to best represent their content. Every term used to characterize documents of the same collection can be seen as adding a new dimensionality to the characterization. Terms should be assigned to documents in such a way that documents on the same topic are positioned close together in the N-dimensional term space, while those on different topics are placed sufficiently apart. Terms can be anything from e.g. tri-grams and words to linguistic-entities and concepts. In the two extreme cases, documents can be characterized by themselves, e.g. their document numbers, or all documents by exactly the same characterization. The former characterization positions documents as far as possible apart, resulting in no way of retrieving documents on the same topic, thus it is unusable in the IR context. The latter provides no way of discriminating between different topics. Therefore, a suitable characterization must be usable and discriminating.
In a keyword-based representation, every document is characterized by a set of keywords with weights representing the importance of each keyword in characterizing the document. Keywords are usually derived directly from the document's text. Keyword-based representations are modestly usable and discriminating. Single words are rarely specific enough for accurate representation, e.g. the word system does not say much, whereas a sound system clarifies the meaning somewhat more. Moreover, a word with high frequency of occurrence in a document collection is not a good discriminator. On the other hand, a phrase, even one which is made up of high frequency words may occur only in a few documents, thus becoming a good discriminator. These observations suggest that a better characterization will make use of phrases. Consequently, a naive phrase retrieval hypothesis can be formalized as follows.
Phrases can be obtained using statistical or syntactic methods. Syntactic phrases appear to be reasonable indicators of content, arguably better than proximity-based statistical phrases, since they account for word-order changes or other structural constructions, e.g. science library vs. library science vs. library of science. However, experiments have shown that syntactic methods are not significantly more effective than statistical methods [18,19,20]. This failure of NLP to outperform statistics can be attributed to the poor quality and robustness of the existing NLP techniques. Nevertheless, we will adopt a syntactic approach for the time being, assuming that accurate syntactic analysis and disambiguation techniques will become available. We will return to the effectiveness issues of NLP in section 7.
Evidence suggests that noun phrases should be considered as a semantical unit. The most important reasons are:
Phrases can be used in their literal form as terms, although the performance is then expected to be inferior to that of keywords. It is well known that, as the size of corpus grows, the number of keywords grows with the square root of the size of corpus. One could expect that the same holds for phrases, but the number of such enriched terms grows even faster. So does the likelihood of there being different phrases corresponding to the same concept. On one hand we would like to use phrases to achieve precision, but on the other hand recall will be too low, because the probability of a phrase re-occurring literally is too low. To deal with this sparsity of phrasal terms, we shall introduce a number of linguistic normalizations (section 5). Linguistic normalization tries to reduce alternative formulations of meaning to a normalized form. For example, river pollution and pollution of rivers are both normalized to the same indexing term pollution+river.