The problems of linguistic variation have been noted by many researchers, who have answered with various techniques. Many of these techniques employ Natural Language Processing (NLP) and language resources like on-line dictionaries, thesauri, etc. The results until now have been inconsistent, making it difficult to reach a conclusion about their effectiveness. We will review some approaches and their outcomes for each of the morphological, lexical, semantical, and syntactical variation.
Morphology is the area of linguistics concerned with the internal structure of words. It is usually broken down to two types, inflectional and derivational. Inflectional morphology describes predictable changes a word undergoes as a result of syntax, and has no effect on the word's part-of-speech (e.g. a noun remains a noun) and little effect on its meaning. The most common changes are the plural and possessive forms of nouns (e.g. computer, computers, computer's), comparative and superlative form of adjectives (e.g. good, better, best), and the past tense, past participle and progressive form of verbs (e.g. compute, computed, computing). On the contrary, derivational morphology may or may not affect part-of-speech or meaning (e.g. computerize, computerization).
Two ways have generally been followed to deal with morphology in IR trying to increase recall, these are query expansion and stemming. In query expansion, morphological variants of keywords are added to the query. Stemming simply strips a word's suffix to reduce it to its stem, assuming that keywords with a common stem usually have similar meanings. Query expansion and stemming can been regarded as equivalent and the choice depends on the nature of the particular application. We will concentrate on stemming as the mostly-made choice.
Stemming can be done in a linguistic fashion, taking into account the function and the part-of-speech of a word, or a non-linguistic fashion, disregarding a word's context.
Lovins and Porter developed non-linguistic algorithms for suffix stripping based on a list of frequent suffixes to reduce words to their stem [2,3]. It is a common belief that stemmers improve recall without losing too much precision, however, a comparison of the Lovins stemmer, the S-stemmer, and the Porter stemmer with a baseline of no stemming at all, concluded after detailed evaluation that none of the three stemming algorithms consistently improves retrieval for English documents [4]. It was argued that the evaluation measures were not appropriate, and new measures were proposed for evaluating the performance of different stemming algorithms [5]. After experimentation, it was concluded that stemming is almost always beneficial for English, except for long queries at low recall levels. A more reliable version of Porter's stemmer was developed, which uses a dictionary to validate the result after every suffix-stripping step. This revised Porter stemmer resulted in improvements in retrieval performance for English documents, especially short ones [6].
Research with other morphologically more complex languages like Slovene showed improvement in effectiveness using a Porter-like stemmer modified for Slovene [7]. In the same study, when the Slovene corpus was translated to English and the experiment was repeated, there was no improvement in retrieval. For Dutch texts, it was found that linguistic inflectional stemming improves recall without significant loss in precision, while derivational stemming, although useful sometimes, in general reduces precision too much [8].
Lexical variation has been treated generally in two ways. On one hand, by (lexical) query expansion with semantically related terms e.g. synonyms, and on the other hand, the matching of query and document keywords via conceptual distance measures. For these purposes, thesauri have been exploited to supply related query terms, and semantical networks like WORDNET [9] to define semantical distance measures between words.
The choice of semantically related terms for a word depends on the context where the word is used. Thus, the context specifies the word's sense. When a word can be used in different senses, the problem of word sense ambiguity arises. Most of the techniques which are dealing with lexical variation require prior word sense disambiguation, and that makes these techniques strongly dependent on semantical variation (described in the next section).
Query expansion with WORDNET has shown a potential in enhancing recall, since it permits the matching of relevant documents which do not contain any of the query terms [10]. Expansion of queries using synonymy and other semantic relations supported by WORDNET showed that short and incomplete queries can be significantly improved yielding better retrieval effectiveness [11]. However, this query expansion technique made little difference in effectiveness for relatively complete descriptions of the information sought. For Dutch texts, synonym expansion was reported as potentially useful [12].
Experiments on a small collection of image captions (i.e. very short documents), using measures of semantical similarity distance between words based on WORDNET, showed improvements in retrieval [13]. However, their earlier experiments with word-to-word semantical similarity measures resulted in a drop in effectiveness, due to the effects of erroneous word sense disambiguation [14].
Another approach, based on indexing in terms of WORDNET's synonym sets (synsets) instead of word-forms, yielded successful results when queries are fully disambiguated [15]. If queries are not disambiguated, indexing by synsets performs, at best, only as well as standard word indexing.
Semantical variation has strong impacts on lexical query expansion, matching based on word-to-word semantical distance similarity measures, and on conceptual indexing; the success of these techniques requires prior disambiguation of word senses, as many researchers have noted [11,12,13,15]. Most of the research has concentrated into how large is the impact of semantical variation and its inaccurate resolution on IR effectiveness.
It is estimated that if word sense disambiguation is performed with less than 90% accuracy, the retrieval results are worse than not disambiguating at all [16]. Poor retrieval results were blamed on this reason in previous research [14]. Conversely, in the same experiments [16], word sense ambiguity was shown to produce only minor effects on retrieval accuracy, apparently suggesting that query-document matching strategies already perform an implicit disambiguation. In his experimental setup, ambiguity was introduced artificially by substituting randomly selected word pairs like bank and spring with ambiguous terms like bank/spring. This setup has two disadvantages, first, real ambiguity might not behave like the artificially introduced one, and second, the disambiguation of an artificially ambiguous term is only partial: when bank/spring is disambiguated as bank, bank is still ambiguous as it can be used in more than one sense in a text collection [15].
The techniques developed to deal with syntactical variation may be grouped in two categories: addition of phrases to queries, and use of syntactical structures for indexing. These techniques intend to increase retrieval precision.
A phrase is a group of words, and historically what has been referred as a phrase in the IR context varies significantly among researchers. The hypothesis for using phrases has been that they denote more meaningful entities or concepts than single words, thus they may constitute a better representation. Indeed, the use of phrases has become common in IR; many systems participating in the Text Retrieval Conferences (TRECs) now use one or another form of phrase extraction [17].
Traditionally, two types of phrases have been used, statistical and syntactic. Statistical phrases are any series of words that occur contiguously often enough in a text collection. Syntactic phrases are any set of words that satisfy certain syntactic relations or constitute specified syntactic structures. Statistical phrases are extracted using word frequency and co-occurrence information, while syntactic phrases usually require sophisticated NLP techniques. Which of the two types is more useful for IR remains still unclear; syntactic phrases seem to offer an advantage which is statistically rather insignificant [18,19,20]. Addition of syntactic phrases to queries yielded a substantial improvement in precision, especially near the top of ranking [21]. This benefit, however, was tied to the length of the query: the longer the query, the larger the improvement. Significant improvements in retrieval performance were found when syntactic phrases supplemented single words [22]. However, the impact of adding phrases varied according to the query topic. Adding phrases helped some topics, while it hurt some others. Small statistically insignificant improvements were also found for Dutch texts [20]. Other research concluded that phrases do not have a major effect in precision at high ranks, but are more useful at lower ranks [19].
Lexical atoms, such as hot dog, were used to replace their single words in indexing [22]. The experiments with replacing high frequency adjacent word-pairs - only adjective-noun and noun-noun combinations - with the corresponding phrase for indexing showed an improvement in average precision. Nevertheless, the inconsistent influence of phrases on recall and initial precision suggested a need for a better control over the selection of phrases which are used for replacing single words.
Indexing structures derived from syntax were tried in [23]. The matching between queries and documents was based on tree structures constructed from clauses. Syntactic ambiguity was also encoded in these tree structures and taken into account by weighting various syntactic interpretations at the time of retrieval. The experimental results were disappointing in both precision and recall. The group gave as possible reasons for the poor results the poor quality of the language analyzer, the different type of language in documents and queries, and the retrieval strategy applied.
Various attempts have been made to break out of the bag-of-words paradigm. Experiments have shown large variation in retrieval effectiveness, making it difficult to establish which techniques actually work and which not. Summarizing,
Although a lot of effort has been put into linguistically-motivated retrieval schemes, whether or not this is worth the trouble remains unclear. The evidence suggests the need for further investigation and better modeling. In the rest of this study, we will describe a retrieval scheme which demonstrates the application of linguistically-motivated techniques.