Next: 4. Experimental Setup
Up: An Evaluation of Linguistically-motivated
Previous: 2. A Linguistically-motivated Indexing
3. Representational Choices
The different indexing sets we experimented with are summarized below.
The acronyms will be used to refer to these choices
in the rest of the article.
- w
- (words): All word-forms found in the text.
- Sw
- (Stemmed words):
All word-forms stemmed by a Porter stemmer.
This a traditional indexing scheme and serves as the baseline
in order to compare the effectiveness of the rest of indexing schemes.
- Lw
- (Lemmatized words):
The same as w, except that all word-forms are lemmatized with respect
to their POS category.
In all the following choices, lemmatization is applied as standard.
Of course, for all w, Sw and Lw
we eliminate words of low indexing value
by using a POS stop-list (see section 4.5).
- Ln
- (Lemmatized nouns):
Nouns and proper nouns are well-known to be important in retrieval.
What happens if we omit all other keywords?
- Lnj
- (Lemmatized nouns and adjectives):
The combined effect of using the union of nouns and adjectives
is investigated in this experiment. These two categories cover
most of the words occurring in noun phrases.
- Lnv
- (Lemmatized nouns and verbs):
We investigate the combined effect of using the union of nouns and verbs.
- Lnjv
- (Lemmatized nouns, adjectives and verbs):
This experiment serves as an indication of what might happen
if we include to the indexing language only
linguistic entities which are extracted from noun or verb phrases.
Moreover, the impact of using adverbs for indexing
can be measured indirectly by comparing Lnjv with Lw,
since the indexing set Lnjv can be constructed from Lw
by removing the adverbs.
- Lap
- (Lemmatized adjacent word-pairs,
extracted from NPs):
These word-pairs consist of the nouns and adjectives of Lnj,
associated to form 2-word phrases by using the adjacency criterion.
The hypothesis for this experiment is that adjacent words
can be considered semantically related because of their proximity
and be taken as one term.
We use an extended notion of adjacency
by accepting non-adjacent words as adjacent if
the in-between words belong to certain POS categories
(e.g. determiner, article, or preposition).
For instance, the phrase pollution of the air gives the adjacent pair
pollution_air.
This is an important experiment because in comparison to
Lbt (described next) should measure the effect of syntactical
normalization on performance.
- Lbt
- (Lemmatized binary terms (Lbt,
extracted from NPs):
These binary terms consist of the nouns and adjectives of Lnj,
associated to form 2-word phrases by using the term modification criterion,
i.e. head-modifier pairs.
The head-modifier pairs are computationally more expensive than
adjacent pairs since syntactical normalization is required,
however, binary terms are syntactically canonical,
e.g. both phrases air pollution and pollution of the air
are mapped onto the same head-modifier pair, [pollution,air].
Next: 4. Experimental Setup
Up: An Evaluation of Linguistically-motivated
Previous: 2. A Linguistically-motivated Indexing
avi (dot) arampatzis (at) gmail