Next: 5. Linguistic Normalization
Up: Linguistically-motivated Information Retrieval
Previous: 3. The Phrase Retrieval
4. Representation of Phrases
A syntactic phrase can be represented in various ways.
At the bottom end of the representation spectrum,
a phrase can be represented simply by the unordered set of its words,
disregarding all structure.
At the other end, all linguistic structure can be taken
into account, resulting in complicated parse-tree representations.
The choice is a trade-off between
syntactic information and ease of phrase extraction.
For example, a simple noun phrase picker could easily be constructed
by looking for sequences of articles, adjectives and nouns within a text.
A noun phrase extracted like that, would contain little information
about how its adjectives and nouns are related to each other,
except that adjacent words are most-probably more related
than non-adjacent ones.
In an unordered set-of-words representation,
and assuming there is no special treatment of proper names,
the noun phrase
the hillary clinton health care bill proposal
would contain bill clinton, but it is obvious that this phrase does not
refer to him.
However, experimentally such a co-occurrence of query keywords
within a noun phrase
has resulted in clear improvements in precision
[25].
A sequence-of-words representation does not contain
bill clinton (rightly),
but does not contain clinton proposal either (wrongly).
A full linguistic parsing would result in a much more precise representation.
However,
the parse-tree would contain too much linguistic detail,
most of which is unnecessary for indexing
as such details reflect mostly the syntactic description of the natural
language used rather than the intended meaning.
Since
the goal is to derive adequately precise (for retrieval purposes) meaning
from syntax,
we will settle for less than full linguistic parsing.
Linguistically motivated light parsing has already been shown
to slightly improve retrieval results over
the classic IR approximation to noun phrase recognition
[26].
As a result,
an intermediate representation of noun and verb phrases is desirable,
eliminating structures which can be assumed not to be beneficial to IR:
Definition 3 (Noun Phrase for IR)
A core noun phrase
,
from an IR point of view,
has the general form:
where
-
(determiner) = article, quantor, number, etc.
-
(pre-modifier) = adjective, noun or coordinated phrase.
-
= usually a noun.
-
(post-modifier) = prepositional phrase, relative clause, etc.
- the asterisk ()
denotes a list of zero or more elements.
Pre- and post-modifiers may recursively include other NPs.
Definition 4 (Verb Phrase for IR)
A verb phrase
,
from an IR point of view,
has the general form:
where
-
(subject) = an NP
(in the wide sense, including personal names, personal pronouns etc.)
-
(verbal clause) = inflected form of some verb, possibly
composed with other auxiliary verb-forms and adverbs.
-
(complements like object, indirect object,
preposition complement, etc.) = an NP or Prepositional Phrase (PP).
- the asterisk ()
denotes a list of zero or more elements,
depending
on the transitivity of the verb (e.g. intransitive verbs have no
complements, transitive verbs have an object, ditransitive
have an object and indirect object).
In accordance with the above definitions, it is possible to perform a parsing
arguably lighter than full linguistic parsing,
while a reasonable amount of structural information will still be retained.
An example parse-tree is given in
figure 1;
this is rather compact in comparison with
a full linguistic parse-tree for the same sentence
which would easily have overrun this page.
Figure 1:
light parsing for IR purposes.
|
Of course, it is important that the parser is able to
deduce the correct (or at least the most probable)
dependency structure in complicated phrases.
As we will see next, some elements which are considered of little interest
from an IR point of view,
e.g. determiners, prepositions, auxiliaries and adverbs,
may be eliminated.
Next: 5. Linguistic Normalization
Up: Linguistically-motivated Information Retrieval
Previous: 3. The Phrase Retrieval
avi (dot) arampatzis (at) gmail