[Fwd: [lingucomponent-dev] completely new thesaurus supporting
multiple meanings]
Robert Ludvik
robert.ludvik at zd-lj.si
Fri Dec 5 08:29:15 CET 2003
Enkrat smo se pogovarjali o tezavru. Kevin ga je na novo spisal,
posiljam bolj kot zanimivost.
Lp
Robert Ludvik
Hi,
FYI:
Because my old thesaurus was such a thrown together hack job, that
is too
limiting for many languagues, I have rebuilt a new thesaurus.
I will be commiting the completely new thesaurus implementation to
the 680 cws
based on m17.
This thesaurus features:
- support for synonyms grouped by multiple meanings
- no more artificial int16 or 65,000 word limits
- based on the WordNet 2.0 data post processed into a structured
text file
- no more separate parser required to be built
- index is also completely text based built on the fly from the
thesaurus data
with very simple perl
Here is a snippet of the thesaurus text file for the entry
"contract" to help
illustrate the format ...
contract|12
n|written agreement
n|declaration|bid|bidding
n|contract bridge|bridge
v|undertake|promise|assure
v|sign|sign on|sign up|hire|engage|employ
v|compress|constrict|squeeze|compact|press|tighten
v|shrink|decrease|diminish|lessen|fall
v|take|get|sicken|come down
v|shrink|reduce
v|condense|concentrate|change|alter|modify
v|narrow|change
v|abridge|foreshorten|abbreviate|shorten|cut|reduce|decrease|lessen|minify
The delimiter character is the pipe "|"
The first line is the entry word or phrase, followed by the number of
meanings.
Each line after that starts with an abbreviation for the part of speech
(n,v,a,r) that becomes part of the meaning, followed by synonyms,
the first of which is used as the meaning as well.
So if you have or are in the process of building a thesarus, please
start
thinking about converting it to the new format (btw, this format can
quite
easily be used to store the old thesaurus information if the part of
speech
is not used and only 1 meaning is given so nothing is lost).
Comments and questions welcome.
Hope this helps,
Kevin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at lingucomponent.openoffice.org
For additional commands, e-mail: dev-help at lingucomponent.openoffice.org
More information about the lugos-slo
mailing list