[LUGOS-SLO] Tezaver

Tue Aug 5 11:29:34 CEST 2003

Zdravo
Bernard Herman, ki je vodja projekta OKO je trenutno na dopustu. 
Verjetno bi mu to lahko predstavili in kaksno rekli. Ali komu drugemu?

 > Groznje: nezmoznost pridobitve avtorskih pravic za OK distribucijo
 > baze;
Morda bi se to lahko resilo tako, da se vsi uporabniki, ki 
dodajajo v bazo strinjajo s tem (da se lahko izda kot OK)?

Avtorje obstojecih bi lahko kontaktiral LUGOS in zaprosil za 
'besede' (ce bi jih vprasal SDTJ ali MID ali kdo bolj zvenec oz. 
vsi skupaj, bi bilo morda se boljse :-). Tako bi se za zacetek 
lahko napolnilo bazo.

 > bi pa lahko nudil prikaz primerov (konkordance). Pri vsem tem je
 > seveda poljubno dela, tudi za dipl-dr.
To bi bilo dobro povedati komu, ki je bolj v tej sceni :-) Kaksen 
informatik/jezikoslovec lahko pri OOo najde kup idej. Ena od njih 
je tudi "Grammar checking"
One of the goals of the Lingucomponent project is to design, 
develop, and implement a Grammar checker for English and other 
supported Languages.
Summary
This is a "Wish" project. I do not intend to undertake it unless 
significant interest and developers decide to help out. If you 
have any interest in helping to design, develop, and implement a 
Grammar Checker for the OpenOffice.Org project, please send an 
e-mail to dev at lingucomponent.openoffice.org identifying yourself, 
your skills, your willingness to lead this project, etc.

In se vec na http://lingucomponent.openoffice.org/

Lp
--
Robert Ludvik

PS
Zadnja razprava na dev at lingucomponent.openoffice.org glede 
trenutne omejitve 32000 vnosov v bazi. Kevin je soavtor tezavra za 
OOo "(Sander and I just threw some ideas around over a
weekend and I simply coded something up.)", so pa tudi novi 
predlogi in ljudje, ki se bodo s tem ukvarjali.

*******************************************************************
The en_US thesaurus only needed under 32000 unique entries but the 
binary
format chosen to hold offsets into the table were unsigned shorts 
which can
hold up to 64000 entries.
The current en_US thesaurus code was and still is a hack I wrote 
to get the a
theasurus in place in time for OOo 1.0.  It was never meant to be 
be a model
for other languages to use.
If I had to do it all over again I would redesign it with other 
languages in
mind, support for affixes, support for multiple meanings, etc.
Unfortunately, I did not (Sander and I just threw some ideas 
around over a
weekend and I simply coded something up.
I was kind of hoping that someone else would come along and design 
a much
better international thesaurus.  Now the French, German, and 
Italian, Czech
seem to have thesauri (or one in the works) and those langauges 
probably feel
quite constrained by the layout and design I did originally.
The key here is that if someone else would like to propose a 
better design and
layout for a thesaurus I would be happy to create a new thesaurus 
component
that would interface to that that design.  I just haven't had time to
develop this properly and hoped that someone else would pick up 
the pieces
and move forward.
I just got back from vacation and I will check that OOo 1.1 rc3 
really does
have the changes in place to support 64535 entries so that others 
can use it
as well.
Hope this helps,
Kevin

*******************************************************************

The OpenThesaurus website (German thesaurus) indeed has some 
features that
are lost in the export for OOo (e.g. multiple meanings). But it's 
based on
a simple relational data model, which should be easy to take over 
to OOo
if there's a relational data lookup stuff possible in OOo, but a 
simple
Berkely DB key/value pair lookup would also work. Currently the data
lookup code for the thesaurus is hand-written AFAIK, which makes 
it a bit
complicated.
I will some day add other features to OpenThesaurus, like query
normalization. People can then search for German "ging" and find 
the base
form, "gehen" (in English that would be: search for "walked" and 
find the
synset for "walk", because "walked" isn't known). Once these 
things are
added, OpenThesaurus can be a model for a new OOo thesaurus. It's 
easy to
understand, you only need to look at the database structure.
Regards
  Daniel

 > Projekt izdelave OK slovenskega tezavra
 >
 > Povezave: Lugos, SDJT, OKO
 > Projekti: MID, MSZS, EU
 >
 > Pristop: Lokalizacija nemskega OpenThesaurus
 > http://thesaurus.kdenews.org/
 >
 > Pravijo:
 >
 > Gibt es dieses Projekt auch f?r anderen Sprachen?
 >
 > Es ist geplant, das gleiche auch mit anderen Sprachen zu machen,
 > sofern f?r diese ebenfalls noch kein freier Thesaurus zur Verf?gung
 > steht und sofern sich Muttersprachler der jeweiligen Sprachen 
finden,
 > die sich als Administrator intensiv um ihren Bereich k?mmern.
 >
 > oz, po Googlu:
 >
 > It is planned to make the same also with other languages if for 
this
 > likewise still no free thesaurus is available and if to native 
speaker
 > of the respective languages are, which worry intensively as
 > administrator about their range.
 >
 > Prednosti koncepta: moznost porazdeljenega vnasanje popravkov in
 > dodatkov, ze postavljena kvalitetna mrezna platforma (php)
 >
 > Sibkosti: potrebno napisati inicialni slovar, ki bo zadosti
 > atraktiven, da bodo ljudje sploh prisli na obisk.
 >
 > Priloznosti: avtomatsko polnenje (inicialne) baze z Amebisovin
 > tezavrom, SSKJ, drugimi slovarji, ali korpusom.
 >
 > Groznje: nezmoznost pridobitve avtorskih pravic za OK distribucijo
 > baze; in kompleksnost detekcije sopomenk (..pomenk) iz
 > eno/dvo-jezicnih slovarjev, kaj sele korpusov.
 >

 >
 > Naloge:
 > 1. prenos in lokalizacija nemskega OT
 > 2. pridobitev virov, konverzija in polnenje
 > 3. pridobitev uporabnikov (najprej mailing liste, potem casopisi)
 > (4. izgradnja novih modulov: konkordance, lematizacija, rudarjenje)
 >
 > Viri / Pridobitev:
 >
 > 1.Amebisev splosni tezaver - mogoce bi ga proti placilu prepustili
 >   (del?) v OK; dodatna, grass-roots moznost je, da dajo na 
zacetku cel
 >   svoj tezaver v iskalnik/editor (ne pa v distribucije) - vendar pa
 >   postane OK vsako geslo, ki ga nek uporabnik (v toku 
projekta?) popravi.
 >
 > 2.Drugi slovarji - to so verjetno terminoloski tezavri in slovarji
 >   posameznih podrocij (linki na 
http://nl.ijs.si/sdjt/sdjt-www.html#lex),
 >   mogoce  Geodetski tezaver, EvroTerm, Slovar  Informatike, baza
 >   Pametnjakovica (Smarta:)??
 >   Specializiranih tezavrov je na www vec kot bi si mislil:
 >   http://www.mszs.si/eurydice/term/tez1poj.htm
 >   Avtorske bi bilo treba razcistit za vsakega posebej.
 >
 > 3.SSKJ - problematicno je prodobiti podpisan dokument, ki dovoljuje
 >   uporabo za namene projekta; avt.prav. si delijo ZRC SAZU in
 >   Avtorji. Alternativna, anarhisticna varianta
 >   je, da se za privoljenje ne vprasa, pac pa se poslje obvestilo na
 >   ZRC, da projekt namerava SSKJ uporabiti za gradnjo tezavra.
 >   Sporno je namrec ce predstavlja izdelava (OK) trezavra iz SSKJ
 >   sploh krsenje avtorskih pravic - 'kopirani' so lahko izredno 
majhni
 >   deli besedila slovarja ki dostikrat niti niso zvezni (buljiti 
.... gledati)
 >   Pravno gledano, je na strani projekta Fair Use Agreement; 
proti pa,
 >   mogoce, Millenium Act...
 >   No, cela stvar je, kot recejo, "a can of worms"...
 >   V okviru SDJT je bilo receno, da se bo za elektronsko bazo 
SSKJ vsaj
 >   dokumentiralo trenutno stanje, no, pa se nihce ni.
 >
 > 3. Korpusi - na lov za slovenskimi besedili na Web? 
Povprasevanje po
 >   besedah na najdi.si? FIDA - avtorske pravice.
 >
 >
 > Viri / Uporaba:
 >
 > Cilja sta dva: vkljuciti obstojece vire v OO::XThesaurus oz. 
OpenThesaurus;
 > google mi na "openoffice thesaurus: vrne
 >     com::sun::star::linguistic2::XThesaurus
 >     Description
 >         allows for the retrieval of possible meanings for a 
given word and language.
 >
 > kar je zelo splosno - ali so ti "pomeni" definicije, sopomenke, 
nad-
 > in pod- in druge pomenke? Mogoce kar prevodi? Vse to? Verjetno 
je to
 > precej odvisno kateri vir bi imeli za polnenje in koliko dela bi se
 > vlozilo v ekstrakcijo - ne vem kaj ima Amebisov tezaver; 
Geodestki bi
 > lahko dal nad- in pod-pomenke (sopomenk pa, skorajda po definiciji,
 > ne); EvroTerm mogoce sopomenke, ce bi sledili prevodom neke besede;
 > SSKJ spet nad- in pod-, ali pa definicije? Korpus, ce bi bil 
vkljucen,
 > bi pa lahko nudil prikaz primerov (konkordance). Pri vsem tem je
 > seveda poljubno dela, tudi za dipl-dr.
 >
 >
 >
 > ...
 >
 > Tomaz Erjavec wrote:
 >
 >> Zdravo,
 >> se strinjam z Alesem, sem pa tudi kar navdusen nad 
OpenThesaurusom.
 >> Pod http://nl.ijs.si/et/project/ootezaver.txt sem napisal nekaj
 >> iztocnic (beri: bljuz), sedaj je samo se treba najti nekoga, 
ki mu je
 >> slovenscina materin jezik, obenem pa "which worry intensively as
 >> administrator about their range"!
 >> No, resno, komentarji dobrodosli...
 >> lv,
 >> Tomaz
 >>
 >> Ales Kosir writes:
 >>  > Za resen in uporaben splosni tezaver je potrebno res veliko 
dela (to se meri
 >>  > v cloveskih letih), ce ga zacnemo od zacetka. Zato je treba 
razmisljati o
 >>  > drugih moznostih, da ne bomo zaceli iz nic.  >  > Lep pozdrav,
 >>  > Ales
 >>  >