Question on Thesaurus

Elanjelian_Venugopal · September 22, 2011, 8:53am

Dear all:

I wonder if any of you have experience creating a thesaurus for your
language ... particularly in UTF-8.

This is my problem. I'm working to create a Tamil thesaurus -- to go along
with the Tamil Spell-checker (already created). I've followed the guidelines
found in http://lingucomponent.openoffice.org/thesaurus.html and have
created test .dat and .idx files required for the purpose. (See attached.)
However, despite trying many times, I'm not able to the output to work in
LibreOffice. I don't get the anticipated synonyms. It says (None).

Could anyone here help me?

Many thanks.
-e.

timar · September 22, 2011, 9:00am

Hi,

Dear all:

I wonder if any of you have experience creating a thesaurus for your
language ... particularly in UTF-8.

This is my problem. I'm working to create a Tamil thesaurus -- to go along
with the Tamil Spell-checker (already created). I've followed the guidelines
found in http://lingucomponent.openoffice.org/thesaurus.html and have
created test .dat and .idx files required for the purpose. (See attached.)

This mailing list does not allow attachments. Have you looked at other
thesauri that work? Is yours look the same?

Best regards,
Andras

Elanjelian_Venugopal · September 22, 2011, 9:13am

Hi Andas,

Most other thesauri are created in ISO8859-1 encoding, except Hungarian,
which uses UTF-8. The dat and idx files I created look pretty much look like
the Hun files to me. So, I'm not sure why it's not working... -e.

timar · September 22, 2011, 9:18am

Did you register it with LibreOffice? I mean via dictionaries.xcu. Do
you see your thesaurus in Tools - Options - Language Settings -
Writing Aids?

Elanjelian_Venugopal · September 22, 2011, 9:31am

Yes. I see OpenOffice.org New Thesaurus, below Hunspell SpellChecker. Both
ticked. -e.

Elanjelian_Venugopal · September 22, 2011, 10:46am

Andas,

I found one oddity, though. In Hungrarian, which uses UTF-8, the byte offset
into the first data is 6. Which makes sense. For English, which uses
ISO8859-1, the byte offset is 10. Again fine. However, in the file I
generated, the byte offset is 9. I wonder if there is something there...

-e.