errors in dictionaries/ ...

Hi guys,

  From the dev list; Steve has found a load of malformed problems in the
dictionaries/ module while re-writing the indexer. I was wondering if
these errors were known and/or who maintains that stuff :slight_smile:

  We were wondering if we should do some sanity checking as we index them
and reject errors ?

  Thoughts appreciated,

  Thanks,

    Michael.

PS. Steve - the l10n list is almost certainly reply-to mangling, so you
will get no reply ;-> please can people CC Steve (or the dev list)
manually.

Hi Steve, *,

It would be nice, if you could provide us with an error log.
BTW only the following locales may be involved (others do not have
thesaurus):
ca, cs, da, de, en, fr, hu, it, ne, no, pl, ro, ru, sk, sl, uk

Many thanks,
Andras

Hi,

It would be nice, if you could provide us with an error log.
BTW only the following locales may be involved (others do not have
thesaurus):
ca, cs, da, de, en, fr, hu, it, ne, no, pl, ro, ru, sk, sl, uk

I guess (I might be wrong) that de, pl, sk and sl (and maybe some other
language) use our openthesaurus sites to generate the thesaurus
dictionaries, so there should be common errors in these dictionaries if at
all.

At least for Slovenian I am using my www.tezaver.si project, based on the
(old php version of) openthesaurus.de.

Lp, m.

Hi all,

Rather than try to list all the issues here I thought it might help if
I provided a script that tries to find errors in the files.

I have made some assumptions about the file format to look for the
common errors I found:
1. A line that starts with 1 or more characters followed by a |, then
only digits to EOL is a word definition.
2. A line that starts with either ( or - is a synonym definition.
This may not be a valid assumption as I've seen lines that start with
interj that were definitely synonym definitions. I am not sure what
interj means in th_ro_RO_v2.dat so I have special cased interj and
prep to also be a synonym line.

With these assumptions the script compares the expected number of
synonyms with the actual number of synonyms and complains if they
don't match (with word and line numbers displayed for the definition).

It will also complain if it finds the same word more than once and
will print out both lines on which the suspect word was found.

I hope this helps - the script finds no issues in a number of
dictionaries, but output this many informational lines for the
following dictionaries in my libreoffice build tree:
    138 th_ca_ES_v3.dat
   1092 th_de_AT_v2.dat
   1101 th_de_CH_v2.dat
   1092 th_de_DE_v2.dat
      2 th_hu_HU_v2.dat
      6 th_ne_NP_v2.dat
   2582 th_ro_RO_v2.dat
      8 th_ru_RU_v2.dat
     15 th_sk_SK_v2.dat

I hope this helps. The perl script is LGPL/MPL.

Regards
Steven Butler

Sorry for asking,

I hope this helps - the script finds no issues in a number of
dictionaries, but output this many informational lines for the
following dictionaries in my libreoffice build tree:
   138 th_ca_ES_v3.dat
  1092 th_de_AT_v2.dat
  1101 th_de_CH_v2.dat
  1092 th_de_DE_v2.dat
     2 th_hu_HU_v2.dat
     6 th_ne_NP_v2.dat
  2582 th_ro_RO_v2.dat
     8 th_ru_RU_v2.dat
    15 th_sk_SK_v2.dat

just making it clear also for my www.tezaver.si project - you found no
errors in Slovenian thesaurus, right?

Thanks,
m.

Hi Martin,

Sorry for asking,

That's okay.

just making it clear also for my www.tezaver.si project - you found no
errors in Slovenian thesaurus, right?

I don't know which thesaurus is Slovenian, but I assume it would have
SL in the name? You'll have to excuse my ignorance as I am only
familiar with the English language.

I checked th_sl_SI_v2.dat with my script and found none of the errors
I noticed (but obviously I can't guarantee I can find all possible
errors :slight_smile: )

I hope that helps.

Regards
Steven Butler

Hi,

I don't know which thesaurus is Slovenian, but I assume it would have
SL in the name? You'll have to excuse my ignorance as I am only
familiar with the English language.

Yes, sl-SI is for Slovenian.

I checked th_sl_SI_v2.dat with my script and found none of the errors
I noticed (but obviously I can't guarantee I can find all possible
errors :slight_smile: )

Great.

Lp, m.

This list does not allow attachments. Could you please commit your script to
git.

Many thanks,
Andras