English Dictionaries Project - Introduction by Marco Pinto

marcoagpinto · May 1, 2016, 2:25pm

Hello!

I am Marco from Portugal and I have been involved on several projects in
the past.

I have translated Pretty Good Privacy 2.6.3i to pt_PT back in 1997,
which was my first translation.

I have also translated Gpg4win, OpenSlides, sites and other documentation.

Lately I have dedicated most of my free time to my PhD project/thesis,
to the British Dictionary, to my tool Proofing Tool GUI and LanguageTool.

Around three years ago I had the idea of creating a place to store the
most up-to-date English dictionaries so that I could create OXTs for AOO
and LO and so that people could download the files from there to other
projects.

I have been trying to find the original authors and get their most
recent files but most of them are gone for a long-time, thus my fork of
en_GB.

I created a GitHub repository and there are already persons/companies
that download the files from there using scripts.

When I created my tool, Proofing Tool GUI, the first goal was to create
a thesaurus editor and the second to create a dictionary editor.

Both have been accomplished.

I offered myself to improve the thesaurus of the pt_PT language for
Minho University, but they never said anything so the project was halted
(I will resume somewhere in the future).

But other persons are developing thesauri using Proofing Tool GUI.

I have also forked and been improving en_GB and have added so far around
21K words and I am releasing it as extensions for Mozilla, OpenOffice
and LibreOffice:

*Mozilla (British):*
https://addons.mozilla.org/en-US/firefox/addon/british-english-dictionary-2

*OpenOffice:*
http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice

*LibreOffice:*
http://extensions.libreoffice.org/extension-center/english-dictionaries

Notice that for AOO and LO I am releasing OXTs with several English
Dictionaries.

My forked en_GB is the best British dictionary around and I have even
bought a Gold Account in Oxford Dictionaries in order to test the words
so that I may add them if needed. I first paste texts into Thunderbird
and, if it appears as typos, I check them in Oxford and add them if they
are valid words. Since the en_GB that came with AOO was obfuscated, I
grabbed the one from Mozilla around three years ago.

I do monthly releases of en_GB.

This is what I have been doing for nearly three years.

I and other people realised that the bundled dictionaries were totally
outdated and I had some difficulties setting up the technical stuff
(Git, Gerrit, etc.) and I had a discussion on IRC that a push is very
likely not going to be merged because of some licence problems and such.

Thanks!

Kind regards,

igaidhlig · May 3, 2016, 9:59am

Appreciate your good work Marco, many thanks!

Michael

Sgrìobh Marco A.G.Pinto na leanas 01/05/2016 aig 15:25:

dennisroczek · May 3, 2016, 11:29am

Hi *,

Hello!
</snip>
This is what I have been doing for nearly three years.

I and other people realised that the bundled dictionaries were totally
outdated and I had some difficulties setting up the technical stuff
(Git, Gerrit, etc.) and I had a discussion on IRC that a push is very
likely not going to be merged because of some licence problems and suc

h.
And in this paragraph is lying more stuff than some might imagine. Marco
is doing a great job, but because of the underestimate of the hunspell
features and the "wrongly" used git, he is not able to update the
bundled dictionary of LibreOffice (through the bundled one in AOO was
updated).

My proposal for the new project would solve many of the problems Marco
has. (if he would have used the project from the beginning, I'm sure we
can solve the actual situation with the en_GB dictionary).

I have started a wiki page for tracking proposals, ideas, pros and cons
together on one page.
https://wiki.documentfoundation.org/User:Dennisroczek/CDP

Please feel free to improve the page!

Thanks!

Kind regards,
>Marco A.G.Pinto

Dennis Roczek

Kruno · May 3, 2016, 3:50pm

03.05.2016 u 13:29, Dennis Roczek je napisao/la:

And in this paragraph is lying more stuff than some might imagine. Marco
is doing a great job, but because of the underestimate of the hunspell
features and the "wrongly" used git, he is not able to update the
bundled dictionary of LibreOffice (through the bundled one in AOO was
updated).

My proposal for the new project would solve many of the problems Marco
has. (if he would have used the project from the beginning, I'm sure we
can solve the actual situation with the en_GB dictionary).

I have started a wiki page for tracking proposals, ideas, pros and cons
together on one page.
https://wiki.documentfoundation.org/User:Dennisroczek/CDP

Please feel free to improve the page!

I've read that wiki page and as somebody who just started to maintain dictionary have to say that it's all sounds so good to me to a point being fantastic (science fiction fantastic).

First:

not doubled maintained word lists by multiple maintainers (not knowing each other)

will not and can not be resolved.

Who's dictionary to include to that single repository, how to merge them, how to resolve different concepts (what to include, what to exclude), how to merge affix files with different affix classes (that will be a mess). Why you think that included dictionary is 'standard' and is better then the other one? How to introduce those guys not knowing each other? The other guy will give up his work? Who will hunt all those 'other' guys telling them 'Yo, dude, leave that, do this shit!'

How will such a repository resolve competition between two English dictionaries?

You can only make one of them as LO's default and hope it will get maintained regularly and well (and that other guy will help).

But again,you can resolve which one to include and for which one to care about. What you will get is other people contributing to a (new) default LO's dictionary and that's where it might end.

Nobody can (or should) just declare 'we are building dictionary repository - here use this, not that' just because being in position of power to do that.

You can leave everything as is but it should be communicated differently. It's blurry like this, at least it is to me.

It's sounds like equation goes 'LO dictionary maintainer = hunspell dictionary maintainer' but that is not evaluating true (it's so obvious that I'm risking here to be called a fool).

Maybe I misunderstood something (my English is bad).

Croatian dictionary hasn't been update since it was released in 2001, that's fifteen years. I'm desperate for somebody to help me, but the whole concept feels a little bit problematic. It's the concept that's bugging me...

What should be done is central repository for _LO_ dictionary and hope that _that instance_ -- that particular dictionary -- will become widely use (Firefox etc.).

use of hunspell features correctly (not simple word lists, but by logic)

what this mean? OK, every dictionary should be build using all the features but those dictionaries who do just word list will continue do just the word list because is purgatory (or hell) to do this right if you were not doing it right from the beginning.

If people start randomly add affix class nobody will be able to maintain those dictionaries pretty soon. That can not be done easily if maintainer doesn't have access to some other dictionary so he can automate this job. It's pain and I often regret I took that on myself. There is no guy in the world who can add word with affix to that dictionary without me because it will take him tree days to study it (I need to write some kind of manual for that as soon as possible).

My point: it should be repository for maintaining hunspell's dictionaries and building extensions for other project - that's fine, but don't expect it to be lively as Pootle and translations - it's just not gonna happen, that's not realistic.

Maintaining a dictionary is devil's work and only doom are willing to do that; let others to participate but only a few will care for technical part even without a git or gerrit. Other will just want to add word or two, report a bad suggestion and such, but generally this repository will make job easy only for thous who are already doing this and that's where it might end (but it's a lot).

(This is not criticism to the wiki page, I wonted to post this three days ago...)

Did I get it all wrong?

Hit me hard,
Kruno

igaidhlig · May 3, 2016, 3:59pm

Sgrìobh Kruno na leanas 03/05/2016 aig 16:51:

will not and can not be resolved.

Who's dictionary to include to that single repository, how to merge them, how to resolve different concepts (what to include, what to exclude), how to merge affix files with different affix classes (that will be a mess). Why you think that included dictionary is 'standard' and is better then the other one? How to introduce those guys not knowing each other? The other guy will give up his work? Who will hunt all those 'other' guys telling them 'Yo, dude, leave that, do this shit!'

How will such a repository resolve competition between two English dictionaries?

I think this is less of a problem that it might seem at first glance. To begin with, there are not that many dictionaries which have active competing teams. Even en-GB (where you might expect a flurry of competition) only has a single maintainer.

First I think you gather in all those projects willing to participate, then you rescue the dead ones and then you try and work out arrangements with competing dictionaries. I don't see a reason which such a resource could not host multiple dictionaries for the same locale, even if it somehow selects one for default inclusion. Many locales have pre and post spelling reform variants anyway so you have to allow for multiples anyway so if you had 3 competing en-US dictionaries, you just label them differently if the differences cannot be reconciled.

My point: it should be repository for maintaining hunspell's dictionaries and building extensions for other project - that's fine, but don't expect it to be lively as Pootle and translations - it's just not gonna happen, that's not realistic.

It won't and that's a good thing, here too many cooks will certainly spoil the broth

Did I get it all wrong?

Hit me hard,

No, I think I had thoughts similar to your going through my head but I think the concept still has legs if we pull together.

Michael

Kruno · May 3, 2016, 4:25pm

03.05.2016 u 17:58, Michael Bauer je napisao/la:

Sgrìobh Kruno na leanas 03/05/2016 aig 16:51:

will not and can not be resolved.

Who's dictionary to include to that single repository, how to merge them, how to resolve different concepts (what to include, what to exclude), how to merge affix files with different affix classes (that will be a mess). Why you think that included dictionary is 'standard' and is better then the other one? How to introduce those guys not knowing each other? The other guy will give up his work? Who will hunt all those 'other' guys telling them 'Yo, dude, leave that, do this shit!'

How will such a repository resolve competition between two English dictionaries?

I think this is less of a problem that it might seem at first glance. To begin with, there are not that many dictionaries which have active competing teams. Even en-GB (where you might expect a flurry of competition) only has a single maintainer.

First I think you gather in all those projects willing to participate, then you rescue the dead ones and then you try and work out arrangements with competing dictionaries. I don't see a reason which such a resource could not host multiple dictionaries for the same locale, even if it somehow selects one for default inclusion. Many locales have pre and post spelling reform variants anyway so you have to allow for multiples anyway so if you had 3 competing en-US dictionaries, you just label them differently if the differences cannot be reconciled.

My point: it should be repository for maintaining hunspell's dictionaries and building extensions for other project - that's fine, but don't expect it to be lively as Pootle and translations - it's just not gonna happen, that's not realistic.

It won't and that's a good thing, here too many cooks will certainly spoil the broth

Did I get it all wrong?

Hit me hard,

No, I think I had thoughts similar to your going through my head but I think the concept still has legs if we pull together.

Michael

I see... (and we shall see).

It's always seemed logical to me for Hunspell project to host 'official' dictionaries. Since it's lacking this feature -- here is a chance for LO to play this card right.

No doubt, it could become huge.

Thanks for clarifications,
Kruno

igaidhlig · May 3, 2016, 5:55pm

There are many things about Hunspell that make me shake my head in disbelief All the more so given how widely used it is. Like the lack of documentation to the outside world. I mean, like a user friendly "Here is how you make a spellchecker for your language" and "All about affixes" page. But their github space aside, there doesn't seem to be much. Not at a casual websearch anyway.

The old FOSS caltrop? Great idea, machinery and people, shaky end-user strategy?

M

Sgrìobh Kruno na leanas 03/05/2016 aig 17:26:

marcoagpinto · May 3, 2016, 6:01pm

Michael and people,

I make a brief explanation of how .AFF files work in Proofing Tool GUI's
manual:
http://marcoagpinto.cidadevirtual.pt/proofingtoolgui_files/ProofingToolGUI_manual_V30.html

But it is an unfinished manual since I haven't had much free time to
work on everything.

Kind regards,

toki · May 3, 2016, 6:35pm

not doubled maintained word lists by multiple maintainers (not knowing each other)

will not and can not be resolved.

With a central repository for working on dictionaries, it is far easier for two individuals interested in the same dictionary to find each other, than if they are working on two different sites, in different locations.

Who's dictionary to include to that single repository, how to merge

As a practical matter, a repository that only allows for one dictionary per language, is not viable. At a minimum, you'll have specialized dictionaries.

how to merge affix files with different affix classes (that will be a mess).

I've seen some tools for automating the creation of affix files.
I don't know how well they work, though.

This goes back to my claim that spell checking without built-in grammar checking is useless.

>Why you think that included dictionary is 'standard' and is better then the other one?

Any dictionary project has to include the ability to have the same language in at least two different writing systems --- Braille (^1) and the standard writing system for the language.

>The other guy will give up his work?

The proposal does not require the other guy to give up his project.

I wouldn't be surprised to see the other guy create a more specialized dictionary.

* John Doe creates a general purpose dictionary;
* Jane Doe creates a name and places dictionary;
* John Roe creates a scientific terminology dictionary;
* Jane Roe creates a basic words dictionary;

Who will hunt all those 'other' guys telling them 'Yo, dude, leave that, do this shit!'

As far as existing spell checking and wordlist projects go, nobody is going to tell them to "leave that, do this". What might happen, is that known, existing projects, are offered space, etc in the proposed repository/incubator, but they will stay where they currently are, due to how their workflow operates.

How will such a repository resolve competition between two English dictionaries?

Since you specifically mentioned English, there currently are versions of English for a dozen locales, plus around half a dozen specialist dictionaries.

Most users won't choose the English (OED) variant, because it has too many words in it. Too many words means that words that are wrongly used, get flagged as correct spelled. The "Eye right withe aye pin" phenomena.

Nobody can (or should) just declare 'we are building dictionary
repository - here use this, not that' just because being in position of
power to do that.

The proposal does not mandate that only the proposed space/workflow/etc be used. In an ideal world, existing groups would be able to drop their work-product into the repository, with only one change to their workflow --- a bot that automatically uploads their new, verified, approved work product into the repository. Furthermore, this change would occur, if, and only if the existing group wanted to do so.

This proposal is about non-technical types being able to _easily_ create viable dictionaries for their specific use-case. It doesn't matter if that use-case is a dictionary in Pondo, or a dictionary of people and places in Bharat, or a dictionary in Moon.

The other part of the proposal is that even if the original dictionary creator abandons the dictionary, it can still be maintained, and updated.

The third part of the proposal is that whilst it is initially for LibO, the hope is that it becomes the source for dictionaries for FLOSS projects.

igaidhlig · May 3, 2016, 6:49pm

Totally disagree from experience. Of course, both is better but you try working in a language with not even a spellchecker and then get someone to count the errors. Even mediocre spellcheck coverage kills a good % of typos. I just have to take a random Gaelic page off the BBC and dump it in LO and count the hits.

Michael

Sgrìobh toki na leanas 03/05/2016 aig 19:35:

Kruno · May 3, 2016, 7:12pm

03.05.2016 u 20:49, Michael Bauer je napisao/la:

Totally disagree from experience. Of course, both is better but you try working in a language with not even a spellchecker and then get someone to count the errors. Even mediocre spellcheck coverage kills a good % of typos. I just have to take a random Gaelic page off the BBC and dump it in LO and count the hits.

Michael

Sgrìobh toki na leanas 03/05/2016 aig 19:35:

This goes back to my claim that spell checking without built-in grammar checking is useless.

I agree. Otherwise you can say that no grammar checker is good if it's not n-gram based or such. And there is now way any language smaller then English build something like outside some institute or outside funded project of some sort.

For small languages even having a spell checker is huge. There's quite a few English dictionaries out there to help you with this or that, but when whole country has population equivalent to only one (average) US city, everything is extra hard.

We all know the downsides of spelling checkers but it's just the way it is.

And yet, spelling checkers (dumb as they are) and grammar checkers (poor as they are) still do a lot of good.

It's easier to teach people how to write then make decent grammar checker (and that's just the way it is).

Kruno · May 3, 2016, 7:54pm

03.05.2016 u 20:35, toki je napisao/la:

not doubled maintained word lists by multiple maintainers (not knowing each other)

will not and can not be resolved.

With a central repository for working on dictionaries, it is far easier for two individuals interested in the same dictionary to find each other, than if they are working on two different sites, in different locations.

Yes, I agree, never argued that. But now we are talking about my point from first mail: you build a place for people involved with LO and provide them with tools to make better dictionaries. I was saying that it makes more sense tie it up with LO for LO then thinking that you are making repository for Hunspell -- you are making one for LibreOffice and that's it.

My point was that you can't build repository for 'official' Hunspell's dictionaries, only for 'official' LO's dictionaries. I was just saying it was communicated a little bit blurry and unclear (to me).

(And you explained it to me in last part of your mail).

Who's dictionary to include to that single repository, how to merge

As a practical matter, a repository that only allows for one dictionary per language, is not viable. At a minimum, you'll have specialized dictionaries.

[Starting new discussion (sic!)]

Which languages? Have of them don't even have decent affix file (mine included, and that's nobody's fault).

I'm not trying discourage or sabotage (I'm really hope this comes alive) but who do you think will build such dictionaries for languages that don't have them already? Who can maintain that?

Are all those specialized dictionaries sharing an affix file?

That was my second (and last) point, it's sounds like goals are set too hight. Not in terms of possibilities, but actual interest.

And we are talking about different things here, having two for the same languages was not what I meant. We are starting new discussion here, so back on topic:

I was more concerned about how would a such system work. I'm not telling anyone what and how (nor even suggesting) because I don't understand all of that. Just wanted to know. It sounded so unreal.

[Yes, have a possibility -- we agree on practically everything but you are pressure where hurts here ]

how to merge affix files with different affix classes (that will be a mess).

I've seen some tools for automating the creation of affix files.
I don't know how well they work, though.

No, no and - no! No scripts with any natural language if you already don't have a finished dictionary for cross-referencing. No, no way. (And small and not so small languages don't have access to thous, or they simply don't exist).

This goes back to my claim that spell checking without built-in grammar checking is useless.

>Why you think that included dictionary is 'standard' and is better then the other one?

Any dictionary project has to include the ability to have the same language in at least two different writing systems --- Braille (^1) and the standard writing system for the language.

>The other guy will give up his work?

The proposal does not require the other guy to give up his project.

I wouldn't be surprised to see the other guy create a more specialized dictionary.

* John Doe creates a general purpose dictionary;
* Jane Doe creates a name and places dictionary;
* John Roe creates a scientific terminology dictionary;
* Jane Roe creates a basic words dictionary;

It sure will be easier then it is now.

Who will hunt all those 'other' guys telling them 'Yo, dude, leave that, do this shit!'

As far as existing spell checking and wordlist projects go, nobody is going to tell them to "leave that, do this".

Yes, exactly: so again, you can invite people -- people you already know, people who already doing this stuff (and thous are few).

So having some soft of bugzilla for missing or wrong words has more potential for regular users (even integrated into UI so it's just reporting to a matching language in that repository of some sort).

The dictionary building tool can help the ones already doing it to do it better.

(Not trying to make discussion of this)

What might happen, is that known, existing projects, are offered space, etc in the proposed repository/incubator, but they will stay where they currently are, due to how their workflow operates.

How will such a repository resolve competition between two English dictionaries?

Since you specifically mentioned English, there currently are versions of English for a dozen locales, plus around half a dozen specialist dictionaries.

Most users won't choose the English (OED) variant, because it has too many words in it. Too many words means that words that are wrongly used, get flagged as correct spelled. The "Eye right withe aye pin" phenomena.

How many languages have that problem?

This proposal is about non-technical types being able to _easily_ create viable dictionaries for their specific use-case. It doesn't matter if that use-case is a dictionary in Pondo, or a dictionary of people and places in Bharat, or a dictionary in Moon.

Perfectly fine, can't wait.

Although, it will be very complex if it's going to support all of Hunspell's advanced features (because how entering text in text box differs from editing txt file?)

Not trying to be rude, again: I really hope this will work!

The other part of the proposal is that even if the original dictionary creator abandons the dictionary, it can still be maintained, and updated.

That's a plus big as a skyscraper!

The third part of the proposal is that whilst it is initially for LibO, the hope is that it becomes the source for dictionaries for FLOSS projects.

#####

Hypothetical situation. One of Kevin Scannell's students decides that what the world needs is a dictionaries in each of the 2,500 languages that have been reduced to a writing system. So said student walks thru Kevin's word lists, and creates a dictionary project for each of the 2,000 languages that Kevin maintains word lists for. A year later, said student graduates, and forgets about their dictionaries.

Under the current scenarios, when said student abandons their dictionaries, the only way other people can update them, is by forking them --- assuming that the license allows forking.

Under the proposed scenario, if said student creates the dictionaries in the repository, when said student abandons them, other people can still update the dictionaries, which can then be distributed to LibO, etc.

I'll grant that were said student to create 2,000+ dictionaries for LibO, it would break the UI. However, as far as the proposal goes, that breakage is irrelevant.

All good. Just wanted to see if I got it all right.

Thanks,
Kruno

Kruno · May 3, 2016, 7:56pm

For small languages even having a spell checker is huge. There's quite a few English dictionaries out there to help you with this or that, but when whole country has population equivalent to only one (average) US city, everything is extra hard.

That is why is such a tool _more_ then welcome.

toki · May 4, 2016, 8:43pm

And there is now way any language smaller then English build something like outside some institute or outside funded

project of some sort.

Please either rephrase that sentence, or write it in your native language.

I'm trying to figure out if you mean that English is the only language in which an N-Gram based grammar checker can be created, or if that is the only language for which adequate funding for such a critter can be found.

For small languages even having a spell checker is huge. There's quite a

When working with evidential grammars, or noun class grammars, spell checking fall apart, because the entire word is rewritten according to the evidential particle, or noun class.

jonathon

igaidhlig · May 4, 2016, 9:13pm

Sgrìobh toki na leanas 04/05/2016 aig 21:43:

And there is now way any language smaller then English build something like outside some institute or outside funded

project of some sort.

Please either rephrase that sentence, or write it in your native language.

It's not that hard to understand. What he said is that unless you happen to be lucky and have an institute or some funding mechanism, as a small language you often don't have the means to go and do the really fancy stuff that would be really nice to have. English has a massive amount of research and resources to throw at its linguistic problems. A language like Scottish Gaelic mostly works on the back of dedicated volunteers (or just a volunteer in some cases) donating time and/or expertise.

For small languages even having a spell checker is huge. There's quite a

When working with evidential grammars, or noun class grammars, spell checking fall apart, because the entire word is rewritten according to the evidential particle, or noun class.

That's a strange and rather defeatist argument. Whatever those are (really never heard of evidential grammars, I'm guessing with noun class grammars you mean languages like Bantu) I have yet to come across a language for which spellchecking is practically or theoretically impossible. Ideographic scripts like Chinese perhaps where you need to take longer chunks or semantics into account to account for 水 vs 氷 being in the wrong place but there are few systems like that. Sure, coverage is an issue in morphologically complex languages but it's by no means impossible. Basque has a dozen or so cases and a myriad of suffixes which can be combined in lots of different ways but oddly enough, spellchecking is possible. You just have to be clever about how you go about creating them.

No need to throw out the baby with the bathwater.

Michael

Kruno · May 5, 2016, 9:13am

04.05.2016 u 22:43, toki je napisao/la:

And there is now way any language smaller then English build something like outside some institute or outside funded

project of some sort.

Please either rephrase that sentence, or write it in your native language.

I'm trying to figure out if you mean that English is the only language in which an N-Gram based grammar checker can be created, or if that is the only language for which adequate funding for such a critter can be found.

I think it would make more sense that you commented my first mail; that's were I could (and would) rephrase.

For small languages even having a spell checker is huge. There's quite a

When working with evidential grammars, or noun class grammars, spell checking fall apart, because the entire word is rewritten according to the evidential particle, or noun class.

jonathon

How that reflects situation with dictionaries in LO?

No grammar checker of spelling checker can fix dadaism. Nor they can think for the user. Only think you can do, is fix typically grammar or spelling mistakes by spelling checker and grammar checker working together.

Without grammar checker you still can cache typo. So having spelling checker is still better then don't having it.

I really didn't want to to start this kind of discussion, but can't thous languages still bit word lists (maybe with affix file) and isn't that better then don't having that option at all?

You know what you are talking and you have whole system in your head (phonology, morphology...), you know what can possibly go wrong and what to correct -- but you only working with Hunspell (and maybe LanguageTool) -- so you work with that and forget about stuff you read in doctorate degrees.

My point was that shift in how dictionaries are (or will be) build should not be expected. I'm surprised that there is that many of them (put aside overall quality).

And I can't build even decent grammar checking with LanguageTool without corpus available. There is one, but I'm not doing this on my on, alone -- forget it. So let it be just spelling checker (broken or not).

Kruno