[PROPOSAL] New project for dictionaries

Hello community,

I was having some discussions with dictionary maintainers and come to a
completely new idea for a new project!

The TL;DR version: Provide a central place for dictionaries maintainers
including useful tools plus a possibility for easier collaboration.

The long version: LibreOffice uses dictionaries based on Hunspell.
Hunspell is a free, open source project. Many applications do support
Hunspell based dictionaries. The following list (but not
exclusive) of applications supporting Hunspell based dictionaries is
shameless copied from the German Wikipedia:
* AOO
* LibO
* Mozilla products
** Thunderbird
** Seamonkey
** Firefox
* Latex IDEs: WinShell, TexWorks, LyX, Texmaker, TeXstudio, etc.
* Google Chrome
* The Bat
* Emacs
* Opera Web Browsr
* Apple Mac OS X 10.6+
* Adobe InDesign
* Adobe FrameMaker
* SoftMaker Office
* Scribus
and many, many more applications.

Some applications additional provide their own dictionary extension page
to download dictionaries. Moreover depending on the applications, the
dictionary have to be differently packed (simplest example is a
different file extensions)

Some projects do maintain their extension separately (independently).
Also the user who wants to propose a change / an addition has mostly no
easy solution to find out whom to contact to get another word added.

The basis idea is to provide a portal which provides tools to pack the
word lists to dictionaries automatically by scripts and - if possible -
to update and upload the new version of the dictionary to the download
pages. (e.g. the extension centre of AOO / LO or Mozilla's page) The
portal should provide a way for users to help the maintainers and
proposing new words or contact the maintainer if there are errors in the
word list.

The manual work of packaging the extension is very time consuming and
moreover leads to situations that there might be multiple maintained
versions for the same language. By providing tools to create a new
version of the dictionary the maintainer can invest their time in
improving the dictionary and release easily more often a new version.

I'm eager to hear what do you thing about this idea.

Regards,

Dennis Roczek

I think the different licenses across products might be a problem. But not an expert.

On the whole, I'm not sure if the packaging is actually the biggest stumbling block for the end user but the actual installation/implementation. Some are very easy, like Mozilla or LO, you just get the file and it either auto-installs or you click it. Others, like Trados, you have to find the folder for the dic file, you have to manually add (if it's not bundled) the locale to an XML file ... and pray it works. Others, like Chrome, you can't actually add a spellchecker for a language that Google hates.

Though perhaps if there was a central place on the web for all these so there's a single place where devs could point their software for grabbing Hunspell dictionaries that might make things easier. I'd be all up for that but it would probably end up an exercise in herding cats. Maybe forking existing dictionaries might be worth considering, if the owners are either non-communicative or not interested?

Michael

Sgrìobh Dennis Roczek na leanas 25/04/2016 aig 11:13:

Hi Michael,

I think the different licenses across products might be a problem.
But not an expert.

In comparison to the Apache Foundation the Document Foundation has no
particular open source license mentioned in our statutes. The Document
Liberation Project for example uses many different licenses across the
different libraries.

On the whole, I'm not sure if the packaging is actually the
biggest stumbling block for the end user

Well the ideal part is that somebody provides an pre-backed "Binary" /
extension.

but the actual installation/implementation. Some are very easy,
like Mozilla or LO, you just get the file and it either
auto-installs or you click it. Others, like Trados, you have to
find the folder for the dic file, you have to manually add (if it's
not bundled) the locale to an XML file ... and pray it works.
Others, like Chrome, you can't actually add a spellchecker for a
language that Google hates.

We actually cannot do anything about the different implementations.
But we could provide as easiest packages as possible.

Though perhaps if there was a central place on the web for all
these so there's a single place where devs could point their
software for grabbing Hunspell dictionaries that might make things
easier. I'd be all up for that but it would probably end up an
exercise in herding cats. Maybe forking existing dictionaries might
be worth considering, if the owners are either non-communicative or
not interested?

If the maintainers are not responsive (or if the dictionary is
abandoned), we actually cannot do anything as long as we have no new
maintainer for a fork. If there is somebody willing to overtake the
development, then forking of that particular dictionary is indeed a
possibility.

Michael

</snip>

25.04.2016 u 12:13, Dennis Roczek je napisao/la:

The basis idea is to provide a portal which provides tools to pack the
word lists to dictionaries automatically by scripts and - if possible -
to update and upload the new version of the dictionary to the download
pages. (e.g. the extension centre of AOO / LO or Mozilla's page) The
portal should provide a way for users to help the maintainers and
proposing new words or contact the maintainer if there are errors in the
word list.

I think that's a great idea and I look forward to it (although I can do very little to help).

First time I look for a Hunspell dictionary, I look into Hunspell's own source code. How stupid of me (or is it?).

People often complain about abandoned dictionaries but having a central repository could make maintaining easier. There are languages that haven't been updated for fifteen years. Maintaining dictionaries the way we do translations could be beneficial for smaller languages. It would be very helpful to hear about missing or wrong word form in a dictionary.

You can't abandon Wikipedia article, this way you could not abandon dictionary; there will always be somebody to continue when one losses interest. It's very hard take over maintaining unmaintained dictionary.

But I see big ones to complain about doing things this way.

Building plugins for other projects also sounds good, it's hard to beg 'Hey, I've updated word list, could you please update that Firefox plugin you maintain, you haven't done it for two years'.

But again, what will make that Firefox person to actually include that updated dictionary.

It could be great repository for developing dictionaries and updating ones i LO, but doubt that cooperations with other project will go the way it was communicated in initial mail.

Thanks,
Kruno

Hi Dennis,

I strongly support your idea. For our Czech language, the dictionary has not been updated for many years and no one maintains it. So we were thinking about a webpage where users would give their feedback and propose new words. If there was an infrastructure for that provided by TDF, it would be extremely helpful.

Thanks,
Stanislav

Dne 25.4.2016 v 12:13 Dennis Roczek napsal(a):

Hello Dennis,

2016-04-25 13:13, Dennis Roczek wrote:

<...>

The TL;DR version: Provide a central place for dictionaries maintainers
including useful tools plus a possibility for easier collaboration.

<...>

I think the idea is awesome. One of the programs I localize currently
maintains its own list of dictionary URLs in XML format, and these point
to OOo mirrors, which I suppose are slowly going into oblivion...

Since Hunspell (with a few exceptions, I know) is pretty much the
de-facto spell checker in today's open-source applications (and not just
them), I think it may be beneficial to have a central repository to host
these dictionaries. Perhaps it would even make sense to adopt one of the
package formats as proposed/official, and then begin getting in touch
with application developers, suggesting that they adopt support for it.
Possibilities here are endless, for example, the repository could
(should) provide a generated listing of these dictionaries in some
pre-agreed format, so that application developers could parse it
automatically and allow users to download desired dictionaries and
install them without ever opening their browser. TDF might indeed be a
good candidate to host such repository.

Rimas

Hi Dennis

The TL;DR version: Provide a central place for dictionaries maintainers
including useful tools plus a possibility for easier collaboration.

+1
Everithing I can say is sumarized as: it is a nice hard work among a
"herd of cats".

The easiest part is the technical part. Beware of the service user.

The success of such initiative will depend on the strategy and aproach
to get the support of the dicitionary maintainers, l10n and doc' leaders
and make them use and spread the word about the tool: mutate cats into
wolves. It involves communications and marketing of the service and will
require engagement of resources.

The strategy is also about a migration from existing processes to the
new one. We know a lot about migration issues: solid/old installed base,
user resistance to change, rebelions and even sabotage (derail the
project). So, IMHO, start with TDF hunspell for LibreOffice in as many
L10n communities as possible. Once the commnunity is engaged,
invite/open other communities (Mozz, Latex, etc...). we did it for
Pootle, and it was not in a snapshot.

Regards

If Hunspell does not offer a repository for all languages, and The
Document Foundation has the resources to do so, then this is something
that would be extremely useful.

I don't what resources would be required, but my guess is:
* GIT, or something similar, that hosts word lists;
* A BuildBot system, that creates the dictionary extensions/packages/etc
on a weekly/fortnightly/monthly/quarterly builds;
* Automatic uploading of spelling dictionaries to the Dictionary
Extensions host used by LibreOffice;

Somebody has to write _extensive_ documentation on:
* How to create word list;
* How to modify a word list;
* How to upload the wordlist to the repository;
* How third parties can download content in the repository;
And maybe also:
* How to transform the data in the repository to use with other software;

I know that storing the wordlist on Git is well within the realms of
being doable. However, I have no idea how that is done, and my reading
of books on Git hasn't provided me with any pointers in that direction.

For people who aren't used to versioning systems, Git is a confusing
maze that is utterly incomprehensible. If it weren't for the fact that
there are over 100 books on using Git, in print, I'd suggest that the
first solvable problem is writing a manual for Git that mortals can
understand. On second thoughts, even with 100 books in print, this type
of manual could be an extremely useful addition.

jonathon

Sgrìobh Dennis Roczek na leanas 25/04/2016 aig 15:33:

If the maintainers are not responsive (or if the dictionary is
abandoned), we actually cannot do anything as long as we have no new
maintainer for a fork. If there is somebody willing to overtake the
development, then forking of that particular dictionary is indeed a
possibility.

A couple of extra thoughts from an active maintainer of a dictionary and the associated extensions.

In general, I'm interested in anything that brings together scattered resources and makes stuff more user friendly, be that end users or devs using it.

- there would have to be some sort of a locale-specific admin system for active maintainers joining. Crowdsourcing stuff works in big languages but if folk start submitting random stuff to a locale like Scottish Gaelic or playing with the affix file, that's a recipe for disaster.

- there are different ways in which people maintain their dictionaries. Some edit the text file directly, others build from corpora, in our case, we use scripts to build from a dictionary database, which automatically generates the .zip/.oxt/.xpi files. One would probably have to consider different entry points if ones were to successfully attract people from different projects because I would NOT want to have to start manually maintaining our file.

- one thing that hasn't been mooted yet would be making some provision for easily creating new dictionaries for new locales, something that has a loc-tech threshold.

Michael

Hi Michael,
Hi toki,

but that's the whole point: getting the actual system easier and include
more contributors.

I have no clear understanding which languages uses which kind of system
and scripts. But as far as I know: many who uses scripts to get the
extensions packed, simply use homegrown / self-made scripts in any
language or simply pack the extension manually.

Do try to get the system running, provide a system (hosted / maintained
by TDF) and everybody would use the same system to create new
dictionaries - it will become superior after time. Additional systems
(e.g. Mozilla based products for the easiest example) can be added later
.

The point about the affix file: I imagined the system more on a much
lower base: Joe Average is computer affine and realized that the system
is based on volunteer work and sends his "customs"/unknown words to the
maintainer (either e.g. a web page, directly integrated within
libreoffice or whatever) and the maintainer (similar to the Language
Administrator in Pootle) decides if it goes in the dictionary or not.

@toki I really hope not that the maintainers do have *NOT* to learn git.
It should be more than an intelligent pootle system. (dunno how that
looks with the affix files, but we will find somebody who can do it - if
we want)

Moreover: that kind of system won't include that much resources (neither
human resource nor server source) to maintain. OTOH we might have
additional many new easy hacks for new developers who want to develop
new "skripts" to create extensions for other plugins or conversations to
other dictionary systems, or the like.

At the moment this is only a discussion if "we" find that useful. If the
"TDF" and volunteers are able to implement that in a technical way, Will
be a completely different story.

Dennis Roczek

Sgrìobh Dennis Roczek na leanas 25/04/2016 aig 15:33:

If the maintainers are not responsive (or if the dictionary is
abandoned), we actually cannot do anything as long as we have no new
maintainer for a fork. If there is somebody willing to overtake the
development, then forking of that particular dictionary is indeed a
possibility.

A couple of extra thoughts from an active maintainer of a dictionary a

nd

the associated extensions.

In general, I'm interested in anything that brings together scattered
resources and makes stuff more user friendly, be that end users or dev

s

using it.

- there would have to be some sort of a locale-specific admin system f

or

active maintainers joining. Crowdsourcing stuff works in big languages
but if folk start submitting random stuff to a locale like Scottish
Gaelic or playing with the affix file, that's a recipe for disaster.

- there are different ways in which people maintain their dictionaries

.

Some edit the text file directly, others build from corpora, in our
case, we use scripts to build from a dictionary database, which
automatically generates the .zip/.oxt/.xpi files. One would probably
have to consider different entry points if ones were to successfully
attract people from different projects because I would NOT want to hav

e

to start manually maintaining our file.

- one thing that hasn't been mooted yet would be making some provision
for easily creating new dictionaries for new locales, something that h

as

Do try to get the system running, provide a system (hosted / maintained

by TDF) and everybody would use the same system to create new dictionaries

Make it easy enough, and every group that is working on reducing a
language to writing, will be uploading their word lists here.
(If that occurs, then LibO will definitely need to redesign the language
selection component.)

libreoffice or whatever) and the maintainer (similar to the Language
Administrator in Pootle) decides if it goes in the dictionary or not.

This is where automated tools are great.
(I'll ignore issues such as
the one with Afrikaans, which didn't contain the word "die" for several
years. This is the definite article in Afrikaans, so it was a pretty
annoying omission.)

@toki I really hope not that the maintainers do have *NOT* to learn git.
It should be more than an intelligent pootle system.

The reason I mentioned GIT, is because there was (operative word _was_)
a group of people working on an extension for AOo/LibO to save documents
to GIT. The projected end result would be that all the individual would
have to know about GIT, was to click on this extension to either save,
or retrieve the document from GIT.

looks with the affix files, but we will find somebody who can do it -

My impression is that there is a python library that takes word lists,
and creates affix files from them.

If the "TDF" and volunteers are able to implement that in a technical way, Will be a completely different story.

Listing the technical requirements for the must-haves and nice-to-haves.

jonathon

Hi toki,

@toki I really hope not that the maintainers do have *NOT* to learn g

it.

It should be more than an intelligent pootle system.

The reason I mentioned GIT, is because there was (operative word _was_

)

a group of people working on an extension for AOo/LibO to save documen

ts

to GIT. The projected end result would be that all the individual woul

d

have to know about GIT, was to click on this extension to either save,
or retrieve the document from GIT.

there are still some projects, like the en_US or the en_GB dictionary.

looks with the affix files, but we will find somebody who can do it -

My impression is that there is a python library that takes word lists,
and creates affix files from them.

:slight_smile:

If the "TDF" and volunteers are able to implement that in a technical

way, Will be a completely different story.

Listing the technical requirements for the must-haves and nice-to-have

s.
Yeah, I think easiest would be to start to create a wiki page listing
* pro and Cons
* must-have
* nice-to-have
* could be added later
stuff (mostly features). Technical requirement mostly depends on the
implementation (e.g. phyton as mentioned above).

jonathon

Dennis

Hi Dennis,

Sgrìobh Dennis Roczek na leanas 28/04/2016 aig 00:30:

I have no clear understanding which languages uses which kind of system
and scripts. But as far as I know: many who uses scripts to get the
extensions packed, simply use homegrown / self-made scripts in any
language or simply pack the extension manually.

Yes, we use a homegrown script because it has to parcel a very specific file which gets exported from the dictionary database

Do try to get the system running, provide a system (hosted / maintained
by TDF) and everybody would use the same system to create new
dictionaries - it will become superior after time. Additional systems
(e.g. Mozilla based products for the easiest example) can be added later

There will always be special cases and if the new centralised system is to draw in as many as possible, it must allow committing of ready dic/aff/xpi etc files by people who create their Hunspell stuff in other ways. There's no way I would ever start maintaining our files on another platform manually and I would imagine that not many people who have a dynamic setup like ours would either. We have grown the dic this way from 500k to 1.5m words this way in 4 years, that would simply not be feasible in another way (for us).

The point about the affix file: I imagined the system more on a much
lower base: Joe Average is computer affine and realized that the system
is based on volunteer work and sends his "customs"/unknown words to the
maintainer (either e.g. a web page, directly integrated within
libreoffice or whatever) and the maintainer (similar to the Language
Administrator in Pootle) decides if it goes in the dictionary or not.

That might work quite well for a very mature dictionary file or new locales where there is no existing data that one can draw on.

Moreover: that kind of system won't include that much resources (neither
human resource nor server source) to maintain. OTOH we might have

Hah. Nothing that involves spelling is ever easy :slight_smile: Even if just because many languages have competing orthographies.

Michael