testing out 2 new large word list English dictionaries.

krackedpress · November 1, 2011, 2:42pm

I am testing out 2 new .oxt extensions. They are about 5.5MB each in size.

The American English one has en_US spelling, hyphen, and thesaurus. The British one has en_GB spelling, hyphen, and US thesaurus. I could not find a GB thesaurus instructions, so I used the US one since they are close enough.

The spelling dictionary word list are over 217,000 words with the British one having about 280 less words. These list have proper names, possessive name/word forms, words ending with suffices like "ing" and "ism". I would like to have constructive comments about these .oxt extensions for LO/OOo. I have word lists that contain 50K+, 98K+, 217K+, 390K+, and 638K+ words. My original test of editing a dictionary had only about 50K words and no hyphen or thesaurus info in it. This is a major step for me in this "field". I hope it would be welcomed.

If you would rather see the 98K word list, or the other sizes, let me know. I will be working on testing them out later. But, for my fist testing, I decided to go with a large word file.

Here are the links.

http://libreoffice-na.us/English-3.4-installs/add-on-dictionaries-large-list/kpp-american-english-dictionary-large-list.oxt

http://libreoffice-na.us/English-3.4-installs/add-on-dictionaries-large-list/kpp-british-english-dictionary-large-list.oxt

Brian_Barker · November 1, 2011, 7:27pm

But does it include the word "fewer"?

Brian Barker

Libre_User · November 1, 2011, 9:33pm

Are there instructions for installing these alternate files? OS=Win7

Jerry

Mark_Stanton · November 2, 2011, 7:45am

So, since I do actually use my Writer setup in my business, is it
going to be easy to revert to my current working dictionary/thesaurus
if I want?

Regards
Mark Stanton
One small step for mankind...

Tom_Davies · November 2, 2011, 4:26pm

Hi
I think you have choices. I think you can just double-click on an oxt file and that installs it in LibreOffice or else open Writer or something and click on

Tools - "Extensions Manager" - Add

and then browse to where-ever you downloaded or saved the file to. I could be wrong but i heard that one of the list might have incorrectly put .zip instead of .oxt at the end of the file. I'm not sure how to handle that if it has except to try one of the other lists of words and see if that seems to have the same problem (unlikely).
Regards from
Tom

krackedpress · November 6, 2011, 2:09am

I have been "offline", think near hospital stay illness, since just after I posted this list. So I have gotten back to answering the replies. Thanks for Tom to give the help/hint on installing the file to Windows.

The .zip issue was due to the Extension Center for LO and a Mime-type error with the file storage. So my original one I had in the test version of the "center" was redone as an externally hosted file till that issue is resolved. The links I used for the initial post came directly from a dictionary list on my domain, http://libreoffice-na.us/English-3.4-installs/dictionary.html not the LO Extension Center.

"fewer words"
I opened the word lists with each word on a separate line, like needed. Then I did a line count with a text editor. I had to have a "proper" count for the words, for the system to use the file[s] correctly. Unlike some lists that use "controls" at the end of words helping with suffix word forms, the lists I use does not have them. The suffix control version my take less space but would take more processing power to do the adding of the suffixes, plurals, possessive forms, etc., over that of a straight list that include all the word forms. I used the counts from the package to give the actual number of spellings in the list[s] as my number of words. The largest of the American lists is 638,644 words and for British 638,285 words. I could give you the word counts for all the lists I have, including the one for 217K words, but it is not needed, or is it? Last time I did a dictionary spelling project at a college, the total number of words was over 177K. It used the "suffix control" method to fit as many words in the "rom" memory chips as possible. Actually that was when "IBM Compatible" desktop computers were in the 286 CPU stage. Now there are over 600K word lists.

The issue of actual "fewer" words could be there are words that are not included based upon if that spelling is used in American English or British English.

I have seen some words that are spelled a way or two that I did not expect, but are proper according to the sources I have.

I did not create the lists from scratch. They are from a source I trust.

The real thing I am looking for help with is what you, the users, would like to have for an English spelling dictionary. I could test them out hours, days, weeks, months, etc., but it would be with my use and needs. I want others to decide if it is useful for THEM.

I choose the 217K word list version as a starting ground. I may, for my personal use, go back down to the 98K size word list version. I would like to know what you think?

Would you prefer the 50K, 98K, 217K, or larger word list for an English spelling dictionary for LibreOffice. If I could unlock the default one, I would for my testings. As far as I remember, the included default English en_US or en_GB had about 50K words in its US and GB word lists. I could actually check if needed.

So please let me know what you think. What size of word lists would you like?

I have been talking to Tom on and off about a larger word count dictionary for LO/OOo for some months. I just decided it was time to get my feet wet again.

The "funny" thing I saw in an older word list, once, was one that did not have the word "dictionary" in it at all.

If I still could remember my basic C programming, I would write a program comparing the different word lists to see which words are not common, but after 3 strokes I have not programmed such a package in many years. Actually a few months after the last stroke.

Winston_Chuen-Shih_Y · November 6, 2011, 3:06am

webmaster:

Possibly a convenient language for comparing the word lists would be Python.

--- Python has a data structure "dict" (dictionary, hashtable, associative array).

--- Python has a data structure "set".

If you wish, I can email you short, working, example code.

Winston

Mark_Stanton · November 6, 2011, 10:24am

Sorry to hear you (?) have been ill.

I think you'll find Brian was just pulling your leg about your use of
English, rather than asking you anything about specific word counts.

Mark Stanton
One small step for mankind...

krackedpress · November 6, 2011, 12:31pm

Sorry, I did not catch that.
Even when I am not ill, after surviving 3 strokes [even small ones] my ability to catch grammar and other mistakes can be difficult. Sometimes it is hard to type 3 or 4 letters correctly in a row. Other times my speech is no good. I hate it when both are working against me at the same time. At least I am better than my wife who suffers from advanced stage Alzheimer's.

krackedpress · November 6, 2011, 12:55pm

Thanks Winston

I was one day going to look into learning Python, but I am afraid that my programming days are over. The strokes damaged parts of my brain that dealt with communication skills and programming skills, among others. I tried to relearn C/C++, but it was not going to happen.

I would like to see a "simple" working code in Python though. If I ever decided to try to retrain the programming part of my brain, I was told to try Python.

The dictionary word lists are not in an "special" type of array, so far as I can tell. The list sample below is part of an Oxford English "dic" file. After the "/" is the codes for the system to use. I really have not found a good reference to what these codes each indicate, except some appear to be suffix indicators. The word lists I now use do not have codes, but still work well. When I dealt with word lists and code indicators, these codes would shrink the amount of space you need to store the dictionary lists. Back when RAM/ROM was in the price range of 32 to 64MB for $100 or more, the more space you can save, the better. Now we all get sloppy in our code size. Since we do not need to have a chess game fit onto a single 1.44 MB floppy, we do not write that "tight" codes. As long as it is quick enough, correct?
Well not everyone is like that now.

abbreviation/M
abdicate/DNGSn
Abelard/M
abider/M
Abidjan
ablaze
abloom
aboveground
abrader/M
Abram/M
abreaction/MS
abrogator/MS
abscond/DRSG
absinthe/MS
absoluteness/S
absorbency/SM
abstract/ShTVDPiGY
absurdness/S

Tom_Davies · November 6, 2011, 1:15pm

Hi
"Pulling your leg"?? I think you would have noticed?
Regards from
Tom

krackedpress · November 6, 2011, 5:13pm

Sorry Tom, but maybe not.

The way my head was the last few days, I might not have noticed if there was a 300 pound tiger sleeping on my bed instead of the 15 pound tiger cat called Knobby.

Many days worth of lack of calories, lack of heavy pain killers [6 scripts worth that could not be kept down] tends to make my mind not see things straight or even knowing things are real. I take equivalent to Morphine with my scripts, since I could not drive or "function" with the more powerful single medication. With the 6 meds instead of one, I can drive legally and still have the "power" of Morphine without the addiction and the side effects.

Winston_Chuen-Shih_Y · November 8, 2011, 1:28am

webmaster:

Below is some example, elementary Python code that reads two files, with one word per line, and writes an output file with the words that are in exactly one of the files.

If you wish, you can use the code in LibreOffice.

If you have any comments or questions, email me.

Winston

Possibly it is good that you see the results first. Then, if you are interested, then you can read about how to generate the results.

On a command line, you will type the following:

python3.2 find_nonshared_words.py

(You can also type "python2.7" instead of "python3.2". But realize that Python 3.2 is not always backwards-compatible with Python 2.7.)

This command will generate the following output file:

output_file.txt:
a1
a2
a3
a4
a5
a6
b1
b2
b3
b4

Below, I show you how to create the results:

Create the following two input files. (Words starting with "a" appear in only file 1. Words starting with "b" appear in only file 2. Words starting with "c" appear in both files, and should be ignored by the code.)

input_file1.txt:
a1
a2
c1
a3
a4
c2
a5
c3
a6

input_file2.txt:
c1
b1
b2
b3
c2
c3
b4

Then create a file called find_nonshared_words.py:

def create_set_from_file(input_file_name):

input_file = open(input_file_name)

     s = set()
     for line in input_file:
         # Delete any leading or trailing whitespace.
         s.add(line.strip())

input_file.close()

return s

set1 = create_set_from_file("input_file1.txt")
set2 = create_set_from_file("input_file2.txt")
set_of_words_in_exactly_one_file = set1.symmetric_difference(set2)

output_file_name = "output_file.txt"
output_file = open(output_file_name, "w+")
for word in sorted(set_of_words_in_exactly_one_file):
output_file.write(word + "\n")

output_file.close()

Tom_Davies · November 10, 2011, 2:20pm

Hi
There are some programmers guides in the external documentation list
http://wiki.documentfoundation.org/Documentation/Publications#Programmers
I'm not sure that really something you want to get involved in at this point but it might help people that do want to get involved in the devs lists
Regards from
Tom