Creating a dictionary with libreoffice from a simple TXT-file

Constantine · December 24, 2014, 8:45pm

Hi everyone,

I am new here and I hope someone can help me.

I am a translator and my colleagues and I use CAT-software with TMX-files,
glossaries in simple text-files which are in UTF8 encoding and TAB-separated
as well as other tools of course.

Over the years, some of us collected a huge amount of Terminology (with it's
definitions) from each other and various other sources.
This Terminology is unfortunately in simple text-files encoded with UTF8 but
with no TABs or similar as field-separators.

In order to be able to use this Terminology as a glossary or to convert it
into a TMX and then use it efficiently with our CAT-Software (OmegaT
mostly), it has to be in the following format:

Language-A TAB Language-B
or
Language-A ; Language-B

I 'll show you an example of our files to better understand my problem.
I will use a German-Greek terminology example, because I believe the
different languages encodings will make it easier to achieve our goal:

Abbrucharbeiten (πληθ.) εργασίες κατεδάφισης
Abbruchkosten (πληθ.) δαπάνες κατεδάφισης
abbruchreif ετοιμόρροπος, κατεδαφιστέος, das Haus ist ~ το σπίτι είναι
ετοιμόρροπο
Abbruchsbewilligung (θηλ.) άδεια κατεδάφισης
Abbruchunternehmen (ουδ.) εταιρεία κατεδάφισης
abbuchen χρεώνω, Gebühren vom Konto ~ τα τέλη χρεώνονται στον τραπεζικό
λογαριασμό

As a minimum requirement for our goal, it should look like this:

Abbrucharbeiten;(πληθ.) εργασίες κατεδάφισης
Abbruchkosten;(πληθ.) δαπάνες κατεδάφισης
abbruchreif;ετοιμόρροπος, κατεδαφιστέος, das Haus ist ~ το σπίτι είναι
ετοιμόρροπο
Abbruchsbewilligung;(θηλ.) άδεια κατεδάφισης
Abbruchunternehmen;(ουδ.) εταιρεία κατεδάφισης
abbuchen;χρεώνω, Gebühren vom Konto ~ τα τέλη χρεώνονται στον τραπεζικό
λογαριασμό

or

Abbrucharbeiten (πληθ.) εργασίες κατεδάφισης
Abbruchkosten (πληθ.) δαπάνες κατεδάφισης
abbruchreif ετοιμόρροπος, κατεδαφιστέος, das Haus ist ~ το σπίτι είναι
ετοιμόρροπο
Abbruchsbewilligung (θηλ.) άδεια κατεδάφισης
Abbruchunternehmen (ουδ.) εταιρεία κατεδάφισης
abbuchen χρεώνω, Gebühren vom Konto ~ τα τέλη χρεώνονται στον τραπεζικό
λογαριασμό

(where the space is a TAB)

Does anyone of you know any way to achieve this in Libreoffice or even any
other editor?

I was think of something like:
Search for German-language characters + space
at the beginning of the sentence/line and
replace it with the same word + ";" or TAB

I and my colleagues would be most grateful if any of you could provide a
simple solution or suggestion.

I thank you all in advance for your help and interest.

Constantine

krackedpress · December 24, 2014, 9:14pm

The question is what type of dictionary you really want?

A spell checking type for the terms
or
something that shows "Latin / English / German" character set to the Greek character set term including the definition when available.

That really makes a huge difference.

To make a searchable dictionary for use within LibreOffice may be difficult.

Also, how many terms to you have? hundreds, thousands? That may also effect what can be done easily.

I use to do spacing alignments to massive text files using C++ and other programming languages for use with text input for various documents and word, term, processing. But it may not be easily done withing a office package.

Constantine · December 24, 2014, 9:41pm

Thank you for your fast reply krackedpress,

it is actually very simple what I want.
I just want to insert a separator after the first word.
As I said the file is a simple text file encoded in UTF8 containing:

German Word/Term "space" Greek definition ( sometimes plus more definitions
or comments)

I need to put a TAB or ";" as a separator after the first Word/Term at the
beginning of the line.

Do you have ANY suggestion how to do that?
It doesn't have to be in LIbreoffice as long as it is simple enough for me
to apply it.

Brian_Barker · December 24, 2014, 10:23pm

It won't be easy to search for characters in a particular language, but in all your examples the part you want separated from the remainder of the line is simply a single word, so you need just to search for the first space and replace it with a semi-colon or a tab character.

o Paste your material into a LibreOffice text (Writer) document.
o Put the cursor at the beginning of the text.
o Go to Edit | Find & Replace... (or press Ctrl+F).
o In the "Search for" box, enter:
([^ ]*) (.*)
- that's leftparenthesis-leftbracket-circumflex-space-rightbracket-asterisk-rightparenthesis-space-leftparenthesis-dot-asterisk-rightparenthesis.
o In the "Replace with" box, enter either:
$1;$2 or $1\t$2
- that's dollar-one-semicolon-dollar-two or dollar-one-backslash-lowercasetee-dollar-two (as preferred).
o Click More Options.
o Ensure "Regular expressions" is ticked.
o Click Replace All.

Don't do this more than once, or this process will then replace the second occurrences of spaces.

You can copy and paste the resulting material wherever you want it to go, or you can use File | Save As... and select a plain text format for "Save as type".

I trust this helps.

Brian Barker

Tom_Davies1 · December 24, 2014, 9:55pm

Hi
Surely it is best done with a text-editor rather than with a
word-processor?

Either way it looks like you could simply use "search and replace" or "find
and replace". I would search for " (" and replace with " ; (". It might
even be possible to use a tab character in the replace-field.

If you are using Linux then you can probably do this from a command-line
and maybe to multiple files but that would be waaay beyond me.
Regards from
Tom

Constantine · December 24, 2014, 11:11pm

Dear Brian,

just FANTASTIC!!!

Thank you very very much for your fast and efficient help.

This did the job (almost) perfectly. I didn't apply this from the beginning
myself because I wasn't sure if there aren't any terms at the beginning of
the line with two or more german words. I also was too tired from work to
think clearly and run a test.
But after your suggestion AND you gave the me the correct expressions, I ran
a test on a copy of one of the files.

Indeed there where terms with two words or something like this:

a.D. (außer Dienst) εκτός υπηρεσίας, εν αποστρατεία

where "(außer Dienst)" belongs to "a.D."
Which is not bad because "(außer Dienst)" in the second field is more
usefull to me (us).

For the fewer cases with two or more german words at the beginning, well, I
think we will survive that and be able to correct it manually.

Once again mate, cheers and THANK YOU you saved me a lot of time and
trouble.

Constantine · December 24, 2014, 11:37pm

Hi Tom,
thank you too for your reply. Just missed it before, that's why I didn't
respond to you.
I do use Linux (Mint-Mate Rebecca actually) and I do make use of PLUMA
combined with writer.

The file I now have is ready for use with OmegaT as a glossary which is
exactly what I needed desperately, I will later see if I will make a TMX out
of it too.

But one thought is still in my head though.
I do NOT need the following, but a colleague asked me for it.
Perhaps you or Brian or someone else could help me with it since I have no
idea how to do it.
I spent the last few hours reading the documentation but I obviously am too
dump to get it done.

My friend wants a very simple standalone form for his desktop, which uses
this newly created text-file or a calc -file as dbase, to search for a word
and get all the definitions where this word occurs.

So, it should be a small form with 2 fields, one small entry field for the
search and a much larger field (window) where the answer appears.

Do you think you and Brian could help me with that too?
I would be most grateful as will be my friend.

Brian_Barker · December 25, 2014, 1:20am

As you suggest, it is easy to get your text into a two-column spreadsheet array. A simple way forward, if the list is not too long, is just to use the Find & Replace facility again. If you search for the relevant word using Find All, all cells containing the text will be highlighted. You can scroll down to see the highlighted material.

Otherwise, you probably do need a proper database.
o Start a new database (Base) document.
o Select Tables in the left "Database" column.
o In your spreadsheet, select the array of material.
o Drag the array into the lower "Tables" panel of the database window.
o Follow the instructions to create a table from the imported values.
o Spend part of the holiday season reading the Base documentation and learning enough to be able to create the required form!

I trust this helps.

Brian Barker

Constantine · December 25, 2014, 2:06am

Hi Brian,

as you say, I will need to use base and I already started reading the docs
and experimenting with the form creation.

But I would also like to report on my progress.
I took all the files containing German-Greek terms and pasted them in a
single text-file, then using the linux editor pluma for various corrections
(I am more comfortable there) I prepared it for the final phase, which of
course was applying your instructions.
Finally I opened the file in calc and manually corrected the entries where
the German term had 2 or more words. Fortunately they weren't too many. Then
I filtered all the duplicates.
The result is a perfect glossary for OmegaT with, believe it or 31.400
unique entries.

Now I started the same procedure for the Greek-German files but...
These files contain too many greek terms consisting of 2, 3, 4 and even 5
words. Too many to deal with manually.

What would you say? Is there any possible way to do the job with an
expression like the one you gave me?
Can you think of anything? Does it not help that they are greek characters
at the beginning of the line?
As far as I know in writer one can search for language, then perhaps also
for characters of a certain non latin language.
Combining this with an expression like the one before, it would probably
work.

I am not asking you to do the work for me, but I sincerely tried everything
I could and ready as much as possible and still could come up with anything.
I will not give up trying and reading, but since you obviously have much
more knowledge of the matter as well as experience, you could save me a lot
of time but also from possible errors in the resulting file.

Paul16 · December 25, 2014, 2:37am

Just a thought from what I remember of the previous posts, but will
Tom's idea of searching for the left parenthesis instead of the first
space not work?

Paul

Constantine · December 25, 2014, 2:48am

Hi Paul,

unfortunately not.
Not all definitions start with a left parenthesis. For example, all verbs do
not but also many other entries either.
If that was the case, you are right it would very easy. Too easy in fact

Brian_Barker · December 25, 2014, 3:55am

Now I started the same procedure for the Greek-German files but... These files contain too many Greek terms consisting of 2, 3, 4 and even 5 words. Too many to deal with manually. What would you say? Is there any possible way to do the job with an expression like the one you gave me? Can you think of anything? Does it not help that they are Greek characters at the beginning of the line?

Yup. Try searching for
([^a-z]*) (.*)
and replacing with
$1;$2 or $1\t$2
as before.

I am not asking you to do the work for me, ...

Oh, I think you did! But no matter. ;^)

I trust this helps.

Brian Barker

Constantine · December 25, 2014, 4:55am

Dear Brian,
you are the greatest.

Yup. Try searching for
([^a-z]*) (.*)
and replacing with
$1;$2 or $1\t$2
as before.

I printed out days ago, the table with regular expression from the writer
help-file and experimented with it quite a lot, but I missed the (.*) part.
I probably wouldn't come to it for a long time.
This works perfect with words at the beginning of the line, but... I still
have a small problem.

What I mean, you can see at the following example:

αιτία 1. Ursache, Grund, causa changes to αιτία
1.*;*Ursache, Grund, causa should be αιτία*;*1. Ursache,
Grund, causa
αιτιολογία (93 § 3 Σ ) Begründung changes to αιτιολογία (93 § 3 Σ
)*;*Begründung should be αιτιολογία*;*(93 § 3 Σ ) Begründung

How can I avoid that? The semicolon or tab should be before the number and
the parenthesis.
Please tell me this last thing, I really don't know how to.

I am not asking you to do the work for me, ...

Oh, I think you did! But no matter. ;^)

If it goes like a duck and sounds like a duck......
Yes, you are right to think so, but it is very important to me, for you to
know, that I am not one of this lazy guys who let others do the work for
them without googling or reading the documentation themselves first.
I really try very hard my self first and then ask for help. But some things,
I just don not get.

I trust this helps.

Yes, it does help of course.
I hope I can do something in return for you to show you my appreciation.

Constantine

Brian Barker

toki · December 25, 2014, 5:13am

How can I avoid that? The semicolon or tab should be before the number and
the parenthesis.

It looks like the match is occurring on glyphs that utilize the Latin
writing system, before you get to German text.

Please tell me this last thing, I really don't know how to.

I'm assuming this is still a text file.

Segregate the material into two files. GREP probably is the easiest tool
to use.
File one is words that contain the left hand parenthesis.
File two is words that do not contain the left hand parenthesis.

For the second file, use the regex that you were using.
For the first file, do the search on the left hand parenthesis.

jonathon

Constantine · December 25, 2014, 5:41am

Thank you for your suggestion jonathon-4, but after working for more than 20
hours non-stop on these files, I am not even able to do that.

BUT I came up with a lazy solution: I just replaced 1 with QQQ and ( with
UUU and ran what Brian suggested.
Et Voilà, it worked. And then of course replaced it back.

Thank you all for your help guys,
and ESPECIALLY you Brian... As you see I am not that lazy. I do try to
help my self, well, as far as I can after such long hours.

I wish you all merry Holidays and all the best for the New Year.

I'll probably will see you all around here in the future too.

Constantine

Constantine · December 25, 2014, 6:17am

Brian,

you are unbelievable!!!

While I solved the problem with my very sloppy trick and was writing my mail
in order to inform you about it, you were looking for a correct solution and
writing this very long and very very detailed answer.

I am just speechless.

I saved all of your instructions, not twice but three times and also printed
them out. They are priceless.

These expressions are things that I need very often and very badly when
working on my translations and they will make my life much easier in the
future.
I know, that even if I had spent 30 more hours studying the manual, I would
probably never have come to such clean expressions.

Thank you for everything. I really hope I can do something for you in the
future.

Constantine

Brian_Barker · December 25, 2014, 7:23am

you are unbelievable!!!

I hope not!

While I solved the problem with my very sloppy trick ...

Oh, what you did - replacing a text item temporarily with a placeholder that won't occur naturally in the text in order to simplify a search - is a useful technique and not at all sloppy!

I am just speechless.

Don't be that: I'm guessing you may soon need to say "Thank you" to Santa.

I know, that even if I had spent 30 more hours studying the manual, I would probably never have come to such clean expressions.

I imagine you'll now be able to do such things for yourself: that's the idea, of course.

Thank you for everything.

No probs!

Brian Barker

J.A_de_Vries · December 25, 2014, 10:54am

Hi Constatine,

I am a bit late in the thread to help you with this specific case (glad
to see it is solved), but I'd like to suggest something. I think you
wrote that you are working on a Linux system and are willing to learn.
In that case one of the best things you can do to help yourself with
similar problems in the future is to look into vim and especially into
regular expressions too. Your replacement needs would have been
something that regular expressions would have helped you with a lot. In
particular things like postive/negative lookahead/lookback. If vim is
not your thing, then Emacs might be more to your liking. Both can be a
tremendous help with any problem that has to do with text. It will take
some effort to learn vim/Emacs and regular expressions, but it will be
worth every second.

Grx HdV

krackedpress · December 25, 2014, 12:26pm

WoW
where do you get an "easy" documentation file that has this type of search parameters and "coding"?

Luuk · December 25, 2014, 12:34pm

After a lot of responses how to do this in Writer,
a shortnote how to do this in Calc.....

Open the textfile, when the 'Text import' wizzard is show do:
1) Select characterset 'Unicode (UTF-8)'
2) Separater options: 'separated by', check 'Tab' and 'Space', other options should not be checked.
3) at 'Text delimiter' type a space
4) klik 'OK'

5) Insert a column B, and fill it with a semi-colon ';'

6) Klik save-as, type a name, and check 'Edit filter settings'
7) The Export Text file' wizard should be shown.
8) Character set: 'Unicode (UTF-8)'
9) Field delimiter: space ' '
10) Text delimiter: <empty> ''
11) checkboxes: only leave 'Save cell content as shown' checked.....