[Writer] Regular expressions

gordom · January 8, 2013, 6:58pm

Hallo everyone.
I would appreciate your help with the regular expressions. I have a document consisting of hundreds of lines. A small sample is here:

Set: 01SA34509
0109SA
011017B
01020207B
010902B
01090002
011007B
01090001
090110
Set: 0134501
011101
01110102
01110103
080908
Set: 0111679SE
0111SE

I need to delete all text except these lines started with word "Set". If I use "set:.+" regular expression, all these lines, that should be kept, are selected. I cant find a way to reverse this selection. I tried "[^set:.+].+" and "[^(set:.+)].+" but they don't work. Could you please give me any clues? Thanks in advance. Regards,
gordom

Miroslaw_Zalewski · January 8, 2013, 7:21pm

Since LibreOffice regex engine is crippled and don't support lookaheads, the
short answer is:
no, you can't do that.

BUT do lines you want to delete happen to fall into some common pattern? In
your sample (which may or may not be representative for entire text) they
does. In fact you want to delete all lines that start with number followed by
letter. You can use this regexp to match these lines:

^[0-9]+.*

^ (caret) is for "begging of line". So this will match every line that starts
with at least one number.

If you don't care about formatting, you may also export your file into TXT and
use perl, which has superior regex capabilities.

Miroslaw_Zalewski · January 8, 2013, 7:37pm

Another idea:
since you can select these lines you want to preserve, why don't copy them and
paste into new document? This should be easiest and most error-proof solution.

gordom · January 8, 2013, 8:11pm

W dniu 2013-01-08 20:20, Mirosław Zalewski pisze:

Since LibreOffice regex engine is crippled and don't support lookaheads, the
short answer is:
no, you can't do that.

That's a pity.

BUT do lines you want to delete happen to fall into some common pattern? In
your sample (which may or may not be representative for entire text) they
does. In fact you want to delete all lines that start with number followed by
letter. You can use this regexp to match these lines:

^[0-9]+.*

Unfortunately my sample wasn't very accurate and can't be regarded as fully representative. The pattern is more complex actually.

regards,
gordom

gordom · January 8, 2013, 8:11pm

W dniu 2013-01-08 20:37, Mirosław Zalewski pisze:

I use "set:.+" regular expression, all these lines, that should be kept,
are selected.

Another idea:
since you can select these lines you want to preserve, why don't copy them and
paste into new document? This should be easiest and most error-proof solution.

I can't simply copy and paste them because I will loose the line order. But I did another thing. I used a Calc to sort data, delete unwanted text and return to the previous order.

Regards / Pozdrawiam :-),
gordom

Nino_Novak1 · January 8, 2013, 8:13pm

Search for: ^[^Ss]
Replace with: (leave empty)
[x] Regular expression

works fine for me (LibreOffice 3.5.4.2 on win-32)

Nino

Nino_Novak1 · January 8, 2013, 8:17pm

W dniu 2013-01-08 20:20, Mirosław Zalewski pisze:

Since LibreOffice regex engine is crippled and don't support
lookaheads, the
short answer is:
no, you can't do that.

That's a pity.

but only if it's really true

BUT do lines you want to delete happen to fall into some common
pattern? In
your sample (which may or may not be representative for entire text) they
does. In fact you want to delete all lines that start with number
followed by
letter. You can use this regexp to match these lines:

^[0-9]+.*

Unfortunately my sample wasn't very accurate and can't be regarded as
fully representative. The pattern is more complex actually.

Then you should try to show a better example as Regex are pattern matching and without knowing the pattern we cannot guess how to match it

Nino

gordom · January 8, 2013, 8:45pm

W dniu 2013-01-08 21:17, Nino Novak pisze:

W dniu 2013-01-08 20:20, Mirosław Zalewski pisze:

Since LibreOffice regex engine is crippled and don't support
lookaheads, the
short answer is:
no, you can't do that.

That's a pity.

but only if it's really true

BUT do lines you want to delete happen to fall into some common
pattern? In
your sample (which may or may not be representative for entire text)
they
does. In fact you want to delete all lines that start with number
followed by
letter. You can use this regexp to match these lines:

^[0-9]+.*

Unfortunately my sample wasn't very accurate and can't be regarded as
fully representative. The pattern is more complex actually.

Then you should try to show a better example as Regex are pattern
matching and without knowing the pattern we cannot guess how to match it

Nino

In "real" life there are lines starting with letters also. Only these with "Set:" at the beginning should be left, rest is going to be deleted.

Set: 01SA34509
0109SA
011017B
S01020207B
010902B
01090002
011007B
01090001
090110
Set: 0134501
011101
HB01110102
01110103
080908
Set: 0111679SE
0111SE

I'm surprised that there is no simple way to find everything except "Set:.+$"

gordom

Miroslaw_Zalewski · January 8, 2013, 10:38pm

Although you have solved your problem already by other means, but let's check
other possibilities.

In "real" life there are lines starting with letters also. Only
these with "Set:" at the beginning should be left, rest is going to be
deleted.

By looking at this sample, I see three more patterns that could get work done.
1. It seems that only lower-case letters are in "Set", which is in lines you
want preserve. So maybe looking for lines containing only numbers and upper-
case letters will do the trick.
2. In this sample, spaces are only in lines you want to preserve. What about
matching lines without whitespace characters?
3. It looks like colon happens only in lines you want to delete. Match lines
without colons, maybe?

I'm surprised that there is no simple way to find everything except
"Set:.+$"

Well, regexps are most often used in programming languages or tools like grep,
where you can easily get "reverse match" mode (print everything that does NOT
match). But LO is not programming language and it's simple regexp engine is
simply not sophisticated enough in many cases.

I have read somewhere on TDF wiki about incorporating mature regexp library in
LO, but this idea was rejected due to portability issues. LO must run on
Windows and OS X as well as on Linux, whereas library in question runs only on
Linux.

Miroslaw_Zalewski · January 8, 2013, 10:39pm

Unless you are 100% sure that my statement is not true, please don't question
it.

And you are NOT 100% sure until you post LO Writer regexp that will use
negation pattern that OP is looking for.

Tom_Davies · January 9, 2013, 1:08am

Hi :)
I like the quote "Never believe anything until it's been denied by a government minister" but now i can't find where the quote comes from so i'm beginning to doubt it was ever really said by anyone.
Regards from
Tom

Winston_Chuen-Shih_Y · January 9, 2013, 4:04am

Gordom,

Below, Miroslaw mentioned grep. In Linux, you can achieve your goal by typing something like the following:

grep --extended-regexp "^Set" input.txt > output.txt

This gets the lines in input.txt that start with "Set" and writes these lines to output.txt.

You might not even have to type the "--extended-regexp" part.

Winston

Brian_Barker · January 9, 2013, 6:17am

I think this is fairly simple. I'm assuming that your "lines" are actually separate paragraphs, in fact: that they are separated by paragraph breaks, not line breaks, that is.

o Using Find & Replace with "Regular expressions" ticked, search for ^Set and click Find All. This will select just those words, where they occur at the start of a line, not the whole lines.
o Click the down-arrow at the right of the Apply Style window in the Formatting toolbar, and select some (paragraph) style different from the style of your text (perhaps Heading?). Since this is a paragraph style, it will apply to the whole of each relevant line (paragraph), not just the selected occurrences of the word "Set".
o Back in the Find & Replace dialogue, click "Search for Styles", choose your original style (perhaps Default?) in the "Search for" box, and click Find All.
o Press Delete to remove all the unwanted lines.
o Tick "Regular expressions" again, and search for ^$ - replacing with nothing. Click Replace All. This removes the empty paragraphs left by the previous process.
o Go to Edit | Select All (or press Ctrl+A) and use the Apply Style window again to reset your paragraph style appropriately (to Default?).

I trust this helps.

Brian Barker

Brian_Barker · January 9, 2013, 6:33am

This poses a nice paradox. According to your own stricture, you shouldn't be criticising Mr Novak unless you can be sure that he cannot be correct in even questioning your claim. And again, according to your same requirement, you cannot be sure that he cannot be correct to question your claim until you have posted a proof of your belief - in other words a proof of your original claim.

;^)

Brian Barker

gordom · January 9, 2013, 10:02am

W dniu 2013-01-09 07:17, Brian Barker pisze:

I would appreciate your help with the regular expressions. I have a
document consisting of hundreds of lines. A small sample is here:

Set: 01SA34509
0109SA
011017B
01020207B
010902B
01090002
011007B
01090001
090110
Set: 0134501
011101
01110102
01110103
080908
Set: 0111679SE
0111SE

I need to delete all text except these lines started with word "Set".
If I use "set:.+" regular expression, all these lines, that should be
kept, are selected. I cant find a way to reverse this selection. I
tried "[^set:.+].+" and "[^(set:.+)].+" but they don't work. Could you
please give me any clues?

I think this is fairly simple. I'm assuming that your "lines" are
actually separate paragraphs, in fact: that they are separated by
paragraph breaks, not line breaks, that is.

o Using Find & Replace with "Regular expressions" ticked, search for
^Set and click Find All. This will select just those words, where they
occur at the start of a line, not the whole lines.
o Click the down-arrow at the right of the Apply Style window in the
Formatting toolbar, and select some (paragraph) style different from the
style of your text (perhaps Heading?). Since this is a paragraph style,
it will apply to the whole of each relevant line (paragraph), not just
the selected occurrences of the word "Set".
o Back in the Find & Replace dialogue, click "Search for Styles", choose
your original style (perhaps Default?) in the "Search for" box, and
click Find All.
o Press Delete to remove all the unwanted lines.
o Tick "Regular expressions" again, and search for ^$ - replacing with
nothing. Click Replace All. This removes the empty paragraphs left by
the previous process.
o Go to Edit | Select All (or press Ctrl+A) and use the Apply Style
window again to reset your paragraph style appropriately (to Default?).

I trust this helps.

Brian Barker

It seems to work indeed. Thank you very much :-). Regards,

gordom

Johnny_Rosenberg · January 9, 2013, 12:26pm

This worked for me with your example lines a minute ago:

Ctrl+h (or whatever method you prefer for opening the Search and
Replace dialogue).
☒ Regular expressions
Search for: ^[^S][^e][^t].*$
Replace with: (leave empty)
Click Replace All

Search for: ^$
Leave everything else as is
Click Replace All.

Done.

The funny thing is that the last part didn't work for me maybe ten
minutes ago, but I must have done something slightly different that
time…

So, in short terms:
1. Replace all ^[^S][^e][^t].*$ with nothing (regular expressions on).
2. Replace all ^$ with nothing (regular expressions still on).
Done.

Step 1 would also erase lines starting with ”set” and ”SET”, so if you
want to keep all possible combinations for the word ”set”, you should
rather try: ^[^Ss][^Ee][^Tt].*$
I didn't try that myself, but it should work. There is always Undo if
it doesn't…

Kind regards

Johnny Rosenberg
ジョニー・ローゼンバーグ

Johnny_Rosenberg · January 9, 2013, 12:37pm

Okay, forget it. This keeps everything that starts with an S, not only
the Set lines…
If that's not at problem, this is a fast way, otherwise it could
require quite some manual work or further Seach and Replace
operations.
I'll give it a few more thoughts…

Kind regards

Johnny Rosenberg
ジョニー・ローゼンバーグ

Johnny_Rosenberg · January 9, 2013, 12:48pm

Another method:
1. Ctrl+h, Search for: ^Set.*$
2. ☒ Regular expressions, click Search all. Close the dialogue.
3. Ctrl+x Ctrl+a Ctrl+v Ctrl+h
4. Search for: Set
5. Replace with: \nSet
6. ☒ Regular expressions, click Replace all. Close the dialogue.
7. Remove the first line, which now is empty, manually.

Done.
Looks like many steps, but it is quick, actually.

Kind regards

Johnny Rosenberg
ジョニー・ローゼンバーグ

Johnny_Rosenberg · January 9, 2013, 1:03pm

Maybe we should do a feature request.
Somewhere in the dialogue, there could be something like this:
☐ Inverse Search/Replace

Leaving it like this, everything should work like usual.

☒ Inverse Search/Replace
Now, Search should find first non-match, Search all should find all
non-matches, Replace should work as usual – that is replace the
currently highlighted text and Replace all should replace all
non-matches.

Just an idea.

Johnny Rosenberg

Tom_Davies · January 9, 2013, 1:18pm

Hi
That does sound fairly insane and unlikely to ever be useful. I like it!! It would be great to have even if it's just to show-off that LO is ahead of the game in yet another way.
Regards from
Tom