Removing Index Markers from Writer: a How-To

CVAlkan · January 28, 2014, 3:45pm

In previous posts, I described how Writer adds extra index markers when
updating an Alphabetical Index. One side effect of this behavior is that,
even if an item is later removed from the concordance file, the marker
remains in the text, and therefore in the index.

So, here's how to remove all of the index markers from a Writer document so
you can start with a clean slate. To do this, you will need to be running
LibreOffice on some flavor of Linux/Unix, or at least on a system that has a
command line or some text editor with "sed" or "grep" capabilities.

1: Make a backup of your Writer document. You know the consequences if
something goes amiss.
2: Open the document in Writer, and choose Save As "OpenDocument Text (Flat
XML) (fodt)"
   This creates an uncompressed XML version of the document.
   On my system (Ubuntu), I was unable to decompress the odt version, as the
OS complained it was malformed, but using the native capability is always a
better idea.
3: Close the document and exit Writer.
4: Open a command line shell, preferably in the directory containing the
fodt file.
5: Run the following command (all one line - broken apart here for clarity):
   sed 's/<text:alphabetical-index-mark
text:string-value="\([A-Za-z]*\)"\/>//g'
   < Old_File_Name_and_Path.fodt
   > New_File_Name_and_Path.fodt
   Depending on the file size and processor speed, this may take a bit.
   If this gives errors, you're on your own.
6: Close the command line shell.
7: Open the new "cleansed" fodt file with Writer.
8: The file should look the same but without any alphabetical index markers.
(Your index formatting is still there, though)
9: Go to where your alphabetical index is located, right click on it and
select "Update Index/Table"
the referenced pages and manually delete them. Apparently, some of the
indexes are embedded in others and aren't found by the sed command above.
   I didn't bother to try figuring out how or why that happened. I had
several hundred markers, of which only five weren't removed.
B: Now, go back to the index and select Edit Index/Table, then File | Open.
C: Select the original concordance file (assuming you have it set up how you
want it), and let Writer go do its thing.
D: You now have a "clean" document with no duplicate index entries.
E: LOOK AT IT CAREFULLY, of course, before replacing your original. The
document I tried this on was over four hundred pages with lots of tables,
graphics and so forth, and I found no problems, but it's up to you to
determine if everything is ok.

I hope this helps any others who might be using alphabetic indexes.

Michael_Reich · January 28, 2014, 7:05pm

Thanks for this work-around. I think it will also work on Mac OSX as it
includes sed in Terminal.
I will try it as soon as I get a chance.

pbw · January 29, 2014, 5:14am

I've asked a question on this list about soffice, and this topic is the reason I was asking.

Let's say we can use the command line to 1) convert an .odt to an .fodt file, and 2) to convert it back again.

If so, I can write a script that uses soffice to do
1) as above
perl -e .... # perform the text substitutions
2) as above to re-establish the modified .odt file.

Peter West
"Other seed fell among thorns, and the thorns grew up and choked it..."

CVAlkan · January 29, 2014, 1:23pm

Peter:

I mentioned sed and grep, but don't see any reason why perl couldn't be used
as well. If you test this and it works, please post back to give others
another option.

BUT: my sed command only removed the markers from the fodt. As I mentioned I
was unable to convert the odt to an fodt (essentially uncompressing the odt
to readable xml) using the unzip capability of my OS (as I'm pretty sure
could be done with earlier open office documents).

Since LO can easily write and read fodt files, though, it really wasn't
necessary to do any file format conversion, and I didn't bother spending the
time to figure out how to do everything in one shot.

Frank

pbw · February 4, 2014, 5:44am

Hi Frank,

Finally got back to this. Can you check this for me against an .fodt file, please?

perl -pi'orig_*' -e 's/<text:alphabetical-index-mark text:string-value="[[:alpha:]]*"//g'

pbw · February 4, 2014, 5:51am

The file name goes at the end of the command, of course.

pbw · February 4, 2014, 10:16am

Hi Frank,

This script works on OS X, to the extent that does the conversions and creates new filtered odt files. The only questions are whether it works in Linux, and whether the filter does the right thing with the index markers.

I haven't created any files with such index-markers, so maybe you can run it against your files to see how it goes.

http://pbw.id.au/src/sh/strip-odt-index-markers

It's a shell script, using soffice and perl.

CVAlkan · February 4, 2014, 3:54pm

Peter:

The actual perl command should be changed slightly to:
perl -pi'orig_*' -e 's/<text:alphabetical-index-mark
text:string-value="[[:alpha:]]*"\/>//g' Index_Experiment.fodt

After [[:alpha:]]*" the \/> needs to be added to remove the "/>" ending of
the XML tag - otherwise it seems to work fine.

The full blown shell script you sent me (I don't see it here on the forum
for some reason) needs to be modified in the same way of course.

I used [A-Za-z] instead of the [:alpha:] that you used because some systems
don't respect that substitution syntax (I can't remember what it's called),
which limits things just a bit, but a comment might be added to take care of
that - the [:alpha:] syntax, again, is probably a little easier to
understand for those not familiar with grep, sed and their relatives.

You should post the shell script, as it is probably easier to use (?) for
some folks than my simple sed command, since it takes care of hand-holding,
locating the right directories and so forth.

Of course, I hope some of the LibreOffice developers will incorporate the
option and capability to remove old markers when an index is regenerated,
and "fix" the generator so that it doesn't add additional markers to the
same word when updating takes place. (It doesn't always do that, but I
haven't figured out the exact conditions when it does).

So - good work.

Peter_West · February 4, 2014, 11:55pm

D'uh!

Thanks Frank. I've updated the script with that change.

Attachments aren't accepted by this list, are they? What did you mean by "post the shell script"?

Peter

Peter West
"...and a sword will pierce through your own soul also..."

CVAlkan · February 5, 2014, 12:12pm

Hi Peter:

Hmmm. I hadn't noticed that there was no "attachment" button on this forum,
as I haven't ever uploaded anything longer than a few lines.

Under the "More" button on the top of the message box, there is an option to
"upload a file" but I don't know if other users will have easy access to
that.

Maybe someone who is more familiar with this forum can advise. But, even it
it only goes to the LibreOffice folks themselves, I think that would be
useful, since perhaps they can use it is a guide for further development of
the indexing feature.

Frank

Tom_Davies1 · February 5, 2014, 12:38pm

Hi
CVAlkan (=Frank) is using Nabble to view things from the mailing-list
and for posting. The 2 other ways are GMane and as just normal
emails. Of those 3 ways it's only Nabble that has a system for
uploading 'attachments'.

Follow the links in Frank's email to get to the right place in Nabble
or go through the official LibreOffice website thru "Get Help" and
find the correct thread by looking at the subject-lines and/or
date&time.

With scripts people have often just copy&pasted the code directly into
an email rather than try to upload a file. Either way around is fine.
Regards from
Tom