Document 'corrupt' for LibreOffice, opens fine with other OOo-based software

Hello all,

We are trying to generate a document in Open Document Format and we're having a problem with the latest Mac version of LibreOffice (3.5.4.2): it says the document is corrupt when other OOo-based software open it correctly (namely IBM / Lotus Symphony 3.0.0 FP2 and OOo itself version 3.1.1, all on the Mac). The 'repair' option is available and actually opens the document correctly, but we would prefer to have a usable document right away and not force people to repair it after each generation…

Here is how we generate the document: we actually use an existing ODF document as some kind of template, unzip it, extract all its contents and replace the content.xml file by one we generate ourselves. One of the generated documents causing the problem is attached.

So if anyone could have a look and tell us what we are doing wrong. Or even better, if there's a way to make LibreOffice tell us what's actually wrong in the document, that would be great.

Thanks a lot.

Hello all,

We are trying to generate a document in Open Document Format and we're having a problem with the latest Mac version of LibreOffice (3.5.4.2): it says the document is corrupt when other OOo-based software open it correctly (namely IBM / Lotus Symphony 3.0.0 FP2 and OOo itself version 3.1.1, all on the Mac). The 'repair' option is available and actually opens the document correctly, but we would prefer to have a usable document right away and not force people to repair it after each generation…

Here is how we generate the document: we actually use an existing ODF document as some kind of template, unzip it, extract all its contents and replace the content.xml file by one we generate ourselves. One of the generated documents causing the problem is attached.

Looks like the attached document didn't get through, so here is a link to it:
http://ubuntuone.com/0LoWLJjhmIkje4wrRLOpuo

Am 04.06.2012 09:59, Eric Brunel wrote:

Looks like the attached document didn't get through, so here is a link
to it:
http://ubuntuone.com/0LoWLJjhmIkje4wrRLOpuo

Hi,

All my recent offices consider your document as broken and suggest to repair. The repair process ends successfully with a section, a subsection and a picture. The repaired document opens well.
Tested under Linux(32) with LibO 3.3.4, LibO 3.5.4, AOO 3.4

Hope this helps,
Andreas

Eric Brunel wrote:

Hello all,

We are trying to generate a document in Open Document Format and we're having a problem with the latest Mac version of LibreOffice (3.5.4.2): it says the document is corrupt when other OOo-based software open it correctly (namely IBM / Lotus Symphony 3.0.0 FP2 and OOo itself version 3.1.1, all on the Mac). The 'repair' option is available and actually opens the document correctly, but we would prefer to have a usable document right away and not force people to repair it after each generation…

Here is how we generate the document: we actually use an existing ODF document as some kind of template, unzip it, extract all its contents and replace the content.xml file by one we generate ourselves. One of the generated documents causing the problem is attached.

So if anyone could have a look and tell us what we are doing wrong. Or even better, if there's a way to make LibreOffice tell us what's actually wrong in the document, that would be great.

Thanks a lot.
--
- Eric Brunel <eric.brunel@pragmadev.com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

      I did some testing on the document you created. First I downloaded the file, created a copy, and renamed the copy from .odt to .zip (doctext.zip). Then I opened the downloaded file with LO allowing it to repair the file. Then I saved it (doctext(repaired).odt, created a copy of it, and renamed it from .odt to .zip. I also opened the two zipped files next to each other.
      My findings:

                                     doctext.zip doctext(repaired).zip
Configurations2 0 bytes
META-INF 2.0kB 1.2 kB
Pictures 767 bytes 767 bytes
Thumbnails 2.3 kB
content.xml 5.0 kB 4.8 kB
manifest.rdf 899 bytes
meta.xml 1.1 kB 1.1 kB
mimetype 39 bytes 39 bytes
settings.xml 6.8 kB 9.0 kB
styles.xml 14.3 kB 15.0 kB

      I will let others draw their own conclusions as to what this means.
      Then again, I have some questions and comments. Why are you creating a document by only changing the content.xml file? From the above information, clearly other files needed to be changed as well. And as Andreas pointed out, LO will report the error regardless of the OS used on a computer.
      Will creating a template with the layout you want work just as well or better? It can be used to open a new document. Then you can enter or copy material into the new document. When you save this, LO will update all the files and folders within the ODT file.

--Dan

I did some testing on the document you created. First I downloaded the file, created a copy, and renamed the copy from .odt to .zip (doctext.zip). Then I opened the downloaded file with LO allowing it to repair the file. Then I saved it (doctext(repaired).odt, created a copy of it, and renamed it from .odt to .zip. I also opened the two zipped files next to each other.
    My findings:

                                   doctext.zip doctext(repaired).zip
Configurations2 0 bytes
META-INF 2.0kB 1.2 kB
Pictures 767 bytes 767 bytes
Thumbnails 2.3 kB
content.xml 5.0 kB 4.8 kB
manifest.rdf 899 bytes
meta.xml 1.1 kB 1.1 kB
mimetype 39 bytes 39 bytes
settings.xml 6.8 kB 9.0 kB
styles.xml 14.3 kB 15.0 kB

    I will let others draw their own conclusions as to what this means.

This is exactly what I've been doing to try to create a correct document myself. The problem is, I'm not really sure what parts of the document I should change to prevent it from being reported as corrupt. Some information in these files do not look important at all, or not related to the actual contents of the document. It seems that there is some internal consistency problem between the files, but since LO doesn't say anything else than 'the document is corrupt', it's very hard to figure out what's happening here.

    Then again, I have some questions and comments. Why are you creating a document by only changing the content.xml file? From the above information, clearly other files needed to be changed as well. And as Andreas pointed out, LO will report the error regardless of the OS used on a computer.

Yes, we've been working on the matter a bit further and figured out that too: it seems the latest versions of all OOo-based software we could test on any platform do report the document as corrupt. So I was hoping to figure out what had changed between the versions that open it without problem and the current ones to know what to do to generate it correctly.

    Will creating a template with the layout you want work just as well or better? It can be used to open a new document. Then you can enter or copy material into the new document. When you save this, LO will update all the files and folders within the ODT file.

Well, the problem is, as I said in my original post, the document is generated, not created by hand. We have a description of the document contents in an external source, and we are using an existing ODF document only to get the styles from it. So it did seem to be enough to get all the files from the existing ODF document, replace the contents.xml in it by the one we generate, and zip back everything to a new ODF file. And it used to work, but it doesn't anymore… But basically, we would like to be able to generate the document without running LO at all. With the method we're using, it doesn't even need to be installed, and that's even better…

And by the way, if we open the existing ODF file (not the generated one), everything works fine. LO doesn't say it's corrupt and opens it without any problem.

Thanks a lot anyway to you and Andreas for your answers.

Eric Brunel wrote:

I did some testing on the document you created. First I downloaded the file, created a copy, and renamed the copy from .odt to .zip (doctext.zip). Then I opened the downloaded file with LO allowing it to repair the file. Then I saved it (doctext(repaired).odt, created a copy of it, and renamed it from .odt to .zip. I also opened the two zipped files next to each other.
    My findings:

                                   doctext.zip doctext(repaired).zip
Configurations2 0 bytes
META-INF 2.0kB 1.2 kB
Pictures 767 bytes 767 bytes
Thumbnails 2.3 kB
content.xml 5.0 kB 4.8 kB
manifest.rdf 899 bytes
meta.xml 1.1 kB 1.1 kB
mimetype 39 bytes 39 bytes
settings.xml 6.8 kB 9.0 kB
styles.xml 14.3 kB 15.0 kB

    I will let others draw their own conclusions as to what this means.

This is exactly what I've been doing to try to create a correct document myself. The problem is, I'm not really sure what parts of the document I should change to prevent it from being reported as corrupt. Some information in these files do not look important at all, or not related to the actual contents of the document. It seems that there is some internal consistency problem between the files, but since LO doesn't say anything else than 'the document is corrupt', it's very hard to figure out what's happening here.

    Then again, I have some questions and comments. Why are you creating a document by only changing the content.xml file? From the above information, clearly other files needed to be changed as well. And as Andreas pointed out, LO will report the error regardless of the OS used on a computer.

Yes, we've been working on the matter a bit further and figured out that too: it seems the latest versions of all OOo-based software we could test on any platform do report the document as corrupt. So I was hoping to figure out what had changed between the versions that open it without problem and the current ones to know what to do to generate it correctly.

    Will creating a template with the layout you want work just as well or better? It can be used to open a new document. Then you can enter or copy material into the new document. When you save this, LO will update all the files and folders within the ODT file.

Well, the problem is, as I said in my original post, the document is generated, not created by hand. We have a description of the document contents in an external source, and we are using an existing ODF document only to get the styles from it. So it did seem to be enough to get all the files from the existing ODF document, replace the contents.xml in it by the one we generate, and zip back everything to a new ODF file. And it used to work, but it doesn't anymore… But basically, we would like to be able to generate the document without running LO at all. With the method we're using, it doesn't even need to be installed, and that's even better…

And by the way, if we open the existing ODF file (not the generated one), everything works fine. LO doesn't say it's corrupt and opens it without any problem.

Thanks a lot anyway to you and Andreas for your answers.

      I need you to clear a point up for me. You did not say what format the generated document is in. This might make a difference. Also how did you generate the contents.xml file? Perhaps these have a similar answer.

--Dan

Eric Brunel wrote:

So if anyone could have a look and tell us what we are doing wrong. Or even better, if there's a way to make LibreOffice tell us what's actually wrong in the document, that would be great.

Hi Eric,

I've followed most of the discussion to date and I have had similar experiences in the past. When I repaired the document you posted, two new directories appeared, Configurations2 and Thumbnails.

In your original document, the directory META-INF contains a file manifest.xml whose contents contain references to the two missing directories and two files:

Configurations2/accelerator/current.xml
Thumbnails/thumbnail.png

The first is empty and the second is a 181x 256 image of the page. I do not not know whether this is the core of your problem or not, but it is a place I for one would investigate.

Barry

Corruption could indeed be triggered by there being an inconsistency between META-INF/manifest.xml and what is present in the package.

It is interesting if that is a complaint about Configurations/.../current.xml (usually a zero-length useless component), which is private information from the producer that is not defined by ODF and can't be meaningfully used by any software other than the producer. For the most part, that material appears to be gratuitous and unnecessary.

The missing Thumbnails/thumbnail.png is also benign.

Working from memory in order to reply quickly, I believe that there *IS* an ODF requirement for every stream in the package (a Zip) to be accounted for in META-INF/manifest.xml except the manifest itself, mimetype, and anything else in META-INF/ (except if it is meant to be encryptable). I suspect the specifications are silent concerning META-INF/manifest.xml entries that have no corresponding stream in the Zip. I need to confirm the facts.

- Dennis

THINKING OUT LOUD:

I would not be surprised if this tightening of consistency with the manifest is for purposes of improved detection of tampering and the possible incidence of a security exploit of one kind or another. There is a practice in security cases to avoid providing details since it provides too much information for someone attempting to craft an exploit. That's a stretch in this case.

It would be useful to soften the message to one of "There are inconsistencies and it is possible the document is corrupted." The request for permission to attempt correction by eliminating the inconsistencies should be quite clear. It would also be valuable to report whether there was any apparent data loss or that repair did not involve loss of anything critical to the document. Encouraging a save-as of the repaired document to a different location would also be handy in restoring the confidence of the user in the successful effort.

MORE THINKING OUT LOUD:

PS: If the original document was digitally signed, repair will lose the signature of course, but there should have been difficulties with verification of the presumed original too. If the original document was encrypted ("Saved with Password"), how that fails is very dicey.

PPS: It is cumbersome to put too much into the analysis and repair as part of the mainstream operation of a product such as LibreOffice where performance is an important feature. A separate analysis and repair utility might be more valuable (and LO could launch it, of course) for providing smoother handling and forensic analysis when the product software detects an inconsistency.

Thanks to Dennis for his thoughts. They make sense to a newbie to this field ( with 40 years of intermittent programming experience)

  Dennis E. Hamilton wrote:

Corruption could indeed be triggered by there being an inconsistency between META-INF/manifest.xml and what is present in the package.

It is interesting if that is a complaint about Configurations/.../current.xml (usually a zero-length useless component), which is private information from the producer that is not defined by ODF and can't be meaningfully used by any software other than the producer. For the most part, that material appears to be gratuitous and unnecessary.

To my mind, a 'standard' document, say for archival purposes, should not have any extraneous 'information'. Libre office can include private information but other XML agents should gracefully ignore this. Likewise Libre Office should gracefully accept documents which do not conform to its private structure providing they conform to an Open Document standard. Do we need a .odt checker?

If my understanding is correct we have an internally inconsistent document (directory) given that it refers to files which do not exist. Your reference to 'a tightening of consistency' strikes exactly the right chord to my way of thinking.

Barry

    I need you to clear a point up for me. You did not say what format the generated document is in. This might make a difference. Also how did you generate the contents.xml file? Perhaps these have a similar answer.

I'm not too sure I'm understanding the question so I'll describe exactly what we are doing: we have an internal representation in a tool that includes some text with style information, mainly paragraph style names. To export the document, we create a text document in LO or any other OOo-based software, and ask to use it as a kind of template. The document is supposed to contain paragraph styles with the same names as the ones referenced in our tool. That document is correct, since it has been created by LO itself, and is not reported as corrupt at all. For the document I sent out, we of course made sure all styles were correctly defined (they are, as they appear correctly after the document is repaired).

Then our tool unzips the LO document, opens its content.xml file and gets some bits in it (mainly the header and the <office:font-face-

and <text:sequence-decls> blocks) and creates a new content.xml

file, copying the bits into it with default values for the other parts. Then it writes the contents itself, with regular <text:p>, <text:h>, <text:list> tags. This writing is done 'manually', i.e with equivalent of printf's. We're basically sure the document structure is correct, since it opens without problem in previous versions. And we used a few XML validating tools on it, and no errors were reported at all.

For the other files in the .odt, they are basically kept as in the original document, so they are coming from LO itself. The only thing is that we are only copying the files styles.xml, mimetype, meta.xml, settings.xml and the whole contents of the META-INF directory. This used to work in former version.

Hope this is clearer now, and thanks again for your answer.
  -Eric -

Corruption could indeed be triggered by there being an inconsistency between META-INF/manifest.xml and what is present in the package.

It is interesting if that is a complaint about Configurations/.../current.xml (usually a zero-length useless component), which is private information from the producer that is not defined by ODF and can't be meaningfully used by any software other than the producer. For the most part, that material appears to be gratuitous and unnecessary.

The missing Thumbnails/thumbnail.png is also benign.

Working from memory in order to reply quickly, I believe that there *IS* an ODF requirement for every stream in the package (a Zip) to be accounted for in META-INF/manifest.xml except the manifest itself, mimetype, and anything else in META-INF/ (except if it is meant to be encryptable). I suspect the specifications are silent concerning META-INF/manifest.xml entries that have no corresponding stream in the Zip. I need to confirm the facts.

Thanks for the hints. Worked on that, but no luck so far: I removed the references to the non-existing files and directories in the META-INF/manifest.xml file, but the document is still reported as corrupt. Then I did the reverse: keeping the existing META-INF/manifest.xml file and copying all the missing files and directories from the repaired document: LO still says the document is corrupt… I've checked the manifest.xml file thoroughly, and it does reference exactly all the files in the document, except everything in the META-INF directory itself and the mimetype file.

I would not be surprised if this tightening of consistency with the manifest is for purposes of improved detection of tampering and the possible incidence of a security exploit of one kind or another. There is a practice in security cases to avoid providing details since it provides too much information for someone attempting to craft an exploit. That's a stretch in this case.

This was what came to my mind too…

It would be useful to soften the message to one of "There are inconsistencies and it is possible the document is corrupted." The request for permission to attempt correction by eliminating the inconsistencies should be quite clear. It would also be valuable to report whether there was any apparent data loss or that repair did not involve loss of anything critical to the document. Encouraging a save-as of the repaired document to a different location would also be handy in restoring the confidence of the user in the successful effort.

Well again, telling the reason why the document is reported as corrupt would be a great help too. As it is now, we have to rely on wild guesses to figure out what to correct in the generated document, and that's a long and painful thing to do…

Anyway, thanks a lot again for your answers.
  - Eric -

Silly Question,

Can you unpack the repaired file and rezip it. Does it open correctly?

I have tried something similar to your process and what once worked stopped working.

Barry

Eric Brunel wrote: