How does 4.4 compress PDFs so well? Is there a quality problem?

I edit a community magazine using LibreOffice Writer. I create a PDF from the
final version, which I send electronically to the printing company, which
prints the PDF.

To create the PDF, I use the settings "Lossless compression" and do not
reduce image resolution. The resulting PDF is usually in excess of 100Mb,
presumably mainly because of all the images used.

This month, I have created the PDF using both the current stable version
4.3.4.1 and the beta version 4.4.0.0. (I used the beta version because of a
new small bug in 4.3.)

Fascinating results!

Visually, the two PDFs look the same (apart from the small bug). I've
printed a sample page from each PDF and compared them, and they appear
equally-good quality.

However, the sizes are dramatically different. The PDF from 4.3.4.1 is
157.7Mb, which is roughly what I'd expect from past experience. The same PDF
from 4.4.0.0 is a mere 27Mb, just a sixth of its size.

Adding up the sizes of all the images, they come to (approximately) 30Mb.
So, it seems to me that the current version perhaps stuffs the files with
unnecessary content — or perhaps decompresses the images when saving —
whereas the beta version seems to save the files decently compressed.

So, are you able to tell me…

How is the PDF from the beta version so small — or the PDF from the current
version so large?
Am I going to get an unwelcome surprise in the quality of the printed copy
if I send the PDF from 4.4.0.0?
Does the current version use some amazing compression to create its PDF?
Or does the old version stuff the file full of unnecessary content?

Thanks in advance.

@Paddy,

A printed copy of your PDF is not a very good test of document quality.
Embedded BMP representation within LibreOffice is at 300dpi. Export print
may be the vector format (wmf, emf, eps, svg) or a bitmap rendering
preview--at 300dpi.

You really need to open each PDF in suitable viewer and zoom in to 800% or
1200%. How do embedded images compare there? Are they bitmap? Or more
concise full resolution vector images?

Also, the platform you work on will impact handling of vector images.
Several helper programs are needed--Ghostscript, ImageMagick, pstoedit and
the mix will impact handling as bitmap or vector.

Please provide your OS details, and perhaps attach a sample output (via the
Nabble interface).

Stuart

V Stuart Foote wrote

A printed copy of your PDF is not a very good test of document quality.
Embedded BMP representation within LibreOffice is at 300dpi. Export print
may be the vector format (wmf, emf, eps, svg) or a bitmap rendering
preview--at 300dpi.

You really need to open each PDF in suitable viewer and zoom in to 800%
or 1200%. How do embedded images compare there? Are they bitmap? Or
more concise full resolution vector images?

Also, the platform you work on will impact handling of vector images.
Several helper programs are needed--Ghostscript, ImageMagick, pstoedit and
the mix will impact handling as bitmap or vector.

To clarify:

All original images are either JPG or PNG.
I don't need super-clear printing, as this is a simple non-profit
volunteer-run community magazine.
I have zoomed the two files to 6,400% in Adobe Reader XI on Windows, and
they look identical.

V Stuart Foote wrote

Please provide your OS details, and perhaps attach a sample output (via
the Nabble interface).

I am using Linux Ubuntu 14.04 64-bit.

One of the files is too large for the Nabble interface, and so I am
including links to the files instead.
* PDF from version 4.3.4.1
<https://dl.dropboxusercontent.com/u/49313422/Version%204.3.4.1.pdf>
* PDF from version 4.4.0.0
<https://dl.dropboxusercontent.com/u/49313422/Version%204.4.0.0.pdf>

My suspicion now is that the current version converts the images to a
bitmap, whereas the new version saves the images in their original format.
Is there a way to look "inside" the PDF to see how the images are stored,
either on Windows or Linux?

Thank you for looking at this.

Paddy

@Paddy, *,

Hey that was fun to track down. You were spot on...

Looks like a long present issue of JPEGs being exported to PDF were not
retaining their JPEG compression. That looks to have changed. Work done by
Armin Le Grand on the Apache OpenOffice project and picked up in LibreOffice
with commit
http://cgit.freedesktop.org/libreoffice/core/commit/?id=8930030323f269a9b3c6bd6a09fc723e09211caa
the difference between the two versions are the encoding of the two large
JPEGs--on builds through the 4.3.4.1 release the use an uncompressed
\FlatDecode, while on the 4.4.0 development builds they now use correct
\DCTDecode and retain JPEG compression of original images.

Probably will not be back ported to the 4.3 branch, but will be present when
4.4 is released.

Stuart

V Stuart Foote wrote

Hey that was fun to track down. You were spot on...

Looks like a long present issue of JPEGs being exported to PDF were not
retaining their JPEG compression…

Thanks for that, Stuart.

I have tested further using GIF, JPG and PNG images, opening PDFs in
LibreOffice Draw and exporting the graphics (right-click > Save Graphic…).

Comparing the files to the originals, the story seems complicated.

Sometimes, the original images are stored without change, and sometimes with
change (always larger and as PNG, even if the original is PNG). But I cannot
find a pattern. It doesn't matter if the images are large or small, embedded
or linked, resized within the document or at their original sizes.
Furthermore, sometimes 4.3.4.1 does save the images as the original. Again,
I can discern no pattern.

It seems entirely random to me.

The link that you gave is, unfortunately, too technical for me, so I can't
use that to figure out why it sometimes stores the images as their original
and sometimes not.

At least the new PDF files are significantly smaller, without losing
resolution.

Perhaps someone with technical understanding can weigh in to explain when
4.4 saves an image in its original format and when it converts to PNG
(sometimes already PNG, just made larger) — and why? What is behind the
decision to *sometimes* change the format to PNG or, when already PNG, to
make it larger?

Hi :slight_smile:
I think this is tooo new so no-one really knows yet!

Is there a difference between freshly created new documents done in the
4.4.0 compared with documents created in previous branches? Does that help
with figuring out a pattern or is that a "red herring"?
Regards from
Tom :slight_smile:

TomD wrote

I think this is tooo new so no-one really knows yet!

Is there a difference between freshly created new documents done in the
4.4.0 compared with documents created in previous branches?

Yes, there is a difference, comparing 4.3.4.1 and 4.4.0.0. The file size was
what alerted me: it has been reduced from enormous to a reasonable size. For
example, the latest community magazine comes out as a 169Mb PDF with 4.3.4.1
and just 30Mb with 4.4.0.0.

Every example that I've looked at shows that pictures are either stored in
their original format, or converted to PNG (if not already PNG) and made
larger. The difference is that 4.3.4.1 seems to do it with most images,
whereas 4.4.0.0 with very few.

When a specific image is converted by both 4.3.4.1 and 4.4.0.0, the
resulting file is identical, which suggests that the algorithm has not
changed; only the decision whether or not to convert has changed.

As for a pattern, I've not been able to find one. As mentioned before, there
seems to be no pattern as to whether the images are linked or embedded,
large or small, resized or not.

It seems random. I'm sure it's not random, but still, I find it crazy to
modify any of the images especially as they are always made larger (why take
a lossless PNG image and make it into a bigger PNG?).

Paddy

Hi :slight_smile:
Yeh, i got that and it was interesting to see Stuart's response.

I was wondering if there are 2 distinct patterns; 1 for what happens when
you use 4.4.0 to edit old documents created with earlier versions of
LibreOffice or other programs/suites and the other pattern for new
documents.

There was something a few weeks ago about a Table-of-Content or Index (or
something) behaving differently in fresh new documents. In older documents
the behaviour carried on even if the older document was opened in 4.4.0. A
tad confusing but quite pleasing to note that documents remained consistent
regardless of what opened them.

Regards from
Tom :slight_smile:

TomD wrote

I was wondering if there are 2 distinct patterns; … when you use 4.4.0 to
edit old documents created with earlier versions of LibreOffice … and the
other pattern for new documents.

Good question, but I already tried that. I created identical documents with
4.3.4.1 and 4.4.0.0, and there was no difference in their treatment.

TomD wrote

There was something a few weeks ago about a Table-of-Content or Index…

Is the difference documented anywhere? It would be good to know in advance!

Paddy

Note, I am not looking at the code, I am just guessing in my answers, but, if you are testing, perhaps you can test these things:

As for a pattern, I've not been able to find one. As mentioned before, there
seems to be no pattern as to whether the images are linked or embedded,

I expected that Linked / embedded is related to how the documents were added when the ODT document is created. That said, i was not aware that you could simply link an image in a PDF document, so I have almost nothing to add related to this.

large or small, resized or not.

This may be related to the perceived DPI of the original image as compared to the DPI of the created PDF. On export you can choose Lossless compressing, JPEG compression, and Reduce image resolution. Obviously this will not be true if you do not set "reduce image resolution". When I look at those three combinations, I expect that no combination allows for images to never be modified.

Hi :slight_smile:
Maybe in the release notes?
http://www.libreoffice.org/download/release-notes/

Sometimes things get fixed or smoothed out 'by accident' as the devs are
working hard to clean the code up and remove any vague hint of Java. Also
solving 1 bug-report sometimes has a "knock on effect" of accidentally
fixing other things at the same time. It's not always worth trying to find
which weird and obscure use-cases have been fixed in this way but it's
hoped that users might sometimes write in somewhere and let people know.

Regards from
Tom :slight_smile:

Andrew Douglas Pitonyak wrote

… i was not aware that you could simply link an image in a PDF document

Ah, the link is only in the ODT file. The image is always embedded within
the PDF.

Andrew Douglas Pitonyak wrote

large or small, resized or not.

This may be related to the perceived DPI of the original image as compared
to the DPI of the created PDF. On export you can choose Lossless
compressing, JPEG compression, and Reduce image resolution.

I have been looking only at Lossless compression and no reduction of image
resolution. Also, there is absolutely no point ever in taking a PNG image
and making it larger; that should never be done regardless.

Andrew Douglas Pitonyak wrote

When I look at those three combinations, I expect that no combination
allows for images to never be modified.

That's exactly what we've found. I think that images should never be
modified when choosing lossless with no change in resolution.

TomD wrote

Maybe in the release notes?
http://www.libreoffice.org/download/release-notes/

That doesn't give the release notes for 4.4.0.0. I managed to find the 4.4
release notes:
https://wiki.documentfoundation.org/ReleaseNotes/4.4
But nowhere there does it mention image compression or resolution.

TomD wrote

Sometimes things get fixed or smoothed out 'by accident' as the devs are
working hard to clean the code up and remove any vague hint of Java. Also
solving 1 bug-report sometimes has a "knock on effect" of accidentally
fixing other things at the same time. It's not always worth trying to
find which weird and obscure use-cases have been fixed in this way but
it's
hoped that users might sometimes write in somewhere and let people know.

I'd be happy to write in to let the developers know. I wouldn't know where
to do this, though.

Paddy