Searchable PDFs from Graphite fonts

Hello,

I've found that using the Graphite fonts Libertine and Biolinum produce very attractive PDFs. However the PDFs are effectively unsearchable, as LibreOffice seems to insert spurious characters into the text.

The attached files demonstrate what I mean. The text in the ODT reads "This is the official version." while the text in the PDF reads "This his the offichial vershion."

Can anyone suggest why this is happening or what I can do about it?

Many thanks,
Jonathan

Since attached files do not get saved with this email list, I tried using your email's text to see what happens.

I tried both Linux Libertine and Linux Biolinum [14 point] on my 3.5.7 version for Ubuntu 64-bit. I cannot replicate the issue with "added" characters.

Could you tell up which OS and version you are using, plus the "exact" name of the font[s] shown up in the font windows of LO.

ALSO
please try some other PDF generation methods. For Windows, us the doPDF printer-driver-line software, or CUPS-PDF for Linux. See if those type of "Print" to PDF file packages give you the same issues. If they do, then is it not the "export-to-pdf" part of LO, but some other rendering issues.

Also, see if your media printed version [i.e. print on paper] give the same extra characters.

Perhaps your PDF reader has had one tipple too many?

;^)

Brian Barker

Hi :slight_smile:
Errr, you can upload an "attachment" using the Nabble interface. You just
need to follow the links in this email or navigate from the official
LibreOffice website. When you get to the right thread just reply to any of
of the posts, then the "More" button has
"upload file"
as it's top option. There is a "More" button earlier somewhere that has
different options that look vaguely interesting but it's only once you are
replying that you get the option to upload a file.

Happy hunting!
Regards from
Tom :slight_smile:

Thanks for the feedback.

I am currently using LibreOffice 3.6.3 from the Debian experimental
repository. I just updated from version 3.5.4 from the Debian unstable
repository to see if it would make a difference (it didn't). Debian doesn't
package the Graphite fonts so I have downloaded and installed them from
http://numbertext.org/linux/

I have tried other methods for generating PDFs - both CUPS-PDF and
Print-to-file. These methods produced PDFs that looked the same but had no
text encoding - ie not searchable or text-selectable.

The printed version is perfect in every case. My concern is to generate a
PDF that both looks perfect and is searchable.

I am attaching the relevant files following Tom's instructions.

Thanks!
Jonathan

test.odt <http://nabble.documentfoundation.org/file/n4016027/test.odt>
test.pdf <http://nabble.documentfoundation.org/file/n4016027/test.pdf>

I can select the text in the PDF, copy and paste, but get an 'h' added before most 'i'.
I can search, but not if the word is one with the extra h before i
Steve

I can select the text in the PDF, copy and paste, but get an 'h'
added before most 'i'. I can search, but not if the word is one with
the extra h before i Steve

That's exactly what I mean. It effectively means no searching.

I tried both Linux Libertine and Linux Biolinum [14 point] on my
3.5.7 version for Ubuntu 64-bit. I cannot replicate the issue with
"added" characters.

Were you using Graphite fonts ('Libertine G'/'Biolinum G')? Those are the ones where the problem arises. Those are also the fonts that do ligatures and other lovely typesetting things that make them look so nice, which I why I want to use them.

Cheers,
Jonathan

Seems to me that you have solved your own problem: it is the fonts. The search function can not handle the the lovely typesetting things. As you mentioned, an "i" looks like a "hi" to it. The only real solution is to not use any of the Graphite fonts in a PDF.
      But if you want to search the PDF, have you opened it in Draw and search for the text in it?

--Dan

Sorry but I cannot get the results you have.
I use 3.5.7 from LO's site and the fonts are from "wherever" since I do not know if I got them from LO or from some other source.

Interesting problem. Based on my tests, which I detail below, it appears to be LibO bug rather than a font problem.

I'm using LibO 3.5.6.2 with Win7 and Adobe Reader.

In the sentence "This is the official version" with Linux Libertine G, there are two instances of automatic ligatures--the "Th" combination in "This," and the "ffi" combination in "official." In Adobe Reader, I aas able to find "This" when I did a search, which means that the Reader recognized the "Th" ligature as a "T" followed by an "h" which is what I typed into the search box. But, when I tried to search for "official" the Adobe Reader couldn't find it, which means it did NOT associate my typing of an "f", "f", "I" with the "ffi" ligature.

Then when I copied and pasted the sentence from the PDF file into a plain text editor, it placed an "h" before every instance of an "i" just as was reported. However, this obviously has nothing to do with ligatures as most of the instances of "i" were NOT included in the ligatures. In fact, it did not place an "h" before the "i" in the "ffi" ligature.

For comparison, I ran the same test using Apache OpenOffice 3.4.1. to see if it is a font issue or a program issue. I'm sorry to report that it appears to be a program issue. In AOO, I typed the same sentence using Linux Libertine G, "This is the official version." I then saved it as a PDF and opened it in Adobe Reader. This time, a search found both "This" and "official" despite both words containing ligatures. And, when I copied the sentence into a plain text editor, it copied correctly without any additions of "h" before "i".

I love the Linux Libertine set of fonts. I use it, not only with LibO and AOO, but also when I set a document in LaTeX.

I have found that Apache OpenOffice's support for Linux Libertine G appears to be more complete and polished than LibO's. This may be an example of that more complete support.

Of course, LibO has its advantages over AOO; for example, it properly hyphenates American English words, with AOO does not appear to do. It would be nice if someone could combine the best of both programs into one complete program (along with the tabbed interface of Lotus Symphony, yet a third fork of the original OO). But, I won't hold my breath.

Virgil

I tested it with 3.5.7 for 64-bit Ubuntu [.deb] and I do not see the issue at all. I use the default PDF viewer as well.

I wonder if it was fixed between 3.5.6 and 3.5.7?

OR - Could it look different in a different viewer that the "official" Adobe Reader?

I do have both the Linux Libertine G and the non-G versions. I used the uploaded text file and tested it. The PDF output with "Export to PDF" did not give me any issue.

As for the "best of both packages", in some articles I have read, AOO is taking the "best" of LO coding and including it into their package, since LO developer really have made a large amount of work cleaning and improving the base code from the OOo 3.x base code days. The big problem, IMO, is the licensing issue. AOO has a different approach, so I have been told, and it is not as "flexible" for the rights of the individual developers as the LO project has. AOO can use LO's code, but LO's licensing approach will not allow AOO coding to be "easily" a part LO's package unless there is a revamping of the way the developers keep their ownership of their work. [or so I have been told in my reading].

As for the Libertine font itself, I see the following in my font window:

    Linux Libertine
    Linux Libertine Capitals
    Linux Libertine Display
    Linux Libertine Display Capitals
    Linux Libertine G
    Linux Libertine Initials
    Linux Libertine Slanted.

I also have:

    Linux Biolinum
    Linux Biolinum Captials
    Linux Biolinum G
    Linux Biolinum Keyboard
    Linux Biolinum Outline
    Linux Biolinum Shadow
    Linux Biolinum Slanted

I do not use Libertine or Biolinum "much", since about 1/3 of the things I do I tend to go to others for editing in the Windows and non-LO environment. They use the MS core fonts. Now if I was to send out in PDFs, then I can embed the fonts in the documents and therefore could use these fonts. With over 14 Gigi of font files to choose from, I tend to get lost in who has and who do not have the fonts I use on a weekly basis.

Okay, I just tested it with LibreOffice 3.6.2.2 for 64-bit Ubuntu (the version I got when I just clicked on "Download" at the LibO website) and got identical results. I used both Document Viewer and Okular to view the PDF file. In both cases, the search function found "This" with the "Th" ligature, but not "official" with the "ffi" ligature. Cutting and pasting the text from the PDF file to GEdit also produced the rogue additional "h" before each "i".

I then ran the same test with Apache OpenOffice 3.4.1 for Ubuntu and all was fine, just as it was on my Win7 setup.

My next test will be to download the latest version of LibO for Windows and try it again there.

I'm also wondering if the version of Linux Libertine G matters. Even if it does, it is clear that, at least on my computer, Apache OpenOffice is rendering it properly in a PDF file and LibO is not. Since I'm not a developer, I have no idea why.

I agree that Apache and Libre have different licensing structures, but as an end-user, not a developer, I don't particularly care. Both AOO and LibO are free to use by users for any purpose without restriction. As a user, I view the differences in the programs in terms of what they do for me, not in how they are licensed.

Right now, on my computer, AOO works better with the Libertine G fonts, but LibO has accurate US-English hyphenation.

So, my solution, which I hate, is to keep both programs on my computer and load the one that meets my particular need at a given time. When I need good Libertine G support, I use AOO; if I need good hyphenation, I load LibO. I refuse to get sucked into licensing battles between two very similar programs. I just want to get my work done.

Virgil

Why would 3.6.2.2 [for 64-bit Ubuntu] give you the problem when 3.5.7.2 [for 64-bit Ubuntu] does not give the same issue to me.

That is a real head-scratching problem for me.

I downloaded my versions of the fonts in June of 2011. If you want, I can upload the Linux Libertine fonts files I use to a place where you can download then and check them.

As I said, my Ubuntu 3.5.7 version of LO is not showing me the issue you are showing with 3.6.2.2.

I just downloaded LibO 3.5.7.2 to my Win7 computer and tried again. Same results as with 3.5.6.2 for Win7 and 3.6.2.2 for Ubuntu.

I'm running a dual boot Win7/Ubuntu laptop and the results have been consistent across platforms and different versions of the program.

LibO, whether for Win7 or Ubuntu, produces PDFs that cannot find "official" with the "ffi" ligature and insert an additional "h" before "i" upon a cut and paste from the PDF file.

AOO does not share this behavior.

Virgil

Weird.

My Windows version of Linux Libertine G is version 5.1.3. I'm not yet deft enough with Ubuntu to even know where fonts are stored on the system, but I think it's the same version. Either way, I don't think it would explain why LibO and AOO are producing different results with the same font.

I've spent enough of my afternoon looking at this issue, so don't bother sending me your font files. This wasn't my problem to begin with. I just got interested in it because of my fondness for the Libertine fonts.

For me, though, the real issue is that it underscores small (but important) differences between LibO and AOO, which might be resolved if the respective developers could get past their licensing issues and work together.

Virgil

Hi all,

Thank you for all your enthusiastic help with this.

I also made some progress. Using Infix PDF Editor 5 (Windows trial version under Wine) and/or fontforge (Linux) I could see that the Font mapping was messed up. The mistaken mappings begin:

> Th -> T
> i -> hi

That is, the PDF converter has broken 'This' at the wrong place into 'T/hi/s' instead of 'Th/i/s'. All successive instances of the 'i' glyph are mapped to 'hi', thus explaining the drunken slurring effect.

I was able to manually repair the mappings using podofobrowser (Opensource, pre-built Windows executable under Wine). Now my real PDF is good. But if I re-create it I need to re-repair it.

I've reproduced the problem with both Debian and the LO versions of LO. Maybe I should also try installing AOO.

webmaster-Kracked_P_P: would you be able to attach the PDF that you produced from my file so I can look for signs?

My version of the fonts is from January 2012, so that may be a source of the problem. Another possibility, since it's fairly clearly a bug somewhere, is that maybe building for 64 bit architecture makes it disappear.

Unless anyone has a better idea, I'll file this as a LO bug.

Cheers,
Jonathan

Jonathan,

You are obviously much more advanced than I am. Just a couple thoughts.

My computer is 64 bit and both my Windows and Ubuntu versions of LibO are 64 bit, and I have the same problem as you.

Again, my AOO (on either Windows or Ubuntu) doesn't have the problem but I don't know if it is specifically a 64 bit version. I just downloaded whatever popped up from the download page.

I'm beginning to think that the problem may be in the PDF converter settings. I tried checking and unchecking the "imbed standard fonts" option, but it made no difference. I'm still thinking that there's a problem in the way LibO creates PDF files as opposed to AOO.

Good luck.

Virgil

Jonathan,

At the risk of muddying the waters further, let me share another strange behavior I've noticed between LibO and AOO when using Linux Libertine.

I typed the sentence "This is the official version" in LibO. I then placed the cursor at the beginning of the sentence. If I then hit the right arrow key and move the cursor along the sentence one character at a time, in LibO, the cursor jumps over the ligatures as if they are one character. I would expect this since a ligature is, in fact, one character.

But, when I typed the same sentence in AOO and moved the cursor along the sentence, the cursor stopped at each letter, even stopping between the letters of the ligatures.

It seems that LibO and AOO are treating Libertine G differently even before trying to convert a file to PDF.

PS - As to switching to AOO, keep in mind that it has its own problems. Several users, including myself, have noticed that it doesn't hyphenate US-English words at the right locations. Its not a dictionary issue as I have used the same dictionaries in LibO which hyphenates words at the right locations.

Virgil

LibreOffice 3.5.7.1
Build ID: 3fa2330-e49ffd2-90d118b-705e248-051e21c

Version 3.6.2.2 (Build ID: da8c1e6)

Both linux (.deb versions).

$ apt-cache policy fonts-linuxlibertine
fonts-linuxlibertine:
  Installed: 5.1.3-1
  Candidate: 5.1.3-1
  Version table:
*** 5.1.3-1 0
        500 http://archive.ubuntu.com/ubuntu/ precise/universe i386 Packages
        100 /var/lib/dpkg/status
$ apt-cache policy ttf-linux-libertine
ttf-linux-libertine:
  Installed: 5.1.3-1
  Candidate: 5.1.3-1
  Version table:
*** 5.1.3-1 0
        500 http://archive.ubuntu.com/ubuntu/ precise/universe i386 Packages
        100 /var/lib/dpkg/status
$ apt-cache policy texlive-fonts-extra
texlive-fonts-extra:
  Installed: 2009-10ubuntu1
  Candidate: 2009-10ubuntu1
  Version table:
*** 2009-10ubuntu1 0
        500 http://archive.ubuntu.com/ubuntu/ precise/main i386 Packages
        100 /var/lib/dpkg/status

Libertine G and Biolinum G fonts have been bundled with LO since LO 3.3:

<https://www.libreoffice.org/download/3-3-new-features-and-fixes/>

I extract the
"http://www.numbertext.org/linux/e7a384790b13c29113e22e596ade9687-LinLibertineG-20120116.zip"
file that the OP refers to <http://numbertext.org/linux/>, and find that
the fonts are .ttf. They are the same one's that are bundled with LO:

$ ls /home/gl/tempdir/testing/LinLibertineG
ChangeLog LinBiolinum_R_G.ttf LinLibertine_R_G.ttf
origfiles
doc LinBiolinum_RI_G.ttf LinLibertine_RI_G.ttf README
GPL.txt LinLibertine_DR_G.ttf LinLibertine_RZ_G.ttf src
LICENCE.txt LinLibertine_RB_G.ttf LinLibertine_RZI_G.ttf
LinBiolinum_RB_G.ttf LinLibertine_RBI_G.ttf OFL.txt

$ locate LinBiolinum_R_G.ttf
/opt/libreoffice3.5/share/fonts/truetype/LinBiolinum_R_G.ttf
/opt/libreoffice3.6/share/fonts/truetype/LinBiolinum_R_G.ttf

$ locate LinLibertine_R_G.ttf
/opt/libreoffice3.5/share/fonts/truetype/LinLibertine_R_G.ttf
/opt/libreoffice3.6/share/fonts/truetype/LinLibertine_R_G.ttf

Perhaps I'm missing the 'graphite' bits? When I follow the instructions
in <http://numbertext.org/linux/fontfeatures.odt> everything matches.

You are obviously much more advanced than I am. Just a couple thoughts.

Persistent I think is a better word. When I publish my work on the Web I want search engines to index it correctly, otherwise there is little point.

I'm beginning to think that the problem may be in the PDF converter
settings. I tried checking and unchecking the "imbed standard fonts"
option, but it made no difference.

I don't think the Graphite fonts count as standard, so they are embedded regardless of that setting.

I'm still thinking that there's a
problem in the way LibO creates PDF files as opposed to AOO.

It sounds like it to me too.

Jonathan