Text find/replace in PDFs using API

Benjamin_Smith · April 17, 2015, 5:41pm

Hello! This is my first post to this list.

We have a large number of PDF documents that we need to modify slightly and
save. After a fair amount of research, it seems like the LibreOffice API and the
TextReplace.java code sample should get me rather close to the mark. I almost
have the script working, but I'm having trouble telling libreoffice to open the
PDF as a PDF rather than a text document. The result is a bunch of
gobbledygook.

Here's the file I started with:
http://api.libreoffice.org/examples/java/Text/TextReplace.java

I'm running on a spare desktop system, Fedora 21, fresh install, all updates
applied. EG:
java-1.8.0-openjdk-1.8.0.40-25.b25.fc21.x86_64
libreoffice-core-4.3.6.2-8.fc21.x86_64
libreoffice-writer-4.3.6.2-8.fc21.x86_64

I've successfully modified the script so that it finds and opens the example PDF
I'm using, which has been provided to have the same text as the
TextReplace.java example. When it opens the file, it's not importing the PDF,
instead it's reading it as a text file or something.

You can see what this looks like here: (I'm not running headless on purpose)
http://hal.schoolpathways.com/libre/screenshot.png

Looking at the documentation, it looks like there's an "Arguments" parameter
which reads:

"Arguments these arguments specify component or filter specific behavior

For example, "ReadOnly" with a boolean value specifies whether the document
is opened read-only. "FilterName" specifies the component type to create
and the filter to use, for example: "Text - CSV". For more information see
com::sun::document::MediaDescriptor."

See
http://api.libreoffice.org/docs/idl/ref/interfacecom_1_1sun_1_1star_1_1frame_1_1XComponentLoader.html

This leads to TypeDetection but I have not figured out how to find out what
"FilterName" I should use to specify that I'm importing a PDF.

Below is the code fragment that I'm using now. My *guess* is that this line is
the culprit:

loadProps[0].Value = "X-Application/PDF";

but I can't be sure. I'm guessing the following references are related, but
nowhere am I seeing a *list* of import filters that I can use, nor a way to find
out what filters are available in my install.

http://www.openoffice.org/api/docs/common/ref/com/sun/star/document/ImportFilter.html
http://www.openoffice.org/api/docs/common/ref/com/sun/star/document/FilterFactory.html

Applicable snippet below, the full .java script that I'm using is visible
http://hal.schoolpathways.com/libre/TextReplace.java

---- SNIP ------
// ORIGINAL EXAMPLE
// String sURL = "private:factory/" + sDocumentType;

// SEES PDF AS A TEXT DOCUMENT.
// String sURL = "file:///home/ben/pdf/example.pdf";
String sURL = "file:///home/ben/pdf/example.pdf";

com.sun.star.lang.XComponent xComponent = null;
com.sun.star.frame.XComponentLoader xComponentLoader = null;
// com.sun.star.beans.PropertyValue xEmptyArgs[] =
// new com.sun.star.beans.PropertyValue[0];

com.sun.star.beans.PropertyValue[] loadProps =
new com.sun.star.beans.PropertyValue[3];

        loadProps[0] = new com.sun.star.beans.PropertyValue();
        loadProps[0].Name = "FilterData";
        loadProps[0].Value = "X-Application/PDF";

        try {
            xComponentLoader = UnoRuntime.queryInterface(
                com.sun.star.frame.XComponentLoader.class, xDesktop);

            xComponent = xComponentLoader.loadComponentFromURL(
                sURL, "_blank", 0, loadProps);
        }
---- SNIP ------

Your time and attention are much appreciated.

Benjamin Smith

Brad_Rogers · April 17, 2015, 6:01pm

Hello Benjamin,

TextReplace.java example. When it opens the file, it's not importing
the PDF, instead it's reading it as a text file or something.

IME, PDFs open in Draw, not Writer. What you're seeing in Writer is the
mark up.

V_Stuart_Foote · April 17, 2015, 6:28pm

The default PDF import filter opens PDF into Draw, but the filter can also
parse into Writer and is manipulated in GUI from the LibreOffice Open/Save
dialogs.

YMMV to do your processing, but see if this gets you closer to configuration
needed to parse your PDF for changes.

http://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/config/pdf_import_filter.xcu#79

     79 <node oor:name="writer_pdf_import" oor:type="xs:string"
oor:op="replace">
     80 <prop oor:name="DocumentService">
     81 <value>com.sun.star.text.TextDocument</value>
     82 </prop>
     83 <prop oor:name="FileFormatVersion" oor:type="xs:int">
     84 <value>0</value>
     85 </prop>
     86 <prop oor:name="FilterService" oor:type="xs:string">
     87
     89 <value>com.sun.star.comp.Writer.XmlFilterAdaptor</value>
     90 </prop>
     91 <prop oor:name="Flags" oor:type="oor:string-list">
     92 <value>3RDPARTYFILTER ALIEN IMPORT PREFERRED</value>
     93 </prop>
     94 <prop oor:name="Type" oor:type="xs:string">
     95 <value>pdf_Portable_Document_Format</value>
     96 </prop>
     97 <prop oor:name="UIName">
     98 <value xml:lang="x-default">PDF - Portable Document Format
(Writer)</value>
     99 </prop>
    100 <prop oor:name="TemplateName"/>
    101 <prop oor:name="UIComponent"/>
    102 <prop oor:name="UserData" oor:type="oor:string-list">
    103
    104 <value
oor:separator=",">org.libreoffice.comp.documents.WriterPDFImport,com.sun.star.comp.Writer.XMLOasisImporter,true</value>
    105 </prop>
    106 </node>

Benjamin_Smith · April 17, 2015, 6:42pm

Good catch! Now the sample PDF opens and is readable!

Now it's failing at this line... how *do* you specify a search/replace all
text on an instance of draw? (squinting at documentation now)

// You need a descriptor to set properies for Replace
xReplaceDescr = xReplaceable.createReplaceDescriptor();

Benjamin_Smith · April 17, 2015, 6:47pm

// resent with additional info //

>TextReplace.java example. When it opens the file, it's not importing
>the PDF, instead it's reading it as a text file or something.

IME, PDFs open in Draw, not Writer. What you're seeing in Writer is the
mark up.

Good catch! Now the sample PDF opens and is readable!

Now it's failing at this line... how *do* you specify a search/replace all
text on an instance of draw? (squinting at documentation now)

// You need a descriptor to set properies for Replace
xReplaceDescr = xReplaceable.createReplaceDescriptor();

Apparently due to this returning null:

xReplaceable = UnoRuntime.queryInterface(
com.sun.star.util.XReplaceable.class, xTextDocument);

Where xTextDocument is actually an instance of draw rather than writer.

(so close!)

Brad_Rogers · April 17, 2015, 7:39pm

Hello Benjamin,

Now it's failing at this line... how *do* you specify a search/replace
all text on an instance of draw? (squinting at documentation now)

All beyond my abilities, I'm afraid. V Stuart has a more detailed and,
potentially, useful response.

Benjamin_Smith · April 17, 2015, 7:58pm

Sorry I'm a newbie at the LO API, but how would I specify this import filter?
Does this XML content belong in the value of FilterData ? EG:

loadProps[0].Value = "<node oor:name= .... "

?

Thanks,

Ben

Benjamin_Smith · April 17, 2015, 8:29pm

figured out how to install this as an extension, and did so, but it still
opens up the example PDF in Draw. I tried installing just the section you
referenced and also the entire file and neither seemed to make a difference.

So it seems to me that my choices are:

1) Try to determine if there's a way to open the (relatively simple) PDFs in
writer to go ahead with my ReplaceText.java plan;

2) Find some way to do something like ReplaceText in draw;

3) Something else altogether, perhaps a Macro?

Thanks,

Ben