Upland OL User community

PDF Boundaries

I’m working with a PDF file as data source and having difficulty defining the Boundary. I’m trying to set the boundary on text “Page 1”, but I’m getting strange splits. I don’t see anywhere to define whether this text represents the START of a record/document or END.

Under the Boundaries section you have a field named Page before/after:.

I think this is what you are looking for.

That would be useful if the page count is static. What about situations where there is no “document start” marker, but there is a “document end” marker, such as the word “Total”, which is on the last page of a variable-length document?

My conceptual difficulty is in understanding if the boundary is defining the START of a document or the END of a document.

(“Document” interchangeable with “record” in this discussion since we’re discussing a PDF data source.)

Let’s take you example. Your “document end” marker is the word “Total”, which can be found in a region on your last page. See this PDF Example.

Your Page before/after: field will be set at 1. It sets the Bounderie to start each record 1 page after the page in which we find the word “Total” (except the first record, obviously) ;).

This is from our documentation:

Boundaries are the division between records: they define where one record ends and the next record begins.

Hope this clarifies things.

@hamelj I get a pdf file with an unknown amount of records while every record can be an unknown amount of pages. Sadly there is no unique trigger point on the first page of a record but on the second page of each record I have “Page: 2”.
So I want to set boundaries on that trigger point. I set -1 at Page before/after and generally it does what I need but not for the first page of the file.

My boundary settings:
boundaries

My result:
record 1 = first page of the pdf file
record 2 = second page of the pdf file and all corresponding pages of the first record
record 3 = first page with all corresponding pages of the second record (correct)
record 4 = first page with all corresponding pages of the third record (correct)
etc…

How it should be:

How it appears in Datamapper:

So as you can see the boundaries will be set correctly but it does not include the very first page of my pdf file to the first record.

What can I do here?

PS: I am using PlanetPress Connect version 2022.2.

Can you share that PDF? Easier to play around than mind boggling on concept :wink:

It appears to be a bug in the DataMapper. I will report it to the development team.

I found a workaround which isn’t very elegant, but at least it works:

  • Add a page at the very beginning of the PDF (contents of that page are irrelevant)
  • Ignore the first record in the DataMapper

You can create a script in Workflow that automatically adds a page at the beginning of the PDF (it just duplicates the very first page):

var pdf = Watch.GetPDFEditObject();
pdf.open(Watch.GetJobFileName(), false);
pdf.pages().InsertFrom2(pdf.pages(),0,1,0);
pdf.Save(false);

To ignore the first record in the DataMapper, you have to create a condition to check if the record number is equal to 1: if it is, then an Action step is used to immediately skip to the next record, if not, then the data is extracted.


Now that workaround may not be what you’re looking for, but that’s all I got at this stage.

Thank you for sharing that workaround. In my case I thankfully found another trigger point on the first page of each record, but it is always good to have a workaround at hand.