OL Learn

How to divide a PDF into records, drop records meeting a condition

I have input that consists of invoices of two different types (type 1 or 2. They are to many different account numbers. There may be any number (including 0) of either type of invoice to a particular account number.

The goal is that the invoices should be printed, sorted by account number, subsorted by invoice type: For any given account number the type1(s) should be come first, the type2 next. However, if there are type2 invoices that do not have at least one matching type1, those should not be printed. (Actually the output is a PDF, but you get the idea.)

I have all the invoices formatted in a single PDF, sorted by account number, subsorted by invoice type, and have the account number and invoice type printed in the margin in the same place on every page.

So how to finish up by filtering out invoice type 2s that have no matching type 1?

My thought was to read the PDF into a datamapper, separate into records based on change of accountnumber. That (I think) gives one record per account number?

I should then be able to delete any record that is not invoicetype1 on the first page. But do I do this with some sort of filter in a preset? Or is this something best done in the workflow?

It seems simple, but I’m not sure how to proceed. Advice, please?

UPDATE:
First, thanks for the advice on how to do this with scripts in workflow. That seems a good workaround to the issue. I also have been told that the filter is not the optimum way to do this: the best way is to have a mapper that opens the sorted PDF, breaks it into “records” on change of accountnumber (without extracting, just look for a change in a region of the PDF); then a true/false branch if a region of the PDF contains “doctype1”; If true, extract fields. If not true, action step “skip record”.
This yields metadata where all the records (which might actually be several page consisting of several invoices all with the same account number) begin with doctype 1, and any records that begin with any other doctype have no metadata extracted.
Then when I build a new PDF from this, I should get the results I want.
However, every record is printed, paying no attention to whether there is metadata for the record or not.
I can’t see why it would print. Any ideas?

You could try a script in workflow.
I came up with this which seems to work :grinning:,but obviously you will have to try it for yourself as I have no real test data.

set pdf = Watch.GetPDFEditObject()

oldcustomer = “”
type1found = false
dim deleteit(2000)
'large enough for max pdf size in [pages

pdf.Open Watch.GetJobFileName(),false
for i=0 to pdf.Pages().Count()-1
deleteit(i)=false
set objpagei = pdf.Pages().Item(i)
customer = trim(objpagei.ExtractText2( 0.08333,0.61458,1.48958,0.97916))
Watch.log “cust=/”+customer+"/",1
if (customer <> oldcustomer) then
oldcustomer = customer
type1found = false
end if
type1or2 =trim( objpagei.ExtractText2(1.0625,0.25,1.29166,0.5625))
Watch.log “type=/”+ type1or2+"/",1
if (type1or2 = “1”) then
type1found = true
else
'type 2
if (not(type1found)) then
set objpagei = nothing
deleteit(i)=true
end if
end if
next
for i= pdf.Pages().Count()-1 to 0 step -1
if( deleteit(i)) then
pdf.Pages().Delete(i)
end if
next
pdf.Save(false)

It is in VBS because the JS version won’t delete pdf pages. There are posts on here somewhere regarding that problem. You will have to modify the selection regions to match your data. Also modify the type check if you haven’t used literal “1” and “2”.

Why does it store page numbers and then delete pages later? Because when a page is deleted, the pdf pagecount is modified on the fly, meaning the main loop finds itself running out of pages to look at. I’m sure there are other ways of handling that, but this method seems to work for me.
Hope that helps

@stuart-gascoigne the JS implementation will delete pages without issue. You just have to go through the extra step of forcing garbage collection to be able to successfully release objects.

To clarify:

in JavaScript, right before closing the PDF, you must ensure all references to PDF pages are discarded, as explained in this help page: https://help.objectiflune.com/en/pres-workflow-user-guide/2019.2/Default.html#Workflow/Alambic_API/IPDF_Methods.html#Close()

When using the PDF API to manipulate, delete, reorder pages, etc. I loop backwards through the source PDF to avoid the “changing page count because I deleted things” issue.