OL Learn

Extracting Pages based on content of PDF

First time user of PP Connect, and I’m having difficulty with extracting pages.

Thinking of using metadata but I’m running into an odd situation where I look for the specific word, say it’s on page 100 of 200 pages. The output winds up being from the page where I find the word to the end of the document instead of just that page.
The goal is to split 1 PDF into 3 groups. Foreigns, Multi’s and Regular.
For the foreign, I need to find the key word and take that page and the next and create a separate PDF (or PDFs depending on if there are more than one instance)
For the multi’s, Just finding Page N of N will probably do. If i find Page 1 of 1 that’s a regular, anything else and it’s a multi. Those have backing pages as well, so even if it say 1 of 1, it’s 2 actual pages.

I’ve been digging for a few days on the OL Learn for this and I can’t find how it’s done. Scripting would probably do it real quick. If there was a script out there that I could look at, that might be good. Thank you.

Using metadata is the proper way to do this.

  • First, create basic metadata in passthrough mode
  • Then use the metadata Level Creation task to create document boundaries. You have to specify a condition that uniquely identifies each document. This is the key element in the entire process. At this stage, you don’t care if each document is Foreign, Multi or Regular, you just want the metadata to know where each document begins in the PDF.
  • Once that’s done, use a Metadata Sequencer and set it to sequence on each Metadata document
  • At this stage, the rest of the process will be dealing with individual documents. So you can use conditions to examine each document and look for markers indicating whether the document is Foreign, Regular or Multi.
  • When a condition is met, use a Send To Folder task to send each type of document to its own destination. If you want to concatenate all Multis and Regular documents into their own PDF, make sure to tick the Concatenate files option.