The most logical method for identifying boundaries in this datafile is to look for the Page x of y string and set a document boundary whenever x equals y (i.e. Page 1 of 1 or Page 4 of 4).
However, that string is found somewhere in the middle of a variable length page. To make sure the process isn’t thrown off by potential additional lines, you can write a script that checks for the x and y values and, when they match, waits until it finds the next header (identified, in this datafile, by the string Thurrock Council Benefits Department,) to set the document boundary.
The following script should achieve that:
var line = boundaries.get(region.createRegion(1,1,100,1));
var re = /Page (\d+) of (\d+)/;
var matches = re.exec(line[0]);
if(matches!==null && matches.length==3 && matches[1]==matches[2]){
boundaries.setVariable("found", true)
logger.info(line[0]);
} else if(boundaries.getVariable("found") && line[0].slice(0,37)=="Thurrock Council Benefits Department,") {
boundaries.set(0);
boundaries.setVariable("found", false)
}
The script inspects each line. It then checks for the Page x of y construct and if it finds it, it compares the values of x and y. If they match, it sets a variable (found) to true, but does not set a boundary yet. The script keeps processing lines and checks for the header string (“Thurrock…”) and when it finds it AND the found variable is set to true, then the boundary is set.
To make this work, make sure that:
- The page delimiter is set to Lines, with the number of lines set to 1.
- The trigger is set to On script, with the above script.
If any given page contains more lines than the others, the script will still work because it doesn’t look for fixed page lengths.