Back to all How-tos

Extracting a multi-line description in PDF and TXT

In this How-To we’ll show how to extract an item description that appears in multiple and variable number of lines. We will use this sample file (which can be opened in a text editor).

  • First of all, create a new DataMapper Configuration for the Text File provided.
  • Leave the Input Data as default
  • In the Boundaries, as the Trigger field, choose On text. Select the first series of numbers on the invoice (i.e. 101010089) and click on the Select the area button (blue square next to the Location field). Finally, in the Operator drop-down, choose on changes.

 

If you take a look at the excerpt below, we have a variable amount of lines which ends when we find the text LOAD FACTOR:

The method we’ll use is to concatenate all the description lines in a Property using an Action step and then extract this property value in a single field using an Extract step based on a Javascript expression.

Step 1: Defining the property in the Preprocessor

Let’s first define the property in the preprocessor. This property will store the string as we concatenate it.

  1. Double click on Preprocessor step, and take a look at the Step Properties panel.
  2. In Properties section, click on Add icon at the right side. A new property is added in the list.
  3. Rename this property, e.g. description
  4. Select Each record in Scope column. This is the only scope that we can modify per record.
  5. Select String in Type column and leave the default value '';

Step 2: Creating the Repeat Step

Next we need to start extracting the actual data. Our initial loop needs to be for each item, and within that we’ll loop to extract each line of the description into our Source Record Property.

  • Toggle back to the Steps pane.
  • As usual we need to start by using a Goto step to go to the first detail lines. By going through the different records, notice that the first item is not always on the same line for each and every record. We will need to find a way to detect the line which contains the first item. To do this, we can use a Goto step defined by a regular expression:
    1. Add a Goto step and, in the Goto Definition section, in the Target type dropdown, choose Next occurrence of.
    2. Uncheck the Inspect entire page with option, specify the following area: Left: 1, Right: 7 and check the Use Regular Expression.
    3. In the Expression field, type [A-Z]{3}[0-9]{4} in order to find any string composed of 3 letters followed by 4 numbers.
  • The next step is to loop only on the lines that have a product number on the left. Let’s use our secret trick again: Regular Expressions!
    1. Add a Repeat Step and change its Repeat type to Until no more elements
    2. In the Goto Step that’s added automatically, change the Target type to Next occurrence of,
    3. Uncheck Inspect entire page width.
    4. In the Text Viewer pane, make a selection around the ‘ItemID’ value (i.e. HKL2001). Then, click on the Use selection icon (blue square), next to the Left and Right properties.
    5. Check Use Regular Expression to enable the magic. For this particular case, the item number is 3 letters followed by 4 numbers, which can be expressed as [A-Z]{3}[0-9]{4} in the Expression field.
  • Finally add an Extraction Step within the loop (before the Goto step) in order to retrieve the data of the item (ItemId, Seq, Qty_Ord, etc.) into a detail table.

Now that we have the basic loop, we’ll need to create a second, nested loop that stores each description line in the variable we created before. Here’s how we do it.

  1. Make sure that the Extraction Step is selected (it should be the only orange step in the list), so that what we add next appears between it, and the existing Goto step.
  2. Toggle to the Data viewer, and make a selection of the text “LOAD FACTOR :” which is the last line of each item description.
  3. Add a new Repeat step, which will loop on the description lines until it encounters “LOAD FACTOR :”.
  4. Add a new Action step, which should be within the new loop.
  5. In the Step properties pane of the Action step, go to the Run Javascript section.
  6. Erase the default line in Expression editor and enter the following lines:
    var currentLineDesc = data.extract(51,85,0,1,"<br />");
    sourceRecord.properties.description += currentLineDesc + ' ';
    
  7. Back in the Steps pane, select the end of the Repeat step that loops on the description lines.
  8. Add a new Extract step, then change its Step properties :
    1. Set the Mode to JavaScript
    2. Type the following line in Expression editor, which grabs the value you’d expect: sourceRecord.properties.description;
    3. Check the Append values to current record that makes sure we’re extracting to the same detail line we already have, rather than a new line in the detail table.
    4. Rename this field by clicking on Order and rename fields icon at the right side of the Field List, and give it a name such as Description.

And lastly, we need to reset the property description to an empty string, so that the next description doesn’t just continue on this one!

  1. Add and Action step after the last GoTo step.
  2. Toggle to its Step properties and in the Actions section, select Set property in the Type column.
  3. Select description in Property and select Javascript in Based on.
  4. Enter ''; in Expression editor if it is not already written.

Results

In the end, the steps pane should look something like this:

You can also download the solution for this exercise.

Tags
scripts

Leave a Reply

Your email address will not be published. Required fields are marked *