Upland OL User community

Regex to split address block

In Datamapper I try to create a regex to split address parts (only Mid-European adresses) into

  • street
  • (countrycode) ZIP City
  • country

but failed :frowning:

Examples of the possible formats (captured lines already seperated by |)

Sir|Alfred Testman |c/o Watermelon 2245 Inc|Brownstreet 13|6280 Hochdorf|SCHWEIZ
Alfred Testman |Testman 2245|Yellowlane 13|CH-6280 Hochdorf|SCHWEIZ
Boris Checkman |c/o Bavarian 5555|Morninglane 13|CH-6280 Hochdorf
Peter Pan|Oststrasse 13|99999 Simcity

goal is to identify the first 4-5 digit ZIP code (with the approrpiate countrycode)

record.fields.AdrBlock.match(/(\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim(); //works but is stripping the country code. Prerequisite:
Country may be empty, ZIP may be preceeded by countrycode, ZIP is 4-5 digits, street is always the line before the ZIP/city.

So result should look like
Line 1
Street: Brownstreet 13
Zipcity: 6280 Hochdorf
country: SCHWEIZ

Line 2:
Street: Yellowlane 13
ZipCity: CH-6280 Hochdorf
country: SCHWEIZ

Line 3:
Street: Morninglane 13
Zipcity: CH-6280 Hochdorf
country: null

Line4:
Street:Oststrasse 13
Ziptcity: 99999 Simcity
country: null

I tried hundreds of regex combination but none worked for every of these cases.

appreciate any help,

Ralf.

I believe the following should work:

.match(/(([A-Za-z]{2}-)?\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim();

The trick is to create a conditional country code (with dash) that may or may not be there. That’s what the ([A-Za-z]{2}-)? part of the RegEx does.

Phil, thx!
Changed it to /(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi
to eliminate wrong matching (Line 1) adding a possible space between countrycode and ZIP and checking for a leading ā€ž|ā€œ, ZIP+City now work (Sometimes I canā€˜t see the wood for trees).

And: any idea to capture the line before and after that line (e.g. the previous and following text surrounded by ā€ž|ā€œ)?

Ralf

works in regex101 but not in DM:
record.fields.AdrBlock.match(/(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi)[0];

Hi @RalfG, I assume that the Data Mapper cannot handle the following part of your Regular Expression: ā€œ(?<=|)ā€ because without it the Regular Expression seems to work fine.

The expression (?<=|) triggers the RegEx engine’s lookbehind functionality (i.e. the full regular expression is a match if, and only if, the preceding character is a |). That functionality was added in the ECMASCRIPT 2018 specification, but the DataMapper’s JavaScript engine implements the ECMASCRIPT 2016 spec, so lookbehind is not supported.

But in your case, you don’t have to use lookbehind. You can adjust your RegEx to look for the | character without capturing it:

record.fields.AdrBlock.match(/(?:\|)((?:[A-Za-z]{2}[- ])?\d{4,5}\s(?:.[^|]+))/i)[1]

Notice the /i)[1] options and index at the end of the statement, which instruct the JS engine to retrieve the content of the first capturing group instead of the fully matched expression.

didn’t think abount ECMA… and you’re absolutely right, your solution fits best!!!

to improve DM speed (regex shouldn’t run on every extract field) I changed following steps:

  • added 2 global properties (AdrBlock object, international int)
  • inserted an action step to catch the whole adress-block into one object
  • inserted another action step reversing the adressblock array and testing if the (now simplified) regex matches:

sourceRecord.properties.Adressblock=sourceRecord.properties.Adressblock.split("<br />").reverse();
if(/((?:[A-Za-z]{2}[- ])?\d{4,5}\s.+)/i.test(sourceRecord.properties.Adressblock[0]))
{sourceRecord.properties.international=0;}
else {sourceRecord.properties.international=1;}

then in extraction just pulled the array objects adding the international value to the array field:
e.g.Street:
sourceRecord.properties.Adressblock[1+sourceRecord.properties.international];

e.g. City:
sourceRecord.properties.Adressblock[(0+sourceRecord.properties.international)];

and: it works!

thx again!

Ralf.