BOOST-BASED REGULAR EXPRESSIONS


OVERVIEW

Regular expressions are used for

  • data validation, for example, to check whether a phone number is well formed;

  • data extraction, for example, to extract phone numbers from a string; and

  • data transformation, for example, to normalize different phone number inputs.

Stata provides two sets of regular expression functions: byte-stream-based regexm()regexr(), and regexs(); and Unicode-based ustrregexm()ustrregexrf()ustrregexra(), and ustrregexs(). Unicode-based regular expression functions are built on top of ICU libraries.

 

In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The functions are user-version controlled to retain the old behavior if a user specifies version 17:.

 

A good discussion of regular expressions in Stata can be found in Asjad Naqvi’s Stata guide.

 

The old implementation is based on Henry Spencer’s NFA algorithm and is nearly identical to the POSIX.2 standard. The new implementation in Stata 18 has more features. For example, the new implementation supports {n} for matching a regular expression exactly n times:

 

. display regexm("123", "\d{3}")
1

. version 17: display regexm("123", "\d{3}")
0

 

A set of new functions that exclusively use the Boost library have been added:

  • regexmatch() performs a match of a regular expression to an ASCII string.

  • regexreplace() replaces the first substring that matches a regular expression with specified text.

  • regexreplaceall() replaces all substrings that match a regular expression with specified text.

  • regexcapture() returns a subexpression from a previous match.

  • regexcapturenamed() returns a subexpression corresponding to a matching named group in a regular expression from a previous match.

 

© Copyright 1996–2025 StataCorp LLC. All rights reserved.

BOOST-BASED REGULAR EXPRESSIONS IN ACTION!

We would like to match and extract phone numbers in the addresses of heads of governments.

We require the following rules:

  • The phone number follows “Phone:” or “tel:”.

  • It may start with “+”.

  • After “+” or at the start, it has 1 to 3 nonzero digits.

  • After that, it can have anywhere from 7 to 32 digits, space, or “-”.

We would like to generate a variable, phone, for the extracted phone number, which does not contain “Phone:” or “tel:” if the address matches.

 

We would like to generate another variable, address1, to replace the phone number with the extracted phone number in the above followed by “tel:”.

 

Components of the regular expression in the local macro reg are as follows:

  • (?:Phone\:[\s]*?|tel\:[\s]*)—match either “Phone:” or “tel:” followed by no spaces or some but not capturing the match.

  • ([+]{0, 1}[1-9]{1, 3}[0-9\s-]{7,32})—match and capture a regular expression that satisfies the following:

    • [+]{0, 1}—it may start with “+”.

    • [1-9]{1, 3}—after “+” or at the start, it has 1 to 3 nonzero digits.

    • [0-9\s-]{7,32}—after that, it can have anywhere from 7 to 32 digits, space, or “-”.

We see that the third address does not contain either “Phone:” or “tel:” and thus does not match the regular expression, so phone is missing for this observation.