BOOST-BASED REGULAR EXPRESSIONS

OVERVIEW

Regular expressions are used for

data validation, for example, to check whether a phone number is well formed;
data extraction, for example, to extract phone numbers from a string; and
data transformation, for example, to normalize different phone number inputs.

Stata provides two sets of regular expression functions: byte-stream-based regexm(), regexr(), and regexs(); and Unicode-based ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). Unicode-based regular expression functions are built on top of ICU libraries.

In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The functions are user-version controlled to retain the old behavior if a user specifies version 17:.

A good discussion of regular expressions in Stata can be found in Asjad Naqvi’s Stata guide.

The old implementation is based on Henry Spencer’s NFA algorithm and is nearly identical to the POSIX.2 standard. The new implementation in Stata 18 has more features. For example, the new implementation supports {n} for matching a regular expression exactly n times:

. display regexm("123", "\d{3}")
1

. version 17: display regexm("123", "\d{3}")
0

A set of new functions that exclusively use the Boost library have been added:

regexmatch() performs a match of a regular expression to an ASCII string.
regexreplace() replaces the first substring that matches a regular expression with specified text.
regexreplaceall() replaces all substrings that match a regular expression with specified text.
regexcapture() returns a subexpression from a previous match.
regexcapturenamed() returns a subexpression corresponding to a matching named group in a regular expression from a previous match.

BOOST-BASED REGULAR EXPRESSIONS IN ACTION!

We would like to match and extract phone numbers in the addresses of heads of governments.

We require the following rules:

The phone number follows “Phone:” or “tel:”.
It may start with “+”.
After “+” or at the start, it has 1 to 3 nonzero digits.
After that, it can have anywhere from 7 to 32 digits, space, or “-”.

We would like to generate a variable, phone, for the extracted phone number, which does not contain “Phone:” or “tel:” if the address matches.

We would like to generate another variable, address1, to replace the phone number with the extracted phone number in the above followed by “tel:”.

Components of the regular expression in the local macro reg are as follows:

(?:Phone\:[\s]*?|tel\:[\s]*)—match either “Phone:” or “tel:” followed by no spaces or some but not capturing the match.
([+]{0, 1}[1-9]{1, 3}[0-9\s-]{7,32})—match and capture a regular expression that satisfies the following:
- [+]{0, 1}—it may start with “+”.
- [1-9]{1, 3}—after “+” or at the start, it has 1 to 3 nonzero digits.
- [0-9\s-]{7,32}—after that, it can have anywhere from 7 to 32 digits, space, or “-”.

We see that the third address does not contain either “Phone:” or “tel:” and thus does not match the regular expression, so phone is missing for this observation.

Epidemiologia e Biostatistica

Scienze Sociali

Econometria

ECONOMETRIA FINANZIARIA

Corsi per l'utilizzo del software

CONVEGNO ITALIANO DEGLI UTENTI DI STATA

Analisi biostatistica, epidemiologica e ricerca medica

Software per ricerche operative

Analisi statistica generale

formazione multimediale

modelli gerarchici lineari e non lineari

Analisi di data mining

Trasferimento di archivi di dati

Analisi spaziale

Matematica e Ingegneria

word processing scientifico

Analisi statistica specialistica

Disegno di esperimenti e analisi della dimensione dei campioni

Analisi di serie temporali e la stima di modelli econometrici

analisi qualitativa

modelli di reti neurali

STATA PRESS

Altri testi relativi a Stata