mdata
: Stata package to handle metadata¶Gustavo Iglésias - Banco de Portugal Microdata Research Laboratory (BPLIM)
extract: extracts metadata from the dataset in memory to an Excel file
apply: applies metadata from an Excel file to the dataset in memory
check: checks for inconsistencies in the Excel metadata file
cmp: compares Excel metadata files
combine: combines Excel metadata files
morph: transforms Excel metadata files to eliminate redundant information
uniform: harmonizes information in Excel metadata files
clear: removes all metadata from the dataset in memory
mdata subcommand [, options]
where subcommand is one of the tools presented in the previous slide.
mdata extract
¶mdata extract
exports metadata from the dataset in memory to an Excel file, which is organized in sheetsMetadata exported to this file includes, but is not limited to:
Data labels, notes and characteristics
Label languages defined
Variables' labels, type and format
mdata extract
¶Lets take as an example the Stata data set nlsw88
%%stata
use "data/nlsw88", clear
describe
. use "data/nlsw88", clear (NLSW, 1988 extract) . describe Contains data from data/nlsw88.dta Observations: 2,246 NLSW, 1988 extract Variables: 17 22 Apr 2022 16:41 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- idcode int %8.0g NLS ID age byte %8.0g Age in current year race byte %8.0g racelbl Race married byte %8.0g marlbl Married never_married byte %16.0g nev_mar Never married grade byte %8.0g Current grade completed collgrad byte %16.0g gradlbl College graduate south byte %9.0g southlbl Lives in the south smsa byte %9.0g smsalbl Lives in SMSA c_city byte %16.0g ccitylbl Lives in a central city industry byte %23.0g indlbl Industry occupation byte %22.0g occlbl Occupation union byte %8.0g unionlbl Union worker wage float %9.0g Hourly wage hours byte %8.0g Usual hours worked ttl_exp float %9.0g Total work experience (years) tenure float %9.0g Job tenure (years) ------------------------------------------------------------------------------- Sorted by: idcode .
mdata extract
¶%%stata
cap mkdir meta
mdata extract, meta("meta/meta1", replace)
. cap mkdir meta . mdata extract, meta("meta/meta1", replace) File meta/meta1.xlsx saved .
mdata extract
¶%%stata
* Example with labels in Portuguese
label language pt, new
* Variable labels
label var age "Idade"
label var race "Raça"
* Value labels
label define marlbl_pt 0 "Solteiro" 1 "Casado"
label values married marlbl_pt
label language en
* Extract metadata
mdata extract, meta("meta/meta2", replace)
. * Example with labels in Portuguese . label language pt, new (language pt now current language) . * Variable labels . label var age "Idade" . label var race "Raça" . * Value labels . label define marlbl_pt 0 "Solteiro" 1 "Casado" . label values married marlbl_pt . label language en . * Extract meta data . mdata extract, meta("meta/meta2", replace) File meta/meta2.xlsx saved .
mdata extract
¶
Advantages of using mdata extract
:
All the metadata is stored in an Excel file, so users can easily inspect it
Metadata may be analysed (and changed) by non-Stata users
By separating data from metadata, it is possible to use more efficient formats
We can apply the stored metadata to new data (mdata apply
)
mdata apply
¶mdata apply
applies metadata stored in the Excel metadata file to data in memorymdata extract
mdata check
)mdata apply
is particularly useful when you have incoming (monthly, annual, etc.) data that is structurally similarmdata apply
¶%%stata
use data/nlsw85, clear
describe
mdata extract, meta("meta/meta85", replace)
. use data/nlsw85, clear (NLSW - 1985 extraction) . describe Contains data from data/nlsw85.dta Observations: 2,085 NLSW - 1985 extraction Variables: 7 22 Apr 2022 18:26 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- idcode int %8.0g NLS ID year byte %8.0g Interview year birth_yr byte %8.0g Birth year age byte %8.0g Age in current year race byte %8.0g racelbl Race msp byte %23.0g msplbl 1 if married, spouse present collgrad byte %16.0g collgradlbl 1 if college graduate ------------------------------------------------------------------------------- Sorted by: idcode year . mdata extract, meta(meta/meta85, replace) File meta/meta85.xlsx saved .
mdata apply
¶%%stata
use data/nlsw87, clear
describe
. use data/nlsw87, clear . describe Contains data from data/nlsw87.dta Observations: 2,164 Variables: 8 22 Apr 2022 18:29 ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- idcode int %8.0g year byte %8.0g birth_yr byte %8.0g age byte %8.0g race byte %8.0g msp byte %8.0g collgrad byte %8.0g union byte %8.0g ------------------------------------------------------------------------------- Sorted by: .
mdata apply
¶%%stata
mdata apply, meta("meta/meta87") do("dos/apply87")
describe
. mdata apply, meta(meta/meta87) do(dos/apply87) File dos/apply87.do saved . describe Contains data from data/nlsw87.dta Observations: 2,164 Variables: 8 22 Apr 2022 18:29 (_dta has notes) ------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------- idcode int %8.0g NLS ID year byte %8.0g Interview year birth_yr byte %8.0g Birth year age byte %8.0g Age in current year race byte %8.0g racelbl Race msp byte %23.0g msplbl 1 if married, spouse present collgrad byte %16.0g collgradlbl 1 if college graduate union byte %8.0g unionlbl Union worker ------------------------------------------------------------------------------- Sorted by: idcode year Note: Dataset has changed since last saved. .
mdata check
¶mdata check
verifies the integrity of metadata stored in the Excel metadata filemdata extract
mdata apply
, whose execution stops if any inconsistency is foundmdata cmp
¶mdata cmp
compares metadata found in two Excel metadata filesmdata extract
and that the files should be identical (with the exception of data features)Differences are labeled as inconsistencies
Variables
Characteristics
Notes
Value labels
mdata combine
¶mdata combine
combines metadata found in two Excel metadata files, generating a new Excel metadata filemdata extract
mdata morph
¶mdata morph
transforms the Excel metadata file by removing redundant informationmdata extract
mdata uniform
¶mdata uniform
harmonizes metadata stored in the Excel metadata filemdata extract
mdata
offers a suite of tools to handle metadata
All the metadata is stored in an Excel file, so users can easily inspect it
Metadata may be analyzed (and changed) by non-Stata users
By separating data from metadata:
It is possible to use more efficient formats (like parquet for example) when dealing with large amounts of data
Manipulate and combine metadata without loading data into memory (useful for huge data sets)
Allows users who cannot see the data (confidential data) to still be able to analyze and manipulate the metadata
Use the same metadata for multiple data
Portability of metadata
gtools package by Mauricio Caceres
bpencode by BPLIM