The Workflow of Data Analysis Using Stata, by J. Scott Long, is an essential productivity tool for data analysts. Aimed at anyone who analyzes data, this book presents an effective strategy for designing and doing data-analytic projects.
In this book, Long presents lessons gained from his experience with numerous academic publications, as a coauthor of the immensely popular Regression Models for Categorical Dependent Variables Using Stata, and as a coauthor of the SPOST routines, which are downloaded over 20,000 times a year.
A workflow of data analysis is a process for managing all aspects of data analysis. Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.
Long shows how to design and implement efficient workflows for both one-person projects and team projects. Long guides you toward streamlining your workflow, because a good workflow is essential for replicating your work, and replication is essential for good science.
An efficient workflow reduces the time you spend doing data management and lets you produce datasets that are easier to analyze. When you methodically clean your data and carefully choose names and effective labels for your variables, the time you spend doing statistical and graphical analyses will be more productive and more enjoyable.
After introducing workflows and explaining how a better workflow can make it easier to work with data, Long describes planning, organizing, and documenting your work. He then introduces how to write and debug Stata do-files and how to use local and global macros. Long presents conventions that greatly simplify data analysis—conventions for naming, labeling, documenting, and verifying variables. He also covers cleaning, analyzing, and protecting your data.
While describing effective workflows, Long also introduces the concepts of basic data management using Stata and writing Stata do-files. Using real-world examples, Stata commands, and Stata scripts, Long illustrates effective techniques for managing your data and analyses. If you analyze data, this book is recommended for you.
List of tables
List of figures
List of examples
Preface
A word about fonts, files, commands, and examples
1. INTRODUCTION
Replication: The guiding principle for workflow
Steps in the workflow
Cleaning data
Running analysis
Presenting results
Protecting files
Tasks within each step
Planning
Organization
Documentation
Execution
Criteria for choosing a workflow
Accuracy
Efficiency
Simplicity
Standardization
Automation
Usability
Scalability
Changing your workflow
How the book is organized
2. PLANNING, ORGANIZING, AND DOCUMENTING
The cycle of data analysis
Planning
Organization
Principles for organization
Organizing files and directories
Creating your directory structure
A directory structure for a small project
A directory structure for a large, one-person project
Directories for collaborative projects
Special-purpose directories
Remembering what directories contain
Planning your directory structure
Naming files
Batch files
Moving into a new directory structure (advanced topic)Example of moving into a new directory structure
Documentation
What should you document?
Levels of documentation
Suggestions for writing documentation
Evaluating your documentation
The research log
A sample page from a research log
A template for research logs
Codebooks
A codebook based on the survey instrument
Dataset documentation
Conclusions
3. WRITING AND DEBUGGING DO-FILES
Three ways to execute commands
The Command window
Dialog boxes
Do-files
Writing effective do-files
Making do-files robust
Make do-files self-contained
Use version control
Exclude directory information
Include seeds for random numbers
Making do-files legible
Use lots of comments
Use alignment and indentation
Use short lines
Limit your abbreviations
Be consistent
Templates for do-files
Commands that belong in every do-file
A template for simple do-files
A more complex do-file template
Debugging do-files
Simple errors and how to fix them
Log file is open
Log file already exists
Incorrect command name
Incorrect variable name
Incorrect option
Missing comma before options
Steps for resolving errors
Step 1: Update Stata and user-written programs
Step 2: Start with a clean slate
Step 3: Try other data
Step 4: Assume everything could be wrong
Step 5: Run the program in steps
Step 6: Exclude parts of the do-file
Step 7: Starting over
Step 8: Sometimes it is not your mistake
Example 1: Debugging a subtle syntax error
Example 2: Debugging unanticipated results
Advanced methods for debugging
How to get help
Conclusions
4. AUTOMATING YOUR WORK
Macros
Local and global macros
Local macros
Global macros
Using double quotes when defining macros
Creating long strings
Specifying groups of variables and nested models
Setting options with locals
Information returned by Stata commands
Using returned results with local macros
Loops: foreach and forvalues
The foreach command
The forvalues command
Ways to use loops
Loop example 1: Listing variable and value labels
Loop example 2: Creating interaction variables
Loop example 3: Fitting models with alternative measures of education
Loop example 4: Recoding multiple variables the same way
Loop example 5: Creating a macro that holds accumulated information
Loop example 6: Retrieving information returned by Stata
Counters in loops
Using loops to save results to a matrix
Nested loops
Debugging loops
The include command
Specifying the analysis sample with an include file
Recoding data using include files
Caution when using include files
Ado-files
A simple program to change directories
Loading and deleting ado-files
Listing variable names and labels
A general program to change your working directory
Words of caution
Help files
nmlabel.hlp
help me
Conclusions
5. NAMES, NOTES, AND LABELS
Posting files
The dual workflow of data management and statistical analysis
Names, notes, and labels
Naming do-files
Naming do-files to re-create datasets
Naming do-files to reproduce statistical analysis
Using master do-files
Master log files
A template for naming do-files
Using subdirectories for complex analysis
Naming and internally documenting datasets
Never name it final!
One time only and temporary datasets
Datasets for larger projects
Labels and notes for datasets
The datasignature command
A workflow using the datasignature command
Changes datasignature does not detect
Naming variables
The fundamental principle for creating and naming variables
Systems for naming variables
Sequential naming systems
Source naming systems
Mnemonic naming systems
Planning names
Principles for selecting names
Anticipate looking for variables
Use simple, unambiguous names
Try names before you decide
Labeling variables
Listing variable labels and other information
Changing the order of variables in your dataset
Syntax for label variable
Principles for variable labels
Beware of truncation
Test labels before you post the file
Temporarily changing variable labels
Creating variable labels that include the variable nam
Adding notes to variables
Commands for working with notes
Listing notes
Removing notes
Searching notes
Using macros and loops with notes
Value labels
Creating value labels is a two-step process
Step 1: Defining labels
Step 2: Assigning labels
Why a two-step system?
Removing labels
Principles for constructing value labels
Keep labels short
Include the category number
Avoid special characters
Keeping track of where labels are used
Cleaning value labels
Consistent value labels for missing values
Using loops when assigning value labels
Using multiple languages
Using label language for different written languages
Using label language for short and long labels
A workflow for names and labels
Step 1: Plan the changes
Step 2: Archive, clone, and rename
Step 3: Revise variable labels
Step 4: Revise value labels
Step 5: Verify the changes
Step 1: Check the source data
Step 1a: List the current names and labels
Step 1b: Try the current names and labels
Step 2: Create clones and rename variables
Step 2a: Create clones
Step 2b: Create rename commands
Step 2c: Rename variables
Step 3: Revise variable labels
Step 3a: Create variable-label commands
Step 3b: Revise variable labels
Step 4: Revise value labels
Step 4a: List the current labels
Step 4b: Create label define commands to edit
Step 4c: Revise labels and add them to dataset
Step 5: Check the new names and labels
Conclusions
6. CLEANING YOUR DATA
Importing data
Data formatsASCII data formats
Binary-data formats
Ways to import data
Stata commands to import data
Using other statistical packages to export data
Using a data conversion program
Verifying data conversion
Converting the ISSP 2002 data from Russia
Verifying variables
Values review
Values review of data about the scientific career
Values review of data on family values
Substantive review
What does time to degree measure?
Examining high-frequency values
Links among variables
Changes in survey questions
Missing-data review
Comparisons and missing values
Creating indicators of whether cases are missing
Using extended missing values
Verifying and expanding missing-data codes
Using include files
Internal consistency review
Consistency in data on the scientific career
Principles for fixing data inconsistencies
Creating variables for analysis
Principles for creating new variables
New variables get new names
Verify that new variables are correct
Document new variables
Keep the source variables
Core commands for creating variables
The generate command
The clonevar command
The replace command
Creating variables with missing values
Additional commands for creating variables
The recode command
The egen command
The tabulate, generate() command
Labeling variables created by Stata
Verifying that variables are correct
Checking the code
Listing variables
Plotting continuous variables
Tabulating variables
Constructing variables multiple ways
Saving datasets
Selecting observations
Deleting cases versus creating selection variables
Dropping variables
Selecting variables for the ISSP 2002 Russian data
Ordering variables
Internal documentation
Compressing variables
Running diagnostics
The codebook, problem command
Checking for unique ID variables
Adding a data signature
Saving the file
After a file is saved
Extended example of preparing data for analysis
Creating control variables
Creating binary indicators of positive attitudes
Creating four-category scales of positive attitudes
Merging files
Match-merging
Sorting the ID variable
One-to-one merging
Combining unrelated datasets
Forgetting to match-merge
Conclusions
7. ANALYZING DATA AND PRESENTING RESULTS
Planning and organizing statistical analysis
Planning in the large
Planning in the middle
Planning in the small
Organizing do-files
Using master do-files
What belongs in your do-file?
Documentation for statistical analysis
The research log and comments in do-files
Documenting the provenance of results
Captions on graphs
Analyzing data using automation
Locals to define sets of variables
Loops for repeated analyses
Computing t tests using loops
Loops for alternative model specifications
Matrices to collect and print results
Collecting results of t tests
Saving results from nested regressions
Saving results from different transformations of articles
Creating a graph from a matrix
Include files to load data and select your sample
Baseline statistics
Replication
Lost or forgotten files
Software and version control
Unknown seed for random numbers
Bootstrap standard errors
Letting Stata set the seed
Training and confirmation samples
Using a global that is not in your do-file
Presenting results
Creating tables
Using spreadsheets
Regression tables with esttab
Creating graphs
Colors, black, and white
Font size
Tips for papers and presentations
Papers
Presentations
A project checklist
Conclusions
8. PROTECTING YOUR FILES
Levels of protection and types of files
Causes of data loss and issues in recovering a file
Murphy’s law and rules for copying files
A workflow for file protection
Part 1: Mirroring active storage
Part 2: Offline backups
Archival preservation
Conclusions
9. CONCLUSIONS
A. HOW STATA WORKS
How Stata works
Stata directories
The working directory
Working on a network
Customizing Stata
Fonts and window locations
Commands to change preferences
Options that can be set permanently
Options that need to be set each session
profile.do
Function keys
Additional resources