NOVITA'/TESTI

 

The Workflow of Data Analysis Using Stata

by J. Scott Long



 The Workflow of Data Analysis Using Stata, by J. Scott Long, is an essential productivity tool for data analysts. Aimed at anyone who analyzes data, this book presents an effective strategy for designing and doing data-analytic projects.

In this book, Long presents lessons gained from his experience with numerous academic publications, as a coauthor of the immensely popular Regression Models for Categorical Dependent Variables Using Stata, and as a coauthor of the SPOST routines, which are downloaded over 20,000 times a year.

A workflow of data analysis is a process for managing all aspects of data analysis. Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.

Long shows how to design and implement efficient workflows for both one-person projects and team projects. Long guides you toward streamlining your workflow, because a good workflow is essential for replicating your work, and replication is essential for good science.

An efficient workflow reduces the time you spend doing data management and lets you produce datasets that are easier to analyze. When you methodically clean your data and carefully choose names and effective labels for your variables, the time you spend doing statistical and graphical analyses will be more productive and more enjoyable.

After introducing workflows and explaining how a better workflow can make it easier to work with data, Long describes planning, organizing, and documenting your work. He then introduces how to write and debug Stata do-files and how to use local and global macros. Long presents conventions that greatly simplify data analysis—conventions for naming, labeling, documenting, and verifying variables. He also covers cleaning, analyzing, and protecting your data.

While describing effective workflows, Long also introduces the concepts of basic data management using Stata and writing Stata do-files. Using real-world examples, Stata commands, and Stata scripts, Long illustrates effective techniques for managing your data and analyses. If you analyze data, this book is recommended for you.

Comments from Readers

You have written the book that I had planned to write someday. But I’m glad I didn’t—your book is much better. Congratulations, this was greatly needed.

Prof. Bill Gardner
The Ohio State University

I will post the announcement of Workflow on my door with the following note: “I’m glad to help anybody who followed at least 25% of the advice Long provides—and brings me their do-files!”

Prof. Alan C. Acock
Oregon State University

I just wanted to send you a thank you for taking the time to write this book. I feel a little like an obsessed fan because I read it for several hours last night, bought 3 copies for my new research team and am presenting our new organization scheme tomorrow. It turns out that we have just finished a first flurry of data collection and hiring and I’ve been scratching my head about how to systematize some aspects. It is a perfect time to superimpose a structure. I’ve used aspects of your plan in my own work (hence my eagerness to adopt) but having this coherent volume is a wonderful and practical resource. I learned a lot from reading this. Thank you!

Elizabeth Gifford, Ph.D.
Research Scientist
Duke University

I just received a knock at my door with my new copy of The Workflow of Data Analysis Using Stata. I immediately ripped off the packaging and began perusing it. Just before the knock, I was attempting to write a program to get Stata to save the r(mean) and r(sd) for two variables following a summarize command to be saved for a ttesti command. After looking at your book for about two minutes, I stumbled upon pages 91–92, where it gave me all the information I need. … I have only had the book about 10 minutes and already it has made my life easier. Thanks much, and I am already looking forward to reading the rest of the book!

Claire M. Kamp Dush, Ph.D.
The Ohio State University

I am a Spanish professor of public economics who is at present enjoying a study-research leave at Melbourne University (Australia). Because of that I have had the time to read your book from cover to cover. I just want to thank you for the incredible work you have done! A book such as this one is a must for anyone trying to make an academic career. Definitely, I will recommend it to my graduate students as soon as I go back to Spain. If I had the chance to reach this book twenty years ago I would have been much more efficient doing my work. Never is it too late! Thanks!

Prof. Jose Felix Sanz-Sanz
Dept. of Applied Economics
Universidad Complutense de Madrid

Table of contents

List of tables

List of figures

List of examples

Preface

A word about fonts, files, commands, and examples

1 Introduction 

1.1 Replication: The guiding principle for workflow
1.2 Steps in the workflow
         1.2.1 Cleaning data
         1.2.2 Running analysis
         1.2.3 Presenting results
         1.2.4 Protecting files
          1.3 Tasks within each step
                   1.3.1 Planning
                   1.3.2 Organization
                   1.3.3 Documentation
                   1.3.4 Execution
          1.4 Criteria for choosing a workflow
                   1.4.1 Accuracy
                   1.4.2 Efficiency
                   1.4.3 Simplicity
                   1.4.4 Standardization
                   1.4.5 Automation
                   1.4.6 Usability
                   1.4.7 Scalability
          1.5 Changing your workflow
          1.6 How the book is organized

2 Planning, organizing, and documenting 

2.1 The cycle of data analysis
2.2 Planning
2.3 Organization
         2.3.1 Principles for organization
         2.3.2 Organizing files and directories
         2.3.3 Creating your directory structure
                 A directory structure for a small project
                 A directory structure for a large, one-person project
                 Directories for collaborative projects
                 Special-purpose directories
                 Remembering what directories contain
                 Planning your directory structure
                 Naming files
                 Batch files
         2.3.4 Moving into a new directory structure (advanced topic)
                 Example of moving into a new directory structure
2.4 Documentation
         2.4.1 What should you document?
         2.4.2 Levels of documentation
         2.4.3 Suggestions for writing documentation
                 Evaluating your documentation
         2.4.4 The research log
                 A sample page from a research log
                 A template for research logs
         2.4.5 Codebooks
                 A codebook based on the survey instrument
         2.4.6 Dataset documentation
2.5 Conclusions 

3 Writing and debugging do-files 

3.1 Three ways to execute commands
         3.1.1 The Command window
         3.1.2 Dialog boxes
         3.1.3 Do-files
3.2 Writing effective do-files
         3.2.1 Making do-files robust
                 Make do-files self-contained
                 Use version control
                 Exclude directory information
                 Include seeds for random numbers
         3.2.2 Making do-files legible
                 Use lots of comments
                 Use alignment and indentation
                 Use short lines
                 Limit your abbreviations
                 Be consistent
         3.2.3 Templates for do-files
                 Commands that belong in every do-file
                 A template for simple do-files
                 A more complex do-file template
3.3 Debugging do-files
         3.3.1 Simple errors and how to fix them
                 Log file is open
                 Log file already exists
                 Incorrect command name
                 Incorrect variable name
                 Incorrect option
                 Missing comma before options
         3.3.2 Steps for resolving errors
                 Step 1: Update Stata and user-written programs
                 Step 2: Start with a clean slate
                 Step 3: Try other data
                 Step 4: Assume everything could be wrong
                 Step 5: Run the program in steps
                 Step 6: Exclude parts of the do-file
                 Step 7: Starting over
                 Step 8: Sometimes it is not your mistake
         3.3.3 Example 1: Debugging a subtle syntax error
         3.3.4 Example 2: Debugging unanticipated results
         3.3.5 Advanced methods for debugging
3.4 How to get help
3.5 Conclusions  

4 Automating your work

          4.1 Macros
                   4.1.1 Local and global macros
                           Local macros
                           Global macros
                           Using double quotes when defining macros
                           Creating long strings
                  4.1.2 Specifying groups of variables and nested models
                  4.1.3 Setting options with locals
          4.2 Information returned by Stata commands
                          Using returned results with local macros
          4.3 Loops: foreach and forvalues
                          The foreach command
                          The forvalues command
                  4.3.1 Ways to use loops
                           Loop example 1: Listing variable and value labels
                           Loop example 2: Creating interaction variables
                           Loop example 3: Fitting models with alternative measures of education
                           Loop example 4: Recoding multiple variables the same way
                           Loop example 5: Creating a macro that holds accumulated information
                           Loop example 6: Retrieving information returned by Stata
                  4.3.2 Counters in loops
                           Using loops to save results to a matrix
                  4.3.3 Nested loops
                  4.3.4 Debugging loops
         4.4 The include command
                  4.4.1 Specifying the analysis sample with an include file
                  4.4.2 Recoding data using include files
                  4.4.3 Caution when using include files
         4.5 Ado-files
                  4.5.1 A simple program to change directories
                  4.5.2 Loading and deleting ado-files
                  4.5.3 Listing variable names and labels
                  4.5.4 A general program to change your working directory
                  4.5.5 Words of caution
         4.6 Help files
                  4.6.1 nmlabel.hlp
                  4.6.2 help me
         4.7 Conclusions

5 Names, notes, and labels

5.1 Posting files
5.2 The dual workflow of data management and statistical analysis
5.3 Names, notes, and labels
5.4 Naming do-files
         5.4.1 Naming do-files to re-create datasets
         5.4.2 Naming do-files to reproduce statistical analysis
         5.4.3 Using master do-files
                 Master log files
         5.4.4 A template for naming do-files
                 Using subdirectories for complex analysis
5.5 Naming and internally documenting datasets
                 Never name it final!
         5.5.1 One time only and temporary datasets
         5.5.2 Datasets for larger projects
         5.5.3 Labels and notes for datasets
         5.5.4 The datasignature command
                 A workflow using the datasignature command
                 Changes datasignature does not detect
5.6 Naming variables
         5.6.1 The fundamental principle for creating and naming variables
         5.6.2 Systems for naming variables
                 Sequential naming systems
                 Source naming systems
                 Mnemonic naming systems
         5.6.3 Planning names
         5.6.4 Principles for selecting names
                 Anticipate looking for variables
                 Use simple, unambiguous names
                 Try names before you decide
5.7 Labeling variables
         5.7.1 Listing variable labels and other information
                 Changing the order of variables in your dataset
         5.7.2 Syntax for label variable
         5.7.3 Principles for variable labels
                 Beware of truncation
                 Test labels before you post the file
         5.7.4 Temporarily changing variable labels
         5.7.5 Creating variable labels that include the variable nam
 5.8 Adding notes to variables
         5.8.1 Commands for working with notes
                 Listing notes
                 Removing notes
                 Searching notes
         5.8.2 Using macros and loops with notes
5.9 Value labels
         5.9.1 Creating value labels is a two-step process
                 Step 1: Defining labels
                 Step 2: Assigning labels
                 Why a two-step system?
                 Removing labels
         5.9.2 Principles for constructing value labels
                 1) Keep labels short
                 2) Include the category number
                 3) Avoid special characters
                 4) Keeping track of where labels are used
         5.9.3 Cleaning value labels
         5.9.4 Consistent value labels for missing values
         5.9.5 Using loops when assigning value labels

5.10 Using multiple languages
         5.10.1 Using label language for different written languages
         5.10.2 Using label language for short and long labels
5.11 A workflow for names and labels
                   Step 1: Plan the changes
                   Step 2: Archive, clone, and rename
                   Step 3: Revise variable labels
                   Step 4: Revise value labels
                   Step 5: Verify the changes
         5.11.1 Step 1: Check the source data
                   Step 1a: List the current names and labels
                   Step 1b: Try the current names and labels
         5.11.2 Step 2: Create clones and rename variables
                   Step 2a: Create clones
                   Step 2b: Create rename commands
                   Step 2c: Rename variables
         5.11.3 Step 3: Revise variable labels
                   Step 3a: Create variable-label commands
                   Step 3b: Revise variable labels
         5.11.4 Step 4: Revise value labels
                   Step 4a: List the current labels
                   Step 4b: Create label define commands to edit
                   Step 4c: Revise labels and add them to dataset
                   5.11.5 Step 5: Check the new names and labels
5.12 Conclusions 

6 Cleaning your data

6.1 Importing data
          6.1.1 Data formats
                  ASCII data formats
                  Binary-data formats
          6.1.2 Ways to import data
                  Stata commands to import data
                  Using other statistical packages to export data
                  Using a data conversion program
           6.1.3 Verifying data conversion
                   Converting the ISSP 2002 data from Russia
6.2 Verifying variables
           6.2.1 Values review
                   Values review of data about the scientific career
                   Values review of data on family values
           6.2.2 Substantive review
                    What does time to degree measure?
                    Examining high-frequency values
                    Links among variables
                    Changes in survey questions
            6.2.3 Missing-data review
                    Comparisons and missing values
                    Creating indicators of whether cases are missing
                    Using extended missing values
                    Verifying and expanding missing-data codes
                    Using include files
             6.2.4 Internal consistency review
                     Consistency in data on the scientific career
             6.2.5 Principles for fixing data inconsistencies
6.3 Creating variables for analysis
             6.3.1 Principles for creating new variables
                     New variables get new names
                     Verify that new variables are correct
                     Document new variables
                     Keep the source variables
             6.3.2 Core commands for creating variables
                      The generate command
                      The clonevar command
                      The replace command
             6.3.3 Creating variables with missing values
             6.3.4 Additional commands for creating variables
                     The recode command
                     The egen command
                     The tabulate, generate() command
             6.3.5 Labeling variables created by Stata
             6.3.6 Verifying that variables are correct
                     Checking the code
                     Listing variables
                     Plotting continuous variables
                     Tabulating variables
                     Constructing variables multiple ways
6.4 Saving datasets
             6.4.1 Selecting observations
                     Deleting cases versus creating selection variables
             6.4.2 Dropping variables
                     Selecting variables for the ISSP 2002 Russian data
             6.4.3 Ordering variables
             6.4.4 Internal documentation
             6.4.5 Compressing variables
             6.4.6 Running diagnostics
                     The codebook, problem command
                     Checking for unique ID variables
             6.4.7 Adding a data signature
             6.4.8 Saving the file
             6.4.9 After a file is saved
6.5 Extended example of preparing data for analysis
                     Creating control variables
                     Creating binary indicators of positive attitudes
                     Creating four-category scales of positive attitudes
6.6 Merging files
             6.6.1 Match-merging
                     Sorting the ID variable
             6.6.2 One-to-one merging
                     Combining unrelated datasets
             6.6.3 Forgetting to match-merge
6.7 Conclusions

7 Analyzing data and presenting results

7.1 Planning and organizing statistical analysis
             7.1.1 Planning in the large
             7.1.2 Planning in the middle
             7.1.3 Planning in the small
7.2 Organizing do-files
             7.2.1 Using master do-files
             7.2.2 What belongs in your do-file?
7.3 Documentation for statistical analysis
             7.3.1 The research log and comments in do-files
             7.3.2 Documenting the provenance of results
                     Captions on graphs
7.4 Analyzing data using automation
             7.4.1 Locals to define sets of variables
             7.4.2 Loops for repeated analyses
                     Computing t tests using loops
                     Loops for alternative model specifications
             7.4.3 Matrices to collect and print results
                     Collecting results of t tests
                     Saving results from nested regressions
                     Saving results from different transformations of articles
             7.4.4 Creating a graph from a matrix
             7.4.5 Include files to load data and select your sample
7.5 Baseline statistics
7.6 Replication
             7.6.1 Lost or forgotten files
             7.6.2 Software and version control
             7.6.3 Unknown seed for random numbers
                     Bootstrap standard errors
                     Letting Stata set the seed
                     Training and confirmation samples
             7.6.4 Using a global that is not in your do-file
7.7 Presenting results
             7.7.1 Creating tables
                     Using spreadsheets
                     Regression tables with esttab
             7.7.2 Creating graphs
                     Colors, black, and white
                     Font size
             7.7.3 Tips for papers and presentations
                     Papers
                     Presentations
7.8 A project checklist
7.9 Conclusions 

8 Protecting your files

8.1 Levels of protection and types of files
8.2 Causes of data loss and issues in recovering a file
8.3 Murphy’s law and rules for copying files
8.4 A workflow for file protection
                     Part 1: Mirroring active storage
                     Part 2: Offline backups
8.5 Archival preservation
8.6 Conclusions   

9 Conclusions

A How Stata works 

A.1 How Stata works
                     Stata directories
                     The working directory
A.2 Working on a network
A.3 Customizing Stata
           A.3.1 Fonts and window locations
           A.3.2 Commands to change preferences
                   Options that can be set permanently
                   Options that need to be set each session
           A.3.3 profile.do
                   Function keys
A.4 Additional resources

References

Author Index

Subject Index 

© Copyright StataCorp LP 2002-2015.


 
Copyright © 2015 TStat All rights reserved via Rettangolo, 12/14 - 67039 - Sulmona (AQ) - Italia