This book presents the fundamental theoretical notions of supervised machine learning along with a wide range of applications using Python, R, and Stata. It provides a balance between theory and applications and fosters an understanding and awareness of the availability of machine learning methods over different software platforms.
After introducing the machine learning basics, the focus turns to a broad spectrum of topics: model selection and regularization, discriminant analysis, nearest neighbors, support vector machines, tree modeling, artificial neural networks, deep learning, and sentiment analysis. Each chapter is self-contained and comprises an initial theoretical part, where the basics of the methodologies are explained, followed by an applicative part, where the methods are applied to real-world datasets. Numerous examples are included and, for ease of reproducibility, the Python, R, and Stata codes used in the text, along with the related datasets, are available online.
The intended audience is PhD students, researchers and practitioners from various disciplines, including economics and other social sciences, medicine and epidemiology, who have a good understanding of basic statistics and a working knowledge of statistical software, and who want to apply machine learning methods in their work.
The Basics of Machine Learning – Pages 1-17
- This chapter offers a general introduction to the rationale and ontology of Machine Learning (ML). It starts by discussing the definition, rationale, and usefulness of ML in the scientific context. Then, it underscores the transition from symbolic AI to statistical learning and examines the curse of dimensionality as the main source of the non-identifiability of the mapping between the features and target variable in supervised Machine Learning. Next, the chapter provides a guiding taxonomy of Machine Learning methods and discusses some ontological aspects related to Machine Learning as a scientific paradigm. A few final remarks end the chapter.
The Statistics of Machine Learning – Pages 19-58
- This chapter offers a general introduction to the statistics of Machine Learning (ML) and constitutes the basics to get through the next chapters of the book. It starts by focusing on the trade-off between ML prediction and inference and between model flexibility and interpretability. Next, the chapter presents model goodness-of-fit measures and discusses the central notion of overfitting, by presenting then methods for the optimal tuning of the model’s hyper-parameters, including bootstrap and K-fold cross-validation. After devoting some attention to a schematic representation of learning modes and architecture, useful to understand how a machine can learn effectively from data, the last part of the chapter focuses on ML limitations and potential failure, along with a look at the main software solutions to implement ML in practice. Some final conclusions end the chapter.
Model Selection and Regularization – Pages 59-146
- This chapter presents regularization and selection methods for linear and nonlinear (parametric) models. These are important Machine Learning techniques as they allow for targeting three distinct objectives: (1) prediction improvement; (2) model identification and causal inference in high-dimensional data settings; (3) feature-importance detection. The chapter starts by presenting model selection for improving prediction accuracy and model identification and estimation in high-dimensional data settings. Then, it addresses regularized linear models focusing on Lasso, Ridge, and Elastic-net models. Next, it focuses on regularized nonlinear models, which are extensions of the linear ones to generalized linear models (GLMs). Subsequently, it illustrates optimal Subset selection algorithms, which are pure computational approaches to optimal modeling and feature-importance extraction. After delving into the statistical properties of regularized regression, the chapter discusses causal inference in high-dimensional settings, both with an exogenous and endogenous treatment. The applied part of the chapter is fully dedicated to the Stata, R, and Python implementations of the methods presented in the theoretical part.
Discriminant Analysis, Nearest Neighbor, and Support Vector Machine – Pages 147-200
- This chapter covers three related machine learning techniques: discriminant analysis (DA), support vector machine (SVM), and k-nearest neighbor (KNN) algorithms. It focuses mainly on classification but shows also how to extend SVM and KNN to a regression setup. The chapter starts by introducing the Discriminant analysis, which is a Bayesian approach to classification allocating unknown class membership using the Bayes rule. Here, we discuss both linear and quadratic DA, lending attention to the concept of decision boundary. The chapter goes on by introducing the k-nearest neighbors (KNN) algorithm, an extension of the discriminate analysis to nonparametric estimation of the decision boundary, which uses a local proximity-based imputation technique to classify. We show how KNN can be extended also to a regression setting. The chapter continues by introducing the Support vector machine (SVM), a method that can be used both in the classification and regression setting. It then devotes wide attention to the two-class SVM classification, to then extend SVM classification to a multi-class setup. Finally, it discusses SVM regression. The applied part of the chapter is fully dedicated to the Stata, R, and Python implementations of the methods presented in the theoretical part.
Tree Modeling – Pages 201-267
- This chapter deals with tree-based regression and classification models. They are greedy supervised Machine Learning methods involving a sequential stratification of the initial sample by splitting over the feature space. The logic of growing a tree mimics a decision process where, at each step, one has to decide whether or not to go in one direction according to a specific rule that follows an if-then logic. Because of its greedy nature, tree prediction power is generally poor. For this reason, the chapter goes on by presenting tree-related ensemble methods—Bagging, Random forests, and Boosting—developed in the literature for increasing prediction. These methods aggregate several different trees to decrease the prediction variance, thus increasing prediction quality. The cost of obtaining higher precision comes however at expense of lower interpretability. The chapter discusses how to fit and tune this class of ensemble methods. The applied part of the chapter is fully dedicated to the Stata, R, and Python implementations of the methods presented in the theoretical part.
Artificial Neural Networks – Pages 269-322
- This chapter treats artificial neural networks (ANNs), a simplified computational representation of a biological neural system constituting the brain of humans and animals. It starts with a formal definition of what an artificial neural network is, as well as how to fit it to the data. Here, we present the popular gradient descent approach based on the back-propagation algorithm to fit an ANN. The chapter goes on by outlining two notable ANN architectures, that is, the Perceptron and the Adaline. The applied part of the chapter, finally, presents some ANN implementations using Python, R, and Stata.
Deep Learning – Pages 323-364
- This chapter presents deep learning algorithms, a subset of machine learning methods built on sophisticated multi-layer artificial neural networks (ANNs). The “deep” labeling has mainly to do with the construction of ANNs characterized by a hierarchical stratification of layers that stand between the observed inputs and the observed outputs. The chapter starts by providing some intuition of why deep learning models can have high predictive performance, focusing on the concept of data ordering. Next, it presents in detail two classes of (supervised) deep learning models that have become popular in the literature: (i) convolutional neural networks (CNNs), and (ii) recurrent neural networks (RNNs). CNNs have proved to be particularly successful for imaging recognition, while RNNs for sequence prediction (including both quantitative and qualitative sequences). The applied part of the chapter presents applications in Python of these two deep learning architectures using the Keras API.
Sentiment Analysis – Pages 365-384
- This chapter presents the theory and practical applications in Stata, R, and Python of the so-called sentiment analysis (SA), a subfield of Machine Learning called natural language processing (NLP). The SA main purpose is to produce a predicting mapping between human textual expressions (generally stored in documents such as emails, web pages, electronic documents, etc.) and specific human feelings. For this purpose, SA integrates text mining and statistical learning for sentiment prediction. After presenting the SA definition and logic, this chapter presents the procedure of text vectorization through the ”bag of words” algorithm, which is key for transforming textual documents into datasets. We propose three software applications: (1) classifying the topic of newswires based on their text (Stata); (2) detecting positive/negative sentiment toward movies based on people’s reviews (R); building an email “spam” detector (Python).