The Basics

Data Science

Data Science is a multidisciplinary field that combines statistics, computer science and business intelligence to extract meaningful information from data.

Machine Learning

Machine Learning is a method of building computer systems through finding and applying patterns learned from previous observations.

Models

In the context of machine learning, models are mathematical expressions that use a set of parameters (determined through the training process) to generate inference for new observations.

Features/covariates/explanatory variables/predictors/independent variables

These are all different ways of referring to the variables passed to a model to receive an inference result.

Target Variable/dependent variable/response variable

These are all the different ways of referring to the model output.

Hyperparameters

Parameters whose values are set ahead of the training process. They are distinguished from the other parameters in that they are unaffected by the training data (see below). Example: learning rate.

Training Set/Training Data

Data used to develop the model (i.e. determine the model parameters).

Validation Set/Validation Data

Data that is withheld from the model training process, but used to provide an unbiased evaluation of the model for the purpose of hyperparameter tuning.

Testing Set/Testing Data

Data that is withheld from the training and validation process to provide a realistic evaluation of model performance on subsequent observations.

General Concepts

Model Deployment

Model deployment is the process of integrating a Machine Learning model with a production environment, usually to make inference available to other business systems.

Feature Engineering

The series of transformation steps applied to the raw input variables prior to the training phase.

Overfitting

Overfitting is used to describe models that “fit too well” to the training data. These models are bad because they do not generalize very well.

Underfitting

Underfitting is used to describe models that learned too little from the data set, which results in a simplistic understanding of the underlying relationships.

Supervised Learning

When we have a clearly defined input and output, we can use a supervised learning algorithm (think linear regression, support vector machines) to map the input to the output based on prior observations.

Unsupervised Learning

When we have a clearly defined input, but not a clearly defined output, we need to rely on unsupervised learning algorithms (such as clustering) to draw inference from our dataset.

Semi-supervised Learning

In semi-supervised learning, we have labels for some of our observations. The classic semi-supervised learning approach is to train a model on the labelled data, use this model to infer the remaining missing labels, convert confident predictions to definite labels, retrain the model over the new labels and repeat until all data is labelled.

Regression

Regression models explain/predict the relationship between independent variables and a continuous dependent variable. Modeling house prices would be a regression problem.

Classification

Classification models explain/predict the relationship between independent variables and a categorical dependent variable. Classifying animals from pictures is a classification problem.

Clustering

Clustering is a set of unsupervised learning techniques used to group data points based on similarities within each group and dissimilarities between groups.

Natural Language Processing (NLP)

NLP is the area of machine learning tasks focused on human languages. This includes both the written and spoken language.

Computer Vision

Computer vision is the area of machine learning tasks focused on image recognition.

Feature Space

The n-dimensional space constructed by the model features.

Algorithms

a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.

- Oxford Dictionary definition

Neural Network

Neural Networks (sometimes also referred to as Artificial Neural Networks) are a class of machine learning models meant to resemble the 🧠.

Recurrent Neural Network/RNN: RNNs are a subclass of neural networks typically used to process sequential data.

Convolutional Neural Network/CNN: CNNs are a subclass of neural networks typically used to process spatial data such as images.

Deep Learning

Deep Learning is the area of machine learning that uses multi-layer neural networks.

Linear Regression

Linear Regression is used to model a linear relationship between a continuous, scalar response variable and at least one explanatory variable. Linear Regression can be used for predicting monetary valuations amongst other use cases.

Logistic Regression

Logistic Regression is used to model a probabilistic relationship between a binary response variable and at least one explanatory variable. The output of the Logistic Regression model is the log odds, which can be transformed to obtain the probability. Logistic Regression can be used to predict the likelihood of churn amongst other use cases.

Support Vector Machine (SVM)

Support Vector Machine is a binary classifier used to find the optimal hyperplane to separate the two classes in the feature space. New observations are classified based on which side of the hyperplane they fall under.

Decision Tree

A decision tree separates the feature space into distinct subsets. New observations are classified based on the subset they fall under.

Ensemble Modeling

Ensemble Modeling is the process of aggregating multiple models to make a single prediction. The key behind successful ensembling is to pick diverse models that uses very different algorithms. There are several ways of choosing the prediction using multiple models, the simplest being:

take the most commonly predicted value
average/weigh the scores from each model and predict the outcome from the aggregated score

Tools for Model Development

GitHub

GitHub is a web platform used for software development. It offers version control and other collaborative features such as task management and code reviews.

Docker

Dockers are used to deploy applications, including machine learning models.

Python Libraries

Without a doubt, Python is the most popular programming language for Data Scientists.

NumPy

NumPy is a numerical computation library used to structure and manipulate data. It is a building block for many other open source Data Science libraries.

pandas

pandas makes it easy to read, export and work with relational data. The core pandas data structure (dataframes) organizes data into a table format that makes it easy to perform indexing, filtering, aggregating and grouping operations.

Scikit-Learn/sklearn

sklearn is a comprehensive library used for data analysis, feature engineering and for developing machine learning models.

TensorFlow

TensorFlow is a machine learning framework developed by the Google Brain team. The primary use of TensorFlow is for developing and productionizing deep learning models.

Keras

Keras is a deep learning library written in Python. It is a high level API that can be used on top of several deep learning frameworks, including TensorFlow.

PyTorch

PyTorch is a machine learning library developed by the Facebook Artificial Intelligence Research group. It is also primarily used for developing deep learning models.

Metrics

These are some commonly used metrics for assessing model performance. When communicating model performance, we need to specify which dataset we obtained these metrics from in addition to the metrics themselves. A training accuracy of 95% is not the same as a testing accuracy of 95%!

Metrics for Binary Classification

We are usually more interested in the presence of one class than the other. For example, we are more concerned if a client is “fraudulent” than if they are not fraudulent. Let’s persist this example when defining the following terms.

False Positive/Type 1 Error (FP): this is an observation we misclassified as being our class of interest (example: a non-fraudulent client misclassified as fraudulent).

False Negative/Type 2 Error (FN): this is an observation we misclassified as not being our class of interest (example: a fraudulent client misclassified as non-fraudulent).

True Positive (TP): this is a fraudulent client that we correctly classified

True Negative (TN): this is a non-fraudulent client that we correctly classified

Precision: precision is calculated as TP/(TP+FP) where TP is the number of true positives and FP is the number of false positives

Recall: recall is calculated as TP/(TP + FN) where TP is the number of true positives and FN is the number of false negatives

F1-Score: as we can see above, optimizing for precision means reducing the number of false positives while optimizing for recall means reducing the number of false negatives. We use F1-score to combine these metrics. F1-score is calculated as 2 * (precision * recall) / (precision + recall)

Confusion Matrix: the confusion matrix is a visual representation of TP, FP, FN, TN.

Metrics for Multi-class Classification

Most metrics used for binary classification can be used to assess the performance of each class in the multi-class scenario. If we had 3 classes, we would derive 3 precision scores. For each precision score, the TP would be the number of correct predictions we made for that class and FP would be the number of times we misclassified one of the other 2 classes as that class.

The confusion matrix can also be generalized as an n x n matrix where n is the number of classes. The cell with row i and column j represents the number of class i predicted to be class j.

Confusion Matrix Visualization from one of my other articles

Accuracy: the % of correct predictions. This is a good representation of model performance when class sizes are fairly balanced.

Metrics for Regression

Mean Squared Error (MSE): Mean Squared Error averages the square of the difference between the actual and predicted values. This is one of the most common metrics used to evaluate regression models.

Root Mean Squared Error (RMSE): RMSE is the square root of MSE. MSE is usually selected in favor of RMSE because it is easier to work with (one less operation).

Mean Absolute Error (MAE): MAE is the average of the absolute difference between the actual and predicted values.

R-squared/Coefficient of Determination: R-squared is a statistical measure of the % of variance found in the data that can be “explained” by the model.

Adjusted R-squared: Adjusted R-squared adjusts R-squared by penalizing the number of parameters.

Pages

Thursday, November 21, 2019

Data Scientist Glossary