This is a short summary of some of the terminology used in machine learning, with an emphasis on neural networks. I’ve put it together primarily to help my own understanding, phrasing it largely in non-mathematical terms. As such it may be of use to others who come from more of a programming than a mathematical background. Note that there are some overlapping or even slightly different meanings of some words or expressions depending on the context (e.g. “linear” and “dimension”), and I’ve tried to make this clear where appropriate. Disclaimer: It is by no means exhaustive or definitive.
This is the function which determines whether a unit should be activated or not, i.e. a “gate” between a unit’s input and output. It is computed by multiplying the input number(s) by the weight, i.e. the weighted sum of all of the inputs from the previous layer. At its simplest it could be a “step” function which outputs one number below the threshold and another above the threshold. In practice activation functions are non-linear (in the machine learning rather than statistical sense), i.e. a curved line rather than a straight line. One of the most common activation functions at the moment is the rectified linear unit. A sigmoid function might be used for binary classification. Given there may be a large number of activation functions (one for each unit in the network) and it may need to be computed a very large number of times, activation functions are designed for computational efficiency.
Back propagation is an algorithm used to determine a network’s weights. It computes the gradient of the loss function with the respect to the weights, with the gradient determined by computing the derivatives or partial derivatives of the loss function, and does this each layer at a time iterating backwards.
The number of examples used in one iteration of training. Stochastic gradient descent will have a batch size of 1, while Mini-batch gradient descent will have more. In a Recurrent Neural Network, you’d typically have as large a batch size as your memory allows, given a larger batch size will work through the training examples more quickly and a small batch size can lead to irregularities in the rate that the loss decreases over the epochs, but some experimentation may be required to optimise batch size.
In a neural network, a bias unit is a special unit typically present in each (non output) layer which is connected to the next layer but not the previous layer, and which stores the value of 1. It allows the activation function to be shifted to the left or to the right. It is sometimes called the intercept term in linear regression.
In supervised learning, this is trying to predict a discreet value, e.g. true/false or sun/cloud/rain/snow. Contrast with Regression. A binary classification has 2 discreet values, while with more than 2 is a multiclass classification. You can think of classification as working with “labels”.
The process of getting very close to the correct answer.
Convolutional Neural Network
A Convolutional Neural Network (CNN) is a network architecture commonly used for image processing. It uses convolutions, which are filters which pass over the image and apply transformations to a grid, to emphasise certain features in the image, e.g. vertical edges or horizontal edges. CNNs often use pooling, which is a way of compressing an image, e.g. max pooling preserves only the highest value in a grid. The network would typically train on the results of the convolutions, and the pooling, rather than the original input pixels.
See Loss function.
Slope of a line tangent to a curve (e.g. generated by a function). In neural networks, this is usually applied to the cost function to determine how correct or incorrect its output is - the steeper the curve, i.e. the bigger the gradient, the more incorrect it is.
A matrix is a two dimensional array, i.e. has rows and columns. A vector could be said to be a one dimensional array, i.e. just having a single column and multiple rows of data, or a single row and multiple columns. However, a vector’s dimension is long it is, i.e. how many elements it has. Similarly the dimension of a matrix often refers to the number of rows and columns (conventionally in that order), e.g. a 6 x 2 matrix has 6 rows and 2 columns. Sometimes dimensions are referenced in the shape.
Representing a higher dimension vector with a lower dimension representation, e.g. a single number representing a whole word or a category representing a specific book. Reducing dimensionality helps with training, and generated embeddings can help with visualisation. The term “vocabulary size” is used to indicate how many different mappings from number to character or word or part word there are.
One iteration of training, so that each example has been seen once. The term is most relevant where batch size is greater than 1, in which case it will be the number of examples divided by batch size.
Information that is relevant to learning. An attribute of the subject. Features are like columns in a database, with distinct values for each row, e.g. separate features for houses could include the floor area and number of bedrooms. In the example of an image, it could be just the pixel data, but can also include metadata, such as tags describing the image.
More features are not necessarily better because some features may not help improve the results, and too many features may make learning impractical, so feature engineering is important for neural networks. A network may even have automated feature extraction, e.g. a layer for edge detection of an image. A useful rule of thumb when deciding which features to use is whether a human expert would be able to predict the results given the selected features, or at least say the features contain sufficient information to predict the results.
This is where the features are scaled to be of similar range to each other, typically in the -1 to +1 range. Feature scaling can ensure the learning gives a comparable amount of attention to all the different features, not just the features that have a range of large values. In gradient descent, this makes the contours less skewed or tight, which allows it to reach the global minimum more quickly. Feature scaling can be more important in some learning algorithms, e.g. SVMs.
The combination of the features is called a feature vector. A feature vector is one row (or column) of input, and the dimension of the feature vector is the number of columns of that row (or rows of that column).
In its simplest form, a function is the method used to turn an input into an output. Machine learning is trying to find a function describing the relationship between data features x and classification or regression for y, i.e. y = f(x). A linear function is just a straight line. A quadratic function is a curve. A polynomial function is one with lots of ups and downs, i.e. a wiggly line. Note that many examples demonstrate with one or two features (dimensions) so it can be illustrated on a graph, but in practice it is often more dimensions that are more difficult to visualise in this way. See also Activation function, Optimisation function and Loss function.
A function where operations include non-negative integer exponents of variables, e.g. x-squared. A degree 2 polynomial has x-squared and generates a parabolic curve (one peak or trough), a degree 3 polynomial hàs x-squared and x-cubed and generates a cubic curve (one peak and one trough), etc.
Also called logistic function. In contrast to a linear function which is a straight line, a sigmoid function starts off with a slow slope, has the largest slope in the middle, and finishes with a slow slope again.
Algorithm to find the local minimum of a function. It is an iterative process which determines the gradient at each step and proceedes in the direction of the steepest gradient. Step size is determined by the learning rate. The gradient is determined by calculating the derivative of the loss function. Also called batch gradient descent.
Stochastic gradient descent
Stochastic gradient descent (SGD) uses individual training examples in each iteration to update parameters, in contrast to batch gradient descent which uses all the training examples in each iteration before updating parameters which can be computationally expensive with large datasets. This means progress on the path to the global minimum is quicker, although it is not as direct, and tends to converge near the global minimum rather than on it exactly (using a smaller learning rate can help, optionally with a learning rate that decreases over time). You would typically periodically check on progress twoards convergence by plotting the cost averaged over the last x training examples processed.
Mini-batch gradient descent
Mini-batch gradient descent is between batch gradient descent and stochastic gradient descent, using b training examples in each iteration, where b is the mini-batch size. It can be faster than SGD if you use vectorisation to partially parallelise the derivative calculations (gradient computations).
Configuration, e.g. learning rate, iterations, hidden layers, type of activation function. They help determine the final values of the parameters.
Running an already-trained model to make predictions, without training, i.e. without adjusting the weights via back-propagation.
In gradient descent, the learning rate is the size of each step taken. If it is too small it could take a long time to reach the target, and if it is too large it could miss the target.
The loss function determines the performance of a given model for given data, in the form of a single number denoting the difference between predicted values and expected values. The “score” from the loss function is simply known as the loss. The loss function is sometimes called the cost function, and sometimes represented as J(theta). It always has a term to calculate the error (like Mean Squared Error), and sometimes has a term to regularise the result. The derivative of the loss function is the gradient, which shows how quickly the loss (or cost) is improving. The lower the loss (or cost) the better. Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the most common terms to calculate the error, with with MAE perhaps more suitable for time series because it doesn’t punish larger errors as much as MSE does, and there are other terms to calculate the error, e.g. Huber loss which is also less sensitive to outliers.
Rows and columns of numbers. Contrast with a vector which is a matrix with one column and many rows.
The neural network architecture is the connectivity pattern between units, i.e. the number of layers (e.g. single layer, 2 layer etc.), the number of units in each layer, and the connection patterns between layers (e.g. fully connected layers, aka dense layers). It sometimes also includes the activation functions and learning methods. A single layer will just have the input layer and the output layer (the input layer is not counted hence this is called a single layer not a 2 layer), a 2 layer network will have the input layer, a hidden layer and the output layer, etc. A fully connected layer is one where each of the units in one layer are connected to each of the units in the next layer. The number of input units will be the same as the dimension of the feature vector, and the number of output units would match the number of “labels” required in classification. A reasonable default is one hidden layer, and if there are multiple hidden layers a reasonable approach is to have the same number of units in each hidden layer, with the number of units in a hidden layer the same or a small multiple of the number of number of units in the input layer.
In a neural network, an optimisation function (or optimisation algorithm) would be used with back propagation to try to minimise the loss function as a function of the parameters. In other words, it uses the “score” from the loss function to work out how well it is doing and then makes adjustments to try to improve the “score” on the next iteration / epoch. Common optimisation functions are gradient descent, Adam and RMSProp (a benefit of Adam and RMSProp is that they automatically adapt the learning rate during training).
The model is simply memorising the training data and recalling it, which means excellent results against data that has already been seen but generally much poorer results against data that has not been seen. It should learn generalisations which would apply to unseen data. It can happen when you have a lot of features and little training data. You can typically see this has happened when the validation loss is higher than the training loss (or the test loss is higher than the training loss if there is no validation set). To remedy, you can decrease the number of features (which might not be possible or desirable), regularise, or specifically with neural networks decrease network size, or increase the dropout hyperparameter, or get more training data. Overfitting is sometimes called “high variance”.
A parameter represents the weight learned for the connection between two units. Parameters are often represented by W and b, for the weights and biases respectively. You can calculate the number of parameters in a fully connected network by multiplying the number of units in a layer with the number of units in the next layer and adding all the results for all the layers together, and adding the bias units in the hidden and output layers. The parameters are also typically stored in a matrix for efficiency in calculating the forward propagation and back propagation, although sometimes have to be “unrolled” into vectors for advanced optimisation (i.e alternatives to gradient descent). See also Hyperparameters which are the configuration options excluding the weights.
Precision and recall, and F Score
Precision is a measure of exactness (or quality), while recall is a measure of completeness (or quantity). Specificially, precision is the number of true positives divided by predicted positives (or true positives divided by true positives and false positives), while recall is true positives divided by actual positives (or true positives divided by true positives and false negatives). Therefore, perfect precision has no false positives (or no irrelevant results), and perfect recall has no false negatives (all the relevant results). When the classes are heavily skewed (e.g. when 99% of expected results are false and you could get reasonable error/accuracy by always predicting false), you can get a better insight into how well a learning algorithm is performing by looking at precision and recall rather than error and accuracy. To get a view of overall combined precision and recall, use the F Score (or F1 Score) which is 2 times ((precision times recall) divided by (pecision and recall)).
Rectified Linear Unit
A Rectified Linear Unit (ReLU) outputs 0 if the input is negative or 0 and outputs the input if it is positive, i.e. it only returns x if x is greater than 0.
Recurrent Neural Network
A Recurrent Neural Network (RNN) is a network architecture for sequences of data. Hidden layers from the previous run provide part of the input to the same hidden layer in the next run.
In supervised learning, this is trying to predict a continuous output, e.g. price or time or age. Contrast with Classification, which is trying to predict discreet values. In broader statistical terms, regression is a technique for estimating the relationship among variables, and there are several types of regression, e.g. linear regression (estimates for continous output based on a function which generates a line, although not necessarily a straight line in statistical terms because e.g. polynomial regression is a type of linear regression using a polynomial function) and logistic regression (estimates for discreet values, using the “S-shaped” curve of a logistic function aka sigmoid function).
Reduce the magnitude/values of parameters. In a neural network, this would often be performed in the loss function. Without regularisation a model might overfit, and with too much regularisation it could underfit. Not to be confused with Normalisation which adjusts the data rather than the prediction function.
Shape in the context of an input layer is the original dimension of the feature, prior to flattening into a vector. In a Convolutional Neural Network used for image processing, a 28x28 pixel square grayscale input image could be represented by a 28 x 28 matrix of the grayscale values for each pixel and this would be flattened into a vector with an input shape of (28, 28) for the input layer, or a 28x28 pixel square RGB image could have the input shape (28, 28, 3) where 3 is the number of channels (i.e. R, G and B), or if input has a batch size of 4 a 28x28 RGB image would have an input shape of (4, 28, 28, 3). In a Recurrent Neural Network, input shape would be batch size, the number of timestamps, and series dimensionality (i.e. 1 for univariate and 2 or more for multivariate). Shape in the context of a tensor is the number of elements in each dimension, e.g. in a two-dimensional tensor the shape is the [number of rows, number of columns].
Support Vector Machine (SVM)
A Support Vector Machine is a supervised learning algorithm. In simple terms, it is a way of finding the optimal line to separate features. Compared to a neural network, an SVM might be faster to train, and should always find global optima (although in practice neural networks tend not to suffer from the local optima problem).
A multidimensional array that can contain scalars (i.e. numbers) and/or vectors and/or matrices. A vector is a one-dimensional tensor and a matrix is a two-dimensional tensor. A picture, for example, can be represented by a 3 dimensional tensor, with fields for width, height and depth, or a series of pictures as a 4 dimensional vector. Grouping data in this way can be more computationally efficient.
The model hasn’t learned much. You typically see this when the training loss and validation loss remain about the same over multiple epochs. To remedy, you could increase the network size and/or number of layers, or add features. Underfitting is sometimes called “high bias”.
Univariate and multivariate
In relation to time series, a univariate time series has a single value at each time stamp (e.g. a chart showing birth rate), whereas multivariate has multiple values at each stamp (e.g. charts showing birth rate and death rate).
Validation Set vs Training Set vs Test Set
There would normally be a Training Set, Validation Set and Test Set of data. Most would be Training e.g. 70%, then Validation e.g. 20%, then Test e.g. 10%. The Training Set is used for training, i.e. adjusting the weights. The Validation Set is used during training to calculate the accuracy and avoid overfitting, but does not adjust the weights (although may be used to tune hyperparameters). If the validation loss is higher (i.e. accuracy lower) than training loss you are likely overfitting, and if the validation loss is about the same as the training loss but neither are reducing you are likely underfitting. You’d normally choose a model with the lowest validation loss (noting that this is not necessarily the last checkpoint if there has been overfitting). The Test Set is used after training is complete to confirm the predictive accuracy. In cases where there are just 2 sets, they are usually referred to as the Training Set and the Test Set.
A list of numbers. A vector can be seen of as a special type of matrix that has one row and multiple columns (sometimes called a “row vector”) or one column and multiple rows (a “column vector”). Pretty much all of AI/ML works on lists of numbers, so everything has to be converted to them - see also Embedding, Parameters, Shape, Matrix and Tensor. With images, after “flattening” from a width x height matrix, a number could represent the colour of a pixel with the length of the vector equal to the number of pixels wide by number of pixels long. Additional information is often added to the input vector - see Feature vector. In terms of implementation within the machine learning algorithms themselves, vectors are linear algebra vectors, with various properties relating to addition, multiplication etc., and mechanisms for measuring distances and determining the gradients for calculating errors and rate of change. For computational efficiency in machine learning, vectors are more likely to be used as part of a matrix or tensor.