The Theoretical Minimum (for Machine Learning)…And Why
Linear Algebra, Calculus, and Probability & Statistics often get mentioned as the minimum math you need to start on your machine learning journey. But why these disciplines? Here’s an intuition for why:
Linear Algebra: Machines turn data that we produce (text, images, audio, etc.) into a format that’s conducive for algorithmic manipulation. Often, they turn data into vectors. A 2D vector is a point in 2D space, a 3D vector is a point in 3D space, and so on. By converting data into vectors, an ML algorithm can do many things. For instance, it can calculate the distance between two n-dimensional vectors in nD space and establish similarity or dissimilarity. Similar vectors (data) are closer in that nD space than dissimilar vectors (data). Or an ML algorithm can learn how these vectors are distributed in nD space, information that can then be used to generate new vectors—or new data!
You can also think of a machine learning model as something that transforms an input vector X into an output vector Y. This transformation is often the result of multiplying a matrix by a vector, or the result of a sequence of such transformations. The matrix (or matrices) represent the parameters of the model—and have to be learned from data. Vectors, matrices and their manipulations are all essentially linear algebra.
Calculus: A model, or rather its parameters, has to be learned from data. But what does learning entail? Usually, learning involves transforming an input vector into an output vector, calculating the loss (or error) made by the model—according to some measure of the distance between the produced output vector and the expected output vector—and then using that loss to tweak the parameters of the model such that given the same input vector the model produces an output that is a little closer to the expected output, or the loss is reduced.
This is where calculus comes in. If the loss can be written down as a function of the parameters of the model, then reducing the loss is an optimization problem. The loss function gives you the loss landscape, and one has to descend from some region high up in the landscape (representing high loss) to some deep valley (representing very low or zero loss). One way to do so is by using some form of gradient descent. The gradient is a vector, where the elements of the vector are partial derivatives of the loss function with respect to the parameters. The gradient points in a direction of increasing loss. Calculate the gradient, and use it to tweak the parameters such that one moves a smidgen in the opposite direction along the surface of the loss function. Doing so iteratively, one can reduce the loss to some optimal minimum or even zero. At that point, one has learned a good-enough model. All this is calculus.
Probability & Statistics: These are huge fields, but the basics of how probability and statistics apply to ML is not hard to appreciate. All machine learning involves learning about patterns in data and using those learned patterns to say something about new data, or to generate new data. All training data can be said to have been drawn from some underlying true distribution of all possible data. But we can never know this ground truth. So, the job of an ML algorithm is—in one way of thinking—to estimate as well as possible the underlying distribution using only the training data. The closer this estimate is to the ground truth, the better is your ML model. This estimate can then be used to, say, discriminate between different classes of data (discriminative learning). Or one can sample from this distribution to generate a new instance of data that looks very much like real data—this is generative AI. The details are non-trivial, but conceptually that’s a good, simple start.