The Conceptual Simplicity of Machine Learning
Machine Learning, once you grok it, is conceptually simple. But it hardly seems so when you begin, especially if you were once a software engineer—as I was—who worked on non-ML code. My first encounters with ML were confusing at best. It took the writing of WHY MACHINES LEARN for me to see its conceptual simplicity (even if the details can be extremely gnarly). Here’s ML distilled into key concepts:
First, a machine learning (ML) model—like any other piece of software—turns an input into some desired output. The main difference between non-ML software and machine learning is that traditional software transforms the input into the desired output using algorithms designed and implemented by a programmer, but machine learning examines patterns in training data and learns the algorithm to do the transformation. The key words here are: learning, patterns and training data.
This transformation—no matter how complicated—can be thought of as an outcome of providing an input to a function f (), which produces an output. So, x = f (y). Note that x and y are both vectors (they can be, of course, simply scalars, but using vectors makes it more general). So, the function f () transforms some input vector into the desired output vector. But how do you learn the requisite function f ()?
That’s where data comes in. If you have enough examples of the mapping between some instance of x to the desired instance of y, i.e. you have enough (x, y) pairs, you can feed this “training data” to an ML algorithm, which will then learn the best possible f () to transform inputs to outputs. But what does training mean?
This is the beginning of machine learning. Humans have to create the training data, those (x, y) pairs. Then, given an x, your ML model (just think of it as some black box with parameters you can tune or tweak) will take x as input, and produce some output y*, based on the current value of the model parameters. But you know the output should be y. You calculate the loss, based on the discrepancy between the expected value y and the predicted value y*. The loss depends on the mode parameters. Tweak each parameter ever-so-slightly so that the model’s loss, given the same input, will be a little less than before. If you do this over-and-over again for every instance of training data, and reach an overall loss that’s zero or acceptably low, you’ll have a model that now approximates the desired function f (), which will transform any x that wasn’t in the training data into the appropriate y, assuming that the new x hews to the statistics of the training data).
What can such a function f () accomplish? Depends on the task at hand. Let’s say you simply want to find a way to classify 100 x 100 images. Your training data has two sets of images: of cats and of dogs. Humans have labeled them as such. If you turn each image into a vector (by laying out the pixel values end-to-end, so into a 10,000-element vector), then you have the requisite (x, y) pairs, where x is the vectorized image, and y is the label (say, 0 for cat and 1 for dog). Now, the task of the ML algorithm is to find the function f () that represents the boundary between the two types of data (assuming that such a boundary exists), such that if you give the function a vectorized image of a cat, it should produce a 0 and a 1 if the image is that of a dog. Such an ML algorithm can be used to tell apart types of data (there can be more than one type), and the method is called discriminative learning.
This brings us to another key issue in ML algorithms. Is the boundary we just discussed linear (meaning, a straight line in 2D, or a plane in 3D, or a hyperplane in higher dimensions) or non-linear? Many ML algorithms are designed to find linear boundaries (such as the perceptron algorithm or even vanilla support vector machines). But sometime data is not linearly separable. In which case, you can use algorithms such as the k-nearest neighbor (k-NN) rule, or the Naïve Bayes classifier, to find a non-linear boundary. Or if you are feeling more adventurous, you can use a kernel method to project the data into higher dimensions, use a linear classifier such as an SVM in the higher dimension to find a linear boundary, and project back to lower dimensions, where the boundary becomes non-linear.
Sometimes such a function f () is not used to discriminate, but to optimally fit a collection of data points, a process called regression. You still use the (x, y) pairs, but this time you are learning how to predict y, given x. Once you learn the f (), then given some new x, you can predict y. The regression can be linear or non-linear.
What if you want to do more than tell apart data? Say, you want to generate new data that statistically resembles the training data. That’s where generative learning comes in. In this case, the function f () you learn is an estimate of the probability distribution over data. Imagine your data spread out on the x-y plane and a 3D surface above that data that captures the probability distribution over data. Such a surface will have peaks/bumps over regions where the data is more likely and valleys where it’s less likely. (Of course, in reality the data is going to very high dimensional and the surface that models the probability distribution is going to be in one higher dimension.) But once you have learned an estimate of the distribution over data, given enough data, what do you do with it?
Well, you can use it to tell apart different kinds of data: so, you can leverage it for discriminative tasks. But the more interesting use is data generation. If you can sample from the distribution (not an easy task), then—in our toy example—it’s akin to finding a point on your 3D surface, projecting down to the x-y plane and figuring out the properties of the data instance that’s underneath (in actuality, all this would be happening in very high dimensions). But if you could do it, you’d have an instance of data that looks very much like the data on which the ML model was trained.
Generative AI comes down to these two (sometimes very difficult) tasks: 1) learn or estimate the probability distribution over the training data, and 2) sample from that distribution to get at the underlying data. In the case deep learning, the f () you learn could involve the entire process of estimating the distribution and sampling from it. It might beggar belief, but even the most complex AIs out there—large language models and diffusion models such as DALL-E—can be understood by thinking in these terms. Again, the details of the implementations and models can be extremely complicated, but that doesn’t detract from making sense of ML/AI using these broad-brush strokes.
Let’s take DALL-E. It’s an image generation “diffusion model” that’s trained on oodles and oodles of images. The process involves transforming an image, tiny step by tiny step (by adding a smidgen of Gaussian noise at each step) into a noisy, but simple image that resembles a sample from some Gaussian distribution—a process called diffusion. You train a deep neural network to reverse this process: go back from the simple image to the complex image, step-by-step. Once you have a trained an AI to do this for every image in your training data, you are good to go. Now, if you want to generate a new image, you first sample from the simple Gaussian distribution to get a noisy image (which is easy to do), and then run the diffusion process in reverse. You get back a sample from the complex distribution over images: it would be an image that resembles the training data.
What about Large Language Models (LLMs)? Again, the birds-eye view of an LLM is that it learns how to estimate the conditional probability distribution over its entire vocabulary of words (well, tokens, but let’s stick with words). Let’s say you take a trained LLM and give it 100 words. Assume it has a vocabulary of 1,000 words. Basically, what the LLM does is to calculate the conditional probability distribution over its entire vocabulary for the next word, given the input of 100 words. Once it has the distribution, it can sample from it, to figure out the most likely next word. It appends that to the 100 words, to create an input of 101 words, and repeats the process. It now has 102 words, and it keeps doing this until it, say, generates some end-of-text symbol. Very simply, during training the LLM learns to estimate the correct conditional probability distribution over its entire vocabulary given some input text, and sample from it! And that leads to the amazing behavior that we see from large language models.
A not-so-small aside: We talked of (x, y) pairs used for training data. In the early days of AI, these (x, y) pairs were created by humans by painstakingly labeling the training data. That’s not the case for modern generative AI. Rather, one can take some x, and use it to construct some y. For example, you can take a sentence, mask the last word of the sentence, and ask an LLM to learn to predict the masked word. So, x is the masked sentence and y is the masked word. If you took text from the internet, you’d have billions and billions of such (x, y) pairs. This can be automated without human intervention. Or, you can take an image and mask, say, 25% of the pixels and teach the neural network to predict the entire image. In this case, x is the masked image and y is the unmasked image. If humans create labeled (x, y) pairs and train an ML model, the technique is called supervised learning. If you automate the process of creating (x, y) pairs from data, it’s called self-supervised learning.
FOR MORE, please see the links to the book, WHY MACHINES LEARN: US Edition UK Edition