The Many Moods of Machine Learning
If you feel overwhelmed by the jargon of machine learning and wonder how it all ties together, you aren’t alone. Training vs test data, supervised vs unsupervised vs self-supervised learning, regression vs classification, linear vs non-linear models, hand-designed vs learned features, discriminative vs generative AI, and so on. These are a few of the important axes along which one can analyze and understand ML. The more ways we can grok machine learning, the more it begins making sense. Here are some intuitions to get us started.
But first, let’s get a basic definition out of the way. Machine learning in the broadest sense is about getting machines to learn about patterns that exist in data and then use what’s learned—i.e. the model of the data—to make predictions given new previously unseen data or even to generate new data.
TRAINING vs TEST DATA
The data used to train a model is the training dataset. The test dataset is the unseen data on which you test your model before releasing it into the wild. There are nuances, but that’s the broad distinction. One crucial point: both the training and test data are said to be drawn from the same underlying distribution of data (so, if the training data only has images of cats and dogs, so too must the test data; you can’t test your ML model of cats and dogs using images of elephants, for example.) In most ML projects, you ensure this consistency by taking a dataset and splitting it 80:20—80% of the dataset is used for training and 20% for testing. There are niggling details about getting this split right, but that’s the general idea.
Let’s start with a basic dataset to illustrate the concepts. Consider grayscale images that are 50 pixels by 50 pixels, so 2,500 pixels in total. Let’s say we have ten thousand images in total. But we don’t know if the images have any commonality—are they of similar things (say, cats) or very dissimilar things (say, cats and cars)? In other words, no human has looked at the raw data and labeled them as being those of cats or cars. What can we do?
UNSUPERVISED vs SUPERVISED vs SELF-SUPERVISED
One type of learning we can do with such raw data is called unsupervised learning—meaning, no human supervision is involved. We’d first turn each image into a vector, such that each pixel is an element of that vector. Each image becomes a 2,500-element vector. If you were to plot this vector in 2,500-dimensional space, it’d be a point. Imagine plotting each image in that 2,500-D space. You’d get 10,000 points, one for each image. Maybe about half of them are clustered in one region of that space and the rest in another region. A clustering algorithm—such as K-means clustering—can find the “centroid” for each cluster; an image is of one type or another based on the centroid it’s closest to. Note, the algorithm can’t tell whether the image is that of a cat or a car, just that it’s of one type or another (you can, of course, have more than two clusters—but you’ll have to tell your algorithm at the outset how many clusters to look for).
Now, what does it mean to make a prediction about new, unseen data in this case? Well, given a new image, you can simply plot it as a point in the 2,500-D space, and see which centroid it’s closest to; that predicts the type of image (type 1 or type 2, without again knowing anything more about the image).
But what if you wanted to not just cluster the images, but also classify them, i.e. tell apart images of cats from those of cars. That’s where supervised learning comes in. Humans would have to first label the 10,000 images as either a “cat” or a “car”, where “cat” can be -1 and “car” +1. Now, training a model means showing it an image and asking it to learn to correctly classify it as a cat (-1) or a car (+1). Once the model has learned this correlation between an image and it’s class, for all the images in the training dataset, you can give it a new image, and it’ll tell you whether it’s an image of a cat or a car! Of course, there’s always a risk that the classifier will make a mistake and this risk can be quantified; different algorithms come with different levels of risk.
There’s another class of learning: self-supervised. This is not about finding clusters in data, but rather about learning the features or patterns in data that can then be used for downstream tasks. Briefly, imagine taking an image from your training dataset and randomly masking 25% of the pixels, providing the masked image as input to your ML algorithm, and asking it to predict the full unmasked image. As the ML algorithm —in this case, most likely a deep neural network trained using backpropagation—learns to do this, by iterating over-and-over the entire set of images in the training dataset, it learns some internal representations that capture essential features about the data, which can then be used to reconstruct images. Now, given some new, partially obscured image, the model can fill in the missing pixels.
Modern Large Language Models (LLMs) use self-supervised learning. Given textual data, the algorithm learns a model of human written language. It starts by masking, for example, the last word of a sentence, and gets a model to correctly predict the masked word. By doing this for every sentence in a massive corpus of such data, the model learns internal representations of the statistical structure of written language.
Why use the term self-supervised and not unsupervised for LLMs? In unsupervised learning, the algorithm doesn’t generate a “teaching” signal by comparing the output produced by a model against the expected output. In supervised learning, the algorithm generates such a teaching signal by comparing, say, the output of model against a human-provided label. In self-supervised learning, the algorithm manufactures such a teaching signal without a human in the loop, say, by comparing the predicted value for the masked pixels against the known value of the pixels, or by comparing the predicted masked word at the end of a sentence against the known word, and so on.
REGRESSION vs CLASSIFICATION
An ML algorithm that’s trying to tell apart, for example, an image of a cat from that of a car is doing classification. It’s categorizing data
But what if the problem you were trying to solve involved only images of cars, and associated with each image was some real-valued number, denoting the size/volume of the car in cubic feet. (This is a completely made up example, and quite an unreasonable one at that.) Let’s say you had 10,000 such images in the training data, and humans had painstakingly provided the size/volume for the car in each image. One can learn an ML model to correlate the car in an image with its size/volume. Now, given an image of a car that wasn’t in the training dataset, it can predict some real-valued number that’s an estimate of the car’s size/volume. This task—of predicting some real-valued number given some input—is regression.
LINEAR vs NON-LINEAR
A classifier described above finds a boundary between, say, two clusters of data. This boundary can be linear—a line in 2D, a plane in 3D, or a hyperplane in higher dimensions—making it a linear classifier.
But sometimes a line/hyperplane cannot separate the clusters; you need a curve/surface, something wiggly. The boundary and hence the classifier becomes non-linear.
The same can be said for regression, which can be thought of as learning a function that maps input variables to a real-valued output. If this function is a straight line, it’s linear regression. If it’s curvy, wiggly, it’s non-linear.
HAND-DESIGNED vs LEARNED FEATURES
Firstly, what are features? Say you had images of two types of vehicles: fire trucks and school buses. You want to build an ML model that can tell apart fire trucks from school buses. One thing you can do is to either manually, or by writing some software, analyze the images and come up with values for each of these aspects (or features) of the vehicles:
Color (Yellow for school bus, Red for fire truck)
Has a ladder (No for school bus, Yes for fire truck)
Has a hose (No for school bus, Yes for fire truck)
Has a stop sign: (Yes for school bus, No for fire truck)
Now, you can train an ML model to recognize that an object is a school bus if the four features have the value (Yellow, No, No, Yes) and a fire truck if the features have the value (Red, Yes, Yes, No). These features are said to have been hand-designed. Someone decided these were the important features!
Or, you can take a set of images that have been labeled as “school bus” or “fire truck” and feed each image to a neural network (by first turning each image into, say, a 2500-element vector), and use supervised learning to get the network to distinguish between the two types of images. If there are no other pieces of confounding information in the images, it’s very likely that the neural network will learn to internally represent the features of interest—i.e. color, ladder, hose, stop sign—and use these features for classification. It would have learned these features. (It’s entirely possible that the model might pick up some other extraneous information or features that allows it to classify the images.)
DISCRIMINATIVE vs GENERATIVE AI
When you train a model to simply categorize or classify data, you have built a discriminative AI. It discriminates between different classes of data.
Sometimes you want to do more than categorize. You want to generate new data. Think about our training dataset of 10,000 50x50 pixel images of, say, cars and cats. Since each image is a point in 2,500-D space, you’d get two clusters of points, one for cars and one for cats, if you plotted all the images in that high-dimensional space. Now imagine a surface in this space + one extra dimension that hovers over the data points—call it the probability distribution over the data that models how likely any given point is in that 2500-D space. There will be two peaks in this surface: one over points representing cats and the other over points representing cars; the value of the surface will peter down to zero elsewhere. Now, instead of trying to learn a curve or a surface to categorize the classes, you can build a ML model of the surface that estimates the probability distribution over data.
Then, one can sample from that distribution: a sample gives you a point in 2500-D space, or a 50x50 image, with likely pixel values. In all likelihood, such an image would look either like a cat or a car, even if it doesn’t exactly resemble an image in the training data.
This, in essence, is generative AI: you learn the probability distribution over data and figure out how to sample from it, to create a new instance of data that resembles the training data, statistically speaking. However, the twin problems of estimating the distribution and sampling from it are far from trivial.
All this and more, mingled with the social history and lots of math, can be found in WHY MACHINES LEARN