Anil Ananthaswamy 1/4/25 Anil Ananthaswamy 1/4/25

The Many Moods of Machine Learning

If you feel overwhelmed by the jargon of machine learning and wonder how it all ties together, you aren’t alone. Training vs test data, supervised vs unsupervised vs self-supervised learning, regression vs classification, linear vs non-linear models, hand-designed vs learned features, discriminative vs generative AI, and so on. These are a few of the important axes along which one can analyze and understand ML. The more ways we can grok machine learning, the more it begins making sense. Here are some intuitions to get us started.

But first, let’s get a basic definition out of the way. Machine learning in the broadest sense is about getting machines to learn about patterns that exist in data and then use what’s learned—i.e. the model of the data—to make predictions given new previously unseen data or even to generate new data.

TRAINING vs TEST DATA

The data used to train a model is the training dataset. The test dataset is the unseen data on which you test your model before releasing it into the wild. There are nuances, but that’s the broad distinction. One crucial point: both the training and test data are said to be drawn from the same underlying distribution of data (so, if the training data only has images of cats and dogs, so too must the test data; you can’t test your ML model of cats and dogs using images of elephants, for example.) In most ML projects, you ensure this consistency by taking a dataset and splitting it 80:20—80% of the dataset is used for training and 20% for testing. There are niggling details about getting this split right, but that’s the general idea.

Let’s start with a basic dataset to illustrate the concepts. Consider grayscale images that are 50 pixels by 50 pixels, so 2,500 pixels in total. Let’s say we have ten thousand images in total. But we don’t know if the images have any commonality—are they of similar things (say, cats) or very dissimilar things (say, cats and cars)? In other words, no human has looked at the raw data and labeled them as being those of cats or cars. What can we do?

UNSUPERVISED vs SUPERVISED vs SELF-SUPERVISED

One type of learning we can do with such raw data is called unsupervised learning—meaning, no human supervision is involved. We’d first turn each image into a vector, such that each pixel is an element of that vector. Each image becomes a 2,500-element vector. If you were to plot this vector in 2,500-dimensional space, it’d be a point. Imagine plotting each image in that 2,500-D space. You’d get 10,000 points, one for each image. Maybe about half of them are clustered in one region of that space and the rest in another region. A clustering algorithm—such as K-means clustering—can find the “centroid” for each cluster; an image is of one type or another based on the centroid it’s closest to. Note, the algorithm can’t tell whether the image is that of a cat or a car, just that it’s of one type or another (you can, of course, have more than two clusters—but you’ll have to tell your algorithm at the outset how many clusters to look for).

Now, what does it mean to make a prediction about new, unseen data in this case? Well, given a new image, you can simply plot it as a point in the 2,500-D space, and see which centroid it’s closest to; that predicts the type of image (type 1 or type 2, without again knowing anything more about the image).

But what if you wanted to not just cluster the images, but also classify them, i.e. tell apart images of cats from those of cars. That’s where supervised learning comes in. Humans would have to first label the 10,000 images as either a “cat” or a “car”, where “cat” can be -1 and “car” +1. Now, training a model means showing it an image and asking it to learn to correctly classify it as a cat (-1) or a car (+1). Once the model has learned this correlation between an image and it’s class, for all the images in the training dataset, you can give it a new image, and it’ll tell you whether it’s an image of a cat or a car! Of course, there’s always a risk that the classifier will make a mistake and this risk can be quantified; different algorithms come with different levels of risk.

There’s another class of learning: self-supervised. This is not about finding clusters in data, but rather about learning the features or patterns in data that can then be used for downstream tasks. Briefly, imagine taking an image from your training dataset and randomly masking 25% of the pixels, providing the masked image as input to your ML algorithm, and asking it to predict the full unmasked image. As the ML algorithm —in this case, most likely a deep neural network trained using backpropagation—learns to do this, by iterating over-and-over the entire set of images in the training dataset, it learns some internal representations that capture essential features about the data, which can then be used to reconstruct images. Now, given some new, partially obscured image, the model can fill in the missing pixels.

Modern Large Language Models (LLMs) use self-supervised learning. Given textual data, the algorithm learns a model of human written language. It starts by masking, for example, the last word of a sentence, and gets a model to correctly predict the masked word. By doing this for every sentence in a massive corpus of such data, the model learns internal representations of the statistical structure of written language.

Why use the term self-supervised and not unsupervised for LLMs? In unsupervised learning, the algorithm doesn’t generate a “teaching” signal by comparing the output produced by a model against the expected output. In supervised learning, the algorithm generates such a teaching signal by comparing, say, the output of model against a human-provided label. In self-supervised learning, the algorithm manufactures such a teaching signal without a human in the loop, say, by comparing the predicted value for the masked pixels against the known value of the pixels, or by comparing the predicted masked word at the end of a sentence against the known word, and so on.

REGRESSION vs CLASSIFICATION

An ML algorithm that’s trying to tell apart, for example, an image of a cat from that of a car is doing classification. It’s categorizing data

But what if the problem you were trying to solve involved only images of cars, and associated with each image was some real-valued number, denoting the size/volume of the car in cubic feet. (This is a completely made up example, and quite an unreasonable one at that.) Let’s say you had 10,000 such images in the training data, and humans had painstakingly provided the size/volume for the car in each image. One can learn an ML model to correlate the car in an image with its size/volume. Now, given an image of a car that wasn’t in the training dataset, it can predict some real-valued number that’s an estimate of the car’s size/volume. This task—of predicting some real-valued number given some input—is regression.

LINEAR vs NON-LINEAR

A classifier described above finds a boundary between, say, two clusters of data. This boundary can be linear—a line in 2D, a plane in 3D, or a hyperplane in higher dimensions—making it a linear classifier.

But sometimes a line/hyperplane cannot separate the clusters; you need a curve/surface, something wiggly. The boundary and hence the classifier becomes non-linear.

The same can be said for regression, which can be thought of as learning a function that maps input variables to a real-valued output. If this function is a straight line, it’s linear regression. If it’s curvy, wiggly, it’s non-linear.

HAND-DESIGNED vs LEARNED FEATURES

Firstly, what are features? Say you had images of two types of vehicles: fire trucks and school buses. You want to build an ML model that can tell apart fire trucks from school buses. One thing you can do is to either manually, or by writing some software, analyze the images and come up with values for each of these aspects (or features) of the vehicles:

Color (Yellow for school bus, Red for fire truck)
Has a ladder (No for school bus, Yes for fire truck)
Has a hose (No for school bus, Yes for fire truck)
Has a stop sign: (Yes for school bus, No for fire truck)

Now, you can train an ML model to recognize that an object is a school bus if the four features have the value (Yellow, No, No, Yes) and a fire truck if the features have the value (Red, Yes, Yes, No). These features are said to have been hand-designed. Someone decided these were the important features!

Or, you can take a set of images that have been labeled as “school bus” or “fire truck” and feed each image to a neural network (by first turning each image into, say, a 2500-element vector), and use supervised learning to get the network to distinguish between the two types of images. If there are no other pieces of confounding information in the images, it’s very likely that the neural network will learn to internally represent the features of interest—i.e. color, ladder, hose, stop sign—and use these features for classification. It would have learned these features. (It’s entirely possible that the model might pick up some other extraneous information or features that allows it to classify the images.)

DISCRIMINATIVE vs GENERATIVE AI

When you train a model to simply categorize or classify data, you have built a discriminative AI. It discriminates between different classes of data.

Sometimes you want to do more than categorize. You want to generate new data. Think about our training dataset of 10,000 50x50 pixel images of, say, cars and cats. Since each image is a point in 2,500-D space, you’d get two clusters of points, one for cars and one for cats, if you plotted all the images in that high-dimensional space. Now imagine a surface in this space + one extra dimension that hovers over the data points—call it the probability distribution over the data that models how likely any given point is in that 2500-D space. There will be two peaks in this surface: one over points representing cats and the other over points representing cars; the value of the surface will peter down to zero elsewhere. Now, instead of trying to learn a curve or a surface to categorize the classes, you can build a ML model of the surface that estimates the probability distribution over data.

Then, one can sample from that distribution: a sample gives you a point in 2500-D space, or a 50x50 image, with likely pixel values. In all likelihood, such an image would look either like a cat or a car, even if it doesn’t exactly resemble an image in the training data.

This, in essence, is generative AI: you learn the probability distribution over data and figure out how to sample from it, to create a new instance of data that resembles the training data, statistically speaking. However, the twin problems of estimating the distribution and sampling from it are far from trivial.

All this and more, mingled with the social history and lots of math, can be found in WHY MACHINES LEARN

Anil Ananthaswamy 12/30/24 Anil Ananthaswamy 12/30/24

The Theoretical Minimum (for Machine Learning)…And Why

Linear Algebra, Calculus, and Probability & Statistics often get mentioned as the minimum math you need to start on your machine learning journey. But why these disciplines? Here’s an intuition for why:

Linear Algebra: Machines turn data that we produce (text, images, audio, etc.) into a format that’s conducive for algorithmic manipulation. Often, they turn data into vectors. A 2D vector is a point in 2D space, a 3D vector is a point in 3D space, and so on. By converting data into vectors, an ML algorithm can do many things. For instance, it can calculate the distance between two n-dimensional vectors in nD space and establish similarity or dissimilarity. Similar vectors (data) are closer in that nD space than dissimilar vectors (data). Or an ML algorithm can learn how these vectors are distributed in nD space, information that can then be used to generate new vectors—or new data!

You can also think of a machine learning model as something that transforms an input vector X into an output vector Y. This transformation is often the result of multiplying a matrix by a vector, or the result of a sequence of such transformations. The matrix (or matrices) represent the parameters of the model—and have to be learned from data. Vectors, matrices and their manipulations are all essentially linear algebra.

Calculus: A model, or rather its parameters, has to be learned from data. But what does learning entail? Usually, learning involves transforming an input vector into an output vector, calculating the loss (or error) made by the model—according to some measure of the distance between the produced output vector and the expected output vector—and then using that loss to tweak the parameters of the model such that given the same input vector the model produces an output that is a little closer to the expected output, or the loss is reduced.

This is where calculus comes in. If the loss can be written down as a function of the parameters of the model, then reducing the loss is an optimization problem. The loss function gives you the loss landscape, and one has to descend from some region high up in the landscape (representing high loss) to some deep valley (representing very low or zero loss). One way to do so is by using some form of gradient descent. The gradient is a vector, where the elements of the vector are partial derivatives of the loss function with respect to the parameters. The gradient points in a direction of increasing loss. Calculate the gradient, and use it to tweak the parameters such that one moves a smidgen in the opposite direction along the surface of the loss function. Doing so iteratively, one can reduce the loss to some optimal minimum or even zero. At that point, one has learned a good-enough model. All this is calculus.

Probability & Statistics: These are huge fields, but the basics of how probability and statistics apply to ML is not hard to appreciate. All machine learning involves learning about patterns in data and using those learned patterns to say something about new data, or to generate new data. All training data can be said to have been drawn from some underlying true distribution of all possible data. But we can never know this ground truth. So, the job of an ML algorithm is—in one way of thinking—to estimate as well as possible the underlying distribution using only the training data. The closer this estimate is to the ground truth, the better is your ML model. This estimate can then be used to, say, discriminate between different classes of data (discriminative learning). Or one can sample from this distribution to generate a new instance of data that looks very much like real data—this is generative AI. The details are non-trivial, but conceptually that’s a good, simple start.

Anil Ananthaswamy 12/9/24 Anil Ananthaswamy 12/9/24

The Conceptual Simplicity of Machine Learning

Machine Learning, once you grok it, is conceptually simple. But it hardly seems so when you begin, especially if you were once a software engineer—as I was—who worked on non-ML code. My first encounters with ML were confusing at best. It took the writing of WHY MACHINES LEARN for me to see its conceptual simplicity (even if the details can be extremely gnarly). Here’s ML distilled into key concepts:

First, a machine learning (ML) model—like any other piece of software—turns an input into some desired output. The main difference between non-ML software and machine learning is that traditional software transforms the input into the desired output using algorithms designed and implemented by a programmer, but machine learning examines patterns in training data and learns the algorithm to do the transformation. The key words here are: learning, patterns and training data.
This transformation—no matter how complicated—can be thought of as an outcome of providing an input to a function f (), which produces an output. So, x = f (y). Note that x and y are both vectors (they can be, of course, simply scalars, but using vectors makes it more general). So, the function f () transforms some input vector into the desired output vector. But how do you learn the requisite function f ()?
That’s where data comes in. If you have enough examples of the mapping between some instance of x to the desired instance of y, i.e. you have enough (x, y) pairs, you can feed this “training data” to an ML algorithm, which will then learn the best possible f () to transform inputs to outputs. But what does training mean?
This is the beginning of machine learning. Humans have to create the training data, those (x, y) pairs. Then, given an x, your ML model (just think of it as some black box with parameters you can tune or tweak) will take x as input, and produce some output y*, based on the current value of the model parameters. But you know the output should be y. You calculate the loss, based on the discrepancy between the expected value y and the predicted value y*. The loss depends on the mode parameters. Tweak each parameter ever-so-slightly so that the model’s loss, given the same input, will be a little less than before. If you do this over-and-over again for every instance of training data, and reach an overall loss that’s zero or acceptably low, you’ll have a model that now approximates the desired function f (), which will transform any x that wasn’t in the training data into the appropriate y, assuming that the new x hews to the statistics of the training data).
What can such a function f () accomplish? Depends on the task at hand. Let’s say you simply want to find a way to classify 100 x 100 images. Your training data has two sets of images: of cats and of dogs. Humans have labeled them as such. If you turn each image into a vector (by laying out the pixel values end-to-end, so into a 10,000-element vector), then you have the requisite (x, y) pairs, where x is the vectorized image, and y is the label (say, 0 for cat and 1 for dog). Now, the task of the ML algorithm is to find the function f () that represents the boundary between the two types of data (assuming that such a boundary exists), such that if you give the function a vectorized image of a cat, it should produce a 0 and a 1 if the image is that of a dog. Such an ML algorithm can be used to tell apart types of data (there can be more than one type), and the method is called discriminative learning.
This brings us to another key issue in ML algorithms. Is the boundary we just discussed linear (meaning, a straight line in 2D, or a plane in 3D, or a hyperplane in higher dimensions) or non-linear? Many ML algorithms are designed to find linear boundaries (such as the perceptron algorithm or even vanilla support vector machines). But sometime data is not linearly separable. In which case, you can use algorithms such as the k-nearest neighbor (k-NN) rule, or the Naïve Bayes classifier, to find a non-linear boundary. Or if you are feeling more adventurous, you can use a kernel method to project the data into higher dimensions, use a linear classifier such as an SVM in the higher dimension to find a linear boundary, and project back to lower dimensions, where the boundary becomes non-linear.
Sometimes such a function f () is not used to discriminate, but to optimally fit a collection of data points, a process called regression. You still use the (x, y) pairs, but this time you are learning how to predict y, given x. Once you learn the f (), then given some new x, you can predict y. The regression can be linear or non-linear.
What if you want to do more than tell apart data? Say, you want to generate new data that statistically resembles the training data. That’s where generative learning comes in. In this case, the function f () you learn is an estimate of the probability distribution over data. Imagine your data spread out on the x-y plane and a 3D surface above that data that captures the probability distribution over data. Such a surface will have peaks/bumps over regions where the data is more likely and valleys where it’s less likely. (Of course, in reality the data is going to very high dimensional and the surface that models the probability distribution is going to be in one higher dimension.) But once you have learned an estimate of the distribution over data, given enough data, what do you do with it?
Well, you can use it to tell apart different kinds of data: so, you can leverage it for discriminative tasks. But the more interesting use is data generation. If you can sample from the distribution (not an easy task), then—in our toy example—it’s akin to finding a point on your 3D surface, projecting down to the x-y plane and figuring out the properties of the data instance that’s underneath (in actuality, all this would be happening in very high dimensions). But if you could do it, you’d have an instance of data that looks very much like the data on which the ML model was trained.
Generative AI comes down to these two (sometimes very difficult) tasks: 1) learn or estimate the probability distribution over the training data, and 2) sample from that distribution to get at the underlying data. In the case deep learning, the f () you learn could involve the entire process of estimating the distribution and sampling from it. It might beggar belief, but even the most complex AIs out there—large language models and diffusion models such as DALL-E—can be understood by thinking in these terms. Again, the details of the implementations and models can be extremely complicated, but that doesn’t detract from making sense of ML/AI using these broad-brush strokes.
Let’s take DALL-E. It’s an image generation “diffusion model” that’s trained on oodles and oodles of images. The process involves transforming an image, tiny step by tiny step (by adding a smidgen of Gaussian noise at each step) into a noisy, but simple image that resembles a sample from some Gaussian distribution—a process called diffusion. You train a deep neural network to reverse this process: go back from the simple image to the complex image, step-by-step. Once you have a trained an AI to do this for every image in your training data, you are good to go. Now, if you want to generate a new image, you first sample from the simple Gaussian distribution to get a noisy image (which is easy to do), and then run the diffusion process in reverse. You get back a sample from the complex distribution over images: it would be an image that resembles the training data.
What about Large Language Models (LLMs)? Again, the birds-eye view of an LLM is that it learns how to estimate the conditional probability distribution over its entire vocabulary of words (well, tokens, but let’s stick with words). Let’s say you take a trained LLM and give it 100 words. Assume it has a vocabulary of 1,000 words. Basically, what the LLM does is to calculate the conditional probability distribution over its entire vocabulary for the next word, given the input of 100 words. Once it has the distribution, it can sample from it, to figure out the most likely next word. It appends that to the 100 words, to create an input of 101 words, and repeats the process. It now has 102 words, and it keeps doing this until it, say, generates some end-of-text symbol. Very simply, during training the LLM learns to estimate the correct conditional probability distribution over its entire vocabulary given some input text, and sample from it! And that leads to the amazing behavior that we see from large language models.
A not-so-small aside: We talked of (x, y) pairs used for training data. In the early days of AI, these (x, y) pairs were created by humans by painstakingly labeling the training data. That’s not the case for modern generative AI. Rather, one can take some x, and use it to construct some y. For example, you can take a sentence, mask the last word of the sentence, and ask an LLM to learn to predict the masked word. So, x is the masked sentence and y is the masked word. If you took text from the internet, you’d have billions and billions of such (x, y) pairs. This can be automated without human intervention. Or, you can take an image and mask, say, 25% of the pixels and teach the neural network to predict the entire image. In this case, x is the masked image and y is the unmasked image. If humans create labeled (x, y) pairs and train an ML model, the technique is called supervised learning. If you automate the process of creating (x, y) pairs from data, it’s called self-supervised learning.

FOR MORE, please see the links to the book, WHY MACHINES LEARN: US Edition UK Edition

Anil Ananthaswamy 10/9/24 Anil Ananthaswamy 10/9/24

LLM Prompts for Learning About Hopfield Networks

Ill. Niklas Elmehed © Nobel Prize Outreach

John Hopfield and Geoffrey Hinton won the 2024 Nobel Prize for physics. There has been a lot of debate about whether their work counts as physics. I’m biased: I think it does. WHY MACHINES LEARN has an entire chapter describing the physics rationale for Hopfield networks. The chapter is titled: “With A Little Help from Physics”!

This blog post is about using a large language model (I used Claude 3.5 Sonnet) to generate code and help us learn about the inner workings of Hopfield networks. I’m only showing the prompts and the outputs of running the code here. Your mileage will vary depending on the LLM/Code Assistant you use, but this should give you a good idea of how to go about coding a Hopfield network. Here’s a brief intro (more details of how they work, i.e. the algorithm, can be found in Ch8 of WHY MACHINES LEARN):

Training a Hopfield network, which is akin to storing some information in the network, is a simple procedure. First, the size of the network is dictated by the number of bits of information you want to store. Let’s say that you want to store a 28x28 image. If you turn the image into a vector, it’ll be 784 bits long. So, you need 784 bidirectional neurons in your Hopfield network.

Of course, you don’t really need neurons: you just need the 784x784 matrix that gives you the weights of the connections between the neurons. In a Hopfield network, any given neuron doesn’t connect to itself. So, the diagonal of this weight matrix will be zero. Also, the weights are symmetric, meaning the weight for a connection going from the output of neuron i to the input of neuron j is the same as the weight of the connection going from the output of neuron j to the input of neuron i. This was one of Hopfield’s key insights! Weights are either +1 or -1.

The essence of a Hopfield network is that once you store some information, such that the outputs of the neurons taken together make up the memory, then that state is designed to be an energy minimum. If you now corrupt the memory, by setting the outputs of neurons to other values, then that raises the energy of the network. Its dynamics are such that the network will find its way to an energy minimum; if you read off the neurons now, you’d have retrieved the memory.

Here’s a series of prompts that begin with MNIST images, which are wonderful images to store and restore:

(Yann LeCun (Courant Institute, NYU) and Corinna Cortes (Google Labs, New York) hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license).

**Prompt 1:** Please write code to load the MNIST dataset and turn each image in that dataset into an image where each pixel is either 0 or 1, depending on whether the grayscale value in the original image is less than or equal to 120 or greater than 120, respectively

CLAUDE:

This code does the following:

Imports necessary libraries: TensorFlow for loading the MNIST dataset, and NumPy for array operations.
Loads the MNIST dataset using tf.keras.datasets.mnist.load_data().
Defines a binarize_image function that takes an image and a threshold (default 120) and returns a binary version of the image.
Applies the binarization to both the training and test sets using list comprehension.
Prints the shapes of the original and binarized datasets to verify the operation.
Includes an optional visualization step to display an original image and its binarized version side by side.

I noticed that the code was simply showing me the first image in the list. I wanted to select a digit based on its label. This needed additional prompting.

**Prompt 2:** These images have associated labels. Could you also provide a way to select an appropriate digit, such as 8 or 5, and plot only that?

If you look at the two rows of images shown above, it’s clear that the five images from MNIST and the five binarized images are different. It required further prompting to get the code to use the same set of five images.

**Prompt 3:** You are selecting different random indices for the original and binarized images. Make sure that the random indices are the same.

It’s clear that the code is working. The two sets of images shown above are of the same digits, one from MNIST and the other binarized. Now, I want to turn the binarized image with 0 and 1 for pixels, into images that have -1 and +1. I guess I could done it at one shot, but well…

CLAUDE:

Certainly! I'll create a function to map the binarized images from {0, 1} to {-1, 1}. This kind of mapping is often useful in machine learning applications, particularly when working with certain types of neural networks or algorithms that benefit from centered data.

Let's update our code to include this new functionality:

[Stuff here from Claude about the details of the coding changes]

This updated version allows you to visualize the original image, its binarized form, and the version mapped to {-1, 1} all at once. This can be particularly useful for understanding how the data is transformed at each step and how these transformations might affect downstream machine learning tasks.

Prompt 5: Now write code for a Hopfield Network that can store a mapped image. We should then be able to corrupt the mapped image using Gaussian noise, plot the corrupted image, and then give the corrupted image to the trained Hopfield Network, and have it retrieve the uncorrupted image. Plot what is retrieved.

This code generated an error. Instead of debugging it myself, I showed Claude the error and asked it to fix it.

Prompt 6: The code produces this error: Traceback (most recent call last): File "/Volumes/mydisc/Dropbox/BACKUP/Why Machines Learn/ChaptersNew/python/code/ai-assist/hopfield/test5.py", line 51, in <module> (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data ^^ NameError: name 'tf' is not defined

*The code worked! You can see the stored image (above left), the corrupted image in the center, and the retrieved image (above right). I now wanted the corrupted image to have more noise.*

**Prompt 7:** Make the Gaussian noise stronger

You can see that the image in the center (above) is noisier. Yet, the Hopfield network retrieves the stored image. The next thing I did was to use a series of prompts to first generate an interactive user interface and then ask Claude to turn that code to be run inside a Jupyter notebook. Here are the prompts:

Prompt 8: Provide a button on the user interface to do the same process for another digit, and replace the plots for the new digit

Prompt 9: Can you only use matplotlib; don't use Flask or React and all that jazz.

Prompt 10: Can you turn the entire code into something that can run inside a Jupyter Notebook?

Claude generated the code.

The output of running the code inside the Jupyter Notebook was almost the same as what’s shown on the left. There were some changes. The Process New Digit button was above the images. Also, the code in the Jupyter Notebook generated two rows of such images, only one of which was updated when you press the Process New Digit button. I didn’t debug it further.

Sometimes you will see results like the one shown above. I changed the amount of Gaussian noise that was added to the image to corrupt it (tweaked the mean and standard deviation from (0, 1.5) to (2, 3.5)). The Hopfield network recovered a bit-flipped image: black became white and vice-versa. Can you figure out why? Think about energy minimums. For more detail, please have a look at Chapter 8 of **WHY MACHINES LEARN.**

Anil Ananthaswamy 9/28/24 Anil Ananthaswamy 9/28/24

The Monty Hall Problem: Could an LLM have convinced Paul Erdős?

Could a large language model have convinced one of the most prolific mathematicians—Paul Erdős—that he was wrong about his intuition about the Monty Hall Problem, as were so many others?

In WHY MACHINES LEARN, I begin the chapter on probability and statistics with a brief description of the Monty Hall Problem. But here’s how Claude 3.5 Sonnet answered when asked to describe it:

Prompt: Do you know the Monty Hall Dilemma?
Claude: Yes, I'm familiar with the Monty Hall Dilemma. It's a famous probability puzzle named after Monty Hall, the host of the television game show "Let's Make a Deal."

The basic scenario is:

There are three doors. Behind one door is a prize (usually a car), and behind the other two doors are goats (or some other non-prize).
The contestant chooses a door, but it isn't opened yet.
The host, who knows what's behind all the doors, opens one of the other two doors, always revealing a goat.
The host then offers the contestant the chance to switch their choice to the other unopened door.

The key question is: Should the contestant switch doors?

The counterintuitive correct answer is that the contestant should always switch doors, as it increases their chances of winning from 1/3 to 2/3.

You can read the chapter in WHY MACHINES LEARN for a thorough analysis of multiple ways of thinking about this problem, and why most of us get it wrong, because our reasoning often gets messed up in the presence of uncertainty. Erdős thought there’s no point in switching, because the probability of winning is 1/2 regardless. He was wrong.

In 1995, the Hungarian mathematician Andrew Vázsonyi used a computer program that ran 100,000 simulations of the game (where each trial involved making a choice about whether to switch or not), to show that you win 2/3rds of the time if you switch.

For WHY MACHINES LEARN, I too wrote some code to do the simulation and create an illustration (with due credit, for inspiration, to data scientist Paul van der Laken’s blog post on the same).

Could one use an LLM to do the same? Of course! This blog post is about the way you’d go about prompting an LLM.

But first, here’s the final output of my efforts at prompting Claude to generate an animation, depicting how the probability of winning converges to 2/3 if you switch, and to 1/3 if you don’t.

Animation of how probabilities converge to 1/3 and 2/3 over 10,000 trials of the Monty Hall Game

Here’s how you can use an LLM (I used Claude 3.5 Sonnet) to get it to simulate the game. You can tell that the LLM has most likely seen Paul van der Laken’s code during training, because the plot it generates is very much like his! I’m only showing the prompts and the outputs here, not the code.

The basic scenario is:

There are three doors. Behind one door is a prize (usually a car), and behind the other two doors are goats (or some other non-prize).
The contestant chooses a door, but it isn't opened yet.
The host, who knows what's behind all the doors, opens one of the other two doors, always revealing a goat.
The host then offers the contestant the chance to switch their choice to the other unopened door.

The key question is: Should the contestant switch doors?

The counterintuitive correct answer is that the contestant should always switch doors, as it increases their chances of winning from 1/3 to 2/3.

**Prompt**

Can you write python code that simulates the game? It's a Monte Carlo simulation. Calculate the probability that the contestant wins when they choose to switch and when they choose not to switch, by running 10,000 trials. Then plot the odds of winning for either case (Y-axis) against the number of trials (X-axis)
**Output**: The graph produced by running the generated code i shown here.

Prompt: Please describe the code you generated using pseudo-code

Claude: Certainly! I'll describe the code using pseudo-code to give you a high-level overview of the algorithm and structure.

Pseudo-code for Monty Hall Simulation

This pseudo-code outlines the main structure and logic of the Monty Hall simulation:

We define two main functions:
- simulate_monty_hall_game: This simulates a single game of the Monty Hall problem.
- run_simulation: This runs multiple trials of the game and calculates probabilities.
In the simulate_monty_hall_game function:
- We randomly select the prize door and the contestant's initial choice.
- We determine which door the host opens (it must not be the prize door or the contestant's choice).
- Based on whether the contestant switches, we determine their final choice.
- We return whether they won or not.
In the run_simulation function:
- We run the specified number of trials, keeping track of wins for both switching and staying strategies.
- For each trial, we calculate and store the current probability of winning for each strategy.
In the main part of the script:
- We run the simulation for 10,000 trials.
- We plot the results, showing how the probabilities change as more trials are conducted.
- We print the final probabilities for both strategies.

This pseudo-code captures the essence of the simulation without going

At this stage, I wanted Claude to generate a video file or a GIF, to show the probabilities changing with the number of trials. Again, it’s worth looking at the prompts to see how you might get to clean, working code that does what you want. I’m only showing the series of prompts here, and not their outputs.

Prompt: Great. Can you change the plot, so that it appears as an animation. Show the probabilities changing over a span of about 5 seconds, for a total of 10,000 trials

Prompt: The animation takes way too long to complete. Finish it in 5 seconds

Prompt: At the end, please save the animation as video file

Prompt: Running the code generates the following error; please fix it:

Saving animation as video file... MovieWriter ffmpeg unavailable; using Pillow instead. Traceback (most recent call last): File "/Users/anilananth/anaconda3/lib/python3.11/site-packages/PIL/Image.py", line 2390, in save format = EXTENSION[ext] ~~~~~~~~~^^^^^ KeyError: '.mp4'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Volumes/mydisc/Dropbox/BACKUP/Why Machines Learn/ChaptersNew/python/code/ai-assist/prob/test3.py", line 67, in <module> ani.save('monty_hall_animation.mp4', writer='ffmpeg', fps=20) File "/Users/anilananth/anaconda3/lib/python3.11/site-packages/matplotlib/animation.py", line 1089, in save with writer.saving(self._fig, filename, dpi), \ File "/Users/anilananth/anaconda3/lib/python3.11/contextlib.py", line 144, in exit next(self.gen) File "/Users/anilananth/anaconda3/lib/python3.11/site-packages/matplotlib/animation.py", line 245, in saving self.finish() File "/Users/anilananth/anaconda3/lib/python3.11/site-packages/matplotlib/animation.py", line 515, in finish self._frames[0].save( File "/Users/anilananth/anaconda3/lib/python3.11/site-packages/PIL/Image.py", line 2393, in save raise ValueError(msg) from e ValueError: unknown file extension: .mp4

Claude: I apologize for the error. It seems that the ffmpeg writer is not available on your system, and Matplotlib is falling back to using Pillow, which doesn't support MP4 format. Let's modify the code to save the animation as a GIF instead, which should work with the Pillow writer. Here's the updated version:

Prompt: Please annotate the plot so that it shows the final probabilities after 10,000 trials

Prompt: Move the final probability labels to inside the bounding box of the plot

Prompt: Make the labels for the two probabilities BOLD font

The final GIF that appears above in this post is the final output of running the code generated by Claude after all the prompts.

Anil Ananthaswamy 9/22/24 Anil Ananthaswamy 9/22/24

From Rosenblatt to Claude

Getting a modern large language model to generate code to implement and visualize Frank Rosenblatt’s perceptron algorithm is in some way paying tribute to Rosenblatt’s visionary ideas. In 1958, he designed the first learnable artificial neurons and the first single-layer artificial neural networks. LLMs are descendants of those early networks. It’s particularly sweet to get an LLM to help us code/visualize Rosenblatt’s perceptron.

When I started writing WHY MACHINES LEARN, one of the first algorithms I coded for the book was the perceptron algorithm. I designed a simple, interactive user interface that would allow me select my data points on the 2D X-Y plane, so that I could visualize the algorithm as it tried to find a line separating two clusters of data.

The figures in the book are images generated using the same UI (using the Python plotting library matplotlib). I did all this sometime in late 2020, well before ChatGPT came on the scene, and certainly well before any LLM-based coding assistants such as CoPilot.

But it’s a different world now. As Harvard professor Boaz Barak said in a recent tweet: “Just realized that the next time I teach my ML foundations course, the primary programming language we use will likely be English. (Students will still need to know math, and be able to read model-generated python.)”

I have been thinking along the same lines: creating a Codebook for WHY MACHINES LEARN using code assistants, so that interested readers could read about the algorithms and basic mathematical ideas in WHY MACHINES LEARN and then prompt an LLM to generate the code and learn how the algorithms work in code, if they are so interested (I used Anthropic’s Claude 3.5 Sonnet, the paid version; but I’m sure there are many open-source models out there that would do the job just as well).

This post is about the process of generating Python code, so that you can engage with the perceptron algorithm and see it working. Details of Rosenblatt’s work, the history and the math, etc., can be found in the first two chapters of WHY MACHINES LEARN.

Some lessons I learned regarding code generation: It really helps if you know exactly what you want, so that your prompts can be precise. You also need to be reasonably familiar with coding, to be able to understand the coding mistakes made by the LLM, so that you can ask it to correct the errors.

The first thing I did was take one of the images of the perceptron algorithm from the book, which shows a linearly separating hyperplane (in this case a line, as the data is 2D), dropping it into Claude’s context window, and giving it my first prompt (I find myself being weirdly polite while interacting with an LLM, hence the over-the-top usage of “please”!).

**Prompt**

Please look at the image provided. Can you write code that does the following:
Provide a matplotlib interactive user interface that allows the user to click on a 2D graph. The first 5 clicks should be used for circles, the second 5 clicks should be used for triangles.

**Output of Claude’s code**

Claude generated code that worked without any errors. I was able to interact with the UI and select 10 data points, 5 for circles and 5 for triangles. But you can see that the plot doesn’t look exactly like what I asked for. So, I prompted it a little more, to create code that could generate a plot with solid lines for the axes, no bounding box, etc.

**Output of Claude’s code**

This is the output after a couple of iterations of simple prompting. Okay, close enough. Ideally, I should asked Claude to make the circles and triangles to have gray “fill”, but I can now work with this. So, I gave Claude a new prompt.
**Prompt:** Great. Now, once the user has finished clicking 10 times and generating the circles and triangles, when the user clicks next, use that input to kick off a perceptron algorithm, to find a straight line that separates the circles from the triangles. Once the perceptron finds the line, please draw the line

**Output of Claude’s code**

Okay. This was a big change. The code that Claude generated had significantly more functionality than the previous version which was simply allowing me to select the data points. This time, it’s actually implemented a perceptron algorithm and plotted the linearly separating hyperplane.
Next, I wanted to visualize in the form of an animation, where the output involved plotting some of the incorrect hyperplanes, and ending with the correct one. Getting this to work took some prompting. Below is the series of prompts that got it to work (I show only the important prompts; there were simple ones I haven’t included, to do with the look-and-feel of the UI which aren’t that important).

Prompt: Can you modify the code such that you draw every 3rd line the perceptron finds. Show the wrong lines as gray dotted lines, and the final correct line as a solid, black line. But plot it slowly, so that there is a 1-second delay between the plotting of each line.

Prompt: Something is not right. The code is creating a separate plot for each line. Please don't redraw the plot each time, but use the same canvas. It should seem like an animation.

Prompt: Also, for drawing the line, please use the same fig and ax you use for drawing the circles and triangles. This means your perceptron class will need extra arguments: to take in the fig and ax. Once you have the fig and ax inside the perceptron class, then use the artist to draw the line.

Prompt: Also, the code doesn't have a check to see if the perceptron has found a solution. Modify code to check if the perceptron has found a solution and then terminate the loop.

Prompt: Instead of drawing the perceptron's lines for every 3 iterations, do it for every iteration. Also, make the circles and triangles a little bigger.

Prompt: You removed the 1-second pause between drawing the perceptron's lines. Reintroduce the pause, but keep it to 0.5 seconds.

Prompt: So, everything is great, except for one detail. You have used values of 0 for circles and 1 for triangles, for the classification. The perceptron algorithm requires it to be -1 for circles and 1 for triangles. Can you redo the code with this change?

Prompt: After the perceptron has converged and you have drawn the black solid line, can you turn the entire sequence of lines drawn to convergence into a GIF file?

This is the final GIF generated by the Claude-generated code. The code allows you to select your data points, and it then uses the perceptron algorithm to find a line that separates the circles from the triangles.

For readers of WHY MACHINES LEARN: I’ll be writing a series of blog posts, detailing my attempts to generate code using Claude or some open-source code assistant (preferably). I think it’s a great way to learn both the conceptual and mathematical basics of machine learning—which is the subject of WHY MACHINES LEARN—and also learn how to use code assistants, inspect the generated code, and understand HOW the machines work, by seeing/coding the algorithms at work.