Tutorial: How to learn anything in the world using generative AI

We are in a golden age of learning enabled by generative AI. Yet I am constantly stunned by how few people are using generative AI to learn new things. In the below video, I explain how to use GPT-4 as a teacher. By teaching myself what a tensor is, what the role of math is in modern machine learning, and how to create a class in Python, I hope to show you how you can use generative AI to learn anything in the world. I explain my prompts, what I’m thinking about, how to control hallucinations, and other small tips for how to use generative AI to your advantage. I strongly believe everyone who is intellectually curious would greatly benefit from understanding how to use generative AI, and I hope this video helps some of you do that.

Contents

The following timestamps are highlights of the video tutorial; there are other sections in the full video.

  • 00:00 – Why you should use GPT and introducing today’s project
  • 3:20 – Introduction to GPT and differences between models
  • 5:02 – The importance of providing context
  • 5:56 – Prompt 1: What is a tensor?
  • 10:40 – Prompt 2: How does math relate to ML?
  • 23:15 – Prompt construction & why GPT4 is better
  • 27:40 – How to control hallucinations
Chat Log

ZW I’m trying to learn about large language models (LLMs). I’m watching a video from a renowned computer scientist that explains how to build an LLM in python. They are actually building GPT2. I am very much so a beginner with python, machine learning, and the math that I understand is necessary to understand machine learning, so I’d like to ask questions about that subject. First, the computer scientist, Andrej Karpathy, is talking about “tensors”. My understanding of tensors is in the context of GPUs. For example, I would watch GPU review videos that say “oh, this card has 2156 Tensor cores” or “tensor units”. I forget what this means. In the video, though, he defines a tensor in python and it basically looks like a two-dimensional matrix/array. Can you explain what a tensor is?

GPT Yes, of course. In the context of machine learning and especially deep learning, a tensor can be thought of as a generalization of vectors and matrices to potentially higher dimensions. In this context, a tensor is simply a type of data structure used in programming languages like Python, especially with libraries like TensorFlow or PyTorch, which are often used for machine learning.

To understand tensors, let’s start from the basics:

  • A scalar is a single number, such as 7 or 9.5.
  • A vector is an ordered array of numbers, such as [7, 2, 3].
  • A matrix is a 2D grid of numbers, arranged into rows and columns.
  • Now, a tensor, in a generalized sense, can be a 3D, 4D, or higher-dimensional grid of numbers. A tensor can have multiple dimensions, or “axes”.

So, the idea of a tensor as a two-dimensional array is partially correct, but that’s more of a matrix. A tensor is a more general concept that includes not just 2D matrices but also 1D vectors and 0D scalars, as well as structures with three or more dimensions. Each dimension in a tensor is often called an axis, and the number of axes is the tensor’s rank.

For example, you might have a tensor that represents all the pixel values in a batch of color images. This could be a 4D tensor, with axes corresponding to the batch size (number of images), image height in pixels, image width in pixels, and color channels (red, green, blue).

Now, regarding your confusion about GPU Tensor cores: In that context, “tensor” refers to a kind of specialized hardware developed by NVIDIA that is designed to accelerate certain types of computations that are common in deep learning. These are usually operations on large tensors, such as multiplying two matrices together, hence the name “Tensor cores”. They’re designed to perform these operations much more efficiently than traditional GPU cores, making them valuable for machine learning applications where such operations are common. The naming is a bit confusing, but remember that in this context, “tensor” doesn’t refer to a type of data structure in a programming language but rather to a kind of hardware in a GPU.

ZW So essentially, a tensor is just a matrix of greater than two dimensions. Is that right? But right now, probably to keep things simple at the beginning, Karpathy is just using a two dimensional tensor. I understand that ML relies on a lot of complex math that often involves high-dimensional matrix/tensor multiplication. Assume I have a high school level of understanding of math/engineering/CS. Can you explain at a very basic level how math and matrix multiplication plays into ML, if at all?

GPT Yes, in simple terms, a tensor can be thought of as a multi-dimensional matrix. But remember, tensors can also be 0D (scalars), 1D (vectors), and 2D (matrices), in addition to higher dimensions.

Matrix multiplication plays a very critical role in machine learning. At a high level, machine learning models are essentially complex mathematical functions that “learn” to map inputs to outputs based on example data. Many of these models, especially in deep learning, are built using layers of matrix operations, often including multiplication.

To understand how this works, let’s consider a simple type of model called a linear regression. This is a model that learns to predict a numeric output based on one or more numeric inputs, by assuming a linear relationship between them.

For example, you might have a model that predicts a person’s weight based on their height. If we have a dataset of examples with both heights and weights, we can “train” the model by adjusting its parameters to minimize the difference between the predicted weights and the actual weights in the data.

In this case, the linear regression model can be written as a simple equation: y = wx + b, where y is the output (weight), x is the input (height), w is a parameter representing the weight’s dependency on height, and b is a bias term that represents the weight when the height is zero.

Even in this simple model, we can see a kind of multiplication at work: we multiply the input x by the weight w to get a component of the output.

Now consider a more complex model, like a neural network, which can have hundreds, thousands, or even millions of inputs and parameters. The inputs to the model can be represented as a vector, and the parameters for each layer of the model can be represented as a matrix. To get the output of the layer, we perform a matrix multiplication operation between the input vector and the parameter matrix. This is then passed through a non-linear “activation function”, and the result forms the input to the next layer.

This is just the tip of the iceberg when it comes to the mathematics of machine learning. There’s a lot more to learn, including how models are trained using algorithms like gradient descent, and how to handle more complex data types like images and text. But hopefully, this gives you a basic understanding of how matrix multiplication fits into the bigger picture.

ZW That’s really interesting, thank you. The example of how a linear regression can be “scaled up” into many dimensions is very intuitive. Question – can you explain the difference between inputs to a model and parameters of a model?

GPT The difference between inputs and parameters in a machine learning model is in their role and how they’re used:

  1. Inputs: These are the data points that you’re trying to make predictions about. If you’re training a model to recognize handwritten digits, for example, the inputs would be the pixel values of the images of these digits. If you’re building a model to predict housing prices, the inputs might be features of the house like its size, location, number of bedrooms, etc. Inputs are also sometimes called “features” in a machine learning context.
  2. Parameters: These are the parts of the model that get adjusted during the learning process. When you “train” a machine learning model, what you’re really doing is adjusting these parameters to reduce the difference between the model’s predictions and the actual values (this difference is often called the “loss” or “error”). For example, in a simple linear regression model of the form y = wx + b, w and b are the parameters that the model learns. In a neural network, the parameters include the weights and biases for each neuron in each layer of the network.

In other words, inputs are what the model uses to make its predictions, and parameters are what the model adjusts to improve its predictions.

There’s a third term that’s often used in this context as well: hyperparameters. These are settings or configurations that are set before the model starts learning, and they’re not automatically learned from the data. For example, the learning rate (which determines how much the parameters change each time they’re updated), the number of layers in a neural network, and the number of neurons in each layer can all be considered hyperparameters. Unlike parameters, hyperparameters aren’t learned from the data and must be set manually, often through trial and error.

ZW Okay, thanks. That’s really interesting. I have some code that Karpathy has just input into his Python file. Can you explain what is going on here at a high level?

Python
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        return logits

m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
print(out.shape)

GPT Sure, let’s go through this piece by piece:

  • The first few lines are importing necessary modules from PyTorch, a popular library for deep learning. nn is PyTorch’s neural network module, which contains classes for building neural network layers. The functional module is also imported as F, which contains functions for operations that you might perform in a neural network, such as activation functions and loss functions.
  • torch.manual_seed(1337) is setting the seed for PyTorch’s random number generator. This makes the results of the program reproducible, as the initial random weights of the network will be the same each time the program is run.
  • The BigramLanguageModel class is defining a type of neural network. This is a subclass of nn.Module, which is the base class for all neural network modules in PyTorch.
  • In the __init__ method (which is called when an instance of the class is created), a single layer is created: an embedding layer, which is often used in natural language processing to convert words (represented as integer indices) into dense vectors of fixed size. Here, the embedding layer has vocab_size input dimensions and vocab_size output dimensions.
  • The forward method defines how the network processes inputs. When you call an instance of nn.Module (like BigramLanguageModel), PyTorch automatically calls the forward method with whatever arguments you passed. Here, the forward method takes an input tensor idx (and optionally, target tensor targets), applies the embedding layer to idx, and returns the resulting tensor.
  • The notation (B,T,C) is a common way of denoting the dimensions of a tensor, where B is the batch size (the number of examples processed at once), T is the sequence length (the number of words in each example), and C is the number of classes (in this case, the vocabulary size).
  • The last lines of code create an instance of the BigramLanguageModel with vocab_size as the size of the vocabulary, applies this model to xb and yb, and prints the shape of the output.

In summary, this script is defining and applying a simple type of neural network called a bigram language model, which predicts each word in a sequence based on the previous word. The network is very simple and consists of a single embedding layer, which maps each word to a vector of logits (unnormalized log probabilities) for the next word.

ZW Let’s slow down. i understand the first two lines, but I’m stuck on the third line. When the code says from torch.nn import functional as F, to me that’s confusing because if ‘functional’ is part of torch.nn, it would seem like we have already imported it. What is going on here? Is ‘functional’ a sub-library (or whatever it’s called) of ‘torch.nn’?

GPT Yes, you’re correct that functional is a module within the torch.nn package. The line from torch.nn import functional as F imports this specific module and gives it the alias F, which is a convention often used in PyTorch code. This is done so that you can conveniently call functions from this module with a shorter name, like F.relu(x) instead of torch.nn.functional.relu(x).

To clarify, the import torch.nn as nn line imports the nn module from the torch package, but it doesn’t automatically import sub-modules like functional. If you want to use functions from torch.nn.functional, you need to import it explicitly.

So why do we use functional at all? Well, the torch.nn package also contains classes for many of the same operations, like nn.ReLU(). The difference is that the classes in nn maintain state (like parameters that get updated during training), while the functions in functional are stateless. For operations that don’t have any parameters, like activation functions or loss functions, it can be more convenient to use the functional module. That’s why you’ll often see PyTorch code that imports both nn and functional.

ZW Gotcha, that’s really helpful. Let’s move on to the next section. I see that the code defines a class, but I don’t actually know what a class is. Can you explain what a class is in Python? I took a Java course in high school but barely remember anything.ChatGPT

Absolutely! A class in Python is a code template for creating objects. Objects have member variables and have behavior associated with them. In python, a class is created by the keyword class.

An object is created using the constructor of the class. This object will then be called the instance of the class.

Here’s a simple example of a class:

pythonCopy codeclass Dog:
    def __init__(self, name):
        self.name = name  # this is an instance variable

    def bark(self):  # this is an instance method
        return f"{self.name} says woof!"

In this case, Dog is a class that has an __init__ method and a bark method. The __init__ method is a special method that gets called when you create a new instance of the class. You can think of it like a constructor. It sets up the object and gives it its initial state. In this case, every Dog has a name, which is provided when you create the Dog.

Here’s how you might use this class:

pythonCopy codemy_dog = Dog("Fido")  # create a new Dog object
print(my_dog.name)  # prints "Fido"
print(my_dog.bark())  # prints "Fido says woof!"

In the case of the BigramLanguageModel class in your code, it’s a little more complex because it’s a subclass of nn.Module, which is a base class provided by PyTorch for creating neural network models. But the basic principles are the same: it has an __init__ method that sets up the object’s state, and a forward method that defines how the model processes inputs.

It’s worth noting that Python is an object-oriented programming language, just like Java. So if you’ve learned about classes and objects in Java, many of the same concepts apply in Python, although there are some differences in the syntax and some of the details.

ZW You said, “An object is created using the constructor of the class.” What is a constructor? Is it the line that defines the new class, i.e., “class Dog:”? You also use a definite article “the” for both ‘constructor’ and ‘class’. Is there always a single constructor that is linked to a particular class?

GPT Great questions! In object-oriented programming, a constructor is a special method used to initialize a newly created object. In Python, the constructor method is called __init__.

So, when you define a class in Python, you often include an __init__ method. This method gets called automatically when you create a new instance of the class, and it’s where you can set up any state that the object needs.

For example, in the Dog class I described before, the constructor is:

pythonCopy codedef __init__(self, name):
    self.name = name

This constructor takes one argument (in addition to self, which refers to the object being created), which is the name of the Dog.

You can create a new Dog object (i.e., an instance of the Dog class) by calling Dog("Fido"). This creates a new object and automatically calls the __init__ method with “Fido” as the name.

The result is a Dog object where the name attribute is set to “Fido”. Any methods of the Dog class can access this attribute using self.name.

There is typically one constructor method (__init__) per class in Python. However, you can define different ways to create instances of a class by using class methods or static methods, but this is a more advanced topic. For now, it’s sufficient to understand that __init__ is the primary constructor for a class, and it gets called when you create a new instance of that class.

ZW What does the `super().__init__()` line do in the original Python code I gave you?

GPT The line super().__init__() is calling the constructor (the __init__ method) of the parent class. In this case, the parent class is nn.Module because BigramLanguageModel is subclassing nn.Module.

In Python, when you create a subclass, the subclass inherits all the methods and attributes of its parent class. But when you override the constructor of the parent class (i.e., when you define your own __init__ method), the constructor of the parent class is not automatically called when you create a new instance of the subclass. If you still want to run the constructor of the parent class, you need to call it explicitly, which is what super().__init__() does.

In the context of PyTorch and nn.Module, it’s important to call super().__init__() in the __init__ method of any class that subclasses nn.Module. The nn.Module constructor does some important setup, like keeping track of what parameters and submodules the module contains. By calling super().__init__(), you ensure that this setup gets done, which is necessary for many of PyTorch’s features to work correctly.

ZW okay interesting. I imagine then that it’s a common convention to call `super().__init__()`, almost by default, when you create a subclass in order to re-inherit the constructor of the parent class?

GPT Yes, you’re correct. It’s very common to see super().__init__() in the __init__ method of subclasses in Python, especially when using frameworks like PyTorch that provide base classes with important setup code in their constructors. This is to ensure that the initialization processes of the parent classes are not skipped when a new object is instantiated from the subclass.

Remember that calling super().__init__() is especially crucial when your class is inheriting from a parent class that initializes important attributes in its constructor. If you don’t call super().__init__(), and the parent class has some attributes set in its constructor, those attributes won’t be set for instances of your class, which could lead to bugs or errors.

So, when you’re writing a subclass, it’s a good habit to start by calling super().__init__() in your own __init__ method, unless you have a specific reason not to.

END