Coral - Part 1

17 Jan 2023

I was poking around the PyTorch internals a few weeks ago, and thought it’d be fun to write my own (very simplified) deep learning framework ¹. This is the first in what I hope will be a several posts documenting my progress on the project, and thoughts along the way. Coral can be found on GitHub here.

Objectives

The PyTorch codebase is absolutely enormous, with a ton of features which, for the purposes of this project, we don’t care about (read: extensive tensor library, hardware/sparsity-specific implementations, TorchScript, etc). In order to keep the project from going off the rails, we need a few reasonable goal posts which together would be enough to train a simple model. Here’s a rough roadmap:

A cool name for the project.
We’ll need some kind of “tensor” library for working with multi-dimensional arrays. We’ll need to be able to perform various operations on these arrays, including element-wise arithmetic, matrix multiplication, broadcasting, view-changing etc in the same style as PyTorch/Numpy.
In order to perform automatic differentiation, we’ll need some kind of wrapper around our tensor objects which store gradient information, and computation-graph-related metadata (we’ll revisit this “computation graph” idea later, so no worries if it’s confusing).
We’ll need some notion of callable “module” objects which abstract away the declaration and initialization of tensors whose values are to be trained, and which store trainable weights. Our library should provide a few module primitives (linear, convolution kernel, ReLU, sigmoid, etc), which users can compose to create their own arbitrarily complex modules. Borrowing the word module here from PyTorch’s torch.nn.Module class.
We should probably add some utilities for loading data/images/etc into our tensor format, and vice versa.
For performance and memory management reasons, we should use one of C, C++, or Rust. I’ve been writing a bunch of C lately, so I decided to stick with it for this project. In hindsight, this was probably a poor choice (more on this later).
Finally, as a “final test”, we should be able to write a concise program using our library which creates and trains a model to correctly classify MNIST digits.
Last (and only optionally mandatory) is some sort of testing framework.

Name

I spent much more time on this this I’m proud to admit. At first, I was thinking that “CyTorch” might be a good fit. The more I thought about it though, the less I liked it. The “Cy” was too suggestive of “Cyborg”, and sounded too much like Cython. Also, apparently it wasn’t original. I eventually went with Coral, which I’m pretty happy with. Bonus points that it starts with a C.

Tensors, Shapes, and Variables

Central to everything are tensors: multi-dimensional arrays which can be added/subtracted/multiplied/convolved (cross-correlated??)/broadcasted/transposed, etc. Our tensor object should store the following:

The tensor should point to some contiguous chunk of memory where it can store its contents. I’ve gone ahead and assumed that tensor entries are floats, so that each tensor is allotted sizeof(float) * (tensor size) bytes. Here, sizeof(float)=4 is the number of bytes in a float (usually), and tensor size is pseudocode for the number of entries in the tensor.
Our tensors, both out of necessity and for ease of iterating through them, should be aware of what we’ll refer to as a tensor’s shape. Explicitly, shape information includes the number of dimensions, the length along each dimension, and the stride along each dimension². As it’ll be helpful to be able to reshape our tensors, and to pass around and manipulate shapes, we should be treating them as objects in their own right. Crucially, shapes allow us to decouple a tensor’s raw data from its shape. Acquiring a new view on a tensor requires only creating an alias to the same raw data, but with a different (but compatible) shape object. Transposing a tensor (or any permutation of dimensions) again only requires updating the shape so that the dimensions are listed in the permuted order, and so the strides have been recomputed ³.
Ultimately, we want to be able to differentiate through expressions involving our tensors. Generally, as we combine our tensors with the various library functions we’ll eventually write, a graph of the computation will be constructed. To keep track, tensors must store metadata describing their position in the graph and information needed for backpropagation. More on this in the next section.

Whereas the first two requirements describe the tensor’s data itself, the third relates only to how the tensor has been combined with other tensors, irrespective of any details of the tensor itself. I’ve opted to split what I’ve been thus far referring to as a tensor into two structures, which I’ve decided to call tensor_t and variable_t respectively. Finally, a third structure, shape_t, is used to store shape objects. tensor_t objects (structs, in C) comprise requirements (1) and (2), and represent the value that a tensor-valued variable might take on, while variable_t objects comprise the third, and represent a tensor-valued object, with additional autograd-related meta-data.

Explicitly:

    typedef struct {
        int num_dims;
        size_t size;
        size_t* dims;
        size_t* strides;
    } shape_t;

    typedef struct {
        tensor_entry_t* data; // ptr to data
        shape_t* shape; //dimensions of data
    } tensor_t;

    typedef struct {
        tensor_t* tensor;
        tensor_t* gradient;
        grad_meta_t* grad_meta;
    } variable_t;

In summary, we’ve introduced shape_t’s as first class objects which allow for a separation between tensor data and view, tensor_t objects, which hold data and provide a value for variable_t objects, which are the actual variable instances which we intend users of our library to work with.

Automatic Differentiation

Our tensor library won’t be very useful unless we have some way of performing backpropogation on expressions involving our variable_t objects. Here’s the idea. By considering not just the variables present in the expression, but the intermediate results as well, we can construct from our expression a directed acyclic graph (DAG) where out-edges indicate “is an argument of”. For example, the expression $(x+y)z - w$ is given by the graph:

Or, slightly more complex, we can have a variable appear more than once, as occurs in $x(x+y)$:

In any case, backpropogation amounts to initializing to one the gradient of the variable representing the final result of the expression, and then traversing the graph backwards, propogating the gradient backwards via the chain rule. Suppose that $y = f(x_1,\dots, x_n)$, and refer to the $x_i$ as the children of $y$, and $y$ as the parent of the $x_i$. Then at the $y$, this backwards procedure is as follows:

# y = f(x_1, ..., x_n)
def backwards(y):
    for x in (x_1, ..., x_n):
        x.gradient += (df/dx)(x_1,...,x_n) * y.gradient
        if (all_parents_processed(x)):
            backwards(x)

Here, the function all_parents_processed acts as an oracle, returning a boolean indicating whether or not all variables $y$ for which $x$ is an argument have been processed, in turn indicating whether or not the gradient of $x$ is fully computed. Only in the event that all_parents_processed is true should be call backwards on $x$. This also ensures that backwards is only called once, and basically amounts to performing a topological sort as we go.

This is similar to the approach taken by PyTorch, which also constructs “computation graphs”. This way, the DAG rooted at a particular variable is exactly the expression needed to reproduce that variable in the forward direction, or to perform backpropogation in the backward direction. This is an alternative to tape-based automatic differentiation, also known as a Wengert List, in which a global “tape” is contructed listing all of the operations and intermediate arguments in the order that they were performed. Backpropogation with this approach then just involves replaying the tape backwards. This is similar to the approach taken by TensorFlow.

In order to enable this kind of backwards automatic differentiation, we need to ensure that the computation graph is constructed in the forward pass. To do this, it is sufficient that every possible differentiable operation modifies the metadata of the result to store pointers to the argument variables, as well as pointers to the appropriate backwards functions to call. For example, the operation $z = f(x,y)$ would add a pointer from the variable $z$ to each of $x$ and $y$. Furthermore, $z$ would also need to store the functions $\frac{df}{dx}$ and $\frac{df}{dy}$, each multiplied by the gradient of $z$, per the chain rule. In our implementation, if $f$ was variable_add (addition), then this gradient accumulation would be handled by the library function add_backwards_grad. I’ve gone ahead and added the following two structures to store this information:

// differentiable argument
typedef struct {
    variable_t* variable;
    variable_grad_op_t grad_op;
} input_t;

// grad metadata
typedef struct {
    int ref_count;
    int num_inputs;
    input_t* inputs[2];
} grad_meta_t;

where variable_grad_op_t is the function pointer described above. Note that the ref_count attribute maintains the number of vertices to which there is an outgoing edge (in other words, how much times the vertex appears an an argument in the computation of another vertex). By only calling backwards recursively on a node when ref_count reaches zero, we are able to implement the all_parents_processed function in the pseudocode above.

Summary

So far we’ve built a limited tensor library complete with automatic differentiation. Next time we’ll look at introducing modules, handling data, and training.

This blog post has a wonderful explanation of some of what’s going on behind the scenes in PyTorch. ↩
While the first two are pretty self-explanatory, stride essentially encodes how the multi-dimensional tensor is mapped onto the one-dimensional chunk of memory. For example, within a three-dimensional tensor of dimensions $ 3\times 4\times 5 $, indices $(i,j,k)$ and $(i,j,k+1)$ are one entry apart in memory, indices $(i,j,k)$ and $(i,j+1,k)$ are five entries apart, and indices $(i,j,k)$ and $(i+1,j,k)$ are $4\times 5 = 20$ entries apart. The strides, one for each dimension, are therefore $(20, 5, 1)$. ↩
Note that while we may obtain the desired transposition by only updating the tensor’s shape, this is sometimes undesirable for performance reasons. The property that adjacent entries in memory are adjacent in the tensor is no longer true after applying any permutation which moves the last dimension. ↩