I was poking around the PyTorch internals a few weeks ago, and thought it’d be fun to write my own (very simplified) deep learning framework 1. This is the first in what I hope will be a several posts documenting my progress on the project, and thoughts along the way. Coral can be found on GitHub here.
The PyTorch codebase is absolutely enormous, with a ton of features which, for the purposes of this project, we don’t care about (read: extensive tensor library, hardware/sparsity-specific implementations, TorchScript, etc). In order to keep the project from going off the rails, we need a few reasonable goal posts which together would be enough to train a simple model. Here’s a rough roadmap:
torch.nn.Module
class.I spent much more time on this this I’m proud to admit. At first, I was thinking that “CyTorch” might be a good fit. The more I thought about it though, the less I liked it. The “Cy” was too suggestive of “Cyborg”, and sounded too much like Cython. Also, apparently it wasn’t original. I eventually went with Coral, which I’m pretty happy with. Bonus points that it starts with a C.
Central to everything are tensors: multi-dimensional arrays which can be added/subtracted/multiplied/convolved (cross-correlated??)/broadcasted/transposed, etc. Our tensor object should store the following:
sizeof(float) * (tensor size)
bytes. Here, sizeof(float)=4
is the number of bytes in a float (usually), and tensor size
is pseudocode for the number of entries in the tensor.Whereas the first two requirements describe the tensor’s data itself, the third relates only to how the tensor has been combined with other tensors, irrespective of any details of the tensor itself. I’ve opted to split what I’ve been thus far referring to as a tensor into two structures, which I’ve decided to call tensor_t
and variable_t
respectively. Finally, a third structure, shape_t
, is used to store shape objects. tensor_t
objects (structs, in C) comprise requirements (1) and (2), and represent the value that a tensor-valued variable might take on, while variable_t
objects comprise the third, and represent a tensor-valued object, with additional autograd-related meta-data.
Explicitly:
typedef struct {
int num_dims;
size_t size;
size_t* dims;
size_t* strides;
} shape_t;
typedef struct {
tensor_entry_t* data; // ptr to data
shape_t* shape; //dimensions of data
} tensor_t;
typedef struct {
tensor_t* tensor;
tensor_t* gradient;
grad_meta_t* grad_meta;
} variable_t;
In summary, we’ve introduced shape_t
’s as first class objects which allow for a separation between tensor data and view, tensor_t
objects, which hold data and provide a value for variable_t
objects, which are the actual variable instances which we intend users of our library to work with.
Our tensor library won’t be very useful unless we have some way of performing backpropogation on expressions involving our variable_t
objects. Here’s the idea. By considering not just the variables present in the expression, but the intermediate results as well, we can construct from our expression a directed acyclic graph (DAG) where out-edges indicate “is an argument of”. For example, the expression $(x+y)z - w$ is given by the graph:
Or, slightly more complex, we can have a variable appear more than once, as occurs in $x(x+y)$:
In any case, backpropogation amounts to initializing to one the gradient of the variable representing the final result of the expression, and then traversing the graph backwards, propogating the gradient backwards via the chain rule. Suppose that $y = f(x_1,\dots, x_n)$, and refer to the $x_i$ as the children of $y$, and $y$ as the parent of the $x_i$. Then at the $y$, this backwards procedure is as follows:
# y = f(x_1, ..., x_n)
def backwards(y):
for x in (x_1, ..., x_n):
x.gradient += (df/dx)(x_1,...,x_n) * y.gradient
if (all_parents_processed(x)):
backwards(x)
Here, the function all_parents_processed
acts as an oracle, returning a boolean indicating whether or not all variables $y$ for which $x$ is an argument have been processed, in turn indicating whether or not the gradient of $x$ is fully computed. Only in the event that all_parents_processed
is true should be call backwards
on $x$. This also ensures that backwards
is only called once, and basically amounts to performing a topological sort as we go.
This is similar to the approach taken by PyTorch, which also constructs “computation graphs”. This way, the DAG rooted at a particular variable is exactly the expression needed to reproduce that variable in the forward direction, or to perform backpropogation in the backward direction. This is an alternative to tape-based automatic differentiation, also known as a Wengert List, in which a global “tape” is contructed listing all of the operations and intermediate arguments in the order that they were performed. Backpropogation with this approach then just involves replaying the tape backwards. This is similar to the approach taken by TensorFlow.
In order to enable this kind of backwards automatic differentiation, we need to ensure that the computation graph is constructed in the forward pass. To do this, it is sufficient that every possible differentiable operation modifies the metadata of the result to store pointers to the argument variables, as well as pointers to the appropriate backwards functions to call. For example, the operation $z = f(x,y)$ would add a pointer from the variable $z$ to each of $x$ and $y$. Furthermore, $z$ would also need to store the functions $\frac{df}{dx}$ and $\frac{df}{dy}$, each multiplied by the gradient of $z$, per the chain rule. In our implementation, if $f$ was variable_add
(addition), then this gradient accumulation would be handled by the library function add_backwards_grad
. I’ve gone ahead and added the following two structures to store this information:
// differentiable argument
typedef struct {
variable_t* variable;
variable_grad_op_t grad_op;
} input_t;
// grad metadata
typedef struct {
int ref_count;
int num_inputs;
input_t* inputs[2];
} grad_meta_t;
where variable_grad_op_t
is the function pointer described above. Note that the ref_count
attribute maintains the number of vertices to which there is an outgoing edge (in other words, how much times the vertex appears an an argument in the computation of another vertex). By only calling backwards
recursively on a node when ref_count
reaches zero, we are able to implement the all_parents_processed
function in the pseudocode above.
So far we’ve built a limited tensor library complete with automatic differentiation. Next time we’ll look at introducing modules, handling data, and training.
This blog post has a wonderful explanation of some of what’s going on behind the scenes in PyTorch. ↩
While the first two are pretty self-explanatory, stride essentially encodes how the multi-dimensional tensor is mapped onto the one-dimensional chunk of memory. For example, within a three-dimensional tensor of dimensions $ 3\times 4\times 5 $, indices $(i,j,k)$ and $(i,j,k+1)$ are one entry apart in memory, indices $(i,j,k)$ and $(i,j+1,k)$ are five entries apart, and indices $(i,j,k)$ and $(i+1,j,k)$ are $4\times 5 = 20$ entries apart. The strides, one for each dimension, are therefore $(20, 5, 1)$. ↩
Note that while we may obtain the desired transposition by only updating the tensor’s shape, this is sometimes undesirable for performance reasons. The property that adjacent entries in memory are adjacent in the tensor is no longer true after applying any permutation which moves the last dimension. ↩