How I got into deep learning

I ran an education company, Dataquest, for 8 years. Last year, I got the itch to start building again. Deep learning was always interesting to me, but I knew very little about it. I set out to fix that problem.

Since then, I’ve trained dozens of models (several state of the art for open source), built 2 libraries that have 5k+ Github stars, and recently accepted an offer from answer.ai, a research lab started by Jeremy Howard.

I say this to establish the very rough outline of my learning journey. In this post, I’m going to cover more detail about how I learned deep learning. Hopefully it helps on your journey.

My background

I didn’t learn this stuff in school. I majored in American History for undergrad, and failed quite a few classes.

I did machine learning and Python work in 2012, but convinced myself that deep learning was too complicated for me. One reason for this was because I learned by doing Kaggle competitions. Kaggle competitions are amazing for learning quickly, but can leave you with gaps in the fundamentals - like how algorithms work mathematically.

When deep learning started to become popular, it was very math-heavy, and I felt like I’d never be able to understand it. Of course, this was false, as I proved to myself 10 years later, but the angle at which you approach something makes all the difference. I approached deep learning top-down the first time, by gluing models together without understanding how they worked. I eventually hit a wall, and couldn’t get past it.

Useful skills

When I studied deep learning last year, I already had useful skills. The first was strong Python programming ability. Despite efforts to the contrary, Python is still the universal language of AI. If you want to get into AI, start by getting really good at programming.

No matter what era of AI I’ve been in, data cleaning has been >70% of my work. It’s possible if you’re doing pure research or working on toy problems you can avoid working with data, but otherwise data skills are essential.

There’s a slightly more nebulous skill I’ll call pragmatism. Deep learning has a lot of rabbit holes - ranging from “what’s the perfect base model?”, to “what if I get rid of the sigmoid here?” Some of these rabbit holes are useful, but most of them will eat a lot of time. Being able to recognize when to go deep, and when to just do the fast/easy solution is important.

Book learning

This time, I decided to learn bottom-up, fundamentals first. I read The Deep Learning Book. It’s a few years old, but still a fantastic resource. Read it slowly. A lot of the terminology and math will be unfamiliar - look them up. You may need to sketch some things out or code them to get them - give yourself the space to do that. If the math is unfamiliar, a good complementary resource is Math for Machine Learning. Although I haven’t taken them, fast.ai and the Karpathy videos are high quality.

Even though architectures like CNN or RNN might seem out of date in a world that is moving towards transformers for everything, CNNs are still widely used, and everything old is new again with RNNs.

When you’re done with the first 2 parts of the book (you can skip part 3), you should be at a point where you can code up any of the main neural networks architectures in plain numpy (forward and backward passes).

One thing that will really help you get to that point is teaching the skills while you learn them. I started putting together a course, Zero to GPT, as I read the deep learning book. Teaching is the ultimate way to solidify concepts in your head, and I found myself falling into a nice cycle of learning, looking up/sketching what I didn’t understand, then teaching it.

Papers

The book will take you up to 2015-era deep learning. After reading the book, I read some of the foundational deep learning papers from the 2015-2022 era and implemented them in PyTorch. You can use Google Colab for free/cheap GPUs, and Weights and Biases to track your training runs.

A noncomprehensive list is:

After this, you should be able to understand most conversations people have about deep learning model architectures.

Fine-tuning and Discord

The easiest entrypoint for training models these days is finetuning a base model. Huggingface transformers is great for finetuning because it implements a lot of models already, and uses PyTorch.

There are Discord communities, like Nous Research and EleutherAI where people discuss the latest models and papers. I’d recommend joining them, seeing what’s state of the art at the moment, and trying some finetuning.

The easiest way to finetune is to pick a small model (7B or fewer params), and try finetuning with LoRA. You can use Google Colab, or something like Lambda Labs if you need more VRAM or multiple GPUs.

I wanted to train models to code better, so I put together datasets and finetuned a few different base models on data from StackOverflow and other places. It really helped me understand the linkage between model architecture, data, compute, and output. However, finetuning is a very crowded space, and it’s hard to make an impact when the state of the art changes every day.

Problem Discovery

As I was working on finetuning, I realized that some of the highest quality data was in textbook form, and locked away in pdfs. One way I tried to solve this was to generate synthetic data.

Another way was to extract the data from pdfs and turn it into good training data (markdown). There was an approach called nougat that worked well in many cases, but was slow and expensive to run. I decided to see if I could build something better by leveraging the data already in the pdf (avoiding OCR), and only using models when needed. I chained together several different models, along with heuristics in between. This approach, marker, is 10x faster than nougat, works with any language, and is usually more accurate.

Working on marker led me to want to solve several more problems, and I’ve also trained an equation to LaTeX model, a text detection model, an OCR model that’s competitive with Google Cloud, and a layout model.

For all of these models, I took existing architectures, changed the layers, loss, and other elements, then generated/found the right datasets. For example, for the OCR model, I started with the Donut architecture, added GQA, an MoE layer, UTF-16 decoding (1-2 tokens for any character), and changed some of the model shapes.

Since OCR models are typically small (less than 300M params), I was able to train all of these models on 4x A6000s. I probably could have gotten away with 2x A6000s if I was a bit more efficient.

Hopefully this illustrates 3 things for you:

Understanding the fundamentals is important to training good models
Finding interesting problems to solve is the best way to make an impact with what you build
You don’t need a lot of GPUs

There are many niches in AI where you can make a big impact, even as a relative outsider.

Open source

As you may have noticed, I open source all of my AI projects. The data stack is a very underinvested area of AI relative to impact. I feel strongly that the more widely you can distribute high quality training data, the lower the risk of 1-2 organizations having a monopoly on good models.

Open source also has a side effect of being a good way to get exposure. Which leads me to the last part of my story.

Getting a research job

I was thinking about building a business around my open source tools. Working somewhere wasn’t on my radar at all. But when Jeremy reached out about answer.ai, I felt like it was an opportunity I had to take. The chance to work with talented people, make a positive impact, and learn a lot is hard to pass up.

My open source work directly led to the job opportunity, both in the obvious way (it gave me exposure), and in a subtler way (it significantly improved my skills). Hopefully, you’ll open source something as you learn, too.

Next steps

I suspect my work at answer.ai will look very similar to my open source work. I’ll keep training models, improving the data stack, and releasing publicly.

If you’re trying to break into deep learning, I hope this post was useful. If not, I hope it was somewhat entertaining (you made it to the end, so it probably was, right?).

As for me, I’m going back to training some models (watching 2 of them converge right now).