What I have been working on: Modal
Long story short: I'm working on a super cool tool called Modal. Please check it out — it lets you run things in the cloud without having to think about infrastructure. Scaling out, scheduling, containerization, using GPUs, setting up webhooks, and all kinds of other stuff. It's primarily meant for data teams. We aren't quite live, but you can sign up for our waitlist.
Most of my career has been in data. I spent seven years working at Spotify, doing everything from large-scale numerical methods to making charts for board decks. I then spent six years as a CTO, although I managed the data team directly for a long time and would occasionally write some data code.
Data1 strikes me a a discipline that deserves a bit more love. It's grown from obscurity to become some meaningful % of software engineering, but the state of the art in terms of tools and workflow is still emerging.
Data as its own discipline
I spent a lot of time as a CTO, and if I had to condense how to make engineers productive into one thing, it would be something like: make the feedback loops fast.
What sort of “loops” am I talking about? In frontend development it's saving the code in the editor on one screen and seeing the updates in the browser on another. In backend development it's running unit tests (or sometimes, just compiling).
Data is sort of weird because you have to run things on production data to have these sort of feedback loops. Whether you're running SQL or doing ML, it's often pointless to do that on non-production data. This violates a holy wall for a lot of software engineers: the strict separation of local and prod. 2
There are other differences too! Data teams often need to change infrastructure a lot more often (sometimes every new cron job needs a Terraform update), have very “bursty” needs for compute power, and needs a much wider range of hardware (GPUs! high memory jobs! etc). Not to mention the “liberal” (ehum) standards for packaging cutting edge ML code 3.
There's a weird sort of backend-normative view of what data teams should do, but I think it's very misguided. UN should impose sanctions on anyone complaining that “data teams need to adopt software engineering practices”. The needs are different! Lets let the right workflows emerge from what makes teams the most productive, and lets let data workflows stand on their own feet.4
Let's make data teams more productive!
I started looking at this a few years ago, looked at the absurdly crowded market landscape, and spent a lot of time talking to very smart people. I kind of wanted to rebuild everything! Which is an incredibly aspirational and dumb idea. Looking at a lot of different parts of the stack, I started getting interested in the lowest part of it, which I've been thinking of as the “runtime”.
A lot of the issues with data productivity I think comes back to the runtime:
- Infrastructure is often a “chore” in the end. After you get something running locally, you now have to do a ton of complex testing/configuration to ship it.
- The feedback loops working with infra is super slow, because every iteration requires deploying code to K8s or similar
- Things break in production in weird ways because the environment is different
- Every company that reaches a certain stage tends to build its own data platform. This seems… wasteful?
- It turns out it's hard to build abstractions on top of Kubernetes without it leaking through in 100 ways.
Some very large fraction of this has effectively been solved by shoving all the transformations into SQL in the last few years. I'm a big fan of SQL. Better tooling made it 10x easier to write complex pipelines in SQL and so its adoption increased by a lot. But there's still a lot of things where you need code. What can we do to make data teams 10x more productive when they write code?
Let's build a new runtime
I wanted to build something that takes code on a user's computer and launches it in the cloud within a second. Skimming over all the intermediate steps, I built Modal.
Let's write some code that computes the square of 42 and prints it to standard out:
import modal stub = modal.Stub() @stub.function def square(x): print("This code is running on a remote worker!") return x**2 if __name__ == "__main__": with stub.run(): print("the square is", square.call(42))
The foundational building block in Modal is a decorator that takes any Python function and moves its execution to the cloud. This might seem like a very trivial thing, but it turns out you can use this as a very powerful primitive to build a lot of cool stuff.
Let's run this code:
Two things worth noting here:
- This launches the code into the cloud in ~1s.
squarefunction runs in the cloud, but when it prints, we see it on the local stdout.
Things like this lets us take the cloud inside the innermost feedback loop. If we edit the code and re-run the app, the new code just runs. Instead of the loop of: build container → push container → trigger job → download logs which can take a few minutes if you're lucky, or a few hours if you're unlucky (especially if there's version control and CI/CD in that feedback loop).
The other benefit is, you only have a single environment. We can define arbitrary environments in code and have Modal build it for you, in the cloud:
This says: run the function
square inside a container image that has
numpy installed in it. When we run this, if the image doesn't exist, it will be built in the cloud for us. We build it super fast — the above example in a couple of seconds, since we built our own container builder and have fast machines in the cloud with super fast internet.
What about using a GPU?
Or a cronjob running in the cloud?
Any function in Modal can also be used to map over it, fanning it out to lots of containers running in the cloud (see code):
There's plenty of other things too. All of this while not making you write a single line of YAML in the process (in fact, there is no configuration in Modal — everything is in code).
And since we take care of all the infrastructure, it's trivial to get started, once you have an account 5, just run
pip install modal-client
modal token new
This initializes an API token and you're ready to run Modal. We run everything in our infrastructure, so there's nothing to set up other than that.
What are some things you can build using Modal?
We've been focusing a lot on machine learning recently, in particular model inference — Stable Diffusion is obviously the coolest thing right now, but we also support a wide range of other things: Using OpenAI's Whisper model for transcription, Dreambooth, object detection (with a webcam demo!). It's possible to deploy an ML model on Modal in a few minutes, and the actual deployment step usually takes a few seconds.
But Modal is really a general purpose compute layer you can use for a lot of stuff. You can use us for query datasets using DuckDB, or for web scraping, and for many other things: data pipelines, cron jobs, large-scale simulations, and many more things. And it's serverless6, so you only pay for the actual usage.
How does it work?
Taking code on a user's computer and launching it into custom containers in the cloud in less than a second isn't trivial. Along the way, we ended up building a lot of custom stuff: in particular a container runner, an image builder, and our own filesystem. Most of this in Rust for performance and safety.
We decided to not build this on top of tools like Docker/Kubernetes because we want infrastructure to be fast. I met a lot of VCs and other people while I was still just working alone on a prototype and most of them told me I was nuts when I started talking about building custom file systems and container engines. But we built it and it's working beautifully! Modal has no problem building a 100GB container, and then booting up 100 of those containers — you can do the whole thing in a few seconds. This is what it's built for. 7
There's a lot of other complexity: the work of scheduling tasks onto workers, exposing everything as a Python SDK, and much more. I'm very spoiled to have this team 8 and really proud of where we've gotten so far. BUT it's also clear that we're extremely early in this journey — we're working on infrastructure at a very low level and it's going to take a lot of time to support all the cool stuff we want to get to. I will be posting a lot more about it!
I'm deliberately vague about what exact role I mean here: take it to mean data engineers, data scientists, ML engineers, analytics engineers, and maybe more roles. The current division of responsibility is still in flux, and I think it's a mistake overfitting to what we have today. ↩︎
I think it's interesting that Snowflake was (arguably) the first major database to only run in the cloud, and it sort of makes sense from this view. OLTP databases typically need to run in multiple places: in prod, in dev, and in CI. An OLAP database like Snowflake doesn't have this portability requirement. ↩︎
I looked at some code recently where the author had put the entire virtual env in a zip, and the notebook included steps that patched installed site-packages. ↩︎
To be clear: I would still recommend every data person to learn a lot about “traditional” software engineering! But let's not treat it as the “right” way to do things. ↩︎
Unfortunately account registration still isn't open, because we aren't quite ready for it! ↩︎
The term serverless is applied waaaaay too liberally by vendors today, so I struggle a bit with the term, to be honest. ↩︎
Pushing and pulling OCI images is incredibly inefficient. Only a very small fraction of the image content is ever read, and there's an extremely high degree of overlap between even unrelated images. We're exploiting this by running our own file system in FUSE which we expose to containers. But under the hood, the we use a content-addressed storage system. This means we don't have to copy around large images, because we can fetch individual files lazily when they are needed, and we can also achieve a very high cache efficiency for these files. I hope to write a blog bost about this in the future! ↩︎