Software infrastructure 2.0: a wishlist

Software infrastructure (by which I include everything ending with *aaS, or anything remotely similar to it) is an exciting field, in particular because (despite what the neo-luddites may say) it keeps getting better every year! I love working with something that moves so quickly.

In the last few months, I've thought a lot about where it's going in the next 5-10 years and an wishlist has taken shape in my head. It's very opinionated! As in, you may not agree with these. That's fine — these are fundamentally predictions I'm making, or at least a wishlist. I'm OK if I'm right about some but not all of these. Let's dive right in.

Built for delight

You know how crappy software is crappy in ways that are so blatantly obvious to the user that you wonder why it was released? A touchscreen interface that's super laggy, or an appointment booking app that forces you to go in and out of possible dates and fill in all information before it tells you if it's available. We've all seen janky stuff like that, and they are generally janky in the same way: it feels like no one actually used the product after it was built, and said like, hey, this is kind of annoying, maybe we should make it more intuitive?

In 99% of the cases, I imagine they ended up in this situation because someone spelled out a long checklist of requirements, but there was nothing on the checklist to make sure the experience is delightful. Like, someone started with a wall full of post-it notes going “as a user, I want to …". Which I think logically makes sense — you can define a requirement that users should be able to do x, y, z, but you can't define that the experience shouldn't suck.

Anyway, I feel like this applies to like 90% of software infrastructure products.

I mean, as a user, I can set up a static website in AWS, but it takes 45 steps in the console and 12 of them are highly confusing if you never did it before. And it's also super slow to do it, and any time I make a mistake, I end up in some weird state where maybe I broke something terribly and I might have to start over. It's sad this is the current state of infrastructure.

There's a lot to learn from how the best companies build consumer products. How they use data to identify friction points, and constantly experiment with changes to make things easier. I have a lot of hope here that natural selection will favor the products that are easy to get started and fun to use. The first step is, we just need more alternatives and not just a handful of big semi-monopolies. Can't wait.

Truly serverless

We are, like what, 10 years into the cloud adoption? Most companies (at least the ones I talk to) run their stuff in the cloud. So why is software still acting as if the cloud doesn't exist?

  • The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.
  • I don't ever want to provision anything in advance of load.
  • I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.
  • Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.

I could go on, but I won't. I'm dreaming of a world where things are truly serverless. As in, I don't want to think about future resource needs, I just want things to magically handle it. The good news is I think we're actually getting closer to this dream every year!

The beauty of this is that a lot of the configuration stuff goes away magically. The competitive advantage of most startups is to deliver business value through business logic, not capacity planning and capacity management!

Not just that, but multi-tenancy is actually truly a “free lunch” from a resource utilization point of view, so any opportunity to pool resources represents a true win-win bargain. At the scale data centers worldwide, it's big — it depends on who you are, but you could either get excited about the gigatons of CO2 saved, or the increased corporate net income margin (I guess I like both!)

Fast

I don't mean fast as in, serving requests fast. We have software that does a great job doing this! Honestly, I think it's mind blowing how good it is: you can run functions at the edge and get response times around the world on the order of milliseconds.

The speed that's not there is setting up infrastructure. If I make a change in the AWS console, or if I add a new pod to Kubernetes, or whatever, I want that to happen in seconds. I'm not asking for milliseconds! Just please at least get it to less than a second. If we can serve requests in milliseconds then I have no doubt that we can get it there. We have the tech to boot VMs and containers basically instantaneously.

The speed matters because this is a serious waste of time for engineers. I feel like I've wasted years staring at some infrastructure change to kick in. I'll get back to this topic in a second because I think it's an important one!

Ephemeral resources

Almost all infrastructure I've worked with treats resources as something meant to exist indefinitely. If I create a database in the cloud, it sticks around, and unless I do anything, it will clutter up the console forever and I will pay money for it forever.

I used to think this was fine! My justification was that, well, if you want to run a test suite, just run the database yourself locally, maybe in a container. This is fine for some stuff, but I've come to think that it's probably pretty bad:

  • It's a lot of work to build your own replica of the infrastructure so you can run it locally.
  • The development-production delta gets bigger. There's always going to be subtle differences in how cloud infra works vs when you run it locally.
  • A lot of cloud infra is proprietary and impossible to run locally!

My deep desire is to make it easy to create ephemeral resources. Do you need a database for your test suite? Create it in the cloud in a way so that it gets garbage collected once your test suite is done. Run your tests against the cloud infra!

My dumb take is I feel like the debate in the last 5 years roughly looks like this:

test in prodouction

(To be 100% clear, what I'm advocating for in this blog post isn't necessarily pushing out changes in front of users immediately, although I generally support that for other reasons not covered in this blog post: go read everything written by Charity Majors. I'm saying — let me use production-like infrastructure as much as possible throughout the process of building and testing code.)

The point about ephemeral resources gets 100x more powerful in conjunction with the previous point about letting me create resources quickly. The general pattern of how code gets built is that infrastructure has been decoupled from logic, and the logic is tested independently. Slightly simplified, you can think of the development process of a set of nested loops where the cycle time of each loop gets exponentially worse at each level:

programming loops

At each loop level, the stakes get higher and the feedback cycle gets slower. This has an extremely strong relationship to productivity! The key thing to note is how important it is to shift concerns from outer loops into inner loops. Getting iteration speeds down by an order of magnitude has dramatic impacts on getting things done.

Having fast ephemeral cloud infra resources would let us move a lot of the infrastructure concerns from the outermost loop to the innermost loop. This lets you get feedback in seconds or at least minutes, rather than hours or more.

Code not configuration

There are at least 4 ways I can think of that you can interact with infrastructure:

  1. Web interface
  2. Local configuration, then run some command-line client that talks to the system
  3. APIs and you have to build the client yourself
  4. Client libraries

The first one is great to have but generally only for getting started. Once you have something set up, you typically move away from it as a way to make changes, and maybe only use it for monitoring etc.

Local configuration seems to be the general next step. Which is fine for a while, but half the time you realize that

  • Actually, I want this framework to be controlled by another framework at a higher level. In this case, you have two (both bad) options: expose configuration for both frameworks, or have the outermost framework generate configuration dynamically for the other framework.
  • You need to generate resources dynamically, maybe in a for-loop or whatever.

Now suddenly you move from YAML to YAML generated using Jinja or Handlebars or whatever. Slowly, you start adding custom functions to those template languages to make it easier to generate configuration. Eventually, it evolves into its own super-custom DSL with its own documentation.

This is super annoying! 10 times out of 10, I prefer to have everything accessible through a nice little client library. This library might in turn be a simple wrapper around a solid API. Now I can write my own for-loops! I can generate things dynamically! I don't have to learn a custom DSL! The world is a happy place again.

Built for productivity

I wanted to wrap this up under sort of a meta-point which isn't really a point in itself but more of a mindset change and maybe a corollary of all the other points.

Infrastructure feels like it's been built to solve hard scalability and reliability problems. There's some amazing infra and I'm in awe how much hard thinking must have gone into it. But things are rarely built to optimize for developer productivity. I think long term, the tools that “win” are generally the tools that optimize directly for that. Actually it's not just productivity, it's also quality, and these tools push the quality-productivity tradeoff “up and to the right”:

programming tradeoff curve

My point is, the new tradeoff curve lets you “cash out” the improvements in different ways: maybe purely as higher quality, maybe purely as higher productivity, maybe a bit of both.

To me, this represents a massive opportunity gap over the next 5-10 years. I can't wait for engineers to unleash another order of magnitude of productivity. There's so much software waiting to get built!

paradise

Tagged with: software, infrastructure, economics, programming, popular