Here’s another example: Why Are Handsome Men Such Jerks?
I saw a bunch of tweets over the weekend about Peter Norvig claiming there’s a negative correlation between being good at programming competitions and being good at the job. There were some decent Hacker News comments on it.
Norvig’s statement is obviously not true if we’re drawing samples from the general population – most people can’t code. It doesn’t necessarily even have to do with time allocation as this commenter alluded to:
Being a champion at something requires excruciating narrow focus on something for unusually long time. If you are getting GPA of 4.0 or play Rachmaninoff’s Piano Concerto No 3 or deadlift 400 pounds or in top 1000 chess players – you probably have to work on it for hours a day for years while ignoring everything else (of course unless you are one of those 1 in million polymath).
Here’s the real reason: Google is already selecting for the top 1% programmers using some criteria, leading to selection bias. Even if the two values are positively correlated, you might have a selection criterion that leads to a negative correlation.
But let’s start with the ideal case. Let’s say there’s a slight positive correlation between “being good at programming competitions” and “what really matters”. Let’s assume Google hires perfectly. Let’s assume everyone is on a multivariate Gaussian:
For all the people that were hired, I calculate the correlation between “Programming competition skills” and “What really matter”. The correlation for hired people is almost 0.2 and it’s still positive!
However let’s say Google for some reason puts too much weight on programming competitions during the interviews. We now get a negative correlation!
Does this mean it’s bad to hire people who are good at programming competition? No, it just means that we probably overweighted it during the hiring process. If we lower the weight a bit we get something a positive correlation again:
But in general does it mean we should never look at programming competition skills? Actually the reality is a lot more complicated. Instead of observing what really matters, you observe some crappy proxy for it. And when all metrics is noisy, you should put some nonzero positive weight on any metric that correlate positively with your target. Just not too much!
Sorry for spamming you with scatter plots, but it’s in the name of statistics! My point here is that you can tweak these variables and end up seeing correlations with pretty much any value. So when you have these complex selection biases you need to be super careful about how to interpret the data. It’s a great reminder that studies like Project Oxygen always need to be taken with a bucket of sea salt.
Are there other examples of selection biases leading to spurious correlations? Let me know!
Pinterest just open sourced Pinball which seems like an interesting Luigi alternative. There’s two blog posts: Pinball: Building workflow management (from 2014) and Open-sourcing Pinball (from this week). The author has a comment in the comments thread on Hacker News:
Luigi was not available in public, when Pinball starts. So not sure the pros and cons between Pinball and Luigi.
When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).
- easy system upgrade – when we fix bug or adding new features, there should be no interruption for current running workflow and jobs.
- easy add/test workflow – end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows.
- extensibility – a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well.
- flexible workflow scheduling policy, easy failure handling.
- We provide rich UI for you to easily manage your workflows – auto retry failed job, – you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) – you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs – you can also explore the topology of your workflow, and also support easy search.
- Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python …
There are a lot interesting things built in Pinball, and you probably want to have a try!
Sounds pretty similar to Luigi! My initial impression is that
- The architecture is a bit more advanced than Luigi and has some features that Luigi lacks. From what I can tell, it comes with task storage out of the box (whereas Luigi’s task history DB is still not entirely integrated), distributed execution, and a triggering mechanism. These are all areas where Luigi still needs some love
- The workflow API seems very convoluted. I don’t really understand how the code works and there’s a lot of boiler plate.
Fun to have something to compare to. Not that I want to rationalize Luigi’s missing features, but in general I would argue that the importance of good API design is underrated compared to good architecture. I still believe the key thing for a workflow manager is to reduce boiler plate and configuration at any point. It’s slightly harder to create an easy to use API than to think hard about architecture and check all the boxes for every feature.
Hopefully we’ll see more of these in the future. Obviously being Luigi’s author, I think Luigi is an awesome tool. But I think it’s 10% of what it could be, and diversity in this space is great for innovation. There’s a lot of them now: Oozie, Azkaban, Drake, Pinball, etc. Some people apparently use Jenkins for workflow management. A wildcard I encountered the other day is Ketrew. I wish I knew enough OCaml to understand what’s going on!
Wow I guess it was more than a year ago that I tweeted this. Crazy how time flies by. Anyway, here’s my rationale:
When I update one line of code I feel like I have to put in a long explanation about its side effects, why it’s fully backwards compatible, and why it fixes some issue #xyz.
When I refactor 500 lines of code, I get too lazy to write anything sensible, so I just put “refactoring FooBarController”. Note: don’t do at home!
I decided to plot the relationship for Luigi:
The plot is clickable! Check it out! It’s an old school image map which is pretty pathetic, since no one has used it since 1997, but it was just so much easier for this task. Hover over any point to see the commit message and click on it to jump to the commit on Github.
As you can see, there’s essentially no relationship between the two values. Not as spectacular as I was hoping for, but still kind of weird/interesting.
Code is here if you’re curious!
For most people straight out of school, work life is a bit of a culture shock. For me it was an awesome experience, but a lot of the constraints were different and I had to learn to optimize for different things. It wasn’t necessarily the technology that I struggled with. The hardest part was how to manage my own projects and my time, as well as how to grow and make impact as an engineer. I’ve listed some of my biggest mistakes, which are also mistakes I see other (mostly junior) engineers make.
Having the wrong scope
How do you know what’s the right amount of work to spend on a project? I had horrible intuition about this coming out of school. One thing I think is helpful is to think of the relationship between time spent and impact. For a given project, it looks something like this:
It usually ends up being a concave function.
How do you pick a point on this curve? If you only have one task then it’s usually pretty easy because you have some constraint on total time or total impact. In school usually you work on some task until it hits a certain y value (problem is solved) or until it hits a certain x value (time to turn in what you have).
The problem is in real life you actually need to pick not just one point on one curve but a points on each many curves. Actually an infinite number of curves. And you need to pick these points subject to the constraint that you get the maximum value per time invested.
This is a much harder problem! It means the amount of time we spend on task A is actually determined not just by how hard task A is but how hard an infinite number of other tasks are.
Let’s get mathematical here: for this concave optimization problem you can show that the marginal impact of each task should be identical. (I really want to write a book some day called The Mathematics of Project Management)
This means: recognize when the marginal impact of spending more time on a project starts to get low and you get more marginal impact elsewhere. Or just think: is this already good enough to deliver user value? Then take a break and look at the whole portfolio of possible task: ignoring what I have done so far, what’s the highest impact next thing I can do?
Focusing only on the things you are supposed to focus on
This might sound weird. What are you supposed to do at work? Most of the time you should probably do what your team/manager told you to do. But guess what? Your team/manager is not an all-seeing all-knowing oracle. Sometimes you might actually have a better idea of what to do.
Your sole purpose of working somewhere is to deliver value for the company. Completing a task from the backlog is a great proxy for that. But it’s still a proxy and as such has no intrinsic value. In many cases there might be even higher leverage things that no one will tell you to do. For instance, look around you. Is the team struggling with some old framework? Can you help someone get unblocked?
I like people to come in every morning and ask themselves: what is the highest impact thing I can do to for the company today? And do that. If you think about it, task backlogs is a completely artificial construct needed because we don’t have perfect information.
This gets especially important if you are interested in management roles. The higher up you get, the less people are going to tell you what to do.
Silly obligatory visualization:
Focusing only on low-leverage activities
There’s only that much leverage you get by being an individual contributor. Even if you’re a 10x engineer. Look around you for things with a force multiplier built in. Usually that means applying something to the entire team. Are you using the wrong language for the tool? Spend a few days investigating something else, introduce it to the team, and watch the whole team move twice as fast.
I used to work with Neville Li at Spotify who was a genius at finding these opportunities. He also never did what you told him to. Instead, he would spend days reading blogs and trying new technologies. Every few months he would find something that made the whole team 2x as much productive. Then he would organize a workshop, introduce it to the team, and move on.
Not realizing technology isn’t just a job
This is probably my most cynical note, or optimistic, depending on how you look at it.
The truth is, software engineering isn’t just a normal job. It’s a life style. It’s also a field that keeps changing from year to year. If you want to be successful, you need to stay up to date. If you want to be above average, you need to do things like:
- Working on side projects
- Reading tech blogs
- Following influencers on Twitter
- Going to meetups
- Reading papers
Being a software engineer is a fantastic career in many ways. With lots of freedom comes a lot of responsibilities. If you want to stay fresh, you need to invest a fair amount of your spare time.
Not drawing diagrams on glass walls
This is a no-brainer. Everyone knows that solid software engineers work draw everything on glass walls. And they also write everything flipped horizontally because it’s cooler.
I love technology. Go write some kick ass code now.
Febrary 6 was my last day at Spotify. In total I spent more than six years at Spotify and it was an amazing experience.
I joined Spotify in Stockholm in 2008, mainly because a bunch of friends from programming competitions had joined already. Their goal to change music consumption seemed ridiculous at that point, but six years later I think it’s safe to say they actually succeeded.
Back in the early days, my job was to do almost anything related to data. I think the range of tasks that I was responsible for has now grown into 100+ people at Spotify. My day to day tasks was all over the map: Hadoop maintenance, Powerpoint presentations, label reporting, ran A/B tests, optimized ad delivery, did ad delivery forecasts, built music recommendations, and much more (for most of that time we were actually three people though, not just me).
It was an amazing learning experience to see a company grow this way. I think a company goes through different challenges at every stage, both technically and organizationally (honestly a lot more of the latter compared to the former).
I’ve been craving to go back and go through the same journey again, so I’ve joined a small startup in NYC as the head of engineering. I will share more details soon. Hopefully this time will be an opportunity to apply all those things I learned at Spotify.
Oskar Stål, the CTO of Spotify and a great mentor, would always tell me that I have to decide between machine learning and the “CTO ladder” at some point. I made a conscious decision right now to focus more on management and building teams. I think this might be the topic of some future blog post, but not now.
Stay tuned for more updates!
I just made it to Sweden suffering from jet lag induced insomnia, but this blog post will not cover that. Instead, I will talk a little bit about technical debt.
The concept of technical debt always resonated with me, partly because I always like the analogy with “real” debt. If you take the analogy really far, there are some curious implications. I always like to think of the “interest rate” of software development. Debt is really just borrowing from the future, with some interest rate. You are getting a free lunch right now, but you need to pay back 1.2 free lunches in a few months. That’s the interest rate. In a software project the equivalent could be to pick a database that will have scalability issues later, or to make all member variables of some class public. You are doing it because it makes it easier to do things now but you will have to pay the cost of that later.
A recent paper from Google stretches the analogy in its title: Machine Learning: The High-Interest Credit Card of Technical Debt. It focuses specifically on machine learning, but definitely read it if you are interested. A recent blog post challenges if tech debt is really “debt” in the strict sense (you borrow fixed amount and pay back slightly more) or if it has a more complicated structure: Bad code isn’t Technical Debt, it’s an Unhedged Call Option.
I like the blog post because it brings up something I have noticed many times. A lot of developers have this intuitive aversion towards tech debt and always want to fix anything that’s perceived as “hacky”. FooBarController is a 1,000 line mayhem that no one understands, we need to refactor it! But say FooBarController is a well separated component that you have no intent on ever modifying, then there’s really no reason to fix it. It’s almost always a waste of time to try to fix bad code or bad architecture unless you at least some idea of why it helps you in the future.
So in some cases it makes sense not to fix technical debt. In other cases, it makes sense to take on tech debt deliberately. Back to the interest rate analogy: if the interest rate is lower than the return of investment, you should borrow money from the bank. It’s fine to ship a product a year earlier with a hacky code, if you make a lot of money, and hire a ton of developers to clean it up. The concept of interest rate applies both to financing and software engineering.
In my experience, the biggest issues isn’t taking on technical debt or not. As long as you make a conscious decision to take on tech debt, and everyone agrees it’s tech debt that you might need to fix later, you’re in the clear. You will get problems if you build up technical debt without acknowledging it. I made a chart to make it clear:
|Do you think you are taking on tech debt?|
|Are you taking on tech debt?||No||Ok, cool||Don’t worry so much!|
The bottom left picture is Tony Soprano knocking on your door because he’s here to collect the debt you owe him. What happened is, you saw this investment (real estate?) that you thought would appreciate 10% year on year. You borrowed money from Tony, but you never realized you might have to pay it back. It turns out the interest rate was a lot more hefty than you thought, and now he wants it back a year later with 50% interest rate.
The bottom right picture is you going to the bank because you want to buy real estate. You examine the interest rates and make a decision to get a mortgage.
These pictures might not illustrate the point super well, because the bottom right also covers this situation: borrowing at a high interest rate because the return on investment is even higher. Maybe you know of this boxing match that’s already rigged, and it’s 5:1 odds. You won’t be able to borrow money from the bank, so you go to Tony Soprano and borrow it for a few days. Next week, you pay it back with some interest, but you still made a ton of money.
Back to software engineering. The example above is like shipping the v2.0 of your web shop on time, and it turns out to be much better for users. You sell twice as much now! But you also have a bunch of scripts you have to run manually every day. You clearly should automate those scripts later, and it might be really messy to do so, but it’s also clear that you can do that later. You made a deliberate decision to borrow some resources from the future, because the return of your investment was really high.
I was talking with some data engineers at Spotify and had a moment of nostalgia.
I was writing my master’s thesis at Spotify and had to run a Hadoop job to extract some data from the logs. Every time I started running the job, I kept hearing this subtle noise. I kept noticing the correlation for a few days but I was too intimidated to ask. Finally people starting cursing that their machines had gotten really slow lately and I realized we were running Hadoop on the developer’s desktop machines. No one had told me. I think back then we had only GB’s of log data. I remember running less on the log and I would recognize half the usernames because they were my friends.
We took a bunch of machines and put them on a pallet in the foosball room. It was a super hot Swedish summer and I kept running this matrix factorization job in Hadoop that would fail halfway through. The node on the top of the pile would crash and you had to reboot it. I suspected overheating. We had a fan running in the room but it wasn’t helping. Finally I realized the problem was the sun was shining in through the window.
I found a big sheet or blanket and some nails and a hammer and put it up over the window. I was finally able to run my matrix factorization job to completion after doing this. This is probably going to be my favorite bug fix until the day I die.
In the summer of 2009, we installed a 30-node Hadoop cluster in our data center in Stockholm. Finally a “real” cluster.
More and more people started using Hadoop so we decided to move to Elastic Mapreduce. I uploaded all our logs to S3 and we put together some tooling so that you could run things on our own Hadoop cluster or on EC2 using the same source code. It was pretty beautiful but the performance wasn’t super great compared to how much we were paying.
Later in 2011 we had grown even more. We decided to move back to our own data center. We installed 500 nodes in our data center in London, later upgrading it to 700 and then 900 nodes.
I also implemented Luigi as a workflow engine with Mapreduce support in late 2011.
There was this long-standing assumption (at least I had) that Hadoop jobs were I/O bound and thus the language didn’t matter. We were using Python for probably 95% of all jobs, with some stuff in Hive by the analytics team. During 2012 and forward, we started realizing Python isn’t the ideal language, both from a performance and usability point of view. Eventually we would end up switching to Crunch and Scalding. We still use Luigi as the workflow engine to glue everything together.