Articles tagged with Statistics

Mortality statistics and Sweden's "dry tinder" effect

2020-09-23

We live in a year of about 350,000 amateur epidemiologists and I have no desire to join that “club”. But I read something about COVID-19 deaths that I thought was interesting and wanted to see if I could replicated it through data. Basically the claim is that Sweden had an exceptionally “good” year in 2019 in terms of influenza deaths causing there to be more deaths “overdue” in 2020.

How to hire smarter than the market: a toy model

2020-01-13

Let’s consider a toy model where you’re hiring for two things and that those are equally valuable. It’s not very important what those are, so let’s just call them “thing A” and “thing B” for now. For one set of abilities, the scatter plot looks like this:

Buffet lines are terrible, but let's try to improve them using computer simulations

2019-10-16

My company has a buffet every Friday, and the lines grow to epic proportions when the food arrives. I’ve suspected for years that the “classic” buffet line system is a deeply flawed and inefficient method, and every time I’m stuck in the line has made me more convinced.

Modeling conversion rates using Weibull and gamma distributions

2019-08-05

This is a blog post originally featured on the Better engineering blog. If you want to link to this article or share it, please go to the original post URL! Separately, I’m sorry it’s been so long with no posts on this blog. Between kids, moving, and being a startup CTO, I’ve been busy. I have a few posts coming down the pipe though, so stay tuned…

Why software projects take longer than you think: a statistical model

2019-04-15

Anyone who built software for a while knows that estimating how long something is going to take is hard. It’s hard to come up with an unbiased estimate of how long something will take, when fundamentally the work in itself is about solving something. One pet theory I’ve had for a really long time, is that some of this is really just a statistical artifact.

The hacker's guide to uncertainty estimates

2018-10-08

It started with a tweet:

New years resolution: every plot I make during 2018 will contain uncertainty estimates
— Erik Bernhardsson (@bernhardsson) January 7, 2018

Why? Because I’ve been sitting in 100,000,000 meetings where people endlessly debate whether the monthly number of widgets is going up or down, or whether widget method X is more productive than widget method Y. For almost any graph, quantifying the uncertainty seems useful, so I started trying. A few months later:

Waiting time, load factor, and queueing theory: why you need to cut your systems a bit of slack

2018-03-27

queue

I’ve been reading up on operations research lately, including queueing theory. It started out as a way to understand the very complex mortgage process (I work at a mortgage startup) but it’s turned into my little hammer and now I see nails everywhere.

Learning from users faster using machine learning

2017-12-12

I had an interesting idea a few weeks ago, best explained through an example. Let’s say you’re running an e-commerce site (I kind of do) and you want to optimize the number of purchases.

Let’s also say we try to learn as much as we can from users, both using A/B tests but also using just basic slicing and dicing of the data. We are looking at how many people convert (buy our widgets) but a constant problem is there’s just too much uncertainty.

The half-life of code & the ship of Theseus

2016-12-05

trireme

As a project evolves, does the new code just add on top of the old code? Or does it replace the old code slowly over time? In order to understand this, I built a little thing to analyze Git projects, with help from the formidable GitPython project. The idea is to go back in history historical and run a git blame (making this somewhat fast was a bit nontrivial, as it turns out, but I’ll spare you the details, which involve some opportunistic caching of files, pick historical points spread out in time, use git diff to invalidate changed files, etc).

Subway waiting math

2016-07-09

Why does it suck to wait for things? In a previous post I analyzed a NYC subway dataset and found that at some point, quite early, it’s worth just giving up.

This isn’t a proof that the subway doesn’t run on time – in fact it might actually proves that the subway runs really well. The numbers indicate that it’s not worth waiting after 10 minutes, but it’s a rare event and usually involves something extraordinary like a multi-hour delay. You should roughly give up after some point related to the normal train frequency, and 10 minutes is not a lot at all. Conversely if the trains ran hourly, it probably would had been worth waiting an hour or more. My analysis gave me a lot of respect for the job MTA is doing.

NYC subway math

2016-04-04

Apparently MTA (the company running the NYC subway) has a real-time API. My fascination for the subway takes autistic proportions and so obviously I had to analyze some of the data. The documentation is somewhat terrible, but here’s some relevant code for how to use the API:

More MCMC – Analyzing a small dataset with 1-5 ratings

2015-12-05

I’ve been obsessed with how to iterate quickly based on small scale feedback lately. One awesome website I encountered is Usability Hub which lets you run 5 second tests. Users see your site for 5 seconds and you can ask them free-form questions afterwards. The nice thing is you don’t even have to build the site – just upload a static png/jpg and collect data.

MCMC for marketing data

2015-10-31

The other day I was looking at marketing spend broken down by channel and wanted to compute some simple uncertainty estimates. I have data like this:

<th>
  Total spend
</th>

<th>
  Transactions
</th>

<td>
  2292.04
</td>

<td>
  9
</td>

<td>
  1276.85
</td>

<td>
  2
</td>

<td>
  139.59
</td>

<td>
  3
</td>

<td>
  954.98
</td>

<td>
  5
</td>


Channel A
Channel B
Channel C
Channel D

Of course, it’s easy to compute the cost per transaction, but how do you produce uncertainty estimates? Turns out to be somewhat nontrivial. I don’t even think it’s possible to do a t-test, which is kind of interesting in itself.

It's called Berkson's paradox!

2015-04-09

As noted by multiple tweets, my previous post describes a phenomenon denoted Berkson’s paradox.

Here’s another example: Why Are Handsome Men Such Jerks?