Deep learning for… Go

This is the last post about deep learning for chess/go/whatever. But this really cool paper by Christopher Clark and Amos Storkey was forwarded to me by Michael Eickenberg. It’s about using convolutional neural networks to play Go. The authors of the paper do a much better job than I would ever have done of modeling move prediction in Go and show that their model beat certain Go engines.

The fascinating thing about this paper is that playing against other Go engines, they just plug in their move prediction function, with no deep search beyond one level. That means the total time it spends is a fraction of the opponents. Still, the fact that it plays so well speaks for its strength.

So what happened if we could plug this into a deep search framework? The authors suggest doing exactly that in the conclusion. State of the art of Go engines actually use Monte Carlo tree search rather than minimax but other than that, it’s the same principle.

I talked a bit with the authors and the main thing that you have to change is to switch from move prediction to an evaluation function. For my chess experiments, I found a (hacky) way to train a function that does both at the same time. There’s essentially two terms in my objective function: one is comparing the actual move with a random move, using a sigmoid:

\frac{P(q)}{P(q) + P(r)} = \frac{exp(f(q))}{exp(f(q)) + exp(f(r))}.

If you extend that to all possible random moves you actually get a full probability distribution (a softmax) over all possible next moves.

P(p \rightarrow q) = \frac{exp(f(q))}{\sum exp(f(r)) }.

Now, how do you “convert” that into an evaluation function? That’s the second term, which tries fit the negative parent score to the current score. We penalize the quantity f(p) + f(q) by throwing in two more sigmoids. It’s a “soft constraint” that has absolutely no probabilistic interpretation. This a hacky solution, but here’s how I justify it:

  1. Note that the evaluation functions are unique up to a monotonic transform, so we can actually mangle it quite a lot.
  2. The softmax distribution has one degree of freedom in how it chooses the quantities, so (I’m speculating) the artificial constraint does not change the probabilities.

I think you could do the exact thing with their Go engine. In fact I’m willing to bet a couple of hundred bucks that if you did that, you would end up with the best Go engine in the world.

Btw another fun thing was that they plot some of the filters and they seem as random as the ones I learn for Chess. But a clever trick enforcing symmetry seem to help the model quite a lot.

Deep learning for… chess (addendum)

My previous blog post about deep learning for chess blew up and made it to Hacker News and a couple of other places. One pretty amazing thing was that the Github repo got 150 stars overnight. There was also lots of comments on the Hacker News post that I thought were really interesting. (See this skeptical comment for instance).

A couple of things came up in several places. I actually fully agree with a lot of the skepticism my blog post got. Here’s a bit of clarification + other stuff

My assumption that amateur players make near-optimal moves

Let me retract that statement a bit. But just a little bit. There’s several ideas here. The first one is that if 1,000 amateur chess players could vote for the next move, that move is probably pretty strong. There’s some anecdotal evidence suggesting that a large amount of amateurs actually, eg. Kasparov vs the World. The cool thing is that training this machine learning model, it will actually learn to pick the move that corresponds to what “most” players would choose. (You can actually see that the probability distribution over all next valid moves are given by a softmax distribution where the z values are given by the evaluation function).

The second idea is that a lot of moves are pretty obvious, because you are forced to do something. The third thing is that almost any move is good compared to a random move.

I think in hindsight it’s probably not correct that most moves by amateur players are “near-optimal”, but I don’t think it matters for the model.

What does each layer show if you look at it?

I looked at it, but it’s pretty much all over the place. Unlike convolutional neural networks, where the first layer often represents edges, there is nothing like that in this network. It seems like the logic is encoded throughout the whole network. Here’s the first few coefficients of the first feature (out of the 2048 features in the first layer), ranked in decreasing order of magnitude:

0.0856 q @ e7
-0.0686 P @ f7
0.0658 q @ f6
0.0657 P @ d3
-0.0655 r @ c6
0.0650 N @ e4
0.0648 P @ d6
-0.0625 r @ e6
0.0625 q @ d6
0.0588 p @ d7

White pieces are upper case, black are lower case. I don’t see much going on here.

There is actually at least one paper about using deep neural networks for Go

Ilya Sutskever and Vinod Nair wrote this paper in 2008. It even uses convolutional neural networks. It only has about 10k parameters (compared to 10M in my model) but it does something very similar to what I did: it tries to predict the next move of an expert player. I’m not sure why they didn’t evaluate playing with it though. I would guess it probably needs a lot more parameters to play well.



Deep learning for… chess

I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

What’s the theory?

Chess is a game with a finite number of states, meaning if you had infinite computing capacity, you could actually solve chess. Every position in chess is either a win for white, a win for black, or a forced draw for both players. We can denote this by the function f(\mbox{position}). If we had an infinitely fast machine we could compute this by

  1. Assign all the final positions the value \{-1, 0, 1\} depending on who wins.
  2. Use the recursive rule
    f(p) = \max_{p \rightarrow p'} -f(p')
    where p\rightarrow p' denotes all the legal moves from position p. The minus sign is because the players alternate between positions, so if position p is white’s turn, then position p' is black turns (and vice versa). This is the same thing as minimax.

There’s approximately 10^{43} positions in chess, so there’s no way we can compute this. We need to resort to approximations to f(p).

What’s the point of using machine learning for this?

What machine learning really boils down to is approximating functions given data. So assuming we can get a lot of data to learn this from, we can learn this function f(p). Once we have a model, an objective, and training data, we can go knock ourselves out.

I downloaded 100M games from FICS Games Database and began training a machine learning model. My function f(p) is learned from data by using two principles

  1. Players will choose an optimal or near-optimal move. This means that for two position in succession p \rightarrow q observed in the game, we will have f(p) = -f(q).
  2. For the same reason above, going from p, not to q, but to a random position p \rightarrow r, we must have f(r) > f(q) because the random position is better for the next player and worse for the player that made the move.

The model

We construct f(p) as a 3 layer deep 2048 units wide artificial neural network, with rectified linear units in each layer. The input is a 8 * 8 * 12 = 768 wide layer which indicates whether each piece (there are 12 types) is present in each square (there are 8 * 8 squares). After three matrix multiplications (each followed by a nonlinearity), there’s a final dot product with a 2048-wide vector to condense it down to a single value.

In total there’s roughly 10M unknown parameters in the network.


To train the network, I present it with (p, q, r) triplets. I feed it through the network. Denoting by S(x) = 1 / (1 + \exp(-x)), the sigmoid function, the total objective is:

\sum_{(p, q, r)} \log S(f(q) - f(r)) + \kappa \log (f(p) + f(q)) + \kappa \log (-f(q) - f(p))


This is the log likelihood of the “soft” inequalities f(r) > f(q), f(p) > -f(q), and f(p) < -f(q). The last two are just a way of expressing a “soft” equality f(p) = -f(q). I also use \kappa to put more emphasis on getting the equality right. I set it to 10.0. I don’t think the solution is super sensitive to the value of \kappa.

Notice that the function we learn has no idea about the rules of chess. We’re not even teaching it how each piece move. We make sure the model has the expressiveness to work out legal moves, but we don’t encode any information about the game itself. The model learns this information by observing tons of chess games.

Note that I’m also not trying to learn anything from who won the game. The reason is that the training data is full of games played by amateurs. If a grandmaster came into the middle of a game, s/he could probably completely turn it around. This means the final score is a pretty weak label. Still, even an amateur player probably makes near-optimal moves for most time.

Training the model

I rented a GPU instance from AWS and trained it on 100M games for about four days using stochastic gradient descent with Nesterov momentum. I put all (p, q, r) triplets into a HDF5 data file. I was messing around with learning rates for a while but after a while I realized I just wanted something that would give me good results in a few days. So I ended using a slightly unorthodox learning rate scheme: 0.03 \cdot \exp(-\mbox{time in days}). Since I had so much training data, regularization wasn’t necessary, so I wasn’t using either dropout or L2 regularization.

A trick I did was to encode the boards as 64 bytes and then transform the board into a 768 units wide float vector on the GPU. This gave a pretty substantial performance boost since there’s a lot less I/O.

How does a chess AI work?

Every chess AI starts with some function f(p) that approximates the value of the position. This is known as evaluation function.

This function is also combined with a deep search of many millions of positions down the game tree. It turns out that an approximation of f(p) is just a small part of the playing chess well. All chess AI’s focus on smart search algorithms, but the number of positions explode exponentially down the search tree, so in practice you can’t go deeper than say 5-10 positions ahead. What you do is you use some approximation to evaluate leaf nodes and then use some variety of negamax to evaluate a game tree of a bunch of possible next moves.

By applying some smart searching algorithm, we can take pretty much any approximation and make it better. Chess AI’s typically start with some simple evaluation function like: every pawn is worth 1 point, every knight is worth 3 points, etc.

We’re going to take the function we learned and use it to evaluate leaves in the game tree. Then try to search deep. So we’re first going to learn the function f(p) from data, then we’re going to plug it into a search algorithm.

Does it work?

I coined my chess engine Deep Pink as an homage to Deep Blue. As it turns out, the function we learn can definitely play chess. It beats me, every time. But I’m a horrible chess player.

Does Deep Pink beat existing chess AI’s? Sometimes

I pit it against another chess engine: Sunfish by Thomas Dybdahl Ahle. Sunfish is written entirely in Python. The reason I chose to stick to the same language was that I didn’t want this to be an endless exercise of making fast move generation. Deep Pink also relies heavily on quick move generation, and I didn’t want to spend weeks working out edge cases with bitmaps in C++ to be able to compete with the state of the art engines. That would just be an arms race. So to be able to establish something useful, I picked a pure Python engine.

The obvious thing in hindsight is: the main thing you want out of any evaluation function f(p) isn’t accuracy, it’s accuracy per time unit. It doesn’t matter that one evaluation function is slightly better than another if it’s ten times slower, because you can take the fast (but slightly worse) evaluation function and search more nodes in the game tree. So you really want to take into account the time spent by the engine. Without further ado, here’s some results of playing against the engine many times:

Notice the log-scale. The x-axis and y-axis aren’t super relevant here, the main thing is the distance to the diagonal, because that tells us which engine spent more CPU time. Every game I randomized the parameters for each engine: the max depth for Deep Pink, and the max number of nodes for Sunfish. (I didn’t include draws because both engines struggle with it).

Not surprisingly, the more time advantage either side has, the better it plays. Overall, Sunfish is better, winning the majority of the games, but Deep Pink probably wins 1/3 of the time. I’m actually pretty encouraged by this. I think with some optimizations, Deep Pink could actually play substantially better:

  • Better search algorithm. I’m currently using Negamax with alpha-beta pruning, whereas Sunfish uses MTD-f
  • Better evaluation function. Deep Pink plays pretty aggressively, but makes a lot of dumb mistakes. By generating “harder” training examples (ideally fed from mistakes it made) it should learn a better model
  • Faster evaluation function: It might be possible to train a smaller (but maybe deeper) version of the same neural network
  • Faster evaluation function: I didn’t use the GPU for playing, only for training.

Obviously the real goal wouldn’t be to beat Sunfish, but one of the “real” chess engines out there. But for that, I would have to write carefully tuned C++ code, and I’m not sure it’s the best way to spend my time.


I’m encouraged by this. I think it’s really cool that

  1. It’s possible to learn an evaluation function directly from raw data, with no preprocessing
  2. A fairly slow evaluation function (several orders of magnitude slower) can still play well if it’s more accurate
I’m pretty curious to see if this could fare well for Go or other games where AI’s still don’t play well. Either way, the conclusions above come with a million caveats. The biggest one is obviously that I haven’t challenged a “real” chess engine. I’m not sure if I have the time to start hacking on chess engines, but if anyone is interested, I’ve put all the source code up on Github.


Optimizing things: everything is a proxy for a proxy for a proxy

Say you build a machine learning model, like a movie recommender system. You need to optimize for something. You have 1-5 stars as ratings so let’s optimize for mean squared error. Great.

Then let’s say you build a new model. It has even lower mean squared error. You deploy it. This model turns out to give a lower mean squared error. You roll it out to users and the metrics are tanking. Crap! Ok so maybe mean squared error isn’t the right thing to optimize for.

The way you solve this, of course, is you start A/B testing your changes. But what metric to choose? People often ask me why we use one or another metric. We typically look at numbers like Daily active users, Day 2 retention, etc. But what if optimizing too hard for one hurts the other? What if you’re driving day 2 retention but screwing up month 2 retention?

What I like to remind myself is that everything is a proxy metric. We really want to maximize shareholder value or something like similar (let’s not get into a debate about this, so for the purpose of this blog post I’m going to assume that’s our goal).

The problem is, you can’t take all your hyperparameters and calculate the gradient

\frac{\partial}{\partial \Theta} \mbox{shareholder value}(\Theta)


That’s silly for many reasons, but let’s break it down. First of all, the functional relationship is highly stochastic, and depends on all kinds of external factors. Second of all, there’s no way we can even evaluate multiple values of this function, and it won’t be possible until we invent time machines. Third of all, there’s no way we can extract gradient information at all.

So what are we going to do? We’re going to invent a new function that we think is highly correlated with shareholder value. We could even define multiple functions if we want to, just to cross check different ones. But we will never be able to establish the correlation because of the reasons I just mentioned.

So that’s usually why metrics like daily active users are great. If you are trying to grow your company, it’s reasonable to assume that ultimately user growth will lead to success.

But in general, what properties should such a function have? Ideally as many as possible out of this:

  1. Should be highly correlated with shareholder value. For a company focusing on growth, the number of active users is probably a good one.
  2. Should be possible to measure separate independent outcomes, using A/B tests, blind tests, or something else. For instance, number of signups is tricky to test.
  3. Should be fast to measure. We don’t want to launch an A/B test and have to wait many months to get an answer
  4. Should have a high signal to noise ratio. You want to extract as much value from it. If you’re running an A/B test you want to reach statistical significance quickly.

One thing I’ve learned the hard way is that sometimes it’s often useful to pick a more biased metric if that means you can results faster or getting more results. For instance, we can roll out an A/B test with two different recommendation algorithms. We probably won’t see an impact on high level metrics such as retention, so we can pick a feature-specific metric instead.

But we can go even further. Still if we’re A/B testing, we probably have to run that test for two weeks to get any meaningful numbers. At the end of the day we obtain 1 bit of information (roughly) at best after two weeks, which tells us which test group won.

If we want to iterate quickly sometimes it makes a lot more sense to just take the output of the recommendation algorithms and let people go through a blind test. This is slightly more biased because we’re not using real users, but usually a lot quicker, and also lets us extract a lot more information. Not just do we learn which algorithm is the better, we often end up with lots of (noisy) anecdotal information, such as “algorithm A sucks for popular content” or “algorithm B is less diverse”.

At the lowest level, if you have a recommender algorithm, the model’s objective (eg. mean squared error) is another great proxy. Turns out it’s extremely easy to try multiple parameters and learn from it – all you need to do is to retrain the model. It totally dominates points 2, 3 and 4, but doesn’t do a great job on point 1.

TL;DR any metric you’re using is just a proxy. Pick the right one for your task.

Luigi conquering the world

I keep forgetting to buy a costume for Halloween every year, so this year I prepared and got myself a Luigi costume a month in advance. Only to realize I was going to be out of town the whole weekend. If anyone wants a Luigi costume, let me know!

(I’m not as big as the guy in the picture)

Anyway, that’s not the Luigi this blog post is about. This is about the Python workflow manager that I’ve open sourced. If you’re putting together batch jobs into complex pipelines, you should check it out.

There’s been a couple of companies in NYC using it for a while, eg. Foursquare, Bitly, and Mortar Data. The latter company even provides a hosted Luigi solution.

Lately, there’s been a surge of contributors (at the time of writing there’s 78 contributors, which is pretty cool!). People at Houzz has sent tons of pull requests, and they are not the only ones.

Two blog posts just came out, both describing their use of Luigi (among other things)

In addition, I found this interesting deck from Stripe:

There’s also a couple of meetups in the past and the future. There was a dedicated Luigi meetup on July 31 at the Spotify NYC office, also sponsored by Run Ads who use Luigi a lot. Here are my slides about Luigi’s future, heavily delayed:

I’m also talking about Luigi at the NYC Data Science meetup on Dec 19 (also at Spotify’s NYC office). Feel free to drop by and ask me some tough questions!


Annoying blog post

I spent a couple of hours this weekend going through some pull requests and issues to Annoy, which is an open source C++/Python library for Approximate Nearest Neighbor search.

I set up Travis-CI integration and spent some time on one of the issues that multiple people had reported. At the end of the day, it turns out the issue was actually caused by a bug in GCC 4.8. Some crazy compiler optimization introduced between 4.6 and 4.8 caused this loop to be removed:

if (indices.size() <= (size_t)_K) {
  for (size_t i = 0; i < indices.size(); i++)
    m->children[i] = indices[i];
  return item;

Replacing it with std::copy turned out to do the trick

if (indices.size() <= (size_t)_K) {
  copy(indices.begin(), indices.end(), m->children);
  return item;

It’s still bizarre, but I probably deserved it, given how Annoy is abusing C++. The m->children array is declared as being only 2 elements long, but I deliberately overflow the array because I allocate extra space after it. I think this might cause GCC to unroll the loop to run twice.

I always feel a bit more comfortable when it turns out that the compiler is introducing bugs rather than my code. Made me think of the Jeff Dean joke: Jeff Dean builds his code before committing it, but only to check for compiler and linker bugs.

Anyway, after fixing this in three separately places, it seems like it’s finally working. Dirk Eddelbuettel is working on an R implementation of Annoy which is fun to see.

I haven’t spent much time with Annoy in a year or two and looking around it seems like there’s some new competitors on the block. Panns is one of them, another one is the LSHForest pull request for scikit-learn. I haven’t looked at them thoroughly, but they are both Python-only and claim some advantages over Annoy. None of them implement mmap as a method to load indexes, which imho is Annoy’s killer feature.

There’s a performance benchmark featuring Annoy, LSHForest, and FLANN, written by the author of LSHForest. Annoy performs horribly in the benchmark, getting its ass severely kicked by the other two. After re-running the benchmark myself, I think what happened is that the bug I mentioned above was present for Annoy and that’s why it performed so bad. Re-running the benchmark (thanks for making it easily reproducible!) yields very different results.

It’s extremely hard to compare all trade-offs between index building time, index size, query performance, and accuracy. So please don’t take this super seriously. The only thing I changed in the benchmark was (1) I added Panns, for good measure (2) I reduced the number of trees for Annoy (and Panns) to 10 instead of using n_features. Without reducing the number of trees for Annoy, it gets pretty much 100% accuracy for all data sets, but takes several minutes to build each index. So to emphasize approximate aspect of ANN, I decided to sacrifice some accuracy to gain performance.

Pardon the lousy graphics, but here’s the result in all its glory:

In my layman terms:

  • Annoy and Panns outperform LSHF and FLANN significantly on accuracy.
  • Index building process is fast for LSHF and FLANN. Annoy takes a lot more time, but Panns is 10x slower than Annoy
  • FLANN is faster than Annoy for queries. Annoy is 10x faster than LSHF. Panns is super duper slow.

And with my severely biased conclusions:

  • If you want to use mmap for fast index loading, use Annoy
  • If you want to minimize file size at any cost, use Panns
  • If you want fast query times at any cost, use FLANN
  • If you want a pure Python solution, use LSHF
  • For anything else, use Annoy. Or am I going too far promoting my own projects now…?

Btw, I would love it if someone could help me reimplement the algo used by Panns in Annoy, since it seems pretty good.

For another comparison, check out Radim Řehůřek’s Performance Shootout of Nearest Neighbours.

All metrics below:

Time building index (s) Average query time (ms) Average accuracy
n_samples: 1000 n_features: 100
LSHF 0.02 5.46 0.59
Annoy 0.15 0.27 0.98
Flann 0.00 0.17 0.60
Panns 3.27 66.48 0.93
n_samples: 1000 n_features: 500
LSHF 0.10 7.11 0.61
Annoy 0.39 0.75 0.98
Flann 0.01 0.24 0.62
Panns 9.94 140.60 0.96
n_samples: 10000 n_features: 100
LSHF 0.25 8.01 0.61
Annoy 3.17 0.45 0.98
Flann 0.02 0.20 0.62
Panns 55.34 71.12 0.96
n_samples: 10000 n_features: 500
LSHF 1.29 9.50 0.15
Annoy 10.46 1.14 0.50
Flann 0.07 0.24 0.13
Panns 154.58 139.83 0.54
n_samples: 10000 n_features: 1000
LSHF 2.70 13.74 0.16
Annoy 18.28 2.32 0.49
Flann 0.11 0.32 0.12
Panns 278.21 257.45 0.49

The Filter Bubble is Silly and you Can’t Guess What Happened Next

I’m at RecSys 2014, meeting a lot of people and hanging out at talks. Some of the discussions here was about the filter bubble which prompted me to formalize my own thoughts.

I firmly believe that it’s the role of a system to respect the user’s intent. Any sensible system will optimize for user’s long-term happiness by providing info back to the user that s/he finds useful. This holds true as long as a system isn’t (a) stupid and recommends the wrong content (b) trying to push its own agenda, that may or may not be hidden.

When I go to Google and search for “banana” it tries to predict my intent. I expect to results about bananas back. It’s like I go to an Indian restaurant and order something – I expect to get Indian food back.

Now, the chef could be bad, and I might not get what most people consider Indian food. But that’s just a “bug”. The chef knows that he should respect my intent.

The other option is that I get some weird food because the chef has a hidden agenda. For instance, maybe s/he puts crack in my food so that I come back tomorrow. Or maybe, s/he gives me a hamburger, tells me this is Indian food, and blatantly lies that there’s no other Indian restaurants in United States.

Luckily, we live in a free market, and as long as there’s some competition, I believe that the truth will prevail. I’ll eventually figure out that there are other Indian restaurants and I will discover that they are much better. This also means that there’s really no incentive for the chef to serve something that’s wildly different that what I’m asking for.

My point is, I think any system’s role is to respect the intent of the user, and serve whatever you ask for. It’s the user’s role to decide what s/he wants to consume, and in a free market I think there will be those options out there.

Anyway, this story took a funny twist. As I was debating this over Twitter, some Twitter bot stepped in and started offering advice on Indian restaurants. Obviously unfiltered and neutral.

Detecting corporate fraud using Benford’s law

Note: This is a silly application. Don’t take anything seriously.

Benford’s law describes a phenomenon where numbers in any data series will exhibit patterns in their first digit. For instance, if you took a list of the 1,000 longest rivers of Mongolia, or the average daily calorie consumption of mammals, or the wealth distribution of German soccer players, you will on average see that these numbers start with “1″ about 30% of the time. I won’t attempt at proving this, but essentially it’s a result of scale invariance. It doesn’t apply to all numerical series, like IQ or shoe size, but this pattern turns out to pop up in a lot of places.

Since the theory predicts that the first digit follows a certain outcome, you can use it to find “strange” distributions that seem to disobey what we can expect statistically. The Wikipedia article mentions using Benford’s law to detect accounting fraud, and Greece was busted by researchers noting that the Greek macroeconomic data had an abnormally large deviation from what Benford’s law would predict. There’s another couple of papers and an interesting blog post applying Benford’s law to industry sectors.

For fun, I downloaded about 5,000 annual reports (10-K) for most publicly traded companies in the US, to see if there are big outliers.

Benford’s law predict that the probability of any first digit, 1-9, is

Q(d) = \left(\log (d+1) - \log d \right) / \log 10.

For every annual report, I calculate the empirical distribution, P(d) = n_d / \sum n_i where n_d is just the number of occurrences of a dollar amount starting with digit d. To correct for reports with few values, I smooth the measured digit distribution a bit and add  100\cdot P(d) “fake” counts to each n_d.

To measure the difference between expected and actual distributions, I use the KL-divergence which boils down to

D_{P | Q} = \sum_i \log \left( P(i) / Q(i) \right) P(i)


I downloaded the annual reports from SEC and extracted all figures from all tables containing dollar amounts. Since some amounts may occur many times, and skew the digit distribution, I only looked at the unique amounts that occurred in the report. I then extracted first non-zero digit of all amounts.

The distributions of digits for the top five outlier entries illustrate Benford’s law in practice:

On closer inspection, some of these seem legit. For instance, the #1 spot on the list, Mid-America Apartment Communities, Inc. has a long list of units across the country, and the average price per unit happens to cluster around $800.

Below is a list containing the 100 companies with the largest KL-divergence (most “fishy”). None of the companies stand out as having an outrageous distribution, and even the top companies on the list are very unlikely to have commit fraud. The prior belief of accounting fraud is basically extremely low. We would commit the prosecutor’s fallacy for singling out any of these numbers as fraudulent. Anyway, I’ll follow up with a new blog post in five years to see if any of the companies below were actually caught:

0.1311 Mid-America Apartment Communities, Inc.
0.0578 Power Integrations Inc.
0.0497 United Natural Foods,Inc.
0.0474 Lexicon Pharmaceuticals, Inc
0.0461 Pacific Office Properties Trust, Inc.
0.0414 Xilinx
0.0406 Host Hotels & Resorts Inc.
0.0391 World Acceptance Corp
0.0390 Immunomedics, Inc.
0.0388 Marriott International Inc.
0.0387 CVS Caremark Corporation
0.0382 Paychex, Inc.
0.0382 Luna Innovations Incorporated
0.0381 Capstead Mortgage Corporation
0.0370 Verso Paper Corp.
0.0370 Fastenal Co.
0.0364 Insperity, Inc.
0.0359 Diamond Hill Investment Group Inc.
0.0354 National Security Group Inc.
0.0345 GameStop Corp.
0.0342 Compass Minerals International Inc.
0.0340 SIRIUS XM Radio Inc.
0.0339 BP Prudhoe Bay Royalty Trust
0.0326 Investors Bancorp Inc.
0.0323 Kohlberg Capital Corporation
0.0319 Equity One
0.0319 Kona Grill Inc.
0.0313 Alliance Financial Corporation
0.0310 Zale Corporation
0.0310 Anadarko Petroleum Corporation
0.0308 Sigma-Aldrich Corp.
0.0304 Global Cash Access Holdings, Inc.
0.0300 Corcept Therapeutics
0.0294 Enbridge Energy Management LLC
0.0293 BJ’s Restaurants Inc.
0.0293 Air Transport Services Group, Inc.
0.0292 Fairchild Semiconductor International Inc.
0.0292 Universal Electronics Inc.
0.0291 Espey Manufacturing & Electronics Corp.
0.0290 Inland Real Estate Corporation
0.0286 W. R. Berkley Corporation
0.0285 Albemarle Corp.
0.0282 Koss Corp.
0.0281 Leap Wireless International Inc.
0.0279 Encore Wire Corp.
0.0276 UQM Technologies, Inc.
0.0276 DuPont Fabros Technology Inc.
0.0276 Applied Materials Inc.
0.0275 Destination Maternity Corporation
0.0272 Pepsico, Inc.
0.0271 CorVel Corporation
0.0270 Nathan’s Famous Inc.
0.0269 Sport Chalet, Inc.
0.0269 Key Technology Inc.
0.0268 Overhill Farms Inc.
0.0268 Digi International Inc.
0.0267 Materion Corporation
0.0265 DreamWorks Animation SKG Inc.
0.0265 NIC Inc.
0.0264 ANSYS Inc.
0.0258 Volterra Semiconductor Corporation
0.0258 Verenium Corporation
0.0258 KeyCorp
0.0255 Rockwell Collins Inc.
0.0254 Meritage Homes Corporation
0.0253 Perrigo Co.
0.0249 Zhone Technologies Inc
0.0249 McGrath RentCorp
0.0249 A.M. Castle & Co.
0.0248 Delta Natural Gas Co. Inc.
0.0247 Pervasive Software Inc.
0.0247 Senomyx
0.0247 ManTech International Corp.
0.0246 Ross Stores Inc.
0.0245 Bancorp Of New Jersey, Inc.
0.0245 Werner Enterprises
0.0244 Dillards Inc.
0.0244 Sparton Corp.
0.0243 Rudolph Technologies Inc.
0.0243 CyberOptics Corp.
0.0240 Hallador Energy Company
0.0238 DARA BioSciences, Inc
0.0238 Chico’s FAS Inc.
0.0237 Delcath Systems Inc.
0.0236 Pure Cycle Corp.
0.0235 Cytori Therapeutics
0.0235 Vonage Holdings Corporation
0.0235 Spectranetics Corporation
0.0235 Regal-Beloit Corporation
0.0234 ScanSource, Inc.
0.0234 Weyco Group Inc
0.0232 Ambassadors Group Inc.
0.0232 Rent-A-Center Inc.
0.0232 Accenture plc
0.0231 Idenix Pharmaceuticals
0.0231 KAR Auction Services, Inc.
0.0230 Progressive
0.0230 BCSB Bankcorp Inc.
0.0229 PCTEL, Inc.
0.0229 Cincinnati Financial Corp.

Again, a bunch of disclaimers: this is just a silly application, don’t take it seriously, elevator inspection certificate available in the building manager’s office, etc.