My issue with GPU-accelerated deep learning

I've been spending several hundred bucks renting GPU instances on AWS over the last year. The speedup from a GPU is awesome and hard to deny. GPUs have taken over the field. Maybe following the footsteps of Bitcoin mining there's some research on using FPGA (I know very little about this).

I don't think there's a coincidence that GPUs that are built for graphics turn out to be great for image classification using convolutional neural networks. When you are dealing with pixel data packed into 2D arrays it's possible to parallelize all operations very efficiently.

My issue is that the complexity of each minibatch is $$ \mathcal{O}(n) $$ where $$ n $$ is the number of parameters. The larger models you are dealing with, the bigger this issue becomes.

Word2vec uses a clever technique called hierarchical softmax to achieve $$ \mathcal{O}(\log n) $$ (more details here). I have no idea how to implement this on a GPU and I suspect it's impossible. Here's where the CPU shows its strength – traversing a logarithmic datastructure takes a lot of branching and can't be expressed as a batch operation.

Logarithmic data structures happens to be a field I'm pretty excited about, particularly for vector models and multi-class prediction problems. I'm the author of Annoy, which a library for high dimensional nearest neighbor queries, so it's something I've spent some time thinking about.

For collaborative filtering and natural language processing, GPU architectures are highly constraining. I suspect once you hit a billion parameters or so, more specialized networks that use logarithmic datastructures will outperform for NLP and CF. The speedup from the brute force GPU approach will be offset by the smarter datastructures that a CPU can handle. I haven't seen any research on this but seems to me like a huge opportunity. In particular, I would love to see hybrid architectures that can use a GPU for the “dense” networks and CPU for the “sparse” networks.

Tagged with: , math