Nov. 12, 2019, 5:29 a.m.(UTC)
You’re writing software that processes data, and it works fine when you test it on a small sample file. But when you load the real data, your program crashes. The problem is that you don’t have enough memory—if you have 16GB of RAM, you can’t load a 100GB file. At some point the operating system will run out of memory, fail to allocate, and there goes your program. So what can you do? You could spin up a Big Data cluster—all you’ll need to do is: Get a cluster of computers. Spend a week on setup. In many cases, learn a completely new API and rewrite all your code. This can be expensive and frustrating; luckily, in many cases it’s also unnecessary. You need a solution that’s simple and easy: processing your data on a single computer, with minimal setup, and as much as possible using the same libraries you’re already using. And much of the time you can actually do that, using a set of techniques that are sometimes called “out-of-core computation”. In this article I’ll cover: Why you need RAM at all. The easiest way to process data that doesn’t fit in memory: spending some money. The three basic software techniques for handling too much data: compression, chunking, and indexing. Followup articles will then show you how to apply these techniques to particular libraries like NumPy and Pandas.