Why There’s No Single Best Way To Store Information
In data storage, sometimes it’s best to embrace a bit of disorder.
Kristina Armitage/Quanta Magazine
Introduction
Just as there’s no single best way to organize your bookshelf, there’s no one-size-fits-all solution to storing information.
Consider the simple situation where you create a new digital file. Your computer needs to rapidly find a place to put it. If you later want to delete it, the machine must quickly find the right bits to erase. Researchers aim to design storage systems, called data structures, that balance the amount of time it takes to add data, the time it takes to later remove it, and the total amount of memory the system needs.
To get a feel for these challenges, imagine you keep all your books in a row on one long shelf. If they’re organized alphabetically, you can quickly pick out any book. But whenever you acquire a new book, it’ll take time to find its proper spot. Conversely, if you place books wherever there’s space, you’ll save time now, but they’ll be hard to find later. This trade-off between insertion time and retrieval time might not be a problem for a single-shelf library, but you can see how it could get cumbersome with thousands of books.
Instead of a shelf, you could set up 26 alphabetically labeled bins and assign books to bins based on the first letter of the author’s last name. Whenever you get a new book, you can instantly tell which bin it goes in, and whenever you want to retrieve a book, you will immediately know where to look. In certain situations, both insertion and removal can be a lot faster than they would be if you stored items on one long shelf.
Of course, this bin system comes with its own problems. Retrieving books is only instantaneous if you have one book per bin; otherwise, you’ll have to root around to find the right one. In an extreme scenario where all your books are by Asimov, Atwood, and Austen, you’re back to the problem of one long shelf, plus you’ll have a bunch of empty bins cluttering up your living room.
Computer scientists often study data structures called hash tables that resemble more sophisticated versions of this simple bin system. Hash tables calculate a storage address for each item from a known property of that item, called the key. In our example, the key for each book is the first letter of the author’s last name. But that simple key makes it likely that some bins will be much fuller than others. (Few authors writing in English have a last name that starts with X, for example.) A better approach is to start with the author’s full name, replace each letter in the name with the number corresponding to its position in the alphabet, add up all these numbers, and divide the sum by 26. The remainder is some number between zero and 25. Use that number to assign the book to a bin.
This kind of mathematical rule for transforming a key into a storage address is called a hash function. A cleverly designed hash function ensures that items will usually end up distributed relatively evenly across bins, so you won’t need to spend as much time searching in each bin.
If you want to reduce retrieval time further, you can use more bins. But that leads to another trade-off: Those bins will take up space even if they end up empty.
This trade-off between space and time is an inherent feature of hash tables — it’s the price you pay for avoiding the tension between insertion and retrieval time that plagues simpler data structures. More than 70 years after hash tables were invented, computer scientists are still discovering new things about their fundamental properties. Recently, they finally devised a version that strikes an ideal balance between space and time. And last year, an undergraduate student disproved a long-standing conjecture about the minimum amount of time needed to find a specific item in a hash table that’s almost full.
A Heap of Priorities
Hash tables work well when you can’t anticipate which piece of data you’ll need to retrieve next. But that’s not always the case. Imagine you’re trying to complete tasks on a to-do list, but you’re constantly being assigned new tasks with different deadlines. You want to be able to quickly add new items to the to-do list, but you don’t care about retrieving items until they become your top priority.
In this case, your best bet is a type of data structure called a heap. As the name suggests, a heap is a somewhat haphazard approach to data storage. It’s basically a mathematical version of a pile of stuff: Some items are stored above others, and these higher items are easier to access. The highest-priority item is always at the top of the heap, where you can instantly pluck it off. Lower layers will be more disorganized, but you don’t need to worry about the relative positions of these low-priority items.
The simplest implementation of this basic idea uses a mathematical object called a binary tree, which is a network of nodes with a special shape: There’s a single node at the top, and each node is connected to two nodes directly below it.
Let’s imagine a binary tree that contains the items in a to-do list. Each node can store a single item, and each item is labeled with a number that represents its due date. High-priority items get smaller numbers.
Mark Belan/Quanta Magazine
Each new item is put into an empty slot in the current lowest layer.
Once the new item goes in, compare its due date to that of the item in the node directly above it. If the new task is due sooner, swap the items. Keep swapping until the new item ends up directly below an item that’s more urgent.
This procedure ensures that the highest-priority item will always rise to the top. What’s more, the procedure is extremely fast. Even in a nightmare scenario where you have 1,000 tasks on your to-do list and keep getting new assignments, storing them in a heap ensures that it takes no more than nine swaps to move each new item up to the appropriate position. Whenever you complete the most urgent task and remove it from the heap, you can quickly pull up your new top priority from the layer below.
Within computer science, heaps are widely used in algorithms for finding the shortest path from a given starting point in a network to every other point. In 2024, a team of researchers used an ingenious new heap design to transform a classic shortest-paths algorithm into one that is theoretically optimal for any network layout.
There’s no shortage of self-help books filled with contradictory advice about the best way to organize your belongings. If computer science offers any lesson, it’s that there is no perfect solution — every approach comes with trade-offs. But if some items are more important to you than others, don’t be afraid to leave a bit of a mess.