# The Computer Scientist Who Shrinks Big Data

## Introduction

For Jelani Nelson, algorithms represent a wide-open playground. “The design space is just so broad that it’s fun to see what you can come up with,” he said.

Yet the algorithms Nelson devises obey real-world constraints — chief among them the fact that computers cannot store unlimited amounts of data. This poses a challenge for companies like Google and Facebook, which have vast amounts of information streaming into their servers every minute. They’d like to quickly extract patterns in that data without having to remember it all in real time.

Nelson, 36, a computer scientist at the University of California, Berkeley, expands the theoretical possibilities for low-memory streaming algorithms. He’s discovered the best procedures for answering on-the-fly questions like “How many different users are there?” (known as the distinct elements problem) and “What are the trending search terms right now?” (the frequent items problem).

Nelson’s algorithms often use a technique called sketching, which compresses big data sets into smaller components that can be stored using less memory and analyzed quickly.

For example, in 2016 Nelson and his collaborators devised the best possible algorithm for monitoring things like repeat IP addresses (or frequent users) accessing a server. Instead of keeping track of billions of different IP addresses to identify the users who keep coming back, the algorithm breaks each 10-digit address into smaller two-digit chunks. Then sub-algorithms each focus on remembering a different two-digit chunk — drastically cutting the number of combinations logged in memory from billions of possible 10-digit combinations down to just 100 possible two-digit ones. Finally, by using clever strategies to put the chunks back together, the algorithm reconstructs the original IP addresses with a high degree of accuracy. But the massive memory-saving benefits don’t kick in until the users are identified by numbers much longer than 10 digits, so for now his algorithm is more of a theoretical advance.

Before he started designing cutting-edge algorithms, Nelson was a kid trying to teach himself to code. He grew up in St. Thomas in the U.S. Virgin Islands and learned his first programming languages from a few textbooks he picked up during visits to the U.S. mainland. Today he devotes a lot of time to making it easier for kids to get into computer science. In 2011 he founded AddisCoder, a free summer program in Addis Ababa, Ethiopia (where his mother is from). So far the program has taught coding and computer science to over 500 high school students. Perhaps not surprisingly, given Nelson’s involvement, the course is highly compressed, packing a semester of college-level material into just four weeks.

*Quanta *spoke with Nelson about the challenges and trade-offs involved in developing low-memory algorithms, how growing up in the Virgin Islands protected him from America’s race problem, and the story behind AddisCoder. This interview is based on video calls and has been condensed and edited for clarity.

## Introduction

### When did you first get interested in computer science?

When I was 11, I remember reading a book on HTML and giving myself exercises of websites to make. And I remember I wanted to make a webpage for my little sister. She was probably in kindergarten or first grade and she really loved *Rugrats* on Nickelodeon. So on her webpage, I wanted to have some of the characters’ images, like Tommy and Chuckie, and I was really worried that I might be violating Nickelodeon’s copyright. So I called Nickelodeon to ask permission to include these images on my sister’s site. They denied permission.

### That was when you were still in the U.S. Virgin Islands, right? What was it like growing up there?

It felt like a very small town. I grew up on the island of St. Thomas, which has roughly 50,000 people, and the whole island is less than 35 square miles. We didn’t have running water supplied by the government, so we had two big cisterns, which are just giant rooms under the house, to collect water. We still have a giant generator in our front yard in case of power outages and hurricanes.

And when I was 11, Hurricane Marilyn came and wreaked havoc on the Virgin Islands. I remember we actually stayed at the Marriott the night of the hurricane, and it was a good thing we did that because our house was messed up.

### What’s something you remember about the U.S. Virgin Islands that’s different from here in the States?

What I’ve noticed a lot in the mainland U.S. is that often there’s this very strange lack of respect, or disrespect, that I receive in some circumstances. I’ve been in the mainland for a long time now, it’s been like 20 years. There have been isolated incidents here and there where usually no one will explicitly tell you that something is happening because of your race, but you get treated very weirdly.

I did not witness it in my childhood because of where I was. People often ask me about being Black in science in America. But I think in the Virgin Islands, somehow my race was less important down there. Because almost everybody’s Black. It was never like, “Oh, you’re a Black kid who’s succeeding in math and science.” It was like, well, of course I’m a Black kid, everyone’s a Black kid here. I think that growing up in the Virgin Islands shielded me from some of the negative psychological effects of racism in America.

Now that I’m in the mainland U.S., I do agree there’s a real problem here. But because of what I’ve seen in my own life, I feel that this problem is not intrinsic to the world. It doesn’t have to be this way. There are places in the world where it isn’t this way.

## Introduction

### Speaking of other places in the world, what led you to start the AddisCoder program in Ethiopia?

I was a final-semester Ph.D. student; I already had a faculty offer from Harvard, so I was thinking: What do I want to do? And I thought, I’ll go to Ethiopia and visit my relatives and hang out over the summer. And then I thought, well, if I’m going to spend six weeks there, that’s a lot of time to just hang out and do nothing else. Why don’t I teach on the side?

So I had some friends help me advertise my program to high schools in Addis Ababa. I thought there would be a large number of interested students, so I made a puzzle. You had to solve a math problem. The solution to that math problem gave you an email address, and you could sign up for the class by emailing that address. We got a couple hundred kids who signed up to take the class. It was actually too many even with the puzzle. The classroom we got wasn’t big enough to support that. So I made the first few days of class very hard and fast to encourage students to drop out, which many did.

### How has AddisCoder changed since that first summer in 2011?

Instead of 80-something students a year, now we’re training a little more than double that, and it’s a boarding camp that we co-organize with the Meles Zenawi Foundation, a nonprofit that was created in honor of the late prime minister of Ethiopia. The students now come from all over the country, and we have a teaching staff of 40.

A lot of the students have never been outside of their town, or their region. So AddisCoder is the first time they’re seeing kids from all over the country, and then they’re meeting instructors from all over the world. It’s very eye-opening for them.

## Introduction

### I’m sure it’s exciting for them to meet top computer scientists. In your own work, what is your main challenge when developing algorithms?

An algorithm is just a procedure for solving some task. So your job as an algorithm designer is to come up with a procedure that solves that task as efficiently as possible. And the design space is infinite, right? You could do anything.

### What kinds of problems do your algorithms solve?

Companies we interact with use “distinct elements” a lot, which is counting the number of distinct items in a stream. So, maybe you count the number of distinct IP addresses that watch a YouTube video.

It turns out that this is a problem that also can be solved using a low-memory streaming algorithm. The one that’s most often used in practice is something called HyperLogLog. It’s used at Facebook, Google and a bunch of big companies. But the very first optimal** **low-memory algorithm for distinct elements, in theory, is one that I co-developed in 2010 for my Ph.D. thesis with David Woodruff and Daniel Kane.

### If your algorithm is more memory-efficient than HyperLogLog, why don’t companies use it?

It’s optimal once the problem is big enough, but with the kinds of problem sizes that people usually deal with, HyperLogLog is more of a practical algorithm.

Imagine that you’re seeing a stream of packets, and what you want is to count the number of distinct IP addresses that are sending traffic on this link. You want to know how many IP addresses there are. Well, in internet protocol version 4, there are 2^{32} IP addresses total, which is about 4 billion. That’s not big enough for our algorithm to win. It really has to be something astronomically big for our algorithms to be better.* *

### How do streaming algorithms analyze data without having to remember it all?

Here’s a very simple example. I’m Amazon and I want to know what my gross sales were today. Let’s say there were a million transactions today. There’s a very simple solution to this problem, which is you keep a running sum. Every time there’s a sale, you just add the sale amount to that sum, so you’re just storing one number instead of every transaction.

It turns out that there are other problems where the data might not seem numerical, but you somehow think of the data as numerical. And then what you’re doing is somehow taking a little bit of information from each piece of data and combining it, and you’re storing those combinations. This process takes the data and summarizes it into a sketch.

### How does that kind of compression work?

There are many techniques, though a popular one is linear sketching. Let’s say I want to answer the distinct elements problem, where a website like Facebook wants to know how many of their users visit their site each day. Facebook has roughly 3 billion users, so you could imagine creating a data set which has 3 billion dimensions, one for each user. I don’t want to remember the full Facebook user data set. Instead of storing 3 billion dimensions, I’ll store 100 dimensions.

The first dimension comes from taking a subset of the 3 billion numbers and adding them together, or multiplying them by some coefficient. And then it does that again 100 times, so now you’re only storing 100 numbers. Each of these 100 numbers might depend, in principle, on all of the 3 billion numbers. But it doesn’t store them individually.

**A lot of big-data algorithms output answers that are within some range of the truth but don’t output the true value itself. Why can’t they produce an exact answer?**** **

Unfortunately, for a lot of these problems, like the distinct elements problem, you can mathematically prove that if you insist on having the exact correct answer, then there is no algorithm that’s memory-efficient. To get the exact answer, the algorithm would basically have to remember everything it saw.

## Introduction

### So what kinds of sacrifices do you have to make to keep the estimate as accurate as possible?

The sacrifice is cost — memory cost. The more accuracy you want, the more memory you’re typically going to have to devote to the algorithm. And also, there’s failure probability. Maybe I’m OK with outputting a wrong answer with probability 10% of the time. Or maybe I really want it to be 5% or maybe 1%. The lower I make the failure probability, usually that costs me more memory too.** **

### Do you think today’s big-data algorithms are limited by the human mind or human engineering?

I would say the human mind. Can you come up with an algorithm, and can you come up with a proof that there’s no better algorithm? But I should mention that the models we’re working in are constrained by human engineering. Why does it matter that the algorithm uses low memory? It’s because my device has bounded memory. Why does my device have bounded memory? Well, because of some constraints of the device.

### In a sense, then, you could be developing the algorithms of the future. Do you think companies will start using your algorithms at some point?

Well, I think with our distinct elements algorithm that may very well never happen. What may happen is that people see our result as a proof of concept and they’ll work harder at making their practical algorithms as good as the theory suggests they can be.

### What do you enjoy about finding exactly how good an algorithm can be?

There is a sense of satisfaction that comes at the end. I like when we don’t just know a better method, but we know the method — when we know that a million years into the future, no matter how clever humanity becomes, it’s mathematically impossible to do anything better.