# Big Data’s Mathematical Mysteries

Machine learning works spectacularly well, but mathematicians aren’t quite sure why.

18

At a dinner I attended some years ago, the distinguished differential geometer Eugenio Calabi volunteered to me his tongue-in-cheek distinction between pure and applied mathematicians. A pure mathematician, when stuck on the problem under study, often decides to narrow the problem further and so avoid the obstruction. An applied mathematician interprets being stuck as an indication that it is time to learn more mathematics and find better tools.

### Quantized

A monthly column in which top researchers explore the process of discovery. This month’s columnist, Ingrid Daubechies, is the James B. Duke Professor of Mathematics and Electrical and Computer Engineering at Duke University.

I have always loved this point of view; it explains how applied mathematicians will always need to make use of the new concepts and structures that are constantly being developed in more foundational mathematics. This is particularly evident today in the ongoing effort to understand “big data” — data sets that are too large or complex to be understood using traditional data-processing techniques.

Our current mathematical understanding of many techniques that are central to the ongoing big-data revolution is inadequate, at best. Consider the simplest case, that of supervised learning, which has been used by companies such as Google, Facebook and Apple to create voice- or image-recognition technologies with a near-human level of accuracy. These systems start with a massive corpus of training samples — millions or billions of images or voice recordings — which are used to train a deep neural network to spot statistical regularities. As in other areas of machine learning, the hope is that computers can churn through enough data to “learn” the task: Instead of being programmed with the detailed steps necessary for the decision process, the computers follow algorithms that gradually lead them to focus on the relevant patterns.

David von Becker

Ingrid Daubechies

In mathematical terms, these supervised-learning systems are given a large set of inputs and the corresponding outputs; the goal is for a computer to learn the function that will reliably transform a new input into the correct output. To do this, the computer breaks down the mystery function into a number of layers of unknown functions called sigmoid functions. These S-shaped functions look like a street-to-curb transition: a smoothened step from one level to another, where the starting level, the height of the step and the width of the transition region are not determined ahead of time.

Inputs enter the first layer of sigmoid functions, which spits out results that can be combined before being fed into a second layer of sigmoid functions, and so on. This web of resulting functions constitutes the “network” in a neural network. A “deep” one has many layers.

Olena Shmahalo/Quanta Magazine

Decades ago, researchers proved that these networks are universal, meaning that they can generate all possible functions. Other researchers later proved a number of theoretical results about the unique correspondence between a network and the function it generates. But these results assume networks that can have extremely large numbers of layers and of function nodes within each layer. In practice, neural networks use anywhere between two and two dozen layers.* Because of this limitation, none of the classical results come close to explaining why neural networks and deep learning work as spectacularly well as they do.

It is the guiding principle of many applied mathematicians that if something mathematical works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it. In this particular case, it may be that we don’t even have the appropriate mathematical framework to figure it out yet. (Or, if we do, it may have been developed within an area of “pure” mathematics from which it hasn’t yet spread to other mathematical disciplines.)

Another technique used in machine learning is unsupervised learning, which is used to discover hidden connections in large data sets. Let’s say, for example, that you’re a researcher who wants to learn more about human personality types. You’re awarded an extremely generous grant that allows you to give 200,000 people a 500-question personality test, with answers that vary on a scale from one to 10. Eventually you find yourself with 200,000 data points in 500 virtual “dimensions” — one dimension for each of the original questions on the personality quiz. These points, taken together, form a lower-dimensional “surface” in the 500-dimensional space in the same way that a simple plot of elevation across a mountain range creates a two-dimensional surface in three-dimensional space.

What you would like to do, as a researcher, is identify this lower-dimensional surface, thereby reducing the personality portraits of the 200,000 subjects to their essential properties — a task that is similar to finding that two variables suffice to identify any point in the mountain-range surface. Perhaps the personality-test surface can also be described with a simple function, a connection between a number of variables that is significantly smaller than 500. This function is likely to reflect a hidden structure in the data.

In the last 15 years or so, researchers have created a number of tools to probe the geometry of these hidden structures. For example, you might build a model of the surface by first zooming in at many different points. At each point, you would place a drop of virtual ink on the surface and watch how it spread out. Depending on how the surface is curved at each point, the ink would diffuse in some directions but not in others. If you were to connect all the drops of ink, you would get a pretty good picture of what the surface looks like as a whole. And with this information in hand, you would no longer have just a collection of data points. Now you would start to see the connections on the surface, the interesting loops, folds and kinks. This would give you a map for how to explore it.

These methods are already leading to interesting and useful results, but many more techniques will be needed. Applied mathematicians have plenty of work to do. And in the face of such challenges, they trust that many of their “purer” colleagues will keep an open mind, follow what is going on, and help discover connections with other existing mathematical frameworks. Or perhaps even build new ones.

* Correction on December 4, 2015: The original version of this article stated that practical neural nets use only two or three layers. This number is now greater than 10 layers in state-of-the-art systems. The Google image-recognition algorithm that won a recent ImageNet Large-Scale Visual Recognition Challenge used 22 layers.

• Mark Gubrud says:

The most successful networks today do not use sigmoid activation functions, they use rectified linear or double-linear (lower slope for negative part). Networks that use these functions learn faster. Also, they use many layers, not just two or three. This is the meaning of the term "deep."

• Steven Sagaert says:

Hi Ingrid,
As a former physicist you'll appreciate the following links to statistical physics:
https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work
https://charlesmartin14.wordpress.com/2015/04/01/why-deep-learning-works-ii-the-renormalization-group

• Dreamtimer says:

Any literature recommendations in the "pure data science" department ( books or overview articles )?

• littlefooch says:

Can someone define what 'spectacularly well' means? If I hand the algorithms lots of lab data for lots of patients collected at various medical locations and times, can it figure out (the semantics) of when a lab result is meaningful? (like it was taken w/o clinical data to confirm a diagnosis, or without a diagnosis at all, or how was it part of a given encounter, or a lab taken in the context of a complex diagnosis which could have multiple meanings, depending upon the patient, time disease progression….)

• bigfooch says:

"spectacularly well" means there is a class of problems on which CNN's handily beat any human. Today, that's the image classification problem, where object recognition and image captioning results from CNN's are way more accurate than humans. This was not the case say 5 years ago. In fact even 3 years ago the human was ahead. Now it's clear we have no chance !
Regarding the "lab result is meaningful" problem – there's a bunch of ML startups most notably Enlitic with good success in that area.

• Eugene says:

Ingrid,

Could you point to the relevant papers about "a number of tools to probe the geometry of these hidden structures" or names of the researchers? I'd like to see if these tools may be of help, but it's hard to do based on the vague reference.

• Randy Crawford says:

Deep Learning, by Goodfellow, Courville, Bengio
https://goodfeli.github.io/dlbook/

• Jens Stilling says:

Translating? If that really is so, I hope Google will start using it on Google translate. As far as I can see it is getting worse not better. Honestly I do not think such an activity can be performed without additional structures that has an understanding of the subject. Translating is obviously a very difficult activity, most of the human generated translations I see are wrong. I remember reading a book as a young man in my native language and onle years later when reading it in another language I realised what some parts really meant.

• Ben says:

@littlefooch
Yes, but actually implementing it can't be described so simply.

For simplicities' sake, imagine your lab reports only came with three values. This means you could plot the content of every single lab report possible on a 3-dimensional graph. Every single lab report, every single point in that graph, has a correct diagnosis. So there would be an ideal graph where you could take three numbers, trace along the axes of the graph, and arrive at a correct diagnosis. That ideal graph would probably be filled with all kinds of exotic shapes that describe the different kinds of diagnoses possible.

Using both trained examples and manually configured knowledge about the world, machine learning attempts to construct that ideal graph, but before you can get to that, there are a bunch of hiccups you need to overcome along the way. You could have an incomplete set of input. Maybe you try to train your machine learning algorithm with those three lab report numbers, but the diagnosis is not only dependent on those, but also on the nurse who collected the data. Maybe the instruments you used to record the data are not sensitive enough for some of the diagnoses, or perhaps you have too few lab reports to accurately describe the system. Often, these hiccups do not impact accuracy negatively enough to abandon the project. There will always be inaccuracy in your system, as nothing, other than reality itself will produce that ideal graph. You just need to use your judgement to decide when it matters.

If you have a good mind for how these diagnoses work, I think you would be successful at building a machine learning system to classify them. I'd recommend you start from the simplest cases possible. Pick diagnoses that can be described by numerical data in a lab report (like high blood pressure vs low blood pressure), and try to think about which inputs you'll need to get a good diagnosis. Once you've mastered that, start trying to tackle more complex ones. I'd bet most of the fruit are hanging pretty low. Remember that if a doctor can do it, a computer can too. Anything worth knowing can be explained in code; if a doctor knows something that can't be, either he is lying (intentionally or not), or the universe operates on magic.

If your diagnosis includes entirely text based information, you're probably SOL. You'll need to find some way to encode the content of the text as numerical values before you can try to make a diagnosis. You'd be creating a pretty cutting edge system if you managed to do that.

Here's an example of what you're talking about using different methods. AFIK their system achieved higher accuracy than the average doctor.

• littlefooch says:

Enlitic looks like they are just starting out, perhaps successful in gaining funding to do further work. At some point, they will need to clear at least FDA medical device trials (in US at least) to be considered medically successful

• littlefooch says:

I wasn't describing an implementation. I was looking to describe a real end result that could possibly result and have medical value.

• JR says:

As a fan of the author's wavelet transform, I would like to thank Quanta for printing this piece. If anything I wish it were simply longer!

It is interesting to consider what "understanding" means in the case of applied mathematics and machine learning. I would argue that the researchers at the forefront of this field unquestionably have some understanding of why their techniques work – they design the architecture of the network and specify the optimization functions that are used to learn connection weights. The classification accuracy of deep networks approaches that of human observers. These networks are able to cluster data correctly in high dimensional spaces, and although we can only draw either cartoons or reduced-dimensionality representations to "show what they are doing", our understanding (or at least the understanding of the designers) is embodied in the construction of the networks.

This is an elegant articulation of the theory, as applied to object recognition by the human brain:
Untangling invariant object recognition
James J. DiCarlo and David D. Cox
http://www.coxlab.org/pdfs/TICS_DiCarloCox_2007.pdf

Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler Rob Fergus Dept. of Computer Science, Courant Institute, New York University
http://www.matthewzeiler.com/pubs/arxive2013/arxive2013.pdf

• Mike James says:

The way neural networks learn may seem to be unreasonably effective but there are some big problems. In particular there is the problem of adversarial inputs.
The Flaw In Every Neural Network Just Got A Little Worse
http://www.i-programmer.info/news/105-artificial-intelligence/9090-the-flaw-in-every-neural-network-just-got-a-little-worse.html
see the related articles at the end for the timeline history of how this is developing.
It seems that neural networks aren't quite like us and this defect could give some clue as to how they do work.

• Dr K.R.Balasubramaniyan says:

Excellent article by which you can make students to understand the importance of both pure and applied mathematics. This can be a part of syllabus for both mathematics and physics students. A must read article for all those in teaching profession.

• JR says:

It is also a nice coincidence that the first layer linear filters in networks used for object recognition as well as primate cortical V1 receptive fields are sometimes described as families of 2D wavelets.

• Mike Maxwell says:

@Jens: Translation involves a huge number of factors; for most documents, that includes a lot of world knowledge, without which one cannot understand the meaning, hence cannot translate reliably. Most statistical MT systems are "trained" on texts from a fairly narrow domain, which allows them to work reasonably well on other documents in that domain. But expecting them to deal with text outside that domain can lead to disappointment. (Of course there's lots to be said beyond that…)

You mentioned that "most of the human generated translations I see are wrong." "Wrong" is a binary judgment; in my experience, there's a wide range of reasonably good ways of expressing something, hence a wide range of potentially more-or-less correct ways to translate something (and of course an even wider range of ways to translate it wrong!). That said, some (human) translators are better than others, and even human translations suffer if the translator doesn't have the world knowledge for some domain. Do you have some specific examples of bad translations?

• Rafael says:

Pardon my ignorance, but what would the problem be with defining classic pre and post conditions for the neural network ? As a pre-condition you have inputs and corresponding outputs, and as post-conditions, the learned function. In the middle, you have assertions that restrict each of the generated sigmoid functions. What is so hard about finding those restrictions ? Inputs imply first restrictions on the first layer of sigmoid functions, first layer of sm implies restrictions on the second and so on. Until the final assertion is implied… Do you need more than that ?

• Frankenstein says:

Well, I am a mathematician, I work in big data, and the current statistical approaches work remarkably unwell. I guess, our measurement for wellness must be established first.