Quantized: Computer Science

The Rise of Computer-Aided Explanation

Computers can translate French and prove mathematical theorems. But can they make deep conceptual insights into the way the world works?

[No Caption]

Olena Shmahalo/Quanta Magazine. Paul Cézanne's "Still Life with Apples" courtesy of the Getty's Open Content Program.

Imagine it’s the 1950s and you’re in charge of one of the world’s first electronic computers. A company approaches you and says: “We have 10 million words of French text that we’d like to translate into English. We could hire translators, but is there some way your computer could do the translation automatically?”


A monthly column in which top researchers explore the process of discovery. This month’s columnist, Michael Nielsen, is a computer scientist and author of three books.

At this time, computers are still a novelty, and no one has ever done automated translation. But you decide to attempt it. You write a program that examines each sentence and tries to understand the grammatical structure. It looks for verbs, the nouns that go with the verbs, the adjectives modifying nouns, and so on. With the grammatical structure understood, your program converts the sentence structure into English and uses a French-English dictionary to translate individual words.

For several decades, most computer translation systems used ideas along these lines — long lists of rules expressing linguistic structure. But in the late 1980s, a team from IBM’s Thomas J. Watson Research Center in Yorktown Heights, N.Y., tried a radically different approach. They threw out almost everything we know about language — all the rules about verb tenses and noun placement — and instead created a statistical model.

Michael Nielsen

They did this in a clever way. They got hold of a copy of the transcripts of the Canadian parliament from a collection known as Hansard. By Canadian law, Hansard is available in both English and French. They then used a computer to compare corresponding English and French text and spot relationships.

For instance, the computer might notice that sentences containing the French word bonjour tend to contain the English word hello in about the same position in the sentence. The computer didn’t know anything about either word — it started without a conventional grammar or dictionary. But it didn’t need those. Instead, it could use pure brute force to spot the correspondence between bonjour and hello.

By making such comparisons, the program built up a statistical model of how French and English sentences correspond. That model matched words and phrases in French to words and phrases in English. More precisely, the computer used Hansard to estimate the probability that an English word or phrase will be in a sentence, given that a particular French word or phrase is in the corresponding translation. It also used Hansard to estimate probabilities for the way words and phrases are shuffled around within translated sentences.

Using this statistical model, the computer could take a new French sentence — one it had never seen before — and figure out the most likely corresponding English sentence. And that would be the program’s translation.

When I first heard about this approach, it sounded ludicrous. This statistical model throws away nearly everything we know about language. There’s no concept of subjects, predicates or objects, none of what we usually think of as the structure of language. And the models don’t try to figure out anything about the meaning (whatever that is) of the sentence either.

Despite all this, the IBM team found this approach worked much better than systems based on sophisticated linguistic concepts. Indeed, their system was so successful that the best modern systems for language translation — systems like Google Translate — are based on similar ideas.

Statistical models are helpful for more than just computer translation. There are many problems involving language for which statistical models work better than those based on traditional linguistic ideas. For example, the best modern computer speech-recognition systems are based on statistical models of human language. And online search engines use statistical models to understand search queries and find the best responses.

Many traditionally trained linguists view these statistical models skeptically. Consider the following comments by the great linguist Noam Chomsky:

There’s a lot of work which tries to do sophisticated statistical analysis, … without any concern for the actual structure of language, as far as I’m aware that only achieves success in a very odd sense of success. … It interprets success as approximating unanalyzed data. … Well that’s a notion of success which is I think novel, I don’t know of anything like it in the history of science.

Chomsky compares the approach to a statistical model of insect behavior. Given enough video of swarming bees, for example, researchers might devise a statistical model that allows them to predict what the bees might do next. But in Chomsky’s opinion it doesn’t impart any true understanding of why the bees dance in the way that they do.

Library of Congress, Geography and Map Division

A map of New York from 1896. The four-color theorem states that any map can be shaded using four colors in such a way that no two adjacent regions have the same color.

Related stories are playing out across science, not just in linguistics. In mathematics, for example, it is becoming more and more common for problems to be settled using computer-generated proofs. An early example occurred in 1976, when Kenneth Appel and Wolfgang Haken proved the four-color theorem, the conjecture that every map can be colored using four colors in such a way that no two adjacent regions have the same color. Their computer proof was greeted with controversy. It was too long for a human being to check, much less understand in detail. Some mathematicians objected that the theorem couldn’t be considered truly proved until there was a proof that human beings could understand.

Today, the proofs of many important theorems have no known human-accessible form. Sometimes the computer is merely doing grunt work — calculations, for example. But as time goes on, computers are making more conceptually significant contributions to proofs. One well-known mathematician, Doron Zeilberger of Rutgers University in New Jersey, has gone so far as to include his computer (which he has named Shalosh B. Ekhad) as a co-author of his research work.

Not all mathematicians are happy about this. In an echo of Chomsky’s doubts, the Fields Medal-winning mathematician Pierre Deligne said: “I don’t believe in a proof done by a computer. In a way, I am very egocentric. I believe in a proof if I understand it, if it’s clear.”

On the surface, statistical translation and computer-assisted proofs seem different. But the two have something important in common. In mathematics, a proof isn’t just a justification for a result. It’s actually a kind of explanation of why a result is true. So computer-assisted proofs are, arguably, computer-generated explanations of mathematical theorems. Similarly, in computer translation the statistical models provide circumstantial explanations of translations. In the simplest case, they tell us that bonjour should be translated as hello because the computer has observed that it has nearly always been translated that way in the past.

Thus, we can view both statistical translation and computer-assisted proofs as instances of a much more general phenomenon: the rise of computer-assisted explanation. Such explanations are becoming increasingly important, not just in linguistics and mathematics, but in nearly all areas of human knowledge.

But as smart skeptics like Chomsky and Deligne (and critics in other fields) have pointed out, these explanations can be unsatisfying. They argue that these computer techniques are not offering us the sort of insight provided by an orthodox approach. In short, they’re not real explanations.

A traditionalist scientist might agree with Chomsky and Deligne and go back to conventional language models or proofs. A pragmatic young scientist, eager to break new ground, might respond: “Who cares, let’s get on with what works,” and continue to pursue computer-assisted work.

Better than either approach is to take both the objections and the computer-assisted explanations seriously. Then we might ask the following: What qualities do traditional explanations have that aren’t currently shared by computer-assisted explanations? And how can we improve computer-assisted explanations so that they have those qualities?

For instance, might it be possible to get the statistical models of language to deduce the existence of verbs and nouns and other parts of speech? That is, perhaps we could actually see verbs as emergent properties of the underlying statistical model. Even better, might such a deduction actually deepen our understanding of existing linguistic categories? For instance, imagine that we discover previously unknown units of language. Or perhaps we might uncover new rules of grammar and broaden our knowledge of linguistics at the conceptual level.

As far as I know, this has not yet happened in the field of linguistics. But analogous discoveries are now being made in other fields. For instance, biologists are increasingly using genomic models and computers to deduce high-level facts about biology. By using computers to compare the genomes of crocodiles, researchers have determined that the Nile crocodile, formerly thought to be a single species, is actually two different species. And in 2010 a new species of human, the Denisovans, was discovered through an analysis of the genome of a finger-bone fragment.

Another interesting avenue is being pursued by Hod Lipson of Columbia University. Lipson and his collaborators have developed algorithms that, when given a raw data set describing observations of a mechanical system, will actually work backward to infer the “laws of nature” underlying those data. In particular, the algorithms can figure out force laws and conserved quantities (like energy or momentum) for the system. The process can provide considerable conceptual insight. So far Lipson has analyzed only simple systems (though complex raw data sets). But it’s a promising case in which we start from a very complex situation, and then use a computer to simplify the description to arrive at a much higher level of understanding.

The examples I’ve given are modest. As yet, we have few powerful techniques for taking a computer-assisted proof or model, extracting the most important ideas, and answering conceptual questions about the proof or model. But computer-assisted explanations are so useful that they’re here to stay. And so we can expect that developing such techniques will be an increasingly important aspect of scientific research over the next few decades.

Correction: This article was revised on July 24, 2015, to reflect that a Denisovan bone fragment was not found in an Alaskan cave.

This article was reprinted on Wired.com.

View Reader Comments (17)

Leave a Comment

Reader CommentsLeave a Comment

  • "a new species of human, the Denisovans, was discovered through an analysis of the genome of a finger-bone fragment that had been found in an Alaskan cave."

    Denisova cave (in Siberia) is quite far from Alaska..

    Sorry to nitpick, very interesting article! Stuff emerging from computer programs, that's a fascinating topic.

  • I would think of computer-assisted translation as "computer-assisted mapping", mapping A(such as French) to B(English).

  • The French-English translation example reminds me of a recent story on Nigel Richards who has just become the French Scrabble champion even though he doesn't speak French:
    Statistical models have similarly become "translation champions" even though they don't understand the linguistic structure of the language. I also remember being very surprised when I first heard of such approach to translation, and that it can actually work quite well. I agree that it would be interesting to combine both approaches, for example, by trying to extract some linguistic structure (e.g., the concepts of nouns and verbs) from the statistical data.

  • If I have understood this correctly, then Google Translate is something like John Searle's Chinese Room. No wonder it's unsatisfying; we don't normally think of proof as happening without consciousness.

  • At first the statistical approach sounded all wrong to me, But then I thought about how very few people actually know the rules of language. Some people can speak English well even though they don't know what a verb or participle is. Perhaps humans generate speech partly by echoing things that they remember hearing, not constructing by rules, and this programming technique is mimicking the human brain in this regard. There's definitely something to approaching it from both directions, and the same goes for mathematical proofs.

  • In all of its forms, this argument depends upon the profound uniformity of the world and, thus, induction. In other words, in any given circumstances what has happened will, necessarily, happen as it did before.

    Statistics is a method that allows us to analyze a record of systems of behaviors and they are dependent upon the behavior being captured or described in a consistent way for the purposes of useful comparison. Text translation of latin based languages is an example of a good case. However, the audio of a Chinese speaker and an English speaker is much less so. Indeed, the regional audio of one English speaker over another presents difficult challenges. Similarly, the case for translation between the blind and the sighted will vary.

    Computer-aided proofs typically demonstrate only tautologies by deduction. Deduction, and truth value systems generally, is a useful dualism for the purposes of explanation, but it is a dualism, none-the-less, with well-known challenges that result (Godel). Lengthy proofs require the time and attention to the rules that a machine can give. They cannot conceive of new mathematics nor modify mathematical approaches that they have not been instructed to allow.

    Computers are useful machines. They are very good at handling many artificial things, such as ASCII text, and they can do a more diligent job of applying the methods we discover (such as statistics and deduction), by this means they may offer us explanations that would be difficult for us to reach but let us be clear, this is our own intelligence extended by proxy. The machine may surprise us but it adds the persistence of a mechanical device to our own ingenuity. They are, and as currently conceived they always will, be unable to apprehend the world as one, as we do. And while machines do have short term benefits over our own capacities, this is a limit upon their possible intelligence. And this is most certainly the case at the levels of electrical power utilized by the human organism.

    As both Charles Peirce and Alan Turing note (I paraphrase in order to unify they words): "computing machines allow us to discern what remains for the living mind."

  • May I suggest that an 'adequate' understanding of complex systems (by science as a whole) could help very significantly to clarify the questions raised by Michael Nielsen.
    For instance:
    — The brain is part of the human physiological system that enables the human organism (as an encompassing system – think Venn Diagrams) to exist; live; direct action to meet challenges that confront the whole system; etc, etc, etc.
    — The 'mind' is a system 'associated' in various complex ways with the 'brain'. (Science has not adequately explored the nature of this association).
    — We have today a quite significantly better understanding of the brain as a complex system (controlling, in some sense, the encompassing human organism) than we had even two decades ago; however, our understanding of 'brain' is still very limited indeed.
    — Our understanding of the 'mind' as system is even more limited.
    — The computer is a 'human-made system' that, in some (very limited) sense, mimics a few of the simpler operations and activities of the human mind (as part of the living system).

    Any real advances on the issues raised by Michael Nielsen would have to wait till cognitive and other scientists have adequately *integrated* systems science into their disciplines. At the moment (I claim), conventional science has rather little understanding of system science. (What is called 'systems analysis' is scarcely pertinent to the issues under consideration; statistics enhances understanding of these issues as little as the 'discipline' of economics has enhanced the understanding of the nature of national/world economies).


  • Empirically, the fact that the statistical translators are used for the major translating applications we have today, is sufficient proof that this approach is the most valid. As a neuroscientist it also feels like the closest kin to how we learn about the world – observing regularities, making predictions about pattern based on context and probability. All of the "rules" of grammar are reverse engineered: nobody sat down at the dawn of humanity and said, "we're going to pattern our communication around nouns, verbs and the following tenses …". The regularities in language have coalesced around patterns the human brain is good at inferring from the communication sounds of others, and we tend to converge on a consensus structure which is easily predicted and reproduced, so that we can move past grammar and deal with meaning. If we need more complex meaning, the language gets bent into new structures which yield that meaning, and we all adapt by assimilating the new forms into our internal models. The rise of internet slang is a beautiful example of this process, verbing nouns, acronyms as spoken words, graphical symbols and even pasted photos and GIFs as language elements. New slang, new meaning, new communication. It's not designed, and if you want to understand it, you just have to ride its currents. Anyway – thanks for writing this piece, I enjoyed the issues it raised!

  • A very interesting article and discussion. As one knows, computers can play better chess than humans. In the opening they will play the best openings based on the statistics of millions of games. In the middle game, they will do better calculations. And in the end game, they can use the database of all end game positions with the shortest sequence of moves to check mate. Of course, chess players were complaining that the computers did not understand the game. However, what is there to understand if you know the way to force mate?

  • A very early example of computers generating conceptually new information resulted in my being able to make sense of my PhD research data, which otherwise would have been completely baffling. In 1960 Gibson et. al. (see Physical Review 120, p1229, 1960) simulated what happens when an atom plows into a crystal made of atoms having the same mass as the incident atom. They looked at crystals consisting of 500 to 1000 atoms, varying the energy and direction of the incident atom with respect to the crystal structure. They kept track of the motion and interactions of each atom in the simulated crystal as the energetic atom was stopped and the corresponding energy and momentum were dissipated throughout the crystal.

    The results of 45 such 3-D simulations revealed that “replacement collisions” were very numerous, resulting in a localized “hole” in the crystal of only a few atoms in size at the point of the collision. Other groups subsequently replicated the simulated phenomenon for a wide range of crystal structures. This localized “hole” was an intrinsically different form of radiation damage, compared to damage that happens when the bulk of a material gradually deteriorates under radiation.

    (Replacement collisions occur when the incident atom and the struck atom have the same mass. If this occurs in a crystal, the incident atom stops and can transfer its energy down some axis of the crystal, with an atom “down the road”, so to speak, getting bumped out of its position. Think of the desk toy where a row of steel balls are hanging in a line and the end balls alternately swing away and then return to bump the ball on the other end out of its position.)

    In my PhD research we were intending to measure the relative nuclear shapes of three hafnium isotopes (176, 178, 180). This entailed creating isotopically-pure hafnium crystals (e.g. HfC) and using the local electromagnetic field within the crystal to reveal the different interaction of each isotope nucleus with that field. (We used a beam of high-energy alpha particles to excite the hafnium nuclei and the Mossbauer effect to observe the subsequent de-excitation spectra.)

    What we found were dramatically distorted interactions compared to what was expected. And, curiously, the shape of each spectrum was independent of the time the crystal was exposed to the beam of alpha particles. The distortion could only be the result of changes in the immediate neighborhood of the decaying nucleus, not in bulk changes due to the alpha-particle irradiation. This was the first time such localized radiation damage had been observed (see Jacobs et al, Physics Letters A 29, p 498, 1969). It would have been difficult, if not impossible, to explain the data without the results of the earlier computer simulations.

  • Mr. Rickard brings up an interesting point, and one that occurred to me while reading the article, as well. Searle's tea room poses an interesting question regarding the state of machine intelligence, but, in the case of translation, I'm not sure that it is material to any appraisals of its success. For just like the tea room, what does it matter the methods by which the translation is accomplished, so long as it is performed satisfactorily?

    Interestingly enough, there is currently a Reddit experiment – /r/subredditsimulator – that employs bots which parse Markov chains and attempt to produce content that is indistinguishable from human submitted content. Humans vote on the best performing bots and score is kept to determine who is the best creating human-like content.

    While there is no actual prize at the moment other than pride for the programmer, it might give a glimpse into the future where bots compete at a Darwinistic level for more computing resources – a cyberpunk vision of survival of the fittest.

  • Two points:
    1. Skeptics of computer generated proofs are correct in that if a human does not understand the proof or cannot directly verify that it is correct, then the theorem hasn't been proven to him, even though it may have been proven to the computer. I'm not familiar with the software that generates such proofs, but I'm guessing that some work has to be done on producing comprehensible output.
    2. While it is true that statistically based translation techniques do not say much about the structure of language and are thus of little use to the linguist, that is not what they are for; but rather to provide accurate translations. Thus, Dr. Chomsky and his linguistic colleagues still need to rely on more structured analyses with the aid of other sorts of programs (to the extent they use computers).

  • MIT claims to have found a “language universal” that ties all languages together

    Perhaps this is an example of statistical models allowing us to "discover previously unknown units of language"?

  • Interesting article. I had not (still have not) kept up with progress in machine translation, so I was quite startled at how (and in what matter) it had progressed.
    The Google and similar translators, while far, far from perfect, none the less sometimes leave me stunned by their sheer usefulness; for example I have been able to use sizable technical articles in topics with which I have a nodding acquaintance, in languages that I do not know, as bases for producing English material on the topics (not translations, but along the same notional lines). Not many decades ago I would not have believed that I would ever see the day. It also has permitted me to correspond with experts who have little competence in English (barely more than I have in their language!) This might not sound like much, but in practice it is beyond rubies!
    Of course, I still strongly urge people to read H.G. Wells' short story:"Triumphs of a Taxidermist".
    HOWEVER… it seems that although it works well on major European languages like Scandinavian, Italian, Spanish and the like, the translators have a lot more work to do on the minor languages. I speak Afrikaans for example, and on occasion I check on translations from that language, mostly for fun, and what comes out is not often intelligible. I mention this not as a criticism, but as a caution to anyone who might think that the magic all just HAPPENS…

    That said, not to clutter correspondence with adulation, I am deeply impressed with your writing and your command of your subject matter. Much strength to your powers, and long may they function!

  • This reminds me of the trinity of Brahe, Kepler and Newton. Brahe collected – Which is now collected by our machines. Kepler discovered the equations – which is now being done by machine learning. How far can the machine learning go? Will it be able ot discover the laws that give rise to the equations that can explain the data?

    Michael – which domain is the most fertile? You mention the example of natural language. Very hard. You mention the genome – very rich because we have such few laws. But, probably very hard too – for the same reason.

    Is there a simpler domain? What about machine modeling?

Comments are closed.