At first glance, the big news coming out of this summer’s conference on the theory of computing appeared to be something of a letdown. For more than 40 years, researchers had been trying to find a better way to compare two arbitrary strings of characters, such as the long strings of chemical letters within DNA molecules. The most widely used algorithm is slow and not all that clever: It proceeds step-by-step down the two lists, comparing values at each step. If a better method to calculate this “edit distance” could be found, researchers would be able to quickly compare full genomes or large data sets, and computer scientists would have a powerful new tool with which they could attempt to solve additional problems in the field.

Yet in a paper presented at the ACM Symposium on Theory of Computing, two researchers from the Massachusetts Institute of Technology put forth a mathematical proof that the current best algorithm was “optimal” — in other words, that finding a more efficient way to compute edit distance was mathematically impossible. *The Boston Globe* celebrated the hometown researchers’ achievement with a headline that read “For 40 Years, Computer Scientists Looked for a Solution That Doesn’t Exist.”

But researchers aren’t quite ready to record the time of death. One significant loophole remains. The impossibility result is only true if another, famously unproven statement called the strong exponential time hypothesis (SETH) is also true. Most computational complexity researchers assume that this is the case — including Piotr Indyk and Artūrs Bačkurs of MIT, who published the edit-distance finding — but SETH’s validity is still an open question. This makes the article about the edit-distance problem seem like a mathematical version of the legendary report of Mark Twain’s death: greatly exaggerated.

The media’s confusion about edit distance reflects a murkiness in the depths of complexity theory itself, where mathematicians and computer scientists attempt to map out what is and is not feasible to compute as though they were deep-sea explorers charting the bottom of an ocean trench. This algorithmic terrain is just as vast — and poorly understood — as the real seafloor, said Russell Impagliazzo, a complexity theorist who first formulated the exponential-time hypothesis with Ramamohan Paturi in 1999. “The analogy is a good one,” he said. “The oceans are where computational hardness is. What we’re attempting to do is use finer tools to measure the depth of the ocean in different places.”

According to Ryan Williams, a computational complexity theorist at Stanford University, an imprecise understanding of theoretical concepts like SETH may have real-world consequences. “If a funding agency read that [*Boston Globe* headline] and took it to heart, then I don’t see any reason why they’d ever fund work on edit distance again,” he said. “To me, that’s a little dangerous.” Williams rejects the conclusion that a better edit-distance algorithm is impossible, since he happens to believe that SETH is false. “My stance [on SETH] is a little controversial,” he admits, “but there isn’t a consensus. It is a hypothesis, and I don’t believe it’s true.”

SETH is more than a mere loophole in the edit-distance problem. It embodies a number of deep connections that tie together the hardest problems in computation. The ambiguity over its truth or falsity also reveals the basic practices of theoretical computer science, in which math and logic often marshal “strong evidence,” rather than proof, of how algorithms behave at a fundamental level.

Whether by assuming SETH’s validity or, in Williams’ case, trying to refute it, complexity theorists are using this arcane hypothesis to explore two different versions of our universe: one in which precise answers to tough problems stay forever buried like needles within a vast computational haystack, and one in which it may be possible to speed up the search for knowledge ever so slightly.

**How Hard Can It Be?**

Computational complexity theory is the study of problems. Specifically, it attempts to classify how “hard” they are — that is, how efficiently a solution can be computed under realistic conditions.

SETH is a hardness assumption about one of the central problems in theoretical computer science: Boolean satisfiability, which is abbreviated as SAT. On its face, SAT seems simple. If you have a formula containing variables that can be set as true or false rather than as number values, is it possible to set those variables in such a way that the formula outputs “true”? Translating SAT into plain language, though, reveals its metamathematical thorniness: Essentially, it asks if a generic problem (as modeled by a logical formula) is solvable at all*.*

As far as computer scientists know, the only general-purpose method to find the correct answer to a SAT problem is to try all possible settings of the variables one by one. The amount of time that this exhaustive or “brute-force” approach takes depends on how many variables there are in the formula. As the number of variables increases, the time it takes to search through all the possibilities increases exponentially. To complexity theorists and algorithm designers, this is bad. (Or, technically speaking, hard.)

SETH takes this situation from bad to worse. It implies that finding a better general-purpose algorithm for SAT — even one that only improves on brute-force searching by a small amount — is impossible.

The computational boundaries of SAT are important because SAT is mathematically equivalent to thousands of other problems related to search and optimization. If it were possible to find one efficient, general-purpose algorithm for any of these so-called “NP-complete” problems, all the rest of them would be instantly unlocked too.

This relationship between NP-complete problems is central to the “P versus NP” conjecture, the most famous unsolved problem in computer science, which seeks to define the limits of computation in mathematical terms. (The informal version: if P equals NP, we could quickly compute the true answer to almost any question we wanted, as long as we knew how to describe what we wanted to find and could easily recognize it once we saw it, much like a finished jigsaw puzzle. The vast majority of computer scientists believe that P does not equal NP.) The P versus NP problem also helps draw an informal line between tractable (“easy”) and intractable (“hard”) computational procedures.

SETH addresses an open question about the hardness of NP-complete problems under worst-case conditions: What happens as the number of variables in a SAT formula gets larger and larger? SETH’s answer is given in razor-sharp terms: You shall never do better than exhaustive search. According to Scott Aaronson, a computational complexity expert at MIT, “it’s like ‘P not equal to NP’ on turbochargers.”

**The Upside of Impossible**

Paradoxically, it’s SETH’s sharpness about what cannot be done that makes it so useful to complexity researchers. By assuming that certain problems are computationally intractable under precise constraints, researchers can make airtight inferences about the properties of other problems, even ones that look unrelated at first. This technique, combined with another called reduction (which can translate one question into the mathematical language of another), is a powerful way for complexity theorists to examine the features of problems. According to Impagliazzo, SETH’s precision compared to that of other hardness conjectures (such as P not equal to NP) is a bit like the difference between a scalpel and a club. “We’re trying to [use SETH to] form more delicate connections between problems,” he said.

SETH speaks directly about the hardness of NP-complete problems, but some surprising reductions have connected it to important problems in the complexity class P — the territory of so-called easy or efficiently solvable problems. One such P-class problem is edit distance, which computes the lowest number of operations (or edits) required to transform one sequence of symbols into another. For instance, the edit distance between *book* and *back* is 2, because one can be turned into the other with two edits: Swap the first *o* for an *a*, and the second *o* for a *c*.

Indyk and Bačkurs were able to prove a connection between the complexity of edit distance and that of k-SAT, a version of SAT that researchers often use in reductions. K-SAT is “the canonical NP-complete problem,” Aaronson said, which meant that Indyk could use SETH, and its pessimistic assumptions about k-SAT’s hardness, to make inferences about the hardness of the edit-distance problem.

The result was striking because edit distance, while theoretically an easy problem in the complexity class P, would take perhaps 1,000 years to run when applied to real-world tasks like comparing genomes, where the number of symbols is in the billions (as opposed to *book* and *back*). Discovering a more efficient algorithm for edit distance would have major implications for bioinformatics, which currently relies on approximations and shortcuts to deal with edit distance. But if SETH is true — which Indyk and Bačkurs’ proof assumes — then there is no hope of ever finding a substantially better algorithm.

The key word, of course, is “if.” Indyk readily concedes that their result is not an unconditional impossibility proof, which is “the holy grail of theoretical computer science,” he said. “Unfortunately, we are very, very far from proving anything like that. As a result, we do the next best thing.”

Indyk also wryly admits that he was “on the receiving end of several tweets” regarding the *Globe*’s overstatement of his and Bačkurs’ achievement. “A more accurate way of phrasing it would be that [our result] is strong evidence that the edit-distance problem doesn’t have a more efficient algorithm than the one we already have. But people might vary in their interpretation of that evidence.”

Ryan Williams certainly interprets it differently. “It’s a remarkable connection they made, but I have a different interpretation,” he said. He flips the problem around: “If I want to refute SETH, I just have to solve edit distance faster.” And not even by a margin that would make a practical dent in how genomes get sequenced. If Williams or anyone else can prove the existence of an edit-distance algorithm that runs even moderately faster than normal, SETH is history.

And while Williams is one of the only experts trying to refute SETH, it’s not a heretical position to take. “I think it’s entirely possible,” Aaronson said. Williams is making progress: His latest research refutes another hardness assumption closely related to SETH. (He is preparing the work for publication.) If refuting SETH is scaling Everest, this latest result is like arriving at base camp.

Even though falsifying SETH “could be the result of the decade,” in Aaronson’s words, to hear Williams tell it, SETH’s truth or falsity is not the point. “It’s almost like the truth value isn’t so relevant to me while I’m working,” he said. What he means is that the scalpel of SETH is double-edged: Most researchers like to prove results by assuming that SETH is true, but Williams gets more leverage by assuming it is false. “For me it seems to be a good working hypothesis,” he said. “As long as I believe it’s false, I seem to be able to make lots of progress.”

Williams’ attempts at disproving SETH have borne considerable fruit. For example, in October he will present a new algorithm for solving the “nearest neighbors” problem. The advance grew out of a failed attempt to disprove SETH. Two years ago, he took a tactic that he had used to try and refute SETH and applied it to the “all-pairs shortest paths” problem, a classic optimization task “taught in every undergraduate computer science curriculum,” he said. His new algorithm improved on computational strategies that hadn’t significantly changed since the 1960s. And before that, another abortive approach led Williams to derive a breakthrough proof in a related domain of computer science called circuit complexity. Lance Fortnow, a complexity theorist and chairman of the Georgia Institute of Technology’s School of Computer Science, called Williams’ proof “the best progress in circuit lower bounds in nearly a quarter century.”

**The Map and the Territory**

In addition to these peripheral benefits, attacking SETH head-on helps researchers like Williams make progress in one of the central tasks of theoretical computer science: mapping the territory. Just as we know more about the surface of the moon than we do about the depths of our own oceans, algorithms are all around us, and yet they seem to defy researchers’ efforts to understand their properties. “In general, I think we underestimate the power of algorithms and overestimate our own ability to find them,” Williams said. Whether SETH is true or false, what matters is the ability to use it as a tool to map what Williams calls the topography of computational complexity.

Indyk agrees. While he didn’t prove that edit distance is impossible to solve more efficiently, he did prove that this theoretically tractable problem is fundamentally connected to the intrinsic hardness of NP-complete problems. The work is like discovering a mysterious isthmus connecting two landmasses that had previously been thought to be oceans apart. Why does this strange connection exist? What does it tell us about the contours of the mathematical coastline that defines hard problems?

“P versus NP and SETH are ultimately asking about the same thing, just quantifying it differently,” Aaronson said. “We want to know, How much better can we do than blindly searching for the answers to these very hard computational problems? Is there a quicker, cleverer way to mathematical truth, or not? How close can we get?” The difference between solving the mysteries of SETH and those of P versus NP, Aaronson adds, may be significant in degree, but not in kind. “What would be the implications of discovering one extraterrestrial civilization versus a thousand?” he mused. “One finding is more staggering than the other, but they’re both monumental.”

*This article was reprinted on Wired.com.*

Interesting read ! On the relevance of this finding for biology, you should read https://liorpachter.wordpress.com/2015/08/14/in-biology-n-does-not-go-to-infinity/

This problem reminds me of the time when I was the lead programmer working on a computer virus scanner. The most basic way of detecting computer viruses was to search the computer code (program genome) for a string of bytes (virus genome) and alert the user if such a signature was found. That sounds very much like the DNA problem discussed above.

When the number of virus signatures increased the scanning times exploded. This was unacceptable for the users. The solution I implemented was to deconstruct the virus genome into byte pairs (65536 possibilities) and construct a bitmap of that. Then do the same for the program genome. You end up with two bitmaps. If you AND both bitmaps (bitwise operations are very fast on computers) and the result is equal to the virus bitmap you have a potential hit. Then you fall back to the slow byte by byte comparison. It was relatively rare that a program matched a virus with the bitmap method, so the real cpu intensive scan only had to be done in a small percentage of cases. The scanning times improved spectacularly.

It's even conceivable to calculate the genome bitmaps (the most cpu intensive part) in advance so that comparisons against a new bitmap could be done very quickly.

On the off chance that this could be useful.

I wouldn't be quick to jump on the "it's settled" bandwagon. I think quantum algorithms may still surprise us. Once quantum computers start becoming practical and heavily used, research money will flow and I'd bet we see revolutions in complexity theory.

Quantum computers haven't been found to work on a different order than any other computer. They are just better at certain types of problems than other computers.

I loved the explanation about hardness, SAT e.t.c. In the same vein, would the author be as kind and explain what is P, NP, circuit, and k (in the k-SAT)? I would greatly appreciate simple, easy-layperson-understandable explanations.

Most quantum computer designs are probabilistic. They give you answers that have some fixed probability of being true. Running the algorithm repeatedly lets you increase your confidence in the answer.

This article is about deterministic algorithms. Probabilistic algorithms are a whole other thing.

I've struggled with a few of these calculation limit problems, and think the "strong exponential time hypothesis" (SETH) is not a computational hypothesis but a physiological one, so needing physics rather than analysis, and maybe a study of the difference between conceptual and natural law. That the physical world puts constraints on mathematics is, of course, one of those "outcast subjects", generally unpopular because of not being possible to mathematically define. It's how you measure whether a certain result is worth the trouble, though, and we use that quite productively very often.

It's not that its wrong to challenge it, as you may well find major simplification of a problem that way, like by redefining the problem so it doesn't need much computation any more, but the physiological limits of computation isn't a problem of computation or assumptions, it's a problem of externalities.

I have worked extensively with NP-hard optimization problems as a Harvard grad student (with <a href="http://www.merl.com/publications/TR2002-08">published results</a>).

It matters only in an academic sense whether NP-complete problems will remain "out-of-reach." Actual, human-scale problems are falling to faster computation with algorithms that don't necessarily find the optimum solution. There are many algorithmic shortcuts that lead to great results (often provably within a few percent of optimum) that may or may not be perfect, but have the great advantage of being available now.

Continued research into algorithms for NP-hard searching is bringing human benefits in the realm of finding solutions to NP-complete problems. This will continue to be the case even if P does not equal NP (which is likely).

The moral is we shouldn't let the perfect be the enemy of the good.

@ProfessorGuy, sure but P!=NP has all sorts of implications beyond doing big things precisely and fast. In particular of course if P=NP then almost all modern cryptography is doomed, creating very serious economic implications.

The breakthrough connecting SETH and lower bounds for polynomial-time computable problems actually goes back further, to Ryan Williams' 2004 paper. Since then, it has become an active and exciting area, establishing connections between a large number of polynomial-time computable problems (of which Edit Distance is an important example) and SETH. There are other beautiful connections in this area such as relations with dynamic problems, an even more surprising and practically relevant direction.

A good source of information about it is a series of lectures that Virginia Vassilevska recently gave at the Simons institute, as a part of the program she is co-organizing: https://simons.berkeley.edu/talks/computational-complexity-polynomial-time-problems

(It is quite surprising that Virginia is not mentioned here: as far as I know, she is one of the main experts on the subject, and Bačkurs and Indyk's work builds upon some of her results… I would imagine she would be the most natural choice to interview for this article.)

For people interested in "SAT solver" algorithms, I can recommend:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.6133 : "A SAT Solver Primer" by David G. Mitchell (from 2008 I think)

wherein algorithmic approaches are listed to beat SAT problems over the head.

These algorithms have a strong "roulette" feeling to them because they try to exploit possibly-present regularities in the problem formulation. They will work well, maybe extremely well in some (most?) cases but fail on a "worst case" problem which has no exploitable structure.

The interesting question is, how often are worst-case problems encountered in the "real world" (how often is the roulette wheel perfectly balanced?) so that the only thing left to do is to enumerate possible solutions and hope that a match is found "soon"? This seems to be a practical question demanding empirical data collection. And maybe there is a deterministic way to cut a non-worst-case problem in NP of size n into a problem in P of size m and a remainder problem in NP of size n-m.