Improve Linux performance
By Cameron Laird
2004-04-08
Reader Rating:

Sorting is hard
That was an "easy" lesson; "civilians" with no particular programming background can understand the previous paragraph. Sorting, though -- that's in the "hard" category. And it must be: Donald Knuth devotes an entire volume of his classic series on computing to sorting and searching (see Resources for a link to a review of Knuth's The Art of Computer Programming).
Sorting turns out to be at the heart of many performance challenges for a deep reason. Performance is an issue for large-scale computations alone. Humans can only understand a few items at a time, so the way we grasp problems large enough to take a long time is to organize or structure them in some way. Sorting is the most common of these ways; a sorted list can be binary-searched rather than linearly searched. That simplifies manual lookup in the New York City telephone directory, for example, from a problem on the scale of a million to a problem on the order of the logarithm of a million (about fourteen, say).
That's typical of sorting work; superior algorithms often outperform inferior ones by a factor of a thousand or more. It pays to sort intelligently.
Most intelligent of all, however, is to avoid sorting altogether, or at least limit it to occasions when it won't be noticed. That's common with database management systems; among other benefits, they allow for the construction of "insert-time" indexes. When a sorted result is needed, it can be read directly from the index. The cost is a slightly longer time to create or insert elements -- but users won't notice this if the application's workflow is typical.
Knuth's reference describes plenty of other tactics, such as "Boyer-Moore" or "Rabin-Karp" for accelerating sorting operations in specific situations. One common principle that unifies several of these is that it can be easier to solve a general problem than a more specific one. Mathematicians are familiar with this; common problems in algebra are more tractable when considered over the complex numbers than over the reals, even though complex arithmetic is more difficult.
That's how I think about "decorate-sort-undecorate" (DSU). Suppose you have this dataset:
"Jane Jones" -> 15,000
"Dan Smith" -> 6,000
"Kim Black" -> 40,000
Say you need the names sorted by their associated revenues. A naive approach might be to use a library entry point for sorting, supplied with a comparison function to say that "Kim Black" precedes "Dan Smith" because 40,000 is greater than 6,000, and so on. That's straightforward and effective for small problems.
Far more efficient, though -- especially when sorting a million items at a time, as is common in bioscience, finance, and other domains -- is to go to the trouble of creating a special-purpose "reverse table." While this operation takes time, it allows the sorting to be very fast. Python's list comprehensions make this particularly economical to express; the same principle applies in any language you might have at hand, though:
Listing 3. Minimal decorate-sort-undecorate illustration, in Python
decorated = [ (revenues[name], name) for name in name_list ]
decorated.sort()
for (revenue, name) in decorated:
print "%14s: %8d" % (name, revenue)
Sorting the derived dataset can easily accelerate the overall operation by an order of magnitude or more. Calculations of computational complexity seem to be one of the harder subjects for students; there's no doubt of their value, though, when they lead to such speedups.
First published by IBM developerWorks
If you found this article interesting, you may want to read these as well:
|