I have written earlier a blog post about an example of where program optimization actually really matters and can make an absolutely enormous difference in how much you have to wait for a program to do something (even though the notion that program optimization is not all that important is way too widespread). In that case it was a question of Matt Parker's original Python program taking an entire month to calculate something that could be (quite trivially) calculated in just a few seconds, even in under a second with a bit of algorithmic optimization.
I would like to describe here how an experienced programmer approaches such a problem.
Said problem is:
Given a big English dictionary file (ie. a file containing hundreds of thousands of unique English words), find all combinations of five 5-letter words that combined use 25 unique letters of the alphabet.
Any experienced programmer would very quickly think of ways to do that in a few seconds at most, and categorically know that one month is way, way too slow. This even if we restrict ourselves to using Python.
Matt Parker is not an experienced programmer, and his approach to solving the problem with a Python program was extraordinarily naive and inefficient. Essentially, he read the entire dictionary into some kind of data container, all the several hundreds of thousands of words, and then went through every single combination of 5 words and checked if they matched the conditions: In other words, all five words are 5 letters long, and all the 25 letters are unique.
That is, of course, an extremely inefficient way of doing it and, as it so happens, it can be sped up by several orders of magnitude with extremely simple optimizations. These optimizations might sound trivial to most programmers, and when said, but they might not come to the mind of a beginner programmer.
First: By far the biggest and also the simplest optimization (that Matt clearly didn't think of): We are only interested in 5-letter words. That means we can discard all the other words from the get-go. It's that simple. When reading the dictionary file, if a word doesn't have 5 letters, just discard it and don't add it to the data container.
That simple step reduces the number of words from several hundred thousands to just a few thousands. That speeds up the entire subsequent calculations by several orders of magnitude (more than two because we are talking about quadratic behavior).
Matt's biggest mistake was to take all the words in the dictionary file without discarding any of them. That, rather obviously, is completely unnecessary and extremely wasteful and inefficient. Just this optimization alone, even without anything else, would have probably reduced the runtime of his program from one month to just an hour or so, perhaps even less.
Second: We can actually discard even more words than that, further reducing the amount of data to process. In particular, if any word has a repeated letter, we can also discard it when reading the input, because such words would fail the conditions immediately and cannot be part of the answer. This will further reduce the amount of words probably to less than half of all 5-letter words in the dictionary.
This second optimization would have likely made Matt's program, even without any other changes, take just ten minutes, perhaps even less.
Third: This is where actual algorithmic knowledge and thinking is more required.
Matt's program went through every single possible combination of 5 words in the input. This is unnecessary and we can go through significantly less combinations than that, further reducing runtime by an order of magnitude or two, perhaps even more.
This algorithmic trick is very common and very well known when dealing with exactly this kind of problem. And the idea is: If two words share any letters, you don't need to check any combinations containing those two words (because, rather obviously, all of those combinations will fail.) Just this idea alone allows skipping the vast, vast majority of possible combinations, speeding up the program enormously.
While the idea and principle is quite simple, it might not be immediately obvious to the beginner programmer how to actually implement it in code (and many beginner and even not-so-beginner programmers will often succumb to really complicated and lengthy solutions to try to achieve this.) This is where programming and algorithmic expertise becomes very helpful, as the solution is much simpler than it might sound.
Explaining in great detail the simple algorithm to achieve this would require a bit of text, so I'll just summarize the general idea instead: Going through all the combinations of elements (words in this case) can be implemented as a recursive function which keeps track of which elements we are dealing with. The recursion can be stopped (and execution returned to the previous recursion level) when we detect two words with shared letters, thus skipping all the subsequent deeper recursions.
Fourth: Comparing words.
Here, too, Matt used a very naive approach, where he would take the original words and do an elaborated quadratic comparison of each of their letters.
The thing is: When doing comparison we don't need the letters of the words in their original order. That's unnecessary for this comparison because we only want to know if they share letters, and the order of the letters in the word doesn't matter. Thus, we can rearrange the letters in each word to be eg. alphabetically ordered in order to make the comparison simpler and faster. (The original word can be kept alongside the reordered one in order to print out the found words.)
But that's not even the most efficient way of doing it. Since no letters are repeated (as we have discarded all the words with repeated letters), can just create a small bitmap of each word, with each bit representing each letter of the alphabet. Since the English alphabet consists of 26 letters, a 32-bit integer more than suffices for this. Thus, we can "convert" each 5-letter word into an integer (which bits tells which letters are used in the word), and then we can use a bitwise "and" operator to compare two of them to see if they share any of the bits. In other words, rather than going through a string of letters, we are just comparing two integers with a bitwise-and operator. This is extraordinarily fast.
Even if we restricted ourselves to using Python, doing the four optimizations above would solve the problem in just a few seconds, perhaps even in less than a second.
No comments:
Post a Comment