Everybody in Computer Science hears about quick sort soon enough. We are told that if you have to choose one sorting algorithm, choose quick sort. Like everyone else, you start to wonder about complexity and see that quick sort is O(n**2), which is worse than O(n lg n) of Merge Sort and Heap sort. So you are told that well, Merge sort requires extra storage so that is bad. So, you wonder, well, Heap Sort does not need extra storage at all? Why is not that called quick sort? This blog post is about some of the things I have learnt about quick sort recently. The information here is all thanks to several handouts by various professors and also to Robert Sedgewick for his slides.
Vanilla Quick Sort
Quick Sort was invented by Sir Tony Hoare in 1962. Quick Sort gives you the impression that its just rearranging elements and by the end, you realize, “Look Ma, its sorted!”. Let’s see an example.
Consider an input array [38, 14, 19, 4, 15, 6, 1, 7, 18]. 9 elements in total. What does quick sort do?
quicksort(a, low, high): p = partition(a, low, high) quicksort(a, low, p - 1) quicksort(a, p + 1, high)
Essentially, we are “partitioning” the array and then recursively calling quicksort on the two partitions. Let’s see our partition method.
In this case, we are selecting the first element each time, as our “pivot”. Pivot indicates the element around which we are partitioning. Consider the initial array shown above. If we partition around 38, our resulting partition will look like:
[18, 14, 19, 4, 15, 6, 1, 7, 38]. Why is this? Because none of the elements in that array are greater than 38. So, we see the whole array and make no swaps. Finally, our resulting partition will put 38 at the end and leave us to sort the rest of the array in the next step. This is the first problem with quicksort!
Selecting the pivot
So, how do you select the pivot?
Select the mid-point each time. If the data is randomly distributed, then, choosing the midpoint would be most often better than choosing the first or last. The only thing that would change is
mid = lo + (hi - lo)//2 #prevent integer overflow pivot = a[mid] a[lo], a[pivot] = a[pivot], a[lo]
Problem with this one?
It’s easy to game this. You can give a dataset that is quite sorted. And selecting the midpoint will not help us a lot in that case.
Why all this fuss about pivot selection?
We have all this fuss because each time we call “partition”, we actually want to make sure our data gets split into half! Or as close as possible. The complexity of partitioning will always be linear. But, if we can make sure that each time we call it, our data gets split in half, then, our quicksort method will make less number of recursive calls (~lg n). And that is exactly why there is a lot of fuss about selecting the right pivot under all (most) conditions. IF you want to try, call our vanilla_quicksort method passing data that is already sorted. (it will crash or use up an insane amount of recursion depth).
Median of first, mid and last
It’s an array. You know the first & last indices, you can find the mid. What if you take the median of these three as pivot? For instance, in our previous array above, the median of (38, 15, 18) is 18, better than either 38 for sure, pretty much like 15. Will that help us? Asympototically, it may not help us much. Well, what if we select the median of the whole array? Use a linear-time algorithm to find the median and then call partition on that. You will only call it ~lg n times because well, you are partitioning on the median. Lets find out which one helps us in practice.
Random pivot selection
One of the most fascinating things about algorithm study is the power of randomization algorithms. These are just how they sound, they randomly permute the input and guarantee an optimal “expected” time. The first question that obviously comes up is, “Why do I trust randomness?”. We do because in randomized algorithms, only a little is left to chance. In practice, they work out really well. How does it fit into quicksort? Well, remember the recent discussion above about pivot selection. We can select the midpoint, or select the first or last, or? Or select a random index between lo and hi! Select it randomly. And trust your random number generator and the power of randomization to give you good results in practice. All that would change in the code above is the line that selects the midpoint to
p = random.randint(lo, hi)
Everything else would remain the same.
What is the problem with random pivot selection?
We will call a random number generation many times! Each time we call partition, we will call a random number generator. So, we want to really reduce the number of times we call partition.
What else can we do?
Idea 1: Switch to insertion sort for smaller arrays
Insertion sort is a O(n**2) algorithm but its fast for smaller, partially sorted arrays as it makes minimum number of swaps, O(n+d). You can choose a small value like 15 or 32. Probably not more than that. I believe, in Prof Knuth’s book, he proves n = 15 is optimal (using Entropy).
Idea 2: Always call on the smaller partition
From above, we know that the faster we reach smaller sizes, the faster we get. We also know that our partition method is O(N) and there is no way around that. So, each time we get a pivot position, we can always call quicksort recursively using the smaller partition. This way, our recursive stack will bound to O(lg N).
With both of these optimizations, how will the code look?
This will reduce number of calls to the partition method and also sort smaller arrays faster! What are we missing?
It is not so straight-forward. What do you do when you have few distinct keys? You have, say, 10000 partitions but only about 15 are unique. Say, one of the keys is 5 and it occurs 78 times. You pivot using 5 but, the other 77 copies could be anywhere. Ideally, we would want that 5 to be together and then sort the array that has elements before 5 and after 5 recursively. But, our pivot based method is only < and >.
Enter 3-way partitioning
This will divide our array into 3 parts, less than pivot, equal to pivot, more than pivot. This will need some changes to our partition method. It will now look like below
So, with all of this, quicksort is what’s used in languages and APIs right?
Not so fast
If you notice, none of our partitioning methods are stable. So, that is kind of a bummer. You can make it stable, but that is rather complicated. Python uses timsort, which is a hybrid sorting algorithm that uses merge sort as one. It finds parts of the data that are partially sorted and does not sort them again. It also uses the cache effectively. C++ STL has something called as introsort, which is quicksort with all the optimizations mentioned above and it switches to heapsort for smaller arrays (in-place and guaranteed O(n lg n)).
As a summary, when trying to implement quicksort and wondering why its not quick, remember:
- partitioning is expensive. minimize those calls
- data can be bad. permute them randomly and then sort
- select pivot at random
- don’t use it for smaller arrays. switch to something faster