Random randomness

Making Quick Sort Quick

Everybody in Computer Science hears about quick sort soon enough. We are told that if you have to choose one sorting algorithm, choose quick sort. Like everyone else, you start to wonder about complexity and see that quick sort is O(n**2), which is worse than O(n lg n) of Merge Sort and Heap sort. So you are told that well, Merge sort requires extra storage so that is bad. So, you wonder, well, Heap Sort does not need extra storage at all? Why is not that called quick sort? This blog post is about some of the things I have learnt about quick sort recently. The information here is all thanks to several handouts by various professors and also to Robert Sedgewick for his slides.

Vanilla Quick Sort

Quick Sort was invented by Sir Tony Hoare in 1962. Quick Sort gives you the impression that its just rearranging elements and by the end, you realize, “Look Ma, its sorted!”. Let’s see an example.

Consider an input array [38, 14, 19, 4, 15, 6, 1, 7, 18]. 9 elements in total. What does quick sort do?

	quicksort(a, low, high):
		p = partition(a, low, high)
		quicksort(a, low, p - 1)
		quicksort(a, p + 1, high)

Essentially, we are “partitioning” the array and then recursively calling quicksort on the two partitions. Let’s see our partition method.

In this case, we are selecting the first element each time, as our “pivot”. Pivot indicates the element around which we are partitioning. Consider the initial array shown above. If we partition around 38, our resulting partition will look like:

[18, 14, 19, 4, 15, 6, 1, 7, 38]. Why is this? Because none of the elements in that array are greater than 38. So, we see the whole array and make no swaps. Finally, our resulting partition will put 38 at the end and leave us to sort the rest of the array in the next step. This is the first problem with quicksort!

Selecting the pivot

So, how do you select the pivot?

Easy idea

Select the mid-point each time. If the data is randomly distributed, then, choosing the midpoint would be most often better than choosing the first or last. The only thing that would change is

	mid = lo + (hi - lo)//2 #prevent integer overflow
	pivot = a[mid]
	a[lo], a[pivot] = a[pivot], a[lo]

Problem with this one?

It’s easy to game this. You can give a dataset that is quite sorted. And selecting the midpoint will not help us a lot in that case.

Why all this fuss about pivot selection?

We have all this fuss because each time we call “partition”, we actually want to make sure our data gets split into half! Or as close as possible. The complexity of partitioning will always be linear. But, if we can make sure that each time we call it, our data gets split in half, then, our quicksort method will make less number of recursive calls (~lg n). And that is exactly why there is a lot of fuss about selecting the right pivot under all (most) conditions. IF you want to try, call our vanilla_quicksort method passing data that is already sorted. (it will crash or use up an insane amount of recursion depth).

Median of first, mid and last

It’s an array. You know the first & last indices, you can find the mid. What if you take the median of these three as pivot? For instance, in our previous array above, the median of (38, 15, 18) is 18, better than either 38 for sure, pretty much like 15. Will that help us? Asympototically, it may not help us much. Well, what if we select the median of the whole array? Use a linear-time algorithm to find the median and then call partition on that. You will only call it ~lg n times because well, you are partitioning on the median. Lets find out which one helps us in practice.

Random pivot selection

One of the most fascinating things about algorithm study is the power of randomization algorithms. These are just how they sound, they randomly permute the input and guarantee an optimal “expected” time. The first question that obviously comes up is, “Why do I trust randomness?”. We do because in randomized algorithms, only a little is left to chance. In practice, they work out really well. How does it fit into quicksort? Well, remember the recent discussion above about pivot selection. We can select the midpoint, or select the first or last, or? Or select a random index between lo and hi! Select it randomly. And trust your random number generator and the power of randomization to give you good results in practice. All that would change in the code above is the line that selects the midpoint to

		p = random.randint(lo, hi)

Everything else would remain the same.

What is the problem with random pivot selection?

We will call a random number generation many times! Each time we call partition, we will call a random number generator. So, we want to really reduce the number of times we call partition.

What else can we do?

Idea 1: Switch to insertion sort for smaller arrays

Insertion sort is a O(n**2) algorithm but its fast for smaller, partially sorted arrays as it makes minimum number of swaps, O(n+d). You can choose a small value like 15 or 32. Probably not more than that. I believe, in Prof Knuth’s book, he proves n = 15 is optimal (using Entropy).

Idea 2: Always call on the smaller partition

From above, we know that the faster we reach smaller sizes, the faster we get. We also know that our partition method is O(N) and there is no way around that. So, each time we get a pivot position, we can always call quicksort recursively using the smaller partition. This way, our recursive stack will bound to O(lg N).

With both of these optimizations, how will the code look?

This will reduce number of calls to the partition method and also sort smaller arrays faster! What are we missing?

Real-world

It is not so straight-forward. What do you do when you have few distinct keys? You have, say, 10000 partitions but only about 15 are unique. Say, one of the keys is 5 and it occurs 78 times. You pivot using 5 but, the other 77 copies could be anywhere. Ideally, we would want that 5 to be together and then sort the array that has elements before 5 and after 5 recursively. But, our pivot based method is only < and >.

Enter 3-way partitioning

This will divide our array into 3 parts, less than pivot, equal to pivot, more than pivot. This will need some changes to our partition method. It will now look like below

So, with all of this, quicksort is what’s used in languages and APIs right?

Not so fast

If you notice, none of our partitioning methods are stable. So, that is kind of a bummer. You can make it stable, but that is rather complicated. Python uses timsort, which is a hybrid sorting algorithm that uses merge sort as one. It finds parts of the data that are partially sorted and does not sort them again. It also uses the cache effectively. C++ STL has something called as introsort, which is quicksort with all the optimizations mentioned above and it switches to heapsort for smaller arrays (in-place and guaranteed O(n lg n)).

As a summary, when trying to implement quicksort and wondering why its not quick, remember:

  1. partitioning is expensive. minimize those calls
  2. data can be bad. permute them randomly and then sort
  3. select pivot at random
  4. don’t use it for smaller arrays. switch to something faster

Bloom Filters in Python

In a previous post, we had discussed the essence of Bloom Filters and how they can help in Language Modeling. I had promised I will say more but could not keep it. In this post, we will discuss how easy it is to implement Bloom Filters. Note that others have discussed in other posts too in equally easy ways. I wanted to do this for some tweets lying around and this is the result.

What twitter posts? Long time ago, in March, 2011, Stanford SNA lab had shared 470 million tweets for download. Unfortunately, Twitter asked them to take it down. Fortunately, I had a copy of it before they asked it to be taken down. I cannot and won’t share it. But, I can tell you the scale of the data. Consider the following problem:

You want to sign up for Twitter with a username. Twitter tells you in a split second if that username is already taken. They have a LOT of users. How do you test as fast as possible if an element is already in the set? Some %age of error is tolerable, you say.

Enter Bloom Filters. Wikipedia does a very decent job of explaining the concept behind Bloom Filters. You have a bit-array, you set k bits using k hash functions. Now, you have a new element and you want to check if its already there. You repeat the k computations, if all of them are set, its there. Note that we do not keep any track of collisions. So, what do we conclude if all the k bits are set to 1? We say that the element “might” be in the set. Why do we say that? We do because another element or combination of elements might have set the same bits to 1, thus, (mis)leading us to the present conclusion.

Without further ado, lets write some code. I have a lot of tweets. To get started, lets get all the usernames from June, 2009, which has about 19 million tweets. Each tweet has a username and a timestamp, alongwith the tweet. Obviously, usernames are repeated. Here is the python script for it.

Now, we have a file with all the usernames. We want to store them in a Bloom Filter. We can just use a list with 0 denoting ‘not set’ and 1 denoting ‘set’. How many bits should we have? This is a question you want to answer iteratively. You might have some ‘intuition’ about how many elements you have (n). You can decide how many bits, m, you want and then calculate the optimum number k, which is the number of hash functions. What is the problem if you keep the number of bits too low? Lower number of bits would mean that some combination of elements might come up that set a set of already set bits, thus, leading us to wrong conclusions. Lets find out what happens with our username file.

Number of usernames, including duplicates:

 
	18572084

The test file consists of the last 20 lines of the training input, with 3 extra usernames that are definitely not in the file. Lets walk through the code.

We say that the number of bits in our bloom filter is 2**12 = 4096. We only use one hash function, murmur hash. We hash 5 times, but by using

	h1 + i * h2

Why? This is because of a paper, which can be found here. They found that using only two hash functions does not lead to any increase in the rate of false positives. We use murmur hash as it has found to be effective before. We also use the standard technique of hash1 + i*hash2 from hashing literature to “simulate” hash functions. So, what do we get?

Is  fherdito  present in the  True
Is  joshdevine  present in the  True
Is  ayman99  present in the  True
Is  shabuta  present in the  True
Is  babiiiecakez  present in the  True
Is  billhustle  present in the  True
Is  amilacelia  present in the  True
Is  drewstakinatwit  present in the  True
Is  nwriter5  present in the  True
Is  frzz  present in the  True
Is  knightowl  present in the  True
Is  lely91  present in the  True
Is  arcusnelson  present in the  True
Is  ayonaze  present in the  True
Is  ssalti  present in the  True
Is  yr1c  present in the  True
Is  faithkenia  present in the  True
Is  lnjnana  present in the  True
Is  sted4you  present in the  True
Is  ubledyouth1  present in the  True
Is  WhATEver  present in the  True
Is  NOREALLY  present in the  True
Is  #!@#@#%$^(shit  present in the  True

Notice the last three entries. They are actually not in the set! But, they still show up. This is mainly because we have kept our size 2^12, or that the false positive rate with 5 hashes is probably enough to show that as a false positive. What happens if we make our size 2^24?

Is  fherdito  present in the  True
Is  joshdevine  present in the  True
Is  ayman99  present in the  True
Is  shabuta  present in the  True
Is  babiiiecakez  present in the  True
Is  billhustle  present in the  True
Is  amilacelia  present in the  True
Is  drewstakinatwit  present in the  True
Is  nwriter5  present in the  True
Is  frzz  present in the  True
Is  knightowl  present in the  True
Is  lely91  present in the  True
Is  arcusnelson  present in the  True
Is  ayonaze  present in the  True
Is  ssalti  present in the  True
Is  yr1c  present in the  True
Is  faithkenia  present in the  True
Is  lnjnana  present in the  True
Is  sted4you  present in the  True
Is  ubledyouth1  present in the  True
Is  WhATEver  present in the  False
Is  NOREALLY  present in the  False
Is  #!@#@#%$^(shit  present in the  True

The above points to two things. One is that Bloom filters don’t give false negatives. If something is in the filter, it will find it! Second, if we increase the number of bits, less hashing can sometimes make up for it. Note that the dataset is quite large.

That was a very simple implementation of Bloom Filter. People have done cooler things with Bloom Filters, like Bloomier filters, finding out the effects of various hash functions and more. They are used in Cassandra, Google’s BigTable and I presume, in safe browsing functionality of browsers. I believe that instead of trying to squeeze as much as possible out of false positives, the main purpose would be to use it for not throwing up false negatives.

Michelin, Three Stars and Long Days

Michelin guide is an annual book published by a French company. Essentially, they have a set of testers and experienced people who eat all around the world and rate it as 1, 2 or 3 stars. 3 Michelin stars is the pinnacle and every chef’s dream.

I saw a documentary recently that interviewed 9 chefs who have 3 Michelin stars. What is common among them, common to their stories and common to their work days.

16 hour workdays is common. Everyone works very very hard on their craft. Its a constant endeavour to perfect the art, whether its sushi, eurasian cuisine, Basque cuisine or a unique take on royal French cuisine. Everyone works their arse off every single day.

Michelin guide was very fixated on the luxury factor and the ambience earleir. You have to realize that French cuisine is really one of the most sophisticated cuisines. It has been refined in the royal halls, its cooked the way it was done for Kings and Queens. Every dish has many ingredients, a long process, and reductions for sauces and what not. Michelin went to Tokyo for the first time and were so impressed, several restaurants got 3 stars. This took up the French culinary world by surprise. Asian cooking does not give presentation a lot of importance. Its more about exotic ingredients and herbs and spices and their esoteric combinations. Let me not enter into food details because I only eat food, my knowledge about cuisines is very limited.

The documentary reinstated the fact that life is hard for chefs. One might think that, “ Oh, he is a cook. he just does the same thing day-in-and-day-out”. Its not really true. There are just so many other variables in play, and cooking under time pressure, where a little more or less salt, a little less sour cream, one cherry on the left instead of right, destroys your dish and maybe your hopes of a michelin star, gets you one bad review, just imagine the pressure they go through everyday. And a chef’s life is tough when spent in all that smoke and standing all day. They all have 16 hour workdays !

It also brings into question if Michelin stars are worth all that effort. A group of people, self-certified as geniuses in identifying good food, rating you on some scale that they defined, really ? Is that what it’s all about ? Good food is about creating experiences. All food gets digested in 24 hours. What stays with you are memories. That cheese fondue, that tiramisu which still lingers on, thats what remains later.

And as a parting note, you should watch the film “Jiro dreams of Sushi”. Its amazing, its mind-boggling. The dedication of one man to his craft, the endless pursuit of perfection. And, the non-fuss way of going about it. Amazing, incredible and very inspiring.

Hashing in Python

One of the most popular data structures used in Python are dictionaries. Dictionaries promise you O(1) lookup, on average. They are easy to define, easy to populate, easy to lookup and almost always fit in somewhere in your design. This post is not about how to use dictionaries in Python. We will try and look at some of the internal stuff that happens when dictionaries are in the picture. Note that most of the content below is from the documentation found in the python source of dictionaries. And some links here and there. The credit goes to the awesome people behind the Python project. I am just the medium who is trying to find his way through the maze and appreciate, a little bit better everyday, about the awesomeness of the programming language that I use everyday.

d = {}
d[1] = 2
d[2] = 3
d[3] = 4
print d

The above will print

{1: 2, 2: 3, 3: 4}

Of course, you can also write,

    d = {}
    d['abc'] = 3
    d['defg'] = 4
    d['agdce'] = 5

This has keys which are strings.

Dictionaries here are hash tables. A hash table uses a hashing algorithm to map a key to an index. When one hears the word “hashing algorithm’’, one may think of MD5 or SHA1. Those are cryptographic hashes. One of the aims behind cryptographic hashes is mathematical complexity. You want an algorithm which is nearly infinite to break by brute force.

You want a hash function that is i) simple ii)distributed iii)minimizes collisions

What do we mean by simple here? We mean a function that is not mathematically complex such that it takes quite a while to find the hash of a given key. We still want it distributed so that our output space is well-covered. Remember that collisions will always happen. The basic aim of hashing is to map an input space that covers an area, say, U into an output space that covers an area that is less than U. It is like trying to fit 1000 people in a space that can only accommodate 400. So, there will always be collisions. What you want is a way to minimize them and secondly, when they happen, you want to deal with them in the best and fastest possible way.

Now, when Python was still a new language, you could only use Strings as keys of a dictionary. Now, you can use integers and tuples. Let’s look at how python hashes each of those three types. You can see this in action using the hash function (awesome thing about introspection, you can look at the internals of the language!.) But, remember that this is not hashing, its like pre-hashing. Hashing refers to the method where you map a lot of keys to a finite number of slots so that you can minimize the memory taken by a traditional direct-access table. Pre-hashing is this when you are mapping something to an integer, need not be positive.

Strings

>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

The numbers above look quite close but at the bit level, their upper bits will be quite different. This is good behavior as you want to use the upper bits (and not all the 32 bits!) to determine the index in your table. So, if I have two 32-bit numbers, and an empty hash table to start with, I do not use all 32-bits to decide which index it is! I use the last 3 bits (for a table with 8 empty slots.)

Integers

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]

Integers get mapped to the same. Think that is crazy? Not really.

Tuples

>>> map(hash, [(0,1), (0,2),(0,3),(0,4)])
[3713080549409410656, 3713080549410493181, 3713080549411575706, 3713080549412658231]

Observe how “37130805494” is common across all. And the next 2 bits are progressively increasing, “09”, “10”, “11”, “12”.

What happens when collisions occur?

Collision here is when you want to map a key to some index, but, that index already has some value. Collisions can and will happen. What do you do when collisions happen? How do you start looking at the rest of the table? Well, lets think of the dumbest thing, looking at all indexes serially. Is that good?

You do not look at the slots serially. Instead, a sequence is defined that makes sure that on average, you reach the right slot (or an empty slot, in which case, you return “key not found”) in the minimum number of “probes”. What does this mean? This means that when you are inserting a key, or searching for a key, and you start with a slot that is already filled, a sequence of new probes is generated using a recurrence that minimizes the number of probes you need. You do not search serially.

Okay, seriously, what is the hashing function?!

class string:
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

I have seen this at various places, one of which is at effbot. The first character is multiplied by 2**7. Now, for each character in the string, the hash is multiplied by a million. Yes, this is a really really large value.

We already saw that the hash of an integer is the integer itself. The above link also has stuff about how tuples are hashed. (it’s pretty much the same)

What about resizing?

Yes, yes, good question. And, I was looking around and I found comments in the source code and an answer from Tim Peters. From the source code, it says that when your dictionary has a lot of keys (more than 50k), the size is doubled whenever your fill >= 2/3*size. What is fill? fill is the number of slots filled/total number of slots you have.

Each time your fill crosses that threshold, the table is resized. If your dictionary size is not that high, the size is quadrapled. You will wonder why the size is quadrapled. The reason being that now, you have more empty slots, so, collisions are minimized. So, its not that harmful. In any case, if you cross the number of keys beyond the threshold, memory consumption is minimized by doubling.

What happens when you delete an element?

Intuitively, you might think that deleting a key, or multiple keys, will mean that python will garbage collect that and memory consumption will reduce. In practice, this is not happening. The reason being that python assumes that if you made the slot empty, you will fill something soon enough. Moreover, we are using open addressing. So, when you delete a key, that position is not necessarily the first position it was hashed to. Now, if you make that slot empty, and you search for another key (that is on a depth higher than this one), python can find an empty slot and incorrectly return key not found! You do not want that and hence, if you keep deleting keys a lot, python uses up the space with <dummy> values. If you have deleted quite a lot of keys and have no intention of adding more keys, then, dict.copy() is a good way. When the copying is done, the dormant slots are deleted. Again, this information is from a Tim Peters reply.

Note: A lot of this information can also be had, with great graphs and an awesome speaker, Brandon Rhodes, in this pycon talk

When I Met Prof Andrew Ng

Recently, I was attending EMNLP as Seattle is quite close to Vancouver, and, Prof Andrew Ng gave a keynote talk. Now, now, I believe that almost everyone who has access to Internet today knows who Andrew Ng is. But, just in case you were in an island like Tom Hanks in Cast Away and lost track of time, a couple of lines about him. He is a very famous Machine Learning Researcher, his contributions can be found in robotics and self-driving helicopters to driverless cars to deep neural networks that can identify cats ! If that was not enough, he felt that he wanted to reach a lot more people. And thats how Coursera came about. Coursera offers free online courses to anyone with Internet. Now, you would think that “Oh, MIT has been offering that. Khan Academy is there”. But, you see, Coursera started with graduate courses. And, whats more, a lot of famous people have offered their courses. Its not a watered-down version, its the same course that is taken at the university ! For free ! There have been kids in Afganistan who have walked miles of war-ravaged terrain to get Internet and solve the problem sets in courses. Prof Andrew Ng has made the lives of millions of people better. And, after his keynote talk, I was eager to meet him and tell him in a few seconds, how his machine learning course has helped me.

First thing I wanted was a picture with him. Now, in conferences, you meet a lot of famous people and the response to the question, “ Hi, I am a graduate student and I was wondering if I can get a picture with you ? “ ranges from, “ Sure”, to, “Oh, I am busy, lets do it later” to “No !”. What did Prof Andrew Ng say ?

"I will be honoured" 

Yes, thats what the great Prof who has done so much for grad students worldwide said. Note that before me, he had met a lot of other people, had given a 75 minute talk, and was now talking to me instead of enjoying the coffee break.

After we got the pic, we had the following conversation :

    Prof Ng: Where are you from ? 
    Me : India
    Prof Ng: Oh where in India ? 
    Me : Eastern part, around 5 hours south of Calcutta. Its a  rural area. 
    Prof Ng : Oh, which course did you take ? 
    Me : I took your very first course, that you also offered on SEE, 

    and that got me a job in a research lab in India, 

    which in turn got me into grad school, and now I work on MT. Thanks for that very first course ! 

    (I proceeded to tell him a little bit about my hometown and how, having 256 kbps internet and his videos helped me learning about bias-variance tradeoff !)


    Prof Ng: I am happy you benefited. 

    Your story is  inspiring and if you don't mind, please email me about it. 

    ( and he gave me his coursera card)

Now, after I had gotten over the fact that somebody as famous as him could be so approachable and so down-to-earth, I proceeded to think about what could I write as my “inspiring’’ story. I feel quite lucky that I atleast had internet, there are countless other people who have had more obstacles than me and have done more than I can in a lifetime. I realized that my biggest inspiration is Prof Andrew Ng himself. He could have easily gone through the rigorous stanford process and then lived the life. Taught the best minds, and consulted for Google and what not. But, he chose to instead start a company, a company that offers free courses. If you heard his keynote talk, you would realize that he is very very passionate about what he is doing and he is not in it for the money, at all. He believes education is everybody’s fundamental right, and, with his passion and dedication, am sure he will achieve all of that and more.

Often, I have encountered people in grad school, who have done something during the course of their PhD, being uptight, unapproachable, and just not humble ( maybe I have sample bias, but, lets not get into that. I have seen a lot of arrogant people outside grad school too) Remember that a guy who was named in 40 under 40, whose company won numerous awards, who is taking education to the masses, who is one of the smartest guys you will ever meet in CS, who is a prof at Stanford, is humble, patient and so approachable. I saw people giving him a hug saying, “ Professor, you got me a job at (somewhere nice)”.

Stay humble.

Vancouver and Food

I moved to Vancouver on August 31, 2012, to pursue grad school. To say that Vancouver is awesome is quite an understatement. Yes, yes, it rains quite a bit. Yes, its gloomy and depressing those days. But, you know, even when its raining and gloomy, Vancouver still looks awesome, beautiful.

Growing up, and mostly after college, a lot of people told me that I am all about food, and not in a good way. I am not sure what the issue is with being a foodie, but, lets not talk about mean people. This is not a post about that :). Its about Vancouver :)

Vancouver is a foodie city. No doubt about it. I have had the good fortune of having Mexican, Bulgarian, Italian, Sichuan, Indian, Turkish, Peruvian(!), Ethiopian(!), and desserts that blow your mind. Growing up in a small town, where you got pretty much nothing outside, I had simple dreams. I did not dream of becoming the president of United States, or other grandiose ones. You knew that even doing basic things like going to a nearby city was a challenge :), your dreams get automatically scaled. What I did dream was to live in a city where I could have different cuisines every now and then. Not bastardized chinese, real chinese. In India, you get Indo-chinese. They have changed chinese to fit the Indian palette. I don’t want to get into an argument about how unfair it is to put red chilli powder on a pasta dish, so, moving on. That was my dream. How was it going ? Well, the first time I left home in 2004, I ended up in a remote area outside Hyderabad. Bad start. Then, for 4 years, I was in interior tamilnadu where you did not get even proper South Indian food ! Looking worse. Well, there is always a job.

But, no, I ended up in Hyderabad again. This time, with a little bit of money, but, little luck. I realized that most restaurants in India who promise you cuisine, are almost always not the real deal. I thought that well, my dream will probably remain one. And then, I got a chance to do grad school in Vancouver.

Now, after an year and some, I feel like that foodie dream of mine is surely achieved. Its just started, but, atleast its something. Here are some of the pictures :)

Bulgarian dessert called Bird's Nest

Tiramisu

Turkish dessert called Kunefah

Heaven

And if that was not enough, Vancouver is amazingly beautiful. There are beaches, mountains and what not. Its like, everyday at work, you have a new reason to not work.

Sunset Beach

Kits

Snow

English Bay

I can post a lot more pictures, but, anyway. The point is that one should always dream. If you don’t dream, you don’t work towards it. Its a dream, feel free to dream big. There is no rule against dreaming big. Just work hard to achieve those dreams. Just dreaming is not enough :) And don’t let anyone bring you down. Good food leads to memorable experiences. Do you not remember the first time you had the most awesome tiramisu? One with the right amount of cream, where you could smell the ingredients and was so well presented. All food gets digested. What remains are beautiful memories that makes your tongue tingle. Everybody eats, whether I am a random graduate student worried about my dissertation or a high-flying billioniare who has more air miles than a 32 bit integer can represent.

I recently gave a talk in the department about food and good restaurants in Vancouver. It is pretty cool that I was lucky enough to live one of my passions. I hope that always remains the case and I do end up eating a lot more all around the world!

In my bucket list, I want to do both fine dining and hole-in-the-wall. I think I would love to roam around Vietnam and try various stalls. Same with Thailand. I think Northern Thailand are super adventurous with their ingredients, that would be lovely. And Vietnam has so many fresh things to play with! And all that subtle and not-so-subtle French influence in their cooking. It should be a heady mix.

And, I would love to eat at a Thomas Keller restaurant someday! Although am not sure I will be able to appreciate the hard work that went into each dish.

Common Python Idioms

In my lab, we write a lot of Python code. I recently did a course in Machine Translation which required me to write a lot more python code to do core tasks in MT. Python provides a very nice way to do common things in programming and it would be good to list them here.

One should note that there is nothing very “hidden’’ about these features in Python. You can find these spread in other blog posts too. The credit really goes to the amazing python community who keep proving that beauty trumps everything. Its a joy to write code in Python.

Opening multiple files at once

You want to read two files line-by-line, do something and write the output to a third file. How do you do that ?

    with open(f1) as f1, open(f2) as f2, open(f3, 'w') as out:
        for line1, line2 in izip(f1, f2):
            do something

Remember to open files using the “with’’ statement. It takes care of closing files for you so that you don’t have to do

Also, use izip instead of zip. zip creates a list for you in memory and iterates over that. izip creates an iterable and goes one by one. Also, izip reuses the same value, so, really, its just one memory location.

    f1 = open(file1, 'r')
    do something
    f1.close()

People can forget to close the files and that can cause memory issues. Always use “with’’ !

Using any

Say if you have a list of 6 words that are searching for in any sentence. you don’t care which of them occurs and where. as long as it occurs.

  six = ('a', 'b', 'c' , 'd' , 'e' , 'f')

  for line in file:
      if any(item in line for item in six):
          count += 1

fromkeys

You want to form a dictionary with various keys. What will you generally do ?

    d = {'a': 0, 'b' : 0, 'c' : 0, 'd' : 0}

There is nothing wrong with that, but, there is a cleaner way ( of course )

    seq = (key1, key2, key3)
    dict.fromkeys(seq, value)

You can of course use None too.

gzip files

When dealing with gzip files, you will be tempted to use gzip.open(). don’t. they say gzip is written in C internally, but, for reasons that I am not very sure about, gzip.open() is very slow. instead, do

for line in os.popen("zcat" + sys.argv[1]): 
    "do your thang" 
In [5]: %timeit for line in os.popen("zcat" + " some gzipped file"): pass
1 loops, best of 3: 3.13 s per loop

In [6]: %timeit for line in gzip.open("same gzipped file"): pass
1 loops, best of 3: 27.1 s per loop

For more details and the best option, you can refer to http://bugs.python.org/issue7471

The phrase table i tested it with had 4.8 million rules (only).

namedtuples

Say, you want to store a few records. Using tuples makes sense because the records data won’t change. But, using tuples would mean you have to remember which index in that tuple has what data. Namedtuples are just like tuples, but, allow named access to fields. Here is how :

    from collections import namedtuple
    example = namedtuple('Example', ['name', 'location'])

    reader1 = example('abc', 'Vancouver')
    reader2 = example('def', 'Calgary')

    print reader1.name, reader1.location
    print reader2[0], reader2[1]


Which one out of [0] and *.name looks better ? the latter, of course ! Use namedtuples where you can.

List comprehensions

This really is a very powerful feature and you can find very talented people having written posts about its power. Always use them !

All numbers between 1 and 100 divisible by 2

    return [i for i in xrange(1, 100) if i%2 == 0]

This is a simple one. What about if-else in list formation

    return [1 if pos in l else i for pos, i in enumerate(some_list)]

There are a lot of insane possibilities with list comprehensions. But, in production code, and in day-to-day writing as well, one would suggest to keep it relatively simpler. Use nested list comprehensions only when absolutely needed. Its cool to be insane, but, no point when developers struggle to interpret the chain of events within a pair of square brackets.

Nested dictionaries

Creating a dict is

    d = defaultdict(int)

Nested dict is

    d = defauldict(lambda : defaultdict(int))

What about an infinite dict ?

    infinite = lambda: defaultdict(infinite_defaultdict)
    d = infinite()
    d['a']['b']['c']['d'] = 10

Interesting Reads

Off-late, in my feeble attempts to sleep early, I have been trying to read something that is not related to work. And, in general, i try and read the publicly available longform articles on The New Yorker. Fortunately, everyday, for the past few days, i have come across awesome, sometimes heart-breaking, sometimes awe-inspiring tales of bad luck, good luck, and sometimes, just plain craziness. Without further ado, lets share some links !

A Murder Foretold

David Grann, what a journalist. What a writer. This was the first work of art that i came across. Its hard to say much without revealing spoilers, so, lets just say its an awesome tale of politics, gang wars, corruption, love and more. Oh, and the setting is Guatemala. And one more thing, David Grann is awesome !

The Throwaways

A look into the dark side of drugs, drug-offences, undercover novices and the consequences. Drug offences have attracted harsher punishments in the States, but, has that helped ? Or not ? This story will make you cry long after you have read it, and in my case, pretty much immediately. Its unfathomable how parents face their lives, picking up pieces long after its all gone.

Trial By Fire

Another David Grann masterpiece. Did a man really kill his three young kids ? Yes!, that looks so obvious. No, maybe its not. Naa, everybody who matters is convinced he did. What finally happened ? was he really innocent ? Young girls ! How barbaric.

The Lost City of Z

I am not an archaelogy nerd, I have seen the Indiana Jones series but i never got the hoopla over it. between you and me, i never even liked the movies ( ssshhhhhhh…. ) But, this gave me a whole new level of respect for the endurance, hope, intelligence, foresight and mostly, the intrinsic motivation in the explorers of the previous century. This is about a Percy Fawcett. What a man. And how brave of David Grann to go on a path taken by many others, but, with bad consequences. The city may be a lost one, but, the awesomeness of David Grann is quite visible.

Sweden’s notorious Serial Killer

Thomas Quick. Sweden’s famous, notorious serial killer. he has recounted the gruesome ways he killed his victims. they say 16 victims ! but, he wants to say something ? what ? what is this last confession ?

Never thought GQ has good stuff like this.

Almost Famous

Please check the github profile for up-to-date code !

Can you predict which question will be famous on StackOverflow? ( The title is only in keeping with my fondness for movie references). A famous question get a minimum of 10000 views. For details on famous questions, click here . This post will be more about feature collection and the code explanation. The problem of predicting is quite research-y and I will leave it to the people to do it and publish :-). Feel free to use the code.

A question can be famous or not famous. So, binary outcome variables. What about features ? A question has user attributes, question attributes and then there are answers. But, ideally, you would want to predict the badge for a question that has been asked recently ? Why, you might ask? Good question !( its surely going to be famous ;) )

Famous questions get a lot of views. This implies that the question has been useful to a lot of users on the site. Usefulness implies the fulfillment of an information need. The ultimate motto of a community like StackExchange is getting your specific questions answered. In the case of sites like LaTeX and StackOverflow and ServerFault, the problems can be very specific requiring expert knowledge. In such cases, this knowledge can be useful to a lot of others users too.

tl;dr We should work on this prediction problem as this can promote content that has a high potential of being useful.

Now, what about features ? Lets start with user attributes.

  1. Reputation
  2. Num upvotes/downvotes
  3. Num accepted answers
  4. Num questions/answers

For starters, lets consider these user attributes for the OP. The reputation, which is a metric that measures your contribution to the given site. The number of upvotes/downvotes you have received. Remember, even a question can receive upvotes and downvotes. A high number of accepted answers would indicate an active user.

What about question attributes ?

  1. Length of title
  2. Num tags
  3. Num questions for all tags
  4. Num famous questions for these tags

Then, there are answer attributes too. Recent research has shown that the activity on a question right after its asked helps in predicting its usefulness after a period of time ( 1 year in the paper here). Answer attributes. Lets consider the attributes only within 24 hours of the question being asked.

  1. Num answers
  2. Mean reputation of answerers
  3. Mean Length of answers

Lets start with user attributes. We can easily find the number of upvotes/downvotes, reputation and views from the users.xml file. To find the number of questions asked, answers given and answers accepted, we will need to use posts.xml. But, its easier to do a single pass over the posts.xml and store all the answer IDs that were accepted. They will be useful later. Having done that, we can now store all the user details in a dictionary.

But, before that, how do you divide the data into training and test ? We know that there are only 10369 questions. We know that there are almost 2 million questions. We can take 15% of the questions for testing and keep the rest for training. Lets do that.

But, how do you sample ? Random seems best. You can do that with this script.

You can write the numbers in a text file. Then, in our feature extractor, we can use these numbers to decide whether its training or testing. Without any more ado, lets extract features already !

This gives us the user attributes and the question attributes. Note that we have not yet used the features of the answers, how many in the first 24 hours and so on. Why not start with this first :-) ?

What do we use ? Why not use scikit-learn ? Its Python, its fast, its relatively easy-to-use, and did I mention its super awesome from what I have seen/read ?

Before that, lets take a look at the snapshot of the csv file.

You have a few features, just a few. And the last one is the famous/not famous variable. We are predicting a binary outcome variable, why not start with Logistic Regression ?

This is just an example. We haven’t done anything with the features and normalized them. Anyway, the output for this is :

Viva La Redis

This is a post about how I wrote 10^9 key:value pairs into Redis and then ran a couple of 100 concurrent clients bombarding the server with queries

Redis is a NoSQL datastore. Which by the way, also has persistence. Persistence would mean you get the performance of a NoSQL datastore, and, your data also gets stored permanently. But, this is not a post about MySQL v/s NoSQL. This is not about benchmarking Redis either ( although, I will have some awesome things to say)

First, the corpus. Gigaword, we call it. It is a corpus that used 1200 million words, mostly from news articles. In simple words, a language model is like a huge key:value store. Every key is a n-gram. What is a n-gram you ask ?

n-gram is a set of words with length n. For example, a sentence, say,

I am a boy . If you are asked to find out all the n-grams, up to the order of 3 from this sentence, it would look like the following :

    I, am, a, boy  --- 1 - grams
    I am, am a, a boy -- 2 - grams
    I am a, am a boy -- 3 - grams

To give you an idea, a reasonably-sized, but not large by any metric, corpus, would contain a million sentences. And, for good performance, the higher the value of n, the better. What do we mean by performance here by the way ?

A language model gives a probability to a sentence. So, a sentence like “ I am a boy “ has higher probability than, “ I boy am I”. This is understood as “I boy am I” is not grammatical ( although you cannot blame someone if he/she says that sentence . ). A language model learns this through the counts of the n-grams. This blog post is not about how a LM learns it. ( You must be wondering what this is about then ? Soon, soon)

Language models are huge. How huge ? This one generated from gigaword has more than 0.8 billion key:value pairs. And we will see if storing the whole language model in Redis, which is a one-time operation, can be followed by large-scale concurrent execution of a Natural Language Processing task which involves frequent accesses to the language model.

First, the form of the corpus,


-0.1234  a
-3.63    b
-1.2232  c
.
.
.

What do you need :

  1. A system with a decent amount of RAM. Say, 24 gigs.
  2. A redis-client. I used redis-py.
  3. Patience !

Lets first do the dumbest thing we can think of. Read and insert .

Read a line. Split. And insert. If you are inserting more than a 1000 records, DO NOT DO THIS !

Redis follows the client-server model. For each operation sent, the server replies. No matter how fast the network connection is, there is a latency involved and that is huge when done over a billion strings. We need to send stuff in a batch.

Enter Pipelines

As can be seen from the documentation, pipelining can be thought of as mass insertion and then final conclusion. You keep sending commands to the server, and then , read the reply of the server at the end.

You have a pipe object. I simply chose the number 10,000 from the documentation.

But, this is much faster. I ran without and with pipelining and have mentioned the times below

    Without Pipelining -- 12348.340 seconds 
    With pipelining -- 3188.732 seconds
    Num key:value pairs -- 81997635

So, for 0.8 billion key:value pairs, we get an almost 4X improvement. Moral of the story : always use pipelining !

So, at the end, the keys end up taking almost 10 GB of RAM. 10GB is quite a bit of RAM, although with reducing hardware costs, this might not be true soon enough. Can we reduce the amount of memory taken though? Turns out we can !

Redis allows you to use various data structures. The one that we are interested in is hashes. Hashes were originally thought for the purpose of storing objects like user details. For instance, user details can have name, password, account number and so on. But, the usefulness of a hash is that you can store several records in the same hash and thus save memory. The Redis documentation does a great job of explaining the intuition behind Hashes here. As an aside, Redis has top-notch documentation IMHO.

We can test this by storing our complete Gigaword corpus, all of the keys in the form of hashes. We can have buckets of 1000. We don’t have explicit IDs associated with a key. We know that we have approximately 0.82 billion keys, so, lets have 82K buckets. A process of writing a key:value pair to a bucket would look like

    Hash using hash() in Python
    Divide hash by 82,000
    use pipelining 

This script does just that. Here again we use pipelining or it can take forever.

    Before Hashes -- 10G
    After Hashes -- 2.66G

Woah, 25% memory taken. Thats an epic improvement for us. Second moral of the story : Always use hashes when you have more keys. But, but, but, do the lookup times change if we use hashes ? What if you say that you have LOTS of memory at your disposal but low latency is very important to you. lets find out.

Lets do truecasing for one file using a few clients, starting from 1 client and testing for a few more :-)

This does truecasing with any number of clients you mention on the command line. And for looking up Redis when we insert keys in hashes , we can use the following script :

To run and find the times, I preferred to write makefiles. It looked something like this :

Run the makefile for both non-hashes and hashes based lookups.

Clearly, hashing adds up on the lookup times, which is significant when running several concurrent clients. Having said that, the memory gains are very very significant and hashes is worth a second look. It has to be noted that in the truecasing case, I have not sharded the files. Each client is processing the same file and writing the same result in different output files. This is the absolute worst-case scenario. In the real world, you will probably shard the files and give separate parts to separate clients and then write it into a final output file.

Now, there is one more thing. We distribute our keys among 3 instances. How do you distribute them ? Do you say that well, let me put everything that starts from a-g in one, and so on. This is a bad idea. English has way too many strings that would start with “a” and too little that would start with “z” in comparison. One instance would get pinged more often than the rest. So ? Well, I know that all the keys are unique. So, there will not be any collisions. Why not take the ordinal value of each and then add them up ? And to distribute, take the value modulo 3. How does that look like ?

This will add up all the ordinal values. Do you think we can get faster than this ?

Turns out we can. Just use python’s default hash() method.

    s = key 
    h = hash(s) % 3

On testing the times, it turned out that its better to use the default hash() method compared to the ordinal one.