Random randomness

Almost Famous

Can you predict which question will be famous on StackOverflow? ( The title is only in keeping with my fondness for movie references). A famous question get a minimum of 10000 views. For details on famous questions, click here . This post will be more about feature collection and the code explanation. The problem of predicting is quite research-y and I will leave it to the people to do it and publish :-). Feel free to use the code.

A question can be famous or not famous. So, binary outcome variables. What about features ? A question has user attributes, question attributes and then there are answers. But, ideally, you would want to predict the badge for a question that has been asked recently ? Why, you might ask? Good question !( its surely going to be famous ;) )

Famous questions get a lot of views. This implies that the question has been useful to a lot of users on the site. Usefulness implies the fulfillment of an information need. The ultimate motto of a community like StackExchange is getting your specific questions answered. In the case of sites like LaTeX and StackOverflow and ServerFault, the problems can be very specific requiring expert knowledge. In such cases, this knowledge can be useful to a lot of others users too.

tl;dr We should work on this prediction problem as this can promote content that has a high potential of being useful.

Now, what about features ? Lets start with user attributes.

  1. Reputation
  2. Num upvotes/downvotes
  3. Num accepted answers
  4. Num questions/answers

For starters, lets consider these user attributes for the OP. The reputation, which is a metric that measures your contribution to the given site. The number of upvotes/downvotes you have received. Remember, even a question can receive upvotes and downvotes. A high number of accepted answers would indicate an active user.

What about question attributes ?

  1. Length of title
  2. Num tags
  3. Num questions for all tags
  4. Num famous questions for these tags

Then, there are answer attributes too. Recent research has shown that the activity on a question right after its asked helps in predicting its usefulness after a period of time ( 1 year in the paper here). Answer attributes. Lets consider the attributes only within 24 hours of the question being asked.

  1. Num answers
  2. Mean reputation of answerers
  3. Mean Length of answers

Lets start with user attributes. We can easily find the number of upvotes/downvotes, reputation and views from the users.xml file. To find the number of questions asked, answers given and answers accepted, we will need to use posts.xml. But, its easier to do a single pass over the posts.xml and store all the answer IDs that were accepted. They will be useful later. Having done that, we can now store all the user details in a dictionary.

But, before that, how do you divide the data into training and test ? We know that there are only 10369 questions. We know that there are almost 2 million questions. We can take 15% of the questions for testing and keep the rest for training. Lets do that.

But, how do you sample ? Random seems best. You can do that with this script.

You can write the numbers in a text file. Then, in our feature extractor, we can use these numbers to decide whether its training or testing. Without any more ado, lets extract features already !

This gives us the user attributes and the question attributes. Note that we have not yet used the features of the answers, how many in the first 24 hours and so on. Why not start with this first :-) ?

What do we use ? Why not use scikit-learn ? Its Python, its fast, its relatively easy-to-use, and did I mention its super awesome from what I have seen/read ?

Before that, lets take a look at the snapshot of the csv file.

You have a few features, just a few. And the last one is the famous/not famous variable. We are predicting a binary outcome variable, why not start with Logistic Regression ?

This is just an example. We haven’t done anything with the features and normalized them. Anyway, the output for this is :

Viva La Redis

This is a post about how I wrote 10^9 key:value pairs into Redis and then ran a couple of 100 concurrent clients bombarding the server with queries

Redis is a NoSQL datastore. Which by the way, also has persistence. Persistence would mean you get the performance of a NoSQL datastore, and, your data also gets stored permanently. But, this is not a post about MySQL v/s NoSQL. This is not about benchmarking Redis either ( although, I will have some awesome things to say)

First, the corpus. Gigaword, we call it. It is a corpus that used 1200 million words, mostly from news articles. In simple words, a language model is like a huge key:value store. Every key is a n-gram. What is a n-gram you ask ?

n-gram is a set of words with length n. For example, a sentence, say,

I am a boy . If you are asked to find out all the n-grams, up to the order of 3 from this sentence, it would look like the following :

    I, am, a, boy  --- 1 - grams
    I am, am a, a boy -- 2 - grams
    I am a, am a boy -- 3 - grams

To give you an idea, a reasonably-sized, but not large by any metric, corpus, would contain a million sentences. And, for good performance, the higher the value of n, the better. What do we mean by performance here by the way ?

A language model gives a probability to a sentence. So, a sentence like “ I am a boy “ has higher probability than, “ I boy am I”. This is understood as “I boy am I” is not grammatical ( although you cannot blame someone if he/she says that sentence . ). A language model learns this through the counts of the n-grams. This blog post is not about how a LM learns it. ( You must be wondering what this is about then ? Soon, soon)

Language models are huge. How huge ? This one generated from gigaword has more than 0.8 billion key:value pairs. And we will see if storing the whole language model in Redis, which is a one-time operation, can be followed by large-scale concurrent execution of a Natural Language Processing task which involves frequent accesses to the language model.

First, the form of the corpus,


-0.1234  a
-3.63    b
-1.2232  c
.
.
.

What do you need :

  1. A system with a decent amount of RAM. Say, 24 gigs.
  2. A redis-client. I used redis-py.
  3. Patience !

Lets first do the dumbest thing we can think of. Read and insert .

Read a line. Split. And insert. If you are inserting more than a 1000 records, DO NOT DO THIS !

Redis follows the client-server model. For each operation sent, the server replies. No matter how fast the network connection is, there is a latency involved and that is huge when done over a billion strings. We need to send stuff in a batch.

Enter Pipelines

As can be seen from the documentation, pipelining can be thought of as mass insertion and then final conclusion. You keep sending commands to the server, and then , read the reply of the server at the end.

You have a pipe object. I simply chose the number 10,000 from the documentation.

But, this is much faster. I ran without and with pipelining and have mentioned the times below

    Without Pipelining -- 12348.340 seconds 
    With pipelining -- 3188.732 seconds
    Num key:value pairs -- 81997635

So, for 0.8 billion key:value pairs, we get an almost 4X improvement. Moral of the story : always use pipelining !

So, at the end, the keys end up taking almost 10 GB of RAM. 10GB is quite a bit of RAM, although with reducing hardware costs, this might not be true soon enough. Can we reduce the amount of memory taken though? Turns out we can !

Redis allows you to use various data structures. The one that we are interested in is hashes. Hashes were originally thought for the purpose of storing objects like user details. For instance, user details can have name, password, account number and so on. But, the usefulness of a hash is that you can store several records in the same hash and thus save memory. The Redis documentation does a great job of explaining the intuition behind Hashes here. As an aside, Redis has top-notch documentation IMHO.

We can test this by storing our complete Gigaword corpus, all of the keys in the form of hashes. We can have buckets of 1000. We don’t have explicit IDs associated with a key. We know that we have approximately 0.82 billion keys, so, lets have 82K buckets. A process of writing a key:value pair to a bucket would look like

    Hash using hash() in Python
    Divide hash by 82,000
    use pipelining 

This script does just that. Here again we use pipelining or it can take forever.

    Before Hashes -- 10G
    After Hashes -- 2.66G

Woah, 25% memory taken. Thats an epic improvement for us. Second moral of the story : Always use hashes when you have more keys. But, but, but, do the lookup times change if we use hashes ? What if you say that you have LOTS of memory at your disposal but low latency is very important to you. lets find out.

Lets do truecasing for one file using a few clients, starting from 1 client and testing for a few more :-)

This does truecasing with any number of clients you mention on the command line. And for looking up Redis when we insert keys in hashes , we can use the following script :

To run and find the times, I preferred to write makefiles. It looked something like this :

Run the makefile for both non-hashes and hashes based lookups.

Clearly, hashing adds up on the lookup times, which is significant when running several concurrent clients. Having said that, the memory gains are very very significant and hashes is worth a second look. It has to be noted that in the truecasing case, I have not sharded the files. Each client is processing the same file and writing the same result in different output files. This is the absolute worst-case scenario. In the real world, you will probably shard the files and give separate parts to separate clients and then write it into a final output file.

Now, there is one more thing. We distribute our keys among 3 instances. How do you distribute them ? Do you say that well, let me put everything that starts from a-g in one, and so on. This is a bad idea. English has way too many strings that would start with “a” and too little that would start with “z” in comparison. One instance would get pinged more often than the rest. So ? Well, I know that all the keys are unique. So, there will not be any collisions. Why not take the ordinal value of each and then add them up ? And to distribute, take the value modulo 3. How does that look like ?

This will add up all the ordinal values. Do you think we can get faster than this ?

Turns out we can. Just use python’s default hash() method.

    s = key 
    h = hash(s) % 3

On testing the times, it turned out that its better to use the default hash() method compared to the ordinal one.

Epic, Famous and All That

To see part 1 of the StackExchange analysis, start here.

To signify any sort of achievement on StackExchange, we have badges. Badges are a metric of measurement, defined by the StackExchange owners and each badge indicates a certain achievement. For example, filling all fields of your StackExchange profile makes you an autobiographer. Asking a question with 10000 views makes your question a “famous question”.

Epic badge is given when a user manages to get more than 200 reputation every day, 50 times over ! That now, is quite epic. Keep in mind that every Upvote gets you +5. An accepted answer earns you +15. Bounties do have a higher value associated with them, but, you don’t have bounties everyday. This is mainly because more often, the question gets answered anyway. Hence, earning 200 reputation in a day 50 times is a big deal.

There are various questions we will get answered in this post.

  1. How many epic users ?
  2. Mean reputation of those ?
  3. Do they contribute after they get the badge ?

How many epic users

If we look at all of StackExchange, very few apart from StackOverflow have users with the badge. This makes sense as most of the sites started only in September, 2010 and all of them are very exclusive topics. Lets focus on StackOverflow for the Epic badge.

This gives us the number of epic badges, which is 182. Keep in mind that the dataset is only till September, 2011. StackOverflow, as of today, says that the number stands at 351, almost double.

But, its not just the number we want to look at. Lets find out the contribution of the users before and after they received the badge.An epic badge indicates a healthy daily contribution to the site. Does it continue after they receive it too ?

This script gives us the number of questions before and after, answered by users with the badge. We dump our output to a dict which we will use soon to plot our findings.

This shows that the number of questions answered by epic users before and after remains about the same. But, lets find out if its distributed as before.

Lets take the top 50 and plot them.

beforeAfter

You can see that by and large, people contribute lesser once they get the badge. The green one shows the contribution prior to attaining the badge and the red one shows after. Its that some users contribute regardless of their achievement and that keeps up the contributions either way. But, mostly, contribution is a function of the badge.

Famous Question

A question that gets 10000 views is declared a famous question.

This gives us the number of famous questions, which is 10369 in StackOverflow. More than 10K questions on StackOverflow have more than 10000 views. To give some context to the number of views, lets find out average number of views for a non-famous question v/s famous question.

The output of that script gives us

There is a huge difference between the number of views. On an average, a famous question gets way more views than a non-famous one.

Lets put all utility functions in one file so that its easier and cleaner.

This is a snapshot of the utils file.

Lets find out the most commonly associated tags with famous questions, and some of the most uncommon :-)

This gives us the top 20 most often occuring tags.

A wordcloud of the top 20 is

wordCloud

Honey I Just Got Upvoted!

Have you heard of StackOverflow ? If you code for a living, its highly unlikely you have not heard of StackOverflow. If you are alive on the Internet, it is highly unlikely you have not heard of “StackExchange’’. So, one day, I thought, can we get that data? How much fun it will be to analyze that?

Well, turns out that even the founders of the site believe in sharing data. You can get all of StackExchange, yes, all of it, in one nice dump, from here. Its under a very general license, which means you are free to use it with proper attribution.

I took all of that data and I have tried to answer a few questions. For instance, how many questions have accepted answers on the various sites? How many users are there ? What kind of reputation distribution is seen? And a lot more. First, lets see what kind of data do we have.

The data has been shared all in XML. I know right, XML, who would have thunk! But, it is what it is. You gotta write XML parser to get that data. Worry not, I did the unpleasant thing and I am giving away all that code! [ Okay, honestly, it was not that hard. Just unpleasant seeing at XML. Its ugly as hell]. Like I mentioned earlier, you have data for all of StackExchange. So, that forum you saw on Android, those scary TeX questions, those piano answers, its all there. All of it!

Each has several XML files, one for users, one for badges, posts and then there is post_history. We will focus on them one-by-one. Lets first take users.

How many users does StackExchange actually have? Remember, you can use the site without registering. Read answers and get all the knowledge. Users have to register to ask questions, or post their own answers or both. So, the count of users you get is those who have participated in the site. But, the number of users who use the site to fulfill their information need will be definitely more than what we see.

This is the script you can use as well [ the github repo is public and free to use] . Note the usage of elem.clear() and delete. This is necessary as the XML files are quite large and if you reach deep into the tree and do not remove the unnecessary previous node, it can hog a lot of memory. All the analysis has been done on an entry-level Macbook Pro with 4GB of RAM. These optimizations are fun and necessary :-)

That script gave us the user Counts for all the sites. But, lets just look at the top 5.

Top5

Clearly, StackOverflow has far more users than the 2nd most popular one. That is obvious. The other StackExchange sites were started only in 2010. But, it all started with StackOverflow in 2008. It is the most popular programming website on the Internet. Moreover, I guess a lot of people get doubts when coding :-)

One of the reasons why all StackExchange sites are so famous, so trusted and so frequented is that your questions are answered. What does answered mean here ? Does it mean getting any answer or getting an answer that fulfills your information need ? It is the latter. Then, how do you, the OP, indicate the same? There is something called as an accepted answer. An answer can be accepted only by the user who posted the question. And an user does that only when the given answer answers the questions he originally had. Of course there can be users who forget to accept any of the answers, or just neglect it, or feel none of them answer their requirement. But, in any case, the %age we get is the baseline.

Acceptances

Quite a few of them are in the high 60s, low 70s. Even the minimum is quite close to 50%. What this implies is that 50% of the questions asked are deemed relevant and correct by the original poster himself/herself.

In the next post, we will start looking at more answering more questions, for instance, how many tags are associated with famous questions, badges distribution, contribution before and after. And more.

OLS Regression in R:A Primer

Regression is from latin roots, re, back, and gradus, to go . Literally means, to go back , which fits with the way the method is used in mathematics/statistics. The first usage of the term “regression” is credited to Sir Francis Galton . He proved that sons of tall fathers are tall but not as tall as their fathers, and sons of short fathers are short but not as short as their fathers ( This is known as the “regression effect”).

Today, we will discuss Regression using the statistical computing Language, R . The R Project is a great language for using statistical methods with your data and is very widely used. The dataset we are going to use is “Abalone” , from the UCI Machine Learning repository . Get the dataset here : Abalone

Description of the dataset, from the file :

    # Abalone
    #   Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it,and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns
    and location (hence food availability) may be required to solve the problem.
> abalone<-read.table("/home/rohit/abalone/Dataset.data")#This loads the dataset . R can read from any url and any location
> abalone[1:20,] #Read the 1st 20  lines of the table
   V1    V2    V3    V4     V5     V6     V7    V8 V9
1   M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2   M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070  7
3   F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210  9
4   M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5   I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055  7
6   I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120  8
7   F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
8   F 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
9   M 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165  9
10  F 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
11  F 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.210 14
12  M 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.135 10
13  M 0.490 0.380 0.135 0.5415 0.2175 0.0950 0.190 11
14  F 0.535 0.405 0.145 0.6845 0.2725 0.1710 0.205 10
15  F 0.470 0.355 0.100 0.4755 0.1675 0.0805 0.185 10
16  M 0.500 0.400 0.130 0.6645 0.2580 0.1330 0.240 12
17  I 0.355 0.280 0.085 0.2905 0.0950 0.0395 0.115  7
18  F 0.440 0.340 0.100 0.4510 0.1880 0.0870 0.130 10
19  M 0.365 0.295 0.080 0.2555 0.0970 0.0430 0.100  7
20  M 0.450 0.320 0.100 0.3810 0.1705 0.0750 0.115  9

Lets get the names of the columns.

> names(abalone)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9".

Clearly , the names are not descriptive. Lets change them.

    > names(abalone)=c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings")

c(a,b) or c(“1″,”2″) basically creates a vector of values that are given within the brackets. So, we have given names to the columns of the Dataset. Lets verify the same.

> abalone[1:10,]
   sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings
1    M  0.455    0.365  0.095       0.5140         0.2245         0.1010        0.150    15
2    M  0.350    0.265  0.090       0.2255         0.0995         0.0485        0.070     7
3    F  0.530    0.420  0.135       0.6770         0.2565         0.1415        0.210     9
4    M  0.440    0.365  0.125       0.5160         0.2155         0.1140        0.155    10
5    I  0.330    0.255  0.080       0.2050         0.0895         0.0395        0.055     7
6    I  0.425    0.300  0.095       0.3515         0.1410         0.0775        0.120     8
7    F  0.530    0.415  0.150       0.7775         0.2370         0.1415        0.330    20
8    F  0.545    0.425  0.125       0.7680         0.2940         0.1495        0.260    16
9    M  0.475    0.370  0.125       0.5095         0.2165         0.1125        0.165     9
10   F  0.550    0.440  0.150       0.8945         0.3145         0.1510        0.320    19

Bingo, this looks way better than V1,V2,V3 and so on, is it not ? So , now we know that given the attributes, sex,length,diameter,height, whole_weight,shucked_weight,viscera_weight,shell_weight, we need to predict the number of rings.

We can see that except sex, all the other attributes have numerical values. Sex on the other hand , can take only three values, which are , M , F and I (Gender/Infant) . So , when we tell R to do regression for us, we need to tell the language to treat Sex as a categorical variable.

The advantage with a kernel estimator is that its smooth, it is not dependent on the bins that you have to go for with histograms , and thus gives you a better idea about the distribution of the data w.r.t that variable.

The basic command for doing regression in R is LM ( Linear Models) . The form of the command is :

model <- lm ( outcome ~ predictor1 + predictor2 + predictor3 )

Lets apply it to our case.


    > linearM<-lm(rings ~ as.factor(sex)+length+diameter+height+whole_weight+shucked_weight+viscera_weight+shell_weight,data=abalone) 

What this does is , it generates a linear model called “linearM” where the variable “rings” is predicted based on the values of the other variables. as.factor(sex) indicates that the variable “sex” should be treated as categorical.

> summary(linearM)

Call:
lm(formula = rings ~ as.factor(sex) + length + diameter + height +
    whole_weight + shucked_weight + viscera_weight + shell_weight,
    data = abalone)

Residuals:
     Min       1Q   Median       3Q      Max
-10.4800  -1.3053  -0.3428   0.8600  13.9426 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)       3.89464    0.29157  13.358  < 2e-16 ***
as.factor(sex)I  -0.82488    0.10240  -8.056 1.02e-15 ***
as.factor(sex)M   0.05772    0.08335   0.692    0.489
length           -0.45834    1.80912  -0.253    0.800
diameter         11.07510    2.22728   4.972 6.88e-07 ***
height           10.76154    1.53620   7.005 2.86e-12 ***
whole_weight      8.97544    0.72540  12.373  < 2e-16 ***
shucked_weight  -19.78687    0.81735 -24.209  < 2e-16 ***
viscera_weight  -10.58183    1.29375  -8.179 3.76e-16 ***
shell_weight      8.74181    1.12473   7.772 9.64e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.194 on 4167 degrees of freedom
Multiple R-squared: 0.5379, Adjusted R-squared: 0.5369
F-statistic: 538.9 on 9 and 4167 DF,  p-value: < 2.2e-16

That is a summary of the linear model. Lets delve into the details.

Taking a step back, we recollect that regression means having an equation of the form :

Y=aX1+bX2+cX3+…..+zX26+A, X1,X2,.. etc are the attributes while a,b,c are the coefficients, while A is the value Y is predicted to have when all the independent variables are zero.

Coefficients:

From our regression summary, we can , take a few variables, and say :

Y=-0.45384*length + 10.76154 * height + 8.97544 * whole_weight+13.9426

What this effectively says is that for every 1 unit change in length, the number of rings go down by (-0.45384) and then for every 1 unit change in height, the number of rings goes up by 10.76154 and if all the attributes remain zero,the number of rings is 13.9426. So, that is the role that coefficients play. Attributes which have low coefficients indicate that their effect on being changed is minimal on the outcome variable ( for example, length) while attributes with higher ones have a higher contribution to the outcome variable. Also,positive/negative gives you the direction of the effect. In the case of multiple regression, as the one above, what the coefficient says is that when the independent variable is increased by 1 unit, the outcome variable increases by the value of the coefficient, keeping all the other independent variables constant.

T-statistic :

What is it ? The t-statistic is the coefficient divided by its standard error. Standard error is an estimate of the standard deviation of the coefficient, the amount it varies across all the cases. Its a measure of the kind of precision by which the coefficient has been measured.

**Pr (> t )** :

This is the p-value .Its one of the main things one should look at for a regression model. P-value is x% if (100-x)% of the t-distribution is closer to the mean than the t-value on the coefficient you are looking at. For example, if 95% of the t-distribution is closer to the mean than the t-value of a coefficient you are looking at it, then, the P-value is 5% . The P value is the probability of seeing a result as extreme as the one you are getting in a collection of random data in which the variable had no effect. A P-value of 5% or less is the generally considered acceptable at which to reject the null hypothesis,the null hypothesis being that none of the attributes have an effect on the outcome.

NOTE: The p-value does not have a relation with the size of the effect the independent variable has on the outcome. One can have a large p-value and still have a small effect on the outcome variable

P-value of the regression as a whole :

In case your independent variables are co-related, intuitively, they are explaining the same variations in the Dataset and hence, the influence is divided among them. This condition is known as MultiCollinearity.

To have a great explanation of P-values and Student t-distribution , see the following links :

P-values Student t-distribution

A Week With Nexus 4

I am one of the lucky few who got a chance to order a Nexus 4 on December 3rd when it came up for ordering in Canada.Yay! It was not easy. I had to keep refreshing and then I botched up entering the details, but, all’s well that ends well.(Yes, IRCTC experiences did help in refreshing the page multiple times) . After counting the days (35 days!),I got the phone a week back and its awesome :-) Nexus 4

How does the phone feel

Solid . Slightly big. Sometimes you have to use two hands to press a button, but, with a few widget tweaks, that is unnecessary.The glass is slippery. You should treat it like a baby ! This has glass on both sides. I did not get a protector because that would destroy the phone’s looks , but , if you keep your phone with coins and keys and other hard surfaces or you are just a “dropper”(Chandler reference !) , then, you should probably get a bumper case. The google one is always out-of-order, there are various options in the AndroidCentral forums. The poetic ones come with a 3-year warranty, so, you may want to look into those.

Battery life

Much has been made of battery life. Please remember that under the hood, Nexus 4 is a beast. It has a very high resolution,a quad-core CPU and allows multi-tasking, and has a 2 GB RAM ! Comparing it to the backup you got in a, in my case, Samsung Galaxy S, is wrong. Also, out-of-the-box, GPS,NFC,Map location updates and a host of other things are enabled by default. There are two thought processes here. Either you can say that “Heck, this phone is amazing . Let me enjoy the fullest and then keep charging it” or “ I hate charging .I will disable the crap out of all apps and enjoy battery life like a Nokia phone” . Both of them are perfectly justified . In the latter case,you may want to do the following for starters :

  1. Disable NFC.
  2. Disable the constant report of location in Maps’ settings.
  3. Disable wi-fi/mobile data when not using it.
  4. Disable auto-brightness. I kept it at a constant 20%.You don’t need more unless you go out in the bright too often. I don’t . I am a graduate student.Enuff’ said.

I get more than 30 hours battery backup with about 2 hours of phone calls,a couple of hours of browsing,tweeting,fb’ing and mails . I would call that very good.I did a few of the aforementioned things, mainly because I am almost always online from my work Desktop or laptop,and am not much of a social butterfly by any metric.

Tweaks

Remember that this is the stock Android experience. This is how Google would want you to use Android. Its the Google way.If you feel the brightness is a little low, you miss HTC Sense or TouchWiz, then, probably, you would want proprietary software. But, as has been well-documented, proprietary software on top of Android does more harm than good.

The apps that I did get are :

  1. Apex Launcher – to remove the dock, incorporate gestures , have an empty home screen . There is a pro which does give you more options. There is a nova launcher as well. From my limited knowledge, there is not much difference between them. You can get either.
  2. Tango – to make free video calls home.
  3. Viber – similar to Tango

Surprises

The last time I used Android was in late 2011 . Those days, we had Gingerbread. If that was your last Android experience, be prepared for a huge leap forward. Jelly Bean, like other people have noted, is incredibly polished. Backed by beast-like power ,everything just runs . Mind-boggingly fast. Multi-tasking is a breeze and sometimes, I have more than 11 apps running. (I keep clicking on the multi-tasking button to see if its hanging :P) .

I have not used an iPhone , so, if you are looking for a iPhone 5 v/s Nexus 4 flame post,sorry, look elsewhere. Most of the apps that I use (Dropbox,Evernote,Gmail,facebook, etc ) are cross-platform, so,despite having a Mac, I am finding it easy to use with an Android phone. Some say that the iOS apps for Gmail and Maps are more intuitive , can’t say. The ones that come with the phone are enough for me.

Camera ! The camera is awesome ! Okay, my previous experience was with a Samsung Galaxy which was pretty decent. But, even when coming from there, this one is great. PhotoSphere is very very easy-to-use and does give good panoramas.There are a multi-tude of other options.I generally carry around a very good point-and-shoot camera ( I stay in Vancouver. This place calls for a camera all the time),so, have not experimented much with the camera.

Value for money

Absolutely. At $420 here in Canada with shipping and (ridiculous) HST , its an absolute steal.The sad part though is that Google has botched it up. I am not an expert , or even an amateur at analyzing product launches (or screw-ups), but, even my grandmother can say that Nexus 4 has a huge demand and not having it on the play store for months on end is a disaster. The phone is amazing and Google’s lack of initiative on telling the (eager) public about reasons,if any,behind the absence adds fuel to the rumours.But, when you do read about rumours, do not trust those which do not mention sources.In the heat of journalistic competition,there are many outlets which report things like “LG has back-stabbed Google like Brutus did to Julius Caesar”.Now, I do not know if that is true or false , but,the outlet here did not mention their sources.So, I would not fret over it.

Why Breaking Bad Is So Awesome

Breaking Bad I love Breaking Bad. I started watching it very recently, about a month back . And I finished all the seasons in 15 days . It is a heavy-duty series. It will make you think,react,love,hate,have a gamut of emotions and thoughts and swings. And you will keep going from episode to episode long after the sun has set and then come up again. My body cycle went for a toss, I had no idea about day and night, but , no regrets ! Breaking Bad is awesome !

You would ask why is it . Good Question. More than the fact that most of the cast is awesome, especially the lead , Bryan Cranston, its more about the excellent writers behind the series. The attention to detail is quite mind-boggling. Right from the episode names (“Each one has a reason behind it”), to the fact that each of the cast names has a chemical compound name from the period table, the pre-episode snippet which almost always fast-forwards several episodes and makes you wonder and wonder about what can happen next, and everything else. Lets look at more depth .

Warning : There could be spoilers ahead. If you have not watched Breaking Bad at all, I would suggest that you watch it ( If you do not trust me enough , please use imdb,Quora or any of the other more trust-worthy resources to make a decision and watch it . Then you can come back !)

Great Writing The dialogues in Breaking Bad ring in your mind loud after they have been delivered, quite incredibly well , by the cast. Gus’ advice to Walt about “not making the same mistake twice” , Jesse’ unbelievable portrayal as a faithful yet stupid kid who slowly grows up to be more stable as the series goes on . Even the car that Walt uses is a no-frills , very old model that I am sure the writers showed to signify how under-the-radar Walt wanted to be . Walt is very meticulous and that is something that you start to think about as well as you try and predict what could be next in the episode. ( Yet mostly Vince, the lead writer, manages to surprise you).

On the surface level, the hero slowly becomes the villian. You start off by feeling really bad for Walt, the poor , supremely over-qualified Chemistry professor teaching indifferent kids in high school getting lung cancer even though “he has never smoked a cigarette ever”. You realize he has an unplanned child on the way, his first has cerebral palsy , mortages to pay . But, then, he decides he has to leave enough for his family and starts to cook meth with a college kid.

But, a description like this, you can get from Wikipedia . What Wikipedia will not tell you that everything in Breaking Bad has a reason. The cars the protagonists use, to the colours of their dress ( Mary wears a yellow in the very last episode. She always wears purple earlier) , the gun that Jesse had, G.B and W.W . Why do you think Dean Morris a.k.a Hank found that book in the bath in the last episode ? Do you think Walt made a mistake ? I don’t think so . I think Walt is way too meticulous for that. Is Walt really “out” of everything ? How will Lydia and Todd take that ? See, the attention to detail by Vince is quite incredulous.

Superb Acting

Bryan , you will love to hate him. You will love and hate him at the same time. You will not realize how he does it, but, he does. And then you will not be able to believe that its the same guy from “Malcolm in the middle” ! His transition from the unfortunate Chemistry professor to the avengeful and very canny Heisenberg is so so gradual and well-written and acted that its probably the best transition ever rendered on TV ! ( as far as I have seen. Feel free to tell me of better ones)

Jesse ( Aaron Paul) as a stupid, haphazard college kid is well-played. So, from what I have read about the show, the initial plan was to kill him in the show as a deal gone wrong. But he is so awesome that Vince Gilligan decided that killing him would be an incredible mistake. He slowly transitions into someone who does not flip-flop as much as Walt does , who starts to think through things and he also comes across as someone who cares. He is not cold and brutal like Walt. That is a very nice contrast that is not so expected when you see Season 1 and 2.

More to come

Indexing Millions of Tweets

Say , you want to find out, in a few milliseconds , how many tweets have the word “awesome” in them ? How do you go about it ? We have lots of tweets, lets take a sample of 30 million . 30 million tweets, full-text search . What do you do ?

One of the ways would have been to use Lucene. But lucene is text-search and my data is in the DB . I went with Sphinx. Sphinx is a full-text search engine for databases. And let me tell you, Sphinx is fast, wicked wicked fast . Its very easy to get Sphinx up and running. The amount of configuration needed is relatively very low ( If you come from the java world , where , you write more and more XML files with each day ) , Sphinx will be a very pleasant experience .  

As you can see, we are indexing about 30 million tweets for now , for starters. Lets get started !

The first thing you need to do with Sphinx is change the sphinx.conf file. This configuration file is needed to tell Sphinx a few things :

  • Username, host, password , db name.

  • SQL queries to fetch the data.

  • Name of the index, and some features of the index.

Now , Sphinx distribution has its own version of the configuration file. But its huge and it can be intimidating. For starters, you don’t need more than a few lines of the file. They are mentioned below :

For starters, this much configuration is more than enough. Source is the one which specifies which DB has to be indexed. Indexer indexes the db . searchd is the daemon which helps in searching. Remember, there is a “search” utility too, but, that only helps in debugging and Command-line usage ( as we will see below ) . For production use, you have to run searchd and then use the APIs.

If you run indexer –all on your command prompt and you get a similar output, congrats ! you have successfully run Sphinx to index a database. Now, lets test if search is upto the mark . :)

I searched for awesome and Sphinx gives me the tweets that contain the word awesome. There are a lot of other options you could use with Sphinx , like extended mode, phrase mode ( you search for all the words ), you can search over multiple indexes , and so much more ! And of course, at mind-boggling speeds !

The next step is to search via an API . Sphinx comes with a Python API which can be used to search . But, the search query returns the ids and you have run SQL queries to get the tweets. Of course, there is an option to index the data of fields that you desire ( in this case username and tweets) , while making the index itself, but, I decided not to . The reason being that if you do it for 30 million tweets, and down the line, 400 million tweets, I would run out of memory way too soon ! Remember, I do all the work on a simple dual-core machine with 8 GB RAM . :) Nothing fancy . Hence, its really important to make trade-offs and try and minimize the load on the system, otherwise, it hangs like there is no tomorrow :(

Download Sphinx Talks on Sphinx Documentation for Sphinx is really good and now Sphinx has real-time indexes, lots of search options and more ! One has loads to explore with Sphinx.

API Search:

If you want to use Sphinx in production, just a CLI search won’t do :) . Lets use the Python API and search . This is how Sphinx gives the output :

You can call the API and get the results in the following way :

The output is shown like this. Remember to index the ids and username columns of the table ! It makes the queries really really fast . :)

WARs and Tomcat With Ant

So, I was looking forward to a pain(less) week when,  early Monday morning, my boss tells me  to automate our build/deployment process.  Now, I was  silent for a few minutes  hoping to pass on the message that I have no clue  what  I need to do.  I soon realized that silence was futile and I had to start my war with WARs .   Here is what a typical deployment process would look like :

0. Stop Tomcat

1 . Check out the latest, greatest code from the source control you use ( We use SVN)

2. Build the source

3. Run the unit-test cases

4. If and when they pass, you deploy the WAR file.

5. Restart Tomcat

I had two days to do this, and  let me tell you, I had no idea about  Ant .   I was able to finish it , inspite of powercuts and several wrong steps :) . I will post my (limited) knowledge and hope you do not repeat the mistakes I made :)

i )  Ant works  based on XML files.  Ant has something called “tasks” . Say, you want to run JUnit test cases. You have <junit> task. You want to  execute a shell script ? You have <exec> . You want to compile some java code. You have <javac> . There are > 80 core tasks in Ant . There are > 60 optional ones ( ala plugins ) .  So ,  there is a LOT Ant can do to automate your build/deployment process. For instance,  say, you do not use JUnit, you use TestNG . Sure Ant has something for that. You want to selectively run some java files ?  Sure, use <fileset> with includes,excludes.You hate SVN . Its too old. Sure,  it can work with other source control systems too. You prefer using property files ? , sure, use <property> .  You see,  Ant can do a lot for you.

ii) But , the catch is that Ant understands only XML files. And XML files, unlike code, does not scale too well. ( Hint : Try understanding a 300-line XML build file. Yes, its a mess ). Hence,  use comments , wisely, and use properties as much as possible.  Keep all your paths,  file names, build paths in a separate property file and then read from it .   For example ,

You can define properties that way , in say , config.properties. Now, how do I access it in your XML file ? Well,  like I said, you have a task for it. <property> task .

When you declare a prefix, it can be accessed as ${<prefix>.<name>} .  This way, you can keep all your main paths   in one file and avoid any form of hard-coding in the files .

Stop and start Tomcat :

I personally prefer the empty  screen and the blinking prompt on the command-line  to the humungous, bells-and-whistles, fully loaded Eclipse IDE  .  Starting and stopping Tomcat on the command-line is a matter of running either startup.sh or shutdown.sh .  How about doing it from Ant?  . Now,  there are various ways. You could run it like a “java -jar “ command , and pass arguments , you could use the Tomcat property in Ant. I took an easier way, running a shell script from Ant :) [ Did I not tell you that Ant can do a LOT of  tasks for you ? ]

This calls the script and runs the script for you. Say, you want to stop the server, run it as follows :

ant -f tomcat.xml stop

By default, Ant searches for build.xml . If not present, you have to specify the file name with the “-f” option. And to run a particular target ( this is why you use) , mention it .

 

Now we have two things left,  building our source code and running JUnit tests cases on it.  Lets first start with the building source code. There are two things which are always true on a production codebase :

  1.  There are a lot of project dependencies

  2. A lot of JAR dependencies

Ant needs to be told that project A depends on project B , and  that project A needs a set of jar files . Lets first see how to tell Ant where to find the jar files .

 

You define a path , with paths to the set of jar files using ( Used for multiple files ) and then use the id as “refid” at the corresponding places . It will get much clearer with the XML file below I have 5 projects , and 3 WAR files to make. This is how my project dependencies look : A -> B , c D -> B , C E -> B, C

All three of course depend on a lot of *.jar files too , which are a part of 3 folders in my lib directory. The important thing to note here is that before A can be built, B and C should be built. The order has to be maintained. How do you tell Ant it has to be maintained ?

Here is what you do. You use the word depends=”task-name” .

If you see , compileC depends on compileB which in turn depends on compileA . In case its not clear, what the build process does is , it converts the *.java files to *.class files. Remember ,build stands for compilation ! You build the code, it means you compile it and check for errors . Ant prints them on the console in case of errors. So , when you have dependencies, how you handle them is by using “depends”.

Deploying the WAR file : We have compiled the source code, seen that there are no errors in compilation, and now we want to package it as a WAR file. Well, guess what ? Yup, Ant has a task .

You can tell Ant which files to use to generate the WAR file and where to place it. And Ant does it for you. So , you stop tomcat , build the source, and deploy it. Well, are we not forgetting something ? Yup !, JUnit.

So , we are using Ant 1.8.2 . So Ant has a task . Here is how it looks :

You need to tell JUnit a few things :

  1. Path to junit.jar and ant-junit.jar
  2. Which files make up the unit test cases ?
  3. What format you want the report in ? [ You can choose plain and usefile=false,it prints to the console. You have even XML and HTML reports too]
  4. Where to keep the result of the test cases ?

NOTE : This is for running the JUnit cases. You have to build them too, and that will need the usual javac task .

Role-based Access in Tomcat

Apache Tomcat is a servlet/JSP container.  It can also be used  as a server, although , it is generally suggested that it be used only as a container and use Apache 2 as the server.   At my workplace, we used Tomcat as both .   For our client,  we  wanted to add role-based access.    We have 7 roles,   and  we have  quite a lot of servlets. What we wanted was that  a role, say , Manager, can only access  the following servlets,  say , A , B and C .  How do you go about it ?

1. Enter  Realms

We added  a table  which mentioned the role names   .  So , if you have 3 roles ,employee, CEO  and CFO, you can create a table  with these 3 roles mentioned.  In Tomcat ,  edit server.xml with the following lines :

<Realm className="org.apache.catalina.realm.JDBCRealm" connectionName="postgres" connectionPassword="password" 
connectionURL="jdbc:postgresql://localhost:5432/main" digest="MD5" driverName="org.postgresql.Driver" 
roleNameCol="role_name" userCredCol="user_pass" userNameCol="user_name" userRoleTable="user_roles" userTable="users"/>

What this does is, it connects to a database , which specifies which columns to use for roles and credentials of that role ( role_name and userPass respectively) , and also specifies a role table ( which has all the roles) and a user table ( users with specific roles assigned to them) . A usual scenario is when you want to show different parts of the site to different people .  For example , the manager of the site need not see the exact financials of the company  and so on.  In such a case, you want you want your servlets/JSP to be accessed by the correct role .  Tomcat has constructs which enable you to do this .

i) Enter <security-role>  : <security-role> is  just for defining  a role. So if you have 3 roles,  employee , CEO  and CFO.  A typical declaration of this role would be  as follows :

    <security-role>
    <role-name>Employee</role-name>
    </security-role>

2. <security-constraint> : This is to encapsulate the set of resources  which have to be “protected” . What this basically means is that if you have 3 servlets,  servlet1,servlet2 and servlet3 .  And an employee can access only servlet1 , then, you should do the following :

<security-constraint>
<web-resource-collection> <web-resource-name>Dashboard</web-resource-name> <url-pattern>/Servlet1</url-pattern> <http-method>POST</http-method> <http-method>GET</http-method> </web-resource-collection> <auth-constraint> <role-name>employee</role-name> <role-name>CEO</role-name> <role-name>CFO</role-name> </auth-constraint> <user-data-constraint> <transport-guarantee>CONFIDENTIAL</transport-guarantee> </user-data-constraint> </security-constraint>

Two words : DON’T PANIC . Everything has an explanation.   <web-resource-collection>  is for naming the resource that are going to be protected .  <url-pattern> list  is the list of resources which have to be protected. Whenever  any of these resources is accessed by a user, he has to be authenticated. A common gotcha here could be :

If you have a set of 10 protected resources, does the user need to log into the site 10 times, once for each access ?

The answer is an emphatic NO  . Once he is logged in, this is stored and he/she does not have to login again

 

<auth-constraint> is the list of users who have access to the <url-pattern> mentioned  above it. You can mention multiple names here btw !  <user-data-constraint>  helps you in specifying constraints in the transport-layer w.r.t to the security you would want in your application.  We used CONFIDENTIAL as we wanted https throughout.  It is generally a good practice to use HTTPS as much as possible ,  and its a MUST if you are transferring sensitive data

Here is the complete  XML file for reference. We can have three servlets. One of them can be accessed by all,  another one can only be seen by the CEO.