Regression is from latin roots, re, back, and gradus, to go . Literally means, to go back , which fits with the way the method is used in mathematics/statistics. The first usage of the term “regression” is credited to Sir Francis Galton . He proved that sons of tall fathers are tall but not as tall as their fathers, and sons of short fathers are short but not as short as their fathers ( This is known as the “regression effect”).
Today, we will discuss Regression using the statistical computing Language, R . The R Project is a great language for using statistical methods with your data and is very widely used. The dataset we are going to use is “Abalone” , from the UCI Machine Learning repository . Get the dataset here :
Abalone
Description of the dataset, from the file :
# Abalone
# Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it,and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns
and location (hence food availability) may be required to solve the problem.
> abalone<-read.table("/home/rohit/abalone/Dataset.data")#This loads the dataset . R can read from any url and any location
> abalone[1:20,] #Read the 1st 20 lines of the table
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
7 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
8 F 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
9 M 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
10 F 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
11 F 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.210 14
12 M 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.135 10
13 M 0.490 0.380 0.135 0.5415 0.2175 0.0950 0.190 11
14 F 0.535 0.405 0.145 0.6845 0.2725 0.1710 0.205 10
15 F 0.470 0.355 0.100 0.4755 0.1675 0.0805 0.185 10
16 M 0.500 0.400 0.130 0.6645 0.2580 0.1330 0.240 12
17 I 0.355 0.280 0.085 0.2905 0.0950 0.0395 0.115 7
18 F 0.440 0.340 0.100 0.4510 0.1880 0.0870 0.130 10
19 M 0.365 0.295 0.080 0.2555 0.0970 0.0430 0.100 7
20 M 0.450 0.320 0.100 0.3810 0.1705 0.0750 0.115 9
Lets get the names of the columns.
> names(abalone)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9".
Clearly , the names are not descriptive. Lets change them.
> names(abalone)=c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings")
c(a,b) or c(“1″,”2″) basically creates a vector of values that are given within the brackets. So, we have given names to the columns of the Dataset. Lets verify the same.
> abalone[1:10,]
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings
1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
7 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
8 F 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
9 M 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
10 F 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
Bingo, this looks way better than V1,V2,V3 and so on, is it not ? So , now we know that given the attributes, sex,length,diameter,height, whole_weight,shucked_weight,viscera_weight,shell_weight, we need to predict the number of rings.
We can see that except sex, all the other attributes have numerical values. Sex on the other hand , can take only three values, which are , M , F and I (Gender/Infant) . So , when we tell R to do regression for us, we need to tell the language to treat Sex as a categorical variable.
The advantage with a kernel estimator is that its smooth, it is not dependent on the bins that you have to go for with histograms , and thus gives you a better idea about the distribution of the data w.r.t that variable.
The basic command for doing regression in R is LM ( Linear Models) . The form of the command is :
model <- lm ( outcome ~ predictor1 + predictor2 + predictor3 )
Lets apply it to our case.
> linearM<-lm(rings ~ as.factor(sex)+length+diameter+height+whole_weight+shucked_weight+viscera_weight+shell_weight,data=abalone)
What this does is , it generates a linear model called “linearM” where the variable “rings” is predicted based on the values of the other variables. as.factor(sex) indicates that the variable “sex” should be treated as categorical.
> summary(linearM)
Call:
lm(formula = rings ~ as.factor(sex) + length + diameter + height +
whole_weight + shucked_weight + viscera_weight + shell_weight,
data = abalone)
Residuals:
Min 1Q Median 3Q Max
-10.4800 -1.3053 -0.3428 0.8600 13.9426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.89464 0.29157 13.358 < 2e-16 ***
as.factor(sex)I -0.82488 0.10240 -8.056 1.02e-15 ***
as.factor(sex)M 0.05772 0.08335 0.692 0.489
length -0.45834 1.80912 -0.253 0.800
diameter 11.07510 2.22728 4.972 6.88e-07 ***
height 10.76154 1.53620 7.005 2.86e-12 ***
whole_weight 8.97544 0.72540 12.373 < 2e-16 ***
shucked_weight -19.78687 0.81735 -24.209 < 2e-16 ***
viscera_weight -10.58183 1.29375 -8.179 3.76e-16 ***
shell_weight 8.74181 1.12473 7.772 9.64e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.194 on 4167 degrees of freedom
Multiple R-squared: 0.5379, Adjusted R-squared: 0.5369
F-statistic: 538.9 on 9 and 4167 DF, p-value: < 2.2e-16
That is a summary of the linear model. Lets delve into the details.
Taking a step back, we recollect that regression means having an equation of the form :
Y=aX1+bX2+cX3+…..+zX26+A, X1,X2,.. etc are the attributes while a,b,c are the coefficients, while A is the value Y is predicted to have when all the independent variables are zero.
Coefficients:
From our regression summary, we can , take a few variables, and say :
Y=-0.45384*length + 10.76154 * height + 8.97544 * whole_weight+13.9426
What this effectively says is that for every 1 unit change in length, the number of rings go down by (-0.45384) and then for every 1 unit change in height, the number of rings goes up by 10.76154 and if all the attributes remain zero,the number of rings is 13.9426. So, that is the role that coefficients play. Attributes which have low coefficients indicate that their effect on being changed is minimal on the outcome variable ( for example, length) while attributes with higher ones have a higher contribution to the outcome variable. Also,positive/negative gives you the direction of the effect. In the case of multiple regression, as the one above, what the coefficient says is that when the independent variable is increased by 1 unit, the outcome variable increases by the value of the coefficient, keeping all the other independent variables constant.
T-statistic :
What is it ? The t-statistic is the coefficient divided by its standard error. Standard error is an estimate of the standard deviation of the coefficient, the amount it varies across all the cases. Its a measure of the kind of precision by which the coefficient has been measured.
This is the p-value .Its one of the main things one should look at for a regression model. P-value is x% if (100-x)% of the t-distribution is closer to the mean than the t-value on the coefficient you are looking at. For example, if 95% of the t-distribution is closer to the mean than the t-value of a coefficient you are looking at it, then, the P-value is 5% . The P value is the probability of seeing a result as extreme as the one you are getting in a collection of random data in which the variable had no effect. A P-value of 5% or less is the generally considered acceptable at which to reject the null hypothesis,the null hypothesis being that none of the attributes have an effect on the outcome.
NOTE:
The p-value does not have a relation with the size of the effect the independent variable has on the outcome. One can have a large p-value and still have a small effect on the outcome variable
P-value of the regression as a whole :
In case your independent variables are co-related, intuitively, they are explaining the same variations in the Dataset and hence, the influence is divided among them. This condition is known as MultiCollinearity.
To have a great explanation of P-values and Student t-distribution , see the following links :
P-values
Student t-distribution