Tuesday, May 08, 2012

Linear Regression


Our aim is to see if the current payment system in Public Schools has any correlation with the research value of professors. We try to model salary linearly based on following parameters,

  1.  Years since PhD : x_1
  2. Research value : x_2
  3. Gender : x_3
  4. Area of residence : x_4

Salary = c_0 + c_1 * x_1 + c_2 * x_2 + c_3 * x_3 + c_4 * x_4

We used the citation data for High Energy Physics Theory [HepTh] (from arxiv, 1992 to 2003), and Salary data for year 2003 for Universities of California. There are 27,770 nodes and 352,807 edges in our citations graph.

There are 10 UC campuses :
  1. Berkeley
  2. Davis
  3. Irvine
  4. Los Angeles
  5. Merced
  6. Riverside
  7. San Diego
  8. San Fransisco
  9. Santa Barbara
  10. Santa Cruz

We made a list of professors common in abstracts of papers in citation network, and the salary database of UC campuses. This gave us a list of UC professors who have published in the area of HepTh. We found 52 matchings. By manually checking, we found that some of these professors are from other areas like Medicine, Political Science, Finance, Mathematics etc who have published few papers in HepTh. But these professors have bulk of other work apart from HepTh and hence, we cannot compare them with professors of Physics who are working primarily on HepTh. This leaves us with 30 professors.

We collected the data on years since PhD and gender for each professor manually. We found only one female professor and hence decided to ignore gender as a variable. Also, since we are confining ourselves to UC campuses, x_4 is same for all professors, assuming the cost of living is approximately same all over California. From our previous exercise of comparing years since PhD and salary suggests that there is some correlation.

So, our model reduces to a two-variable linear regression model :

y = b0 + b1 * x_1 + b2 * x_2

To begin with we consider simple definition of research value = total citation count.

We test our model on the 30 professors. The following is the result of the fit :
y =5.32492 * citationCount + 269.615 * yrsSincePhD + 98470.3

r-squared : 0.02974

                         Estimate     Standard Error    t-Statistic     P-Value

                                                                   
1                      98470.3     14595.5               6.74663       3.03632E-7

yrsSincePhD      269.615       627.158              0.4299          0.67068

citationCount     5.32492        7.42821             0.716851      0.479622

If we look at p-value, it seems like both variables are not significant.

But we have already seen Years since PhD should to contribute to the salary. So, we go back and look at the data. The following is a plot of gross pay versus citation count :




We looked at the outliers, which suggests the following two problems :

  1. There are professors who are retired/emeritus who are drawing just pension.
  2. There  are professors who have published in HepTh but these papers are primarily tagged in other areas of physics. Our current database has papers with only first tag in HepTh. Thus significantly lowering the citation count in some of the professors with high salary.

To overcome problem 1, we decided to look at professors whose years since PhD is <=30. The result is as follows :

y = 73625.4 + 12.0094 * citationCount + 1880.43 * yrsSincePhD

r-squared = 0.313416

                        Estimate   Standard Error   t-Statistic   P-Value

1                      73625.4    13891.3          5.3001        0.0000255525

yrsSincePhD     1880.43    791.199          2.37668       0.0265942

citationCount   12.0094    6.6497           1.80601       0.0846189


The p-value of years since PhD suggests that it is statistically significant (this is not surprising given observation and that we have eliminated outliers). The p-value of citation count is not still satisfactory ( the standard p-value for statistical significance is ~ 0.05). But, it is still encouraging. If we get a larger data set, we might be able to see a better result. For this, we are collecting a larger data set. Also, citation count might be a very crude measure of research value, which calls for better ways of ranking papers and hence professors. We have calculated page-rank, closeness and betweeness centrality for the current network. Regression using these measures and comparison is being currently done. We shall also look at h-index, g-index of the professors and compare them with measures based on traditional notions of centrality on citation graph.

Problem 2 cannot be completely solved since we cannot get data on all the publications by all professors, i.e, getting a complete citation graph. But, we are crawling arxiv and get the a better citation data by considering papers which have secondary tags as HepTh.


--
P.S : We use Gephi to work on citation network and Mathematica for regression analysis.

No comments:

Post a Comment