Monday, May 21, 2012

Updates for May 21, 2012

This weekend, we ran the full regression on delta salary versus the number of citations. We observed that most professors received biyearly base salary raises, so we calculated the salary difference after two years and the corresponding citation changes. We were able to obtain 150 datapoints, and the result is as follows:

deltaSalary = beta * deltaCitations


Estimatep-value
delta Citation33.48996.14027 x 10^-12

The p-value was extremely low, rejecting the null hypothesis.

Also, while running regression on biyearly data, we realized that we should have paired up the previous year's centrality with the current year's salary. We ran the regressions with centralities for 2009 with salaries for 2010, and here are the results:


Estimatep-value
Constant57078.51.62923 x 10^-9
Years Since PhD2431.52.01299 x 10^-10
Citations5.0620.00961352


Estimatep-value
Constant60246.72.01867 x 10^-10
Years Since PhD2338.529.00664 x 10^-10
PageRank4.4146 x 10^60.0155467


The p-values and the estimates were similar to the previous results'.


During the last meeting, we agreed that we should try to capture the values of citations coming from differently ranked professors. This weekend, we debated whether PageRank would be a good measure to capture that. Even though the regression result we obtained for PageRank could be interpreted as statistically significant, we still haven't established the significance of delta PageRank.

Michael also wrote code for calculating g and h index. We will be analyzing the output this week.

Why PageRank is not a good measure to capture value of a paper

Let us consider a toy example:

P = [ 0 0 0 0; 1 0 0 0; 1 0 0 0; 1 1 1 0]

From the traditional page-rank:

G = a* (P+I)./deg_of_each_node + (1-a)*(1/n)*(ones)

with this if we compute page rank for our toy example, we get
    0.9873
    0.1039
    0.1039
    0.0598

that is  r(1) = 1, r(2) = 2, r(3) = 2, r(4) = 3.

Now, suppose paper 3 had cited paper 2,

P = [ 0 0 0 0; 1 0 0 0; 1 1 0 0; 1 1 1 0]

Now the page rank gives,

    0.9835
    0.1475
    0.0848
    0.0608

that is  r(1) = 1, r(2) = 2, r(3) = 3, r(4) = 4.

What did we want to capture?  (assuming each citation means the same, i.e., unweighted graph)
The difference between network 2 and network 1 is that,
val(1) > val(2) > val(3) > val(4)
at the current time.

Problems with absolute value of page ranks :

If we look at values that page rank gives, the value of paper 4 increased from 0.0598 to 0.0608. This does not make sense, since a paper in past having cited another paper or not should not change the value of a new paper.

If we look at ranks instead, paper 3 lost its position from rank 2 to rank 3.  But, a paper's value should depend on its in-degree and not its out degree.

So, page rank does not capture the essence of the difference between network 1 and 2. 

Wednesday, May 16, 2012

Linear regression with delta salary and delta citation count for the year 2010

Another goal of this week was to perform regression on the change in salary and centrality. To do so, we first calculated cumulative centrality for every year between 2004 and 2010. Then we calculated the change in citation count and PageRank.

We had access to two different types of salary data: gross pay, and base pay. We initially ran a regression on the change in gross pay and citation count, but the variance in gross pay was too high to draw any correlation.



When we graphed change in base pay versus change in citation count, it fared much better.



We had in fact not expected a negative change in base pay!

We ran linear regression on delta base pay using two variables, years since PhD and delta citation count, for year 2010 (that is change from 2009 to 2010). The result is as follows:

R^2 :0.0709406 Estimate p-value
Constant -406.115 0.869237
delta Citation 45.6588 0.0348579

The p-value for the constant term suggests that it is not statistically significant. We also note that the estimate for constant term is negative. Which means, the data does not suggest a negative change in salary due to non-performance (no change in citation count). Also, a negative intercept would mean that up to a certain threshold (where the line crosses y=0), the professor's salary is reduced. This is quite interesting if it were statistically significant. But a very high p-value suggests null hypothesis.
Hence, we performed single variable regression on the data without constant bias and the following was the result :


R^2 : 0.0895593 Estimate p-value
delta Citation 43.7869 0.0162938


For PageRank, since the number of papers is steadily increasing, it was natural for the change in PageRank to be negative. We'd like to try multiplying PageRank by the number of papers taken into consideration.

We are performing regression on delta citation for other years (2004 to 2009), and then we have compare the results and analyze them. Also, we  are calculating h- and  g- indices to see how would they perform in predicting the salary under simple linear assumption.


Tuesday, May 15, 2012

Linear Regression on new data

Previously we did a linear regression with two variables on data from SNAP, and found the problems due to considering only papers with primary tag in HepTh. So, we crawled arxiv and obtained new data and performed linear regression on the new data. The following are the results :

Using Gross Pay : (limiting years since PhD to <= 40)

R^2 : 0.34126 Estimate p-value
Constant 84184.6 2.56162*10^-8
Years Since PhD 2094.61 0.000113447
Citation Count 7.47883 0.0138856


R^2 : 0.323274 Estimate p-value
Constant 89390.7 3.70218*10^-9
Years Since PhD 1955.36 0.000356706
Page Rank 6.57888*10^6 0.0284647
  
Using Base Pay : (limiting years since PhD to <= 40)
R^2 :0.605081 Estimate p-value
Constant 57158.9 1.61337*10^-9
Years Since PhD 2427.95 2.17864*10^-10
Citation Count 4.74429 0.0103403
 
R^2 :0.598601 Estimate p-value
Constant 60317.7 1.94649*10^-10
Years Since PhD 2334.74 9.65709*10^-10
Page Rank 4.39694*10^6 0.0158609

Yearly Centralities

We have finished calculating centrality for citation network, for each each year from 2004 to 2010. We also have obtained UC salary data from ucpay.globl.org for years 2004 through 2010. We identified professors from UC in the citation data and have listed the papers for each professor. Now we are converting centrality of papers to rank of professors. Also, with yearly centrality we are computing delta centrality which shall be used to construct rank based on the change in centrality of papers.  

Tuesday, May 08, 2012

Fresh Data :)

We have obtained data on papers on High Energy Physics (HepTh) in arxiv from year 1991 to May 2nd 2012. We have crawled all the papers and their abstracts that are tagged with HepTh, both primary and secondary to address issue 2 raised after looking at outliers in old data obtained from SNAP. (1992 to 2003 with papers primary tagged to HepTh).

Current data has 79,188 nodes and 1,163,903 edges! We ran into memory issues :-P Finally, after fixing it, we have calculated in-degree, out-degree and page-rank of each nodes using Gephi. We are currently computing closeness and betweeness centrality on the data. It is taking a LOT of time, as expected.

Meanwhile, we are also simultaneously working on joining the fresh citation data with the salary data. We are downloading salary data for year 2010 for UC campuses. Since the latest salary data is 2010, we need to truncate our citation data accordingly.

Linear Regression


Our aim is to see if the current payment system in Public Schools has any correlation with the research value of professors. We try to model salary linearly based on following parameters,

  1.  Years since PhD : x_1
  2. Research value : x_2
  3. Gender : x_3
  4. Area of residence : x_4

Salary = c_0 + c_1 * x_1 + c_2 * x_2 + c_3 * x_3 + c_4 * x_4

We used the citation data for High Energy Physics Theory [HepTh] (from arxiv, 1992 to 2003), and Salary data for year 2003 for Universities of California. There are 27,770 nodes and 352,807 edges in our citations graph.

There are 10 UC campuses :
  1. Berkeley
  2. Davis
  3. Irvine
  4. Los Angeles
  5. Merced
  6. Riverside
  7. San Diego
  8. San Fransisco
  9. Santa Barbara
  10. Santa Cruz

We made a list of professors common in abstracts of papers in citation network, and the salary database of UC campuses. This gave us a list of UC professors who have published in the area of HepTh. We found 52 matchings. By manually checking, we found that some of these professors are from other areas like Medicine, Political Science, Finance, Mathematics etc who have published few papers in HepTh. But these professors have bulk of other work apart from HepTh and hence, we cannot compare them with professors of Physics who are working primarily on HepTh. This leaves us with 30 professors.

We collected the data on years since PhD and gender for each professor manually. We found only one female professor and hence decided to ignore gender as a variable. Also, since we are confining ourselves to UC campuses, x_4 is same for all professors, assuming the cost of living is approximately same all over California. From our previous exercise of comparing years since PhD and salary suggests that there is some correlation.

So, our model reduces to a two-variable linear regression model :

y = b0 + b1 * x_1 + b2 * x_2

To begin with we consider simple definition of research value = total citation count.

We test our model on the 30 professors. The following is the result of the fit :
y =5.32492 * citationCount + 269.615 * yrsSincePhD + 98470.3

r-squared : 0.02974

                         Estimate     Standard Error    t-Statistic     P-Value

                                                                   
1                      98470.3     14595.5               6.74663       3.03632E-7

yrsSincePhD      269.615       627.158              0.4299          0.67068

citationCount     5.32492        7.42821             0.716851      0.479622

If we look at p-value, it seems like both variables are not significant.

But we have already seen Years since PhD should to contribute to the salary. So, we go back and look at the data. The following is a plot of gross pay versus citation count :




We looked at the outliers, which suggests the following two problems :

  1. There are professors who are retired/emeritus who are drawing just pension.
  2. There  are professors who have published in HepTh but these papers are primarily tagged in other areas of physics. Our current database has papers with only first tag in HepTh. Thus significantly lowering the citation count in some of the professors with high salary.

To overcome problem 1, we decided to look at professors whose years since PhD is <=30. The result is as follows :

y = 73625.4 + 12.0094 * citationCount + 1880.43 * yrsSincePhD

r-squared = 0.313416

                        Estimate   Standard Error   t-Statistic   P-Value

1                      73625.4    13891.3          5.3001        0.0000255525

yrsSincePhD     1880.43    791.199          2.37668       0.0265942

citationCount   12.0094    6.6497           1.80601       0.0846189


The p-value of years since PhD suggests that it is statistically significant (this is not surprising given observation and that we have eliminated outliers). The p-value of citation count is not still satisfactory ( the standard p-value for statistical significance is ~ 0.05). But, it is still encouraging. If we get a larger data set, we might be able to see a better result. For this, we are collecting a larger data set. Also, citation count might be a very crude measure of research value, which calls for better ways of ranking papers and hence professors. We have calculated page-rank, closeness and betweeness centrality for the current network. Regression using these measures and comparison is being currently done. We shall also look at h-index, g-index of the professors and compare them with measures based on traditional notions of centrality on citation graph.

Problem 2 cannot be completely solved since we cannot get data on all the publications by all professors, i.e, getting a complete citation graph. But, we are crawling arxiv and get the a better citation data by considering papers which have secondary tags as HepTh.


--
P.S : We use Gephi to work on citation network and Mathematica for regression analysis.