Monday, April 30, 2012

Crawling Arxiv

We just finished crawling the abstracts and the citation network for 10 years of theoretical high energy physics papers. Arxiv lists up to 2000 articles for a given year, and we can retrieve 10000 abstracts per API request, so fetching metadata was quite simple.

The bottleneck was retrieving the citation data for a given paper. This was supported by http://inspirehep.net/, which is a high energy physics(HEP) literature database. It seems to perform multiple database lookups to match corresponding paper ids, taking anywhere from 1 to 10 seconds per request. We initially ran twenty threads, but using ten threads actually improved the performance.

About 30% of the papers were cross-listed from other fields besides theoretical high energy physics, and more than half of the papers are tagged with multiple fields. "General Relativity and Quantum Cosmology"(GR-QC) was the most common overlap. It might be in our interest to crawl GR-QC and retrieve citation data, except then it would generate a bias for HEP-theory authors who publish papers related to the specific field.

Work Update as of 2012-04-30

[ramya] Retrieved secondary variables for regression. Calculating centrality for the new data.
[kijun] Finished arxiv crawler. Crawled high energy physics citation network from 2003-2012. 
[michael] Retrieved secondary variables for regression. Working on a unified salary table for public schools in multiple states.

Wednesday, April 25, 2012

Plot of Salary vs. Year of PhD




This is a plot of basic pay versus year since Ph.D for professors in the area of High Energy Physics in UC campuses (as of year 2003).

Monday, April 23, 2012

Work Update as of 2012-04-23

[ramya] Computed closeness, betweenness centrality and page rank for the high energy physics papers.
Working on centrality on weighted graph.
[kijun] For each paper in the database, obtained which journal/conference it was published in.
Still working on arxiv crawler.
[michael] Working on salary data.

A snap-shot of Citation Graph

A snap at the citation graph of papers in High Energy Physics on arXiv (from 1993 to 2003).

Average In-Degree : 10.089
Average Out-Degree : 15.723


Monday, April 16, 2012

Milestone 2 : Regression Variables and More Data

This week's goals are to calculate centralities for high energy physics (HEP) citation network, start crawling arXiv for other domains of physics, and to obtain data for regression variables.

Here's the to-do list for our second milestone:
  • calculate the PageRank for each paper
  • calculate closeness and betweenness centrality
  • obtain years since Ph.D. for professors in UCs who study high energy physics (HEP)
  • obtain relative cost of living for each college area
  • start crawling arXiv's HEP citation network from 2003 to 2012
  • start crawling arXiv's astrophysics and condensed matter citation network
  • obtain salary data from outside UCs, including University of Texas system.

Work Update as of 2012-04-16

[Michael] Collecting salary data from other public universities and calculating degree centrality for high energy physics citation data.
[Ramya] Calculated in-degree for each papers, and working on page-rank and degree centrality.
[Kijun] Mapped high energy physics authors to corresponding professors from UCs. Working on an arXiv crawler for collecting citation network for different areas of physics.

We also prepared for our presentation, which was held on Thursday, April 12th!

Saturday, April 14, 2012

Saturday, April 07, 2012

Milestone 1 : Simple Regression

Our first goal is to get the available data and run a simple single variable regression on it. For this we will be using citation network data for high energy physics from http://snap.stanford.edu/data/cit-HepPh.html. This dataset gives us citation network from January 1993 until April 2003. For salary data we are crawling ucpay.globl.org. This gives us base salary for the year 2004. Currently we restrict ourselves to professors from UC campuses.

To Do for first milestone :
  • Make a table of 2004 salary data from ucpay.globl.org (python script) DONE
  • Make a table of HEP-TH professors and their papers (by arxiv id)
  • Parse the abstracts from http://www.cs.cornell.edu/projects/kddcup/datasets.html
  • Alter ucpay salary table to align with HEP-TH professors (rearrange name, remove middle name, etc.)
  • Calculate citation measurements from SNAP data (mathematica, R)
  • Use HEP-TH table to calculate rank for each professor
  • Join the resulting tables
  • Linear regression (mathematica, R)

Work Update as of 2012-04-07

[Michael] The ucpay.globl.org actually lets you download the raw data as CSV. Parsed these (for years [2004, 2010)) and added to the sqlite DB.
[Ramya] Converted the citation network data to format usable by Mathematica. Exploring various tools available on Mathematica for analysis of networks.
[Kijun] Getting the list of authors for each article on the citation network

Also we have set up a github where we commit our data, codes and results. This is to facilitate a easy co-ordination between team members.

Proposed Timeline

first half of the term
We will first focus on the empirical side. Since gathering data may take a long time, after we obtain sufficient data to start we will concurrently analyze the data and refine/obtain more data.

Gathering citation/authorship and salary data (5 weeks)
We need to gather data for citation network and salary of the professors. For citation networks, we will start with the high-energy physics data from http://snap.stanford.edu (a relatively small network). (SNAP also has software that may be useful for processing and computing using the datasets.) If needed, we will also use the DBLP citation network, available at http://arnetminer.org/citation, which consists of more than one million nodes and two million edges.

For professors’ salaries, the Collegiate Times offers a centralized database of such data. Unfortunately, they only display a non-pageable list of first 250 results for each school. We will write to them and ask for the dataset for academic purposes. Otherwise, we will write a scraper to retrieve data from individual sites that host state-wide salary information, such as http://ucpay.globl.org.

One complication is that many women publish under a name that differs from their legal name, which is presumably used in the salary data.

Regression tools (1 week)
To familiarize ourselves with regression techniques, we will run regressions using the salary data and a trivial centrality measure (such as the number of papers for each professor). This involves choosing a software package, and writing scripts to process the data obtained into a form suitable for the software we decide to use.

Apply Centrality Measures and Regression (2 weeks)
We will calculate the PageRank, degree, closeness, and betweenness centralities of the citation network in our dataset. After calculating the centrality for each node, we will try different regression models to determine which model has the best fit.

Evaluate Result (1 week)
We will analyze the results obtained from the previous step. We will explain why some measures performed better than others, and determine and describe factors accounting for error or bias.

second half of the term
Focused on theory. Of course, since the difficulty of obtaining data is hard to predict, this could be advanced or delayed from the midpoint of the course.

Design a Better Centrality Measure (2 weeks)
If we believe that centrality measure can be improved dramatically, perhaps through weighing different measures or using an entirely new concept, we will spend the next two weeks implementing a centrality measure better suited for measuring importance in a citation network. We will run regressions using the new centrality measure and see whether it correlates more with salary.

Analysis of gameability of centrality measures (2 weeks)
If the new centrality measure is deemed “good enough”, it indicates that there exists a important correlation between a professor’s centrality and salary. We will explore the ways in which a professor is able to “game” the centrality measurement to improve one’s salary.

Discussion

It is interesting to learn how scholarly pay is determined, and whether citation centrality is a useful indicator of academic performance. If the academic market is a meritocracy, more productive professors, as measured for example by the number of publications, would earn higher pay. However, quality also matters. This suggests that a better measure would be number of publications adjusted for quality, such as quality of journals. Another way to measure quality of publications is by citation counts of the publications.


Existing research on academic salaries and on citation network centralities

Others have correlated academic salaries with citation counts, including:
Bernard Grofman, “Determinants of Political Science Faculty Salaries at the University of California.” Political Science, 2009, 719-727.
Luis R. Gomez-Mejia and David B. Balkin. “Determinants of Faculty Pay: An Agency Theory Perspective.” Academy of Management Journal, 1992, Vol. 35, No. 5, 921-955.
But counting citations is an imperfect measure of a scholar’s marginal product; it is more informative of quality to be cited by an important author than by a minor author. We therefore want to try to construct a different measure of importance of the authors based on centrality measures in the network of research in the academic subject area. If we are successful, our measure should have incremental explanatory power for scholarly pay, and perhaps even dominate determinants identified in past studies.

The paper Michael Hadani, Susan Coombes, Diya Das and David Jalajas “Finding a good job: Academic network centrality and early occupational outcomes in management Academia,” Journal of Organizational Behavior, 2011.
examines academic network centrality (where linkage is by department) in relation to occupational outcomes. However, it does not examine centrality of citations, it does not examine citation networks, and it does not look at the effect on pay. We indent to look at these aspects.


Centrality measures

The concept of centrality was introduced in 1948 by Bavelas in the context of human communication. Since then various different centrality measures as have been proposed and studied in different contexts.
Linton C. Freeman, “Centrality in Social Networks Conceptual Clarification”, Social Networks, 1 (1978/79) 215-239. Gives a graph theoretic approach in defining and measuring centrality in a network.

The following are a few measures of centrality in a network:
  1. PageRank/Katz centrality (there are several variations)
  2. Degree centrality (Citation count in citation network)
  3. Constant function (i.e. number of papers in case of citation network)
  4. Closeness centrality (reciprocal of the average distance with other nodes)
  5. Betweenness (percentage of shortest paths that pass through the given node)
  6. Eigenvector centrality (superset of PageRank)

We need to identify which centrality measure best suits our purpose. In fact, any individual centrality measure might fail to capture all the desired characteristics. So, we believe that a weighted centrality measure encompassing different aspects contributing to the relative importance needs to be defined. Some of the points we need to consider:
  1. Often authors publish papers which build on their previous works, and hence they cite their old papers. We need see how to value this against the citations by other authors.
  2. People also use citations as a form of social exchange. For example, as a favor people may preferentially cite those that they know personally.
  3. Getting cited by an important author, analogous to the contribution to the page rank when a high ranked node links to the node.
(See also the discussion on bias below.)


Robustness

An important challenge while studying centrality is robustness. In most of the cases we have imperfect data, like a few nodes or edges missing, or spurious nodes or edges present. A study on robustness of centrality measures have been done in
Stephen P. Borgatti, Kathleen M. Carley, David Krackhardt, “ On the robustness of centrality measures under conditions of imperfect data”, Social Networks, Volume 28, Issue 2, May 2006, Pages 124–136
where they, (quote) “ show that the accuracy of centrality measures declines smoothly and predictably with the amount of error. This suggests that, for random networks and random error, we shall be able to construct confidence intervals around centrality scores. In addition, centrality measures were highly similar in their response to error. Dense networks were the most robust in the face of all kinds of error except edge deletion. For edge deletion, sparse networks were more accurately measured”. The authors have considered, degree, betweenness, closeness and eigenvector centrality, and have compare them using top 1%, top 3%, top 10%, overlap, and R2 measures of accuracy. In our proposed project, we will be collecting data on citation network and we will be able to create a partial citation network based on that. So the understanding of robustness becomes necessary.


Sources of bias

Another major challenge will be noise in the data on salaries. Academics are paid not only to convert coffee into papers, but also to teach students and perform administrative work for the university. Furthermore, one expects output in these dimensions to be correlated (for example, time spent writing papers is not used for teaching), which could bias our results if not properly controlled for. A difficulty in controlling for this is that measures of teaching load and administrative workload are unlikely to be public. We still need to address this issue.

We will only be looking at a small subset of scholars and academic papers in a few fields. Ideally, we want a very-well-delimited field, distinct from all others, so that field boundaries (for the scholars and papers chosen) are aligned perfectly. People involved in interdisciplinary research may show up as peripheral when looking at only a single field, whereas due to their importance in connecting disparate fields, they could plausibly have especially high centrality in the unobserved overall graph of publications. They would have smaller centrality measures than their true importance, and thus presumably salary, so this error biases the correlation toward zero.


Gameability

If citation centrality affects pay, that incents academics to alter their citation patterns to maximize their core. That would reduce the usefulness of the centrality measure; we want to find a measure that is hard to game in this way. To do this, we must first develop a precise notion of gameability. One possibility is to analyze the maximum and expected marginal effects of getting one additional citation (perhaps as a favor or in exchange for a citation for the other author). More sophisticated models could try to capture social links (since it’s easier to get a citation from someone you know); perhaps school graduated from, graduation year, and current university department could be used as proxies for whether authors are socially linked.

Introduction

The problem of identifying how significant a node is in a network is encountered in various applications. For example, search engines need to find the importance of web pages for a keyword and rank them in order to produce useful output for the user. For advertising and epidemic control, identifying the most important nodes in the network is a key. The major challenge here is how to measure importance according to the observed structure of a network. The notion of centrality addresses this issue by offering metrics for how “central” a node is to the given network.

In determining faculty salaries, universities attempt to measure a scholar’s marginal value to the university or to the general social welfare. Common measures used are the number of publications, the number of publications in top journals, and the number of times papers are cited. But these measures may not comprehensively measure a scholar’s marginal product.

We propose to construct centrality measures in the citations network of an academic field, use citation centralities of papers to construct empirical indices of the empirical measures of the importance of scholars, and test whether this measure is correlated with professors’ salaries. We will evaluate different centrality measures in the citations network of an academic field. By doing so, we hope to learn what each centrality measure describes, and how well each measure correlates with salary, a proxy for a scholar’s benefit to the university.

Papers cite other papers, and each paper belongs to one or more researchers. We can construct a paper citation network, and calculate the centrality of each paper. We can then calculate statistics for a scholar based on the centralities of the scholar’s papers, such as the mean centrality of the scholar’s papers, and the maximum centrality. We will need some index of the scholar’s overall influence, such as the sum of the author’s paper centralities. This reflects both number of papers and their importance.

But we do not have valuation measures for individual papers, only for scholars. So we must develop a centrality measure for each scholar instead of each paper. A simple measure would be to sum the centralities of papers authored. Alternatively, one could calculate a centrality directly for each scholar. That requires a measure of centrality that is more general than PageRank, because it would have to allow for the fact that scholars do not just have single directed links. For example, Scholar A’s papers may cite Scholar B’s papers 17 times, and that must be distinguished from the case where there is only 1 such citation. So some form of weighted page rank may be better.
It may also be interesting to examine other centrality measures such as degree centrality or closeness centrality. Given the sensitivity of betweenness centrality to small differences in network structure, this measure does not seem appropriate for this study.

Although we are proposing to work on citation networks, our idea and approach is widely applicable. The main aim is to formulate a way to find the central/powerful nodes on a network and see whether the valuation assigned to each node indeed reflects its true importance in the graph. For example, this applies to patent citation networks, considering R&D valuations of companies. This helps us come up with refined measures of centrality for specific applications which require design or incentives based on importance.

After exploring prior centrality measures, we will have some insights as to how well each fit the salary data, and why. If the correlations are poor, and we have specific idea as to why, we will try to spend the rest of the term coming up with a better centrality measurement. If the correlations are strong, that means that a professor has enough incentives to optimize his or her centrality. Then we will study the “gameability” of the most relevant centrality measurement, and analyze how one can improve the centrality and the salary given the existing network.

CS 145 project blog

This is the CS 145 project blog created by Ramya, Kijun and Michael. We are studying centrality measures on networks and their implications, specifically with citation network.