Saturday, April 07, 2012

Proposed Timeline

first half of the term
We will first focus on the empirical side. Since gathering data may take a long time, after we obtain sufficient data to start we will concurrently analyze the data and refine/obtain more data.

Gathering citation/authorship and salary data (5 weeks)
We need to gather data for citation network and salary of the professors. For citation networks, we will start with the high-energy physics data from http://snap.stanford.edu (a relatively small network). (SNAP also has software that may be useful for processing and computing using the datasets.) If needed, we will also use the DBLP citation network, available at http://arnetminer.org/citation, which consists of more than one million nodes and two million edges.

For professors’ salaries, the Collegiate Times offers a centralized database of such data. Unfortunately, they only display a non-pageable list of first 250 results for each school. We will write to them and ask for the dataset for academic purposes. Otherwise, we will write a scraper to retrieve data from individual sites that host state-wide salary information, such as http://ucpay.globl.org.

One complication is that many women publish under a name that differs from their legal name, which is presumably used in the salary data.

Regression tools (1 week)
To familiarize ourselves with regression techniques, we will run regressions using the salary data and a trivial centrality measure (such as the number of papers for each professor). This involves choosing a software package, and writing scripts to process the data obtained into a form suitable for the software we decide to use.

Apply Centrality Measures and Regression (2 weeks)
We will calculate the PageRank, degree, closeness, and betweenness centralities of the citation network in our dataset. After calculating the centrality for each node, we will try different regression models to determine which model has the best fit.

Evaluate Result (1 week)
We will analyze the results obtained from the previous step. We will explain why some measures performed better than others, and determine and describe factors accounting for error or bias.

second half of the term
Focused on theory. Of course, since the difficulty of obtaining data is hard to predict, this could be advanced or delayed from the midpoint of the course.

Design a Better Centrality Measure (2 weeks)
If we believe that centrality measure can be improved dramatically, perhaps through weighing different measures or using an entirely new concept, we will spend the next two weeks implementing a centrality measure better suited for measuring importance in a citation network. We will run regressions using the new centrality measure and see whether it correlates more with salary.

Analysis of gameability of centrality measures (2 weeks)
If the new centrality measure is deemed “good enough”, it indicates that there exists a important correlation between a professor’s centrality and salary. We will explore the ways in which a professor is able to “game” the centrality measurement to improve one’s salary.

No comments:

Post a Comment