Saturday, April 07, 2012

Milestone 1 : Simple Regression

Our first goal is to get the available data and run a simple single variable regression on it. For this we will be using citation network data for high energy physics from http://snap.stanford.edu/data/cit-HepPh.html. This dataset gives us citation network from January 1993 until April 2003. For salary data we are crawling ucpay.globl.org. This gives us base salary for the year 2004. Currently we restrict ourselves to professors from UC campuses.

To Do for first milestone :
  • Make a table of 2004 salary data from ucpay.globl.org (python script) DONE
  • Make a table of HEP-TH professors and their papers (by arxiv id)
  • Parse the abstracts from http://www.cs.cornell.edu/projects/kddcup/datasets.html
  • Alter ucpay salary table to align with HEP-TH professors (rearrange name, remove middle name, etc.)
  • Calculate citation measurements from SNAP data (mathematica, R)
  • Use HEP-TH table to calculate rank for each professor
  • Join the resulting tables
  • Linear regression (mathematica, R)

No comments:

Post a Comment