Monday, April 30, 2012

Crawling Arxiv

We just finished crawling the abstracts and the citation network for 10 years of theoretical high energy physics papers. Arxiv lists up to 2000 articles for a given year, and we can retrieve 10000 abstracts per API request, so fetching metadata was quite simple.

The bottleneck was retrieving the citation data for a given paper. This was supported by http://inspirehep.net/, which is a high energy physics(HEP) literature database. It seems to perform multiple database lookups to match corresponding paper ids, taking anywhere from 1 to 10 seconds per request. We initially ran twenty threads, but using ten threads actually improved the performance.

About 30% of the papers were cross-listed from other fields besides theoretical high energy physics, and more than half of the papers are tagged with multiple fields. "General Relativity and Quantum Cosmology"(GR-QC) was the most common overlap. It might be in our interest to crawl GR-QC and retrieve citation data, except then it would generate a bias for HEP-theory authors who publish papers related to the specific field.

No comments:

Post a Comment