We just finished crawling the abstracts and the citation network for 10 years of theoretical high energy physics papers. Arxiv lists up to 2000 articles for a given year, and we can retrieve 10000 abstracts per API request, so fetching metadata was quite simple.
The bottleneck was retrieving the citation data for a given paper. This was supported by http://inspirehep.net/, which is a high energy physics(HEP) literature database. It seems to perform multiple database lookups to match corresponding paper ids, taking anywhere from 1 to 10 seconds per request. We initially ran twenty threads, but using ten threads actually improved the performance.
About 30% of the papers were cross-listed from other fields besides theoretical high energy physics, and more than half of the papers are tagged with multiple fields. "General Relativity and Quantum Cosmology"(GR-QC) was the most common overlap. It might be in our interest to crawl GR-QC and retrieve citation data, except then it would generate a bias for HEP-theory authors who publish papers related to the specific field.
No comments:
Post a Comment