arxiv mining

Some time ago I launched a little project, mining data from arxiv; you can read about it in other blog posts. Specifically, I modeled figures from about 500k figures as Gaussian mixture models, in order to create some features, so figures might be ultimately represented as graphs for comparison. More ordinary methods might suffice too eg., cosine similarity. The goal was (and still is) to make a new search engine for science, analogous to content based image retrieval; find similar articles on the basis of similar figures and thus data.

In the last 12 months, I came to grips with much more of AWS and the hype over AI reached a fairly raucous crescendo. So the search engine has taken a back seat to polishing off the p2t API, which has many traditional algorithms associated with AI and machine learning. We’ve quietly released the API for testing, currently available here : https://plot2txt-staging.us-east-1.elasticbeanstalk.com

As a test for our API, I returned to mining the arxiv repos. Again, it might surprise you to learn that most if not all arxiv articles are available for bulk download here : https://arxiv.org/help/bulk_data_s3. I took the initial tar volume, about 1700 articles, extracted the pages and used the bbox and simple-scatter-plot methods to model about 2500 figures. What this means in reality is that in less than a day, I was able to infer about 2500 datasets from a broad cross section of articles in physics.

While the initial results were really encouraging (I’ve dropped the results on github for now : https://github.com/wjb19/arxiv_mining), in further iterations of the API I will add a little more discriminating power eg., segregating data/non-data pixels. I also need to invest a little more time in handling non-linear and compound scales. Matias in particular is working with caffe, in order to address some of these issues.