After a very busy couple of months, including a move to Silicon Valley (!), I’m pleased to say that plot2txt now offers a few more API methods, intended to help with search as much as data mining.
The new methods allow the user to automatically create a searchable collection of figures and tables, identified and extracted from papers uploaded to a gateway endpoint.
The gateway separates an input PDF into pages, these pages are then analyzed for figure and table content, using AWS lambda functions. Detected text and reverse engineered figure data are then stored in database tables, ready for search. Database entries also point to images of discovered figures and tables, stored in the cloud.
I ran the overall workflow over about 50k documents from arxiv, and made a little test API method for searching the extracted content. You can search the output of the workflow here, selecting the figure or table search option. At least in Chrome and Firefox, text output is available for download.