Figure Search Engine II

For the time being, the entire figure search engine pipeline and API methods are freely available for testing. Once you sign up here , you’ll receive an API key that will provide limited access to all the relevant methods; the API key will appear in the upload tab of the dashboard:

With the key, you can then upload PDFs for processing using the backend lambda functions that use machine learning. The particular method below will identify figures in PDFs, detecting text, inverting data from pixel positions and populate a database, making figures searchable and data potentially reusable. For example, you could test with the PDF available for this arXiv article. To upload:

curl --request POST -H "Authorization: Token my_token" -H "Accept: application/pdf" -H "Content-Type: application/pdf" -H "Filename: test.pdf" --data-binary "@0001311.pdf" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/fig-search-gateway

where 0001311.pdf is the name of the file to upload, and the Filename header field is the preferred name for storage in the database. After uploading one or more PDFs, you can then search on the x and y label names, as well as the filename field, and limit the search based on x and y data ranges. For example, referring again to the previous document:


curl --request GET -H "Authorization: Token my_token" -H "Accept: application/json" -H "Content-Type: application/json" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/figure-search?label=erg

Here we’ve asked for all figures with x and y labels that contain the substring ‘erg’, and in this case we receive the following (extract):


{"x":6565.761188,"y":3.12652},{"x":6575.148423,"y":4.553576},{"x":6616.079185,"y":4.651428},{"x":6353.417084,"y":0.639236}],"qc_input":"https://s3.amazonaws.com/textextract1/
914797fa933347351535b2c7e3ff3885c5da1457044a09a81067799b8857366f/test-000004_qc_in_0.png","ymax":10,"time":1547417624586,"xmax":6800,"ymin":-2,"xmin":6300,
"qc_output":"https://s3.
amazonaws.com/textextract1/914797fa933347351535b2c7e3ff3885c5da1457044a09a81067799b8857366f/test-000004_qc_out_0.png","page":"000004"}]

Again this is just an extract, but note that figures and data are returned in the response eg., this image of the detected input figure found on page 4 of the input PDF:

You can also narrow your search using a query string eg., here we limit results to those where the minimum x value exceeds 6100, and the filename contains the phrase ‘test’:


curl --request GET -H "Authorization: Token my_token" -H "Accept: application/json" -H "Content-Type: application/json" https://b8jd85qv3b.execute-api.us-east-1.amazonaws.com/test/figure-search?label=erg\&xmin=6100\&source=test

Of course if you’d prefer, you can always freely browse the 100k+ figure datasets extracted from arxiv without signing up for an API key eg., using the p2t dashboard or else using curl, some examples in previous blog posts.