Mining text from document pages

We’re edging closer to officially releasing available API methods, including a core OCR method (text-lines) that allows for text extraction in the presence of extraneous objects like embedded images and so forth. Image up/download times combined with computation cost at the backend amounts to several seconds, which isn’t too bad. Using curl to POST data, along with a couple of standard linux tools makes for a simple text data extraction flow. In the bash script below I loop over a document directory (command line argument 1), break the pdf(s) into pngs, and send the first (title page) to the AWS lambda backend for processing:


#!/bin/bash
files=$(ls $1/*pdf);

for x in $files
do
base=${x%.pdf}
pdftopng $x $base
pngs=$(ls $base*png);

for y in $pngs
do
curl --request POST -H "Accept: image/png" \
-H "Content-Type: image/png" --data-binary "@$y" \
https://xxxxx.execute-api.us-east-1.amazonaws.com/test/text-lines > foo.tar.gz

tar -xvf foo.tar.gz
mv tmp/output.json $y".txt"
break;
done
rm -rf $1"/*.png"
done

In this way you could (for instance) grab author names, institutions, emails etc. The spatial information provided in the response makes it easier to cluster text fields together, for example the title field with authors, using Euclidean distance.