Bulk create searchable PDF from paper documents in Linux

Published

For some documents you have to retain the original in dead tree storage format. For most documents which arrive in the mail, however, a digital copy is just fine and there really isn't any need to retain the paper version, especially if your computer can store millions of them in the space needed for one paper binder.

To archive such documents one could now buy a scanner, maybe even an ADF office appliance which spits out a searchable PDF and all that and let the device collect dust for the next few months until another batch is processed. However, you can also achieve nearly the same result with your phone and a Linux desktop.

Step 1: Capture

Firstly you will need to actually digitize your documents, you can of course use any scanner for this but a phone can be the perfect device to quickly capture dozens of documents, often vastly faster than with a flatbed scanner optimized for photos.

I personally am using the Scanbox to do this but any contraption which can hold your phone or digital camera steady such as a tripod mounted at similar distance to the document should do the trick. Speed is the important factor in my solution, not accurately capturing 6pt legalese in the footer.

Step 2: Format

After capturing you might need to rotate your photos first to get the page into portrait mode. Watch out though that the generic image preview in Gnome might rotate on-the-fly from EXIF data and not tell you. If you were to OCR those files, you would not get any text from it. You can check if your files still need to be rotated by opening them in GIMP, which will ask you if you want to rotate. You can bulk-rotate according to the camera setting with:

mogrify -auto-orient *.jpg

If your camera orientation did not match your document orientation, you'll have to convert by hand, the latter will do so 90 degress clockwise:

mogrify -rotate 90 *.jpg

Step 3: Process

Now you are ready to convert your images to PDF with text in them, basically, all you need to do is to call hocr2pdf and tesseract, the rest of the script below is only concerned with naming things and cleaning up. Thus the packages tesseract-ocr-eng, imagemagick and exactimage should be all that's needed on Debian-based systems, it worked flawlessly for myself with Ubuntu 13.04. Essentially, it's a cruder version of Konrad Voelkel's solution.

#!/bin/bash

for f in *.jpg
do
    localname=$(basename "$f")
    filename="${localname%.*}"

    tesseract $f $filename -l eng hocr
    hocr2pdf -i $filename.jpg -s -o $filename.pdf < $filename.html

    rm $filename.html
    # I wouldn't do this, but you could...
    # rm $filename.jpg
done

​Et ​voilà, you have a searchable PDF which you can locate with the desktop search of your choice, for example Recoll.