✎ Last updated on 2014-07-11 at 12:40 EDT

Today, I had to convert a scanned 3-page PDF file back into a editable document. So, open source software to the rescue. I was able to complete the task with the help of:

tesseract — for OCR, and
imagemagick — for converting PDF pages to an image format that tesseract accepts.

Installing the software

sudo apt-get -y install tesseract-ocr imagemagick
Convert PDF pages to image
```
convert -density 300 -depth 8 scan.pdf[0] scan0.png
convert -density 300 -depth 8 scan.pdf[1] scan1.png
convert -density 300 -depth 8 scan.pdf[2] scan2.png
```
convert is a member of the amagemagick tools. You can use it to convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more.

Here, I’m only using two options:

-density width
to set the resolution of an image for rendering to devices. The default unit of measure is dots per inch. The default resolution is 72 dpi.

-depth value
to set the number of bits in a color sample within a pixel.

The numbers between the brackets mark the page in the PDF document to be converted. Of course, as any programmer can tell you, you start counting at zero.

OCR page images to text

$ tesseract scan0.png scan0.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica
$ tesseract scan1.png scan1.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica
$ tesseract scan2.png scan2.txt
Tesseract Open Source OCR Engine v3.02.01 with Leptonica

And then just copy the OCR text from the text files into a new document to clear up any typo and reformat the document.

Category: Utility

OCR Text from PDF Document

Installing the software

Convert PDF pages to image

OCR page images to text