Simple use of tesseract OCR on a multipage PDF
Using the command line to OCR a PDF file. Done in Cygwin. First, converted pages of the PDF to PPM files, which tesseract can read. Chose 300 dpi.
pdftoppm -r 300 pdf-filename.pdf page
The PDF is ‘pdf-filename.pdf’ and the PPM files will have names of the form ‘page-??.ppm’ since the conversion will add ‘-??.ppm’ to the given stem, where ?? is the page number.
Then, run tesseract
for f in *.ppm ; do tesseract $f $f ; done
So this loops over all files with ppm as the extension and runs tesseract, and just gives the file name itself as the stem of the output. That means we’ll end up with a bunch of files with names like ‘page-03.ppm.txt’. I could have used basename to chop off the ppm, but there’s just no need.
Next, combine the txt files.
cat *.ppm.txt > pdf-filename.txt
This shows another reason for keeping the .ppm in the file name — if I have other .txt files in the subdirectory, they will not get caught up in the cat.
If you are a very thorough person, you might call your final text file something like ‘pdf-filename-tesseractOCR.txt’ or something, to preserve some information about provenance.
This OCR engine is pretty good.
Once the text file has been examined, don’t need the .ppm and ppm.txt files, so
rm page-??.ppm*
Of course, if you are surer of what’s in the folder, you might go
rm *ppm*
Now, clearly this could all be wrapped up in a very simple script, something like this (script has some improvements over commands noted above):
echo "1. Converting to png (limit 9999 pages or your disk space)" #gs -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -dUseCropBox -sOutputFile=ZZZZpage-%04d.png "$1" 2> /dev/null > /dev/null echo -n " " for f in ZZZZpage-????.png ; do echo -n "." ; done echo echo -n "2. Performing OCR " for f in ZZZZpage-????.png ; do echo -n "*" ; tesseract $f $f --dpi 600 2> /dev/null > /dev/null ; done echo " done." prename=`basename "$1" .pdf` newname="$prename.txt" echo 3. Creating text file "$newname" cat ZZZZpage-????.png.txt > "$newname" rm ZZZZpage-????.png ZZZZpage-????.png.txt echo 4. Cleaning up
Where I’ve added a few bells and whistles. Note that the error output is all discarded (‘2> /dev/null’ means ‘send output stream 2 (stderr) to /dev/null, which makes it disappear) so if something does not work these bits should be removed.
‘echo -n’ means ‘echo but do not add a linefeed’. I have read that this does not always work, though in a bash implementation it should be fine.
This requires a working GhostScript interpreter. Other conversion paths are possible; the standard tesseract uses Leptonica, which can read ppm and png and other files, so pdftoppm as used above works, though ppm files are big and not compressed, which is why I changed to png — I note that the gs-based version picked up some text that the pdftoppm version did not, possibly because I went up to 600 dpi, but there may be some other factor at work, I can’t say for sure..