Tesseract: Difference between revisions
| mNo edit summary | mNo edit summary | ||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
| Tesseract is capable of taking images of many different formats like jpg, png, tiff, etc and extracting text from it.  Tessereact is considered one of the best OCR tools and was developed by Hewlett Packard in C and C++ in 1985 and has been improved constantly since that time. | Tesseract is capable of taking images of many different formats like jpg, png, tiff, etc and extracting text from it.  Tessereact is considered one of the best OCR tools and was developed by Hewlett Packard in C and C++ in 1985 and has been improved constantly since that time. | ||
| [[File:tesseract.png|frame|Named after the 4D geometric object]] | |||
| You will likely require the convert tool from imagemagick to be successful in using tesseract.   | You will likely require the convert tool from imagemagick to be successful in using tesseract.   | ||
| Line 25: | Line 27: | ||
| Sometimes tesseract works better if the image is bigger, sharper, or has higher contract.  Using a combination of imagemagick and tesseract we can get a more accurate OCR text file. | Sometimes tesseract works better if the image is bigger, sharper, or has higher contract.  Using a combination of imagemagick and tesseract we can get a more accurate OCR text file. | ||
|   convert -colorspace gray -fill white -resize 480% -sharpen 0x1 documentpage01.png documentpage01.jpg tesseract documentpage01.jpg documentpage01.txt |   convert -colorspace gray -fill white -resize 480% -sharpen 0x1 documentpage01.png documentpage01.jpg tesseract documentpage01.jpg documentpage01.txt | ||
| ===for better results prepare your images=== | |||
| Pre-process images for best results | |||
| Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract.   | |||
| *text x-height is at least 20 pixels | |||
| *reduce or eliminate rotation or skew of the text | |||
| *high contract is recommended | |||
| *eliminate any border or dark boxes around text | |||
| ===graphical front end=== | ===graphical front end=== | ||
Latest revision as of 10:37, 3 October 2019
Tesseract is capable of taking images of many different formats like jpg, png, tiff, etc and extracting text from it. Tessereact is considered one of the best OCR tools and was developed by Hewlett Packard in C and C++ in 1985 and has been improved constantly since that time.

You will likely require the convert tool from imagemagick to be successful in using tesseract.
installation and basic usage
To install Tesseract OCR on linux mint:
sudo apt install tesseract-ocr
which automatically does this for you: apt install tesseract-ocr-eng
Syntax:
tesseract imagename outputbase [-1 lang] [-psm pagesegmode] [configfile...]
To process a PDF file and output OCR'd text to the screen
tesseract document.png stdout
To process a PDF file and output OCR'd text to a text file
tesseract document.png document.txt
This will not directly accept PDF in the fashion as described above.
Sometimes tesseract works better if the image is bigger, sharper, or has higher contract. Using a combination of imagemagick and tesseract we can get a more accurate OCR text file.
convert -colorspace gray -fill white -resize 480% -sharpen 0x1 documentpage01.png documentpage01.jpg tesseract documentpage01.jpg documentpage01.txt
for better results prepare your images
Pre-process images for best results
Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract.
- text x-height is at least 20 pixels
- reduce or eliminate rotation or skew of the text
- high contract is recommended
- eliminate any border or dark boxes around text
graphical front end
gImageReader is a simple GTK+ front-end to tesseract-ocr.
sudo apt install gimagereader
searchable PDF using Tesseract
Using the command line the process of turning a non-searchable PDF such as one created by the process described in Create PDF Documents with ImageMagick and Ghostscript into a searchable PDF will be described here. It is possible, but a little complicated.
tesseract document01.png out PDF tesseract document02.png out PDF tesseract document03.png out PDF
Page 1, Page 2, and Page 3 all made out to searchable PDF. However, we want them all in one PDF
How about multiple pages in one PDF document?
- Create a PDF with multiple pages
- OCR on a multipage PDF
TESSERACT CANNOT DIRECTLY READ PDF FORMAT - I know that stinks doesn't it? Well there's to many variances in PDF document format for a tool like tesseract to cope with.
PDF to PPM method:
Pdftoppm converts Portable Document Format (PDF) files to color image files in Portable Pixmap (PPM) format, grayscale image files in Portable Graymap (PGM) format, or monochrome image files in Portable Bitmap (PBM) format.
The process is best described by Darren Goossens in the article Simple use of tesseract OCR on a multipage PDF
Note: if his site disappears then the page will be mirrored here and credited to Darren.
|  Learn more... | 
Tesseract does a lot and also preforms better when tweaked using a process of 'training' which in comparison to what is covered here and what we would like to have documented, this page is a sparse entry.