Difference between revisions of "Tesseract"

From Free Knowledge Base- The DUCK Project: information for everyone
Jump to: navigation, search
(Created page with "Tesseract is capable of taking images of many different formats like jpg, png, tiff, etc and extracting text from it. Tessereact is considered one of the best OCR tools and w...")
 
(searchable PDF using Tesseract)
Line 29: Line 29:
 
Using the command line the process of turning a non-searchable PDF such as one created by the process described in [[Create PDF Documents with ImageMagick and Ghostscript]] into a searchable PDF will be described here.  It is possible, but a little complicated.
 
Using the command line the process of turning a non-searchable PDF such as one created by the process described in [[Create PDF Documents with ImageMagick and Ghostscript]] into a searchable PDF will be described here.  It is possible, but a little complicated.
  
coming soon...
+
tesseract document01.png out PDF
 +
tesseract document02.png out PDF
 +
tesseract document03.png out PDF
 +
 
 +
Page 1, Page 2, and Page 3 all made out to searchable PDF.  However, we want them all in one PDF
 +
 
 +
 
 +
 
  
  

Revision as of 10:49, 3 October 2019

Tesseract is capable of taking images of many different formats like jpg, png, tiff, etc and extracting text from it. Tessereact is considered one of the best OCR tools and was developed by Hewlett Packard in C and C++ in 1985 and has been improved constantly since that time.

You will likely require the convert tool from imagemagick to be successful in using tesseract.

installation and basic usage

To install Tesseract OCR on linux mint:

sudo apt install tesseract-ocr

which automatically does this for you: apt install tesseract-ocr-eng

Syntax:

tesseract imagename outputbase [-1 lang] [-psm pagesegmode] [configfile...]

To process a PDF file and output OCR'd text to the screen

tesseract document.png stdout

To process a PDF file and output OCR'd text to a text file

tesseract document.png document.txt

This will not directly accept PDF in the fashion as described above.

graphical front end

gImageReader is a simple GTK+ front-end to tesseract-ocr.

sudo apt install gimagereader

searchable PDF using Tesseract

Using the command line the process of turning a non-searchable PDF such as one created by the process described in Create PDF Documents with ImageMagick and Ghostscript into a searchable PDF will be described here. It is possible, but a little complicated.

tesseract document01.png out PDF
tesseract document02.png out PDF
tesseract document03.png out PDF

Page 1, Page 2, and Page 3 all made out to searchable PDF. However, we want them all in one PDF