Changes - Free Knowledge Base- The DUCK Project: information for everyone

PDF: The Portable Document Format

1,067 bytes added, 17:26, 3 October 2019

The following lines were added (+) and removed (-):

=== OCR Scanned Images for your PDF Pages =======tesseract===Tesseract is an optical character recognition utility that will work in Linux and Microsoft Windows as well as other operating systems.Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. Since version 3.00 Tesseract has supported output text formatting and besides TIFF allows for a number of new image formats.Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus or gImageReader.Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract. *text x-height is at least 20 pixels*reduce or eliminate rotation or skew of the text*high contract is recommended*eliminate any border or dark boxes around textsee: [[Tesseract]] for usage and examples of this powerful OCR tool that beats many expensive commercial software products including Adobe. It is pretty impressive!

Ke0etz

Administrator

4,579

edits