Changes

Tesseract

826 bytes added, 17:01, 3 October 2019
The following lines were added (+) and removed (-):
How about multiple pages in one PDF document?* Create a PDF with multiple pages* OCR on a multipage PDF TESSERACT CANNOT DIRECTLY READ PDF FORMAT - I know that stinks doesn't it? Well there's to many variances in PDF document format for a tool like tesseract to cope with.PDF to PPM method:Pdftoppm converts Portable Document Format (PDF) files to color image files in Portable Pixmap (PPM) format, grayscale image files in Portable Graymap (PGM) format, or monochrome image files in Portable Bitmap (PBM) format.The process is best described by Darren Goossens in the article [https://darrengoossens.wordpress.com/2019/01/07/simple-use-of-tesseract-ocr-on-a-multipage-pdf/ Simple use of tesseract OCR on a multipage PDF] Note: if his site disappears then the page will be mirrored here and credited to Darren.  Tesseract does a lot and also preforms better when tweeked using a process of 'training' which in comparison to what is covered here and what we would like to have documented, this page is a sparse entry.Tesseract does a lot and also preforms better when tweaked using a process of 'training' which in comparison to what is covered here and what we would like to have documented, this page is a sparse entry.
Administrator
4,579
edits