Changes

PDF: The Portable Document Format

3,570 bytes added, 23:12, 17 November 2021
/* Linux PDF Tools: pdfcrack */
The following lines were added (+) and removed (-):
This particular method I highly recommend if you are comfortable with the linux shell.  I found found this to yield the best results with the least amount of labor.It also works with png files convert *.png document.pdf<big>See also: [[Create PDF Documents with ImageMagick and Ghostscript]]</big>=== Linux PDF Tools: tiff2pdf and tiffcp ===The tiff2pdf utility can convert a single tiff file into a pdf document.  For multiple pages it will be necessary to create a multi-page tiff file.  Yes, a single tiff file can contain multiple pages.A 12 page black and white document was scanned into jpeg images.  Although jpeg was not the best choice for black and white documents, this is how it was presented and thus needed to be converted to a pdf.  imagemagick convert produced a large pdf over 6mb that was not optimized for black and white.  This is not referring to compression, as applying jpeg compression or changing the dpi is not the correct way to optimize black and white scanned images.Our fat pdf that was created from jpeg and not optimized for black and white is called: document.pdf  It will be deconstructed back to images, except this time into optimized for black and white tiff images.  A larger multi-page tiff file will then be created from the multiple tiff images.  The single multi-page tiff file will then be converted back into a much smaller optimized pdf document. convert -colorspace rgb -density 300 document.pdf -monochrome document-%03d.tiff tiffcp document-???.tiff multipage.tiff tiff2pdf -o documentfinal.pdf multipage.tiffWhile the original document.pdf is over 6 mb, the documentfinal.pdf is less than 1mb.=== Linux PDF Tools: pdfcrack ===To unlock a password protected PDF file when you do NOT know the password.  PDFCrack is a GNU/Linux tool for recovering passwords and content from PDF-files. It is small, command line driven without external dependencies. pdfcrack -f 2020CrackMe.pdfIf you see the error The specific version is not supported (Standard - 6)Then the version of pdfcrack does not support 256-bit''Other resources, look into John the Ripper to brute force crack a protected PDF.  John the Ripper is a fast password cracker.  Its primary purpose is to detect weak Unix passwords.''=== OCR Scanned Images for your PDF Pages ===Tesseract is an optical character recognition utility that will work in Linux and Microsoft Windows as well as other operating systems.Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. Since version 3.00 Tesseract has supported output text formatting and besides TIFF allows for a number of new image formats.Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus or gImageReader.Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract.  *text x-height is at least 20 pixels*reduce or eliminate rotation or skew of the text*high contract is recommended*eliminate any border or dark boxes around textsee: [[Tesseract]] for usage and examples of this powerful OCR tool that beats many expensive commercial software products including Adobe.  It is pretty impressive!* [https://www.moreno.marzolla.name/software/scan-to-pdf/ Creating multi-page PDF documents from scanned images in Linux] - discusses tiff2pdf * [[Create PDF Documents with Gimp and LibreOffice Draw]]* [[Create PDF Documents with ImageMagick and Ghostscript]]
Bureaucrat, administrator
16,192
edits