Changes

PDF: The Portable Document Format

9,711 bytes added, 23:12, 17 November 2021
/* Linux PDF Tools: pdfcrack */
The following lines were added (+) and removed (-):
== PDF Types ==Consider the two main distinctions in PDF file types, scanned versus native.  A native PDF file is superior to a scanned PDF file in capabilities, flexibility, and efficiency.  This is due to a distinction of true text in the PDF from a PDF that is only images of text.PDF Types:* Native* Scanned=== Native PDF ===A native PDF file will contain literal text as part of the structure, including information about the text.  This is not to say that there are no images.  It is to stay that the text itself is actual text and not just part of an image.  A native PDF has an internal structure that can be read and interpreted.  Only a native PDF can utilize all of the capabilities that the format lends to the reader software.=== Scanned PDF ===PDF files created by scanning hard-copy documents containing primarily text do not have the same structure as a PDF file of the same document created directly. The scanned document internally contains a picture of the document, with no information about the text. As far as a user can see it is just another PDF file, with a name and extension indistinguishable from any other; a good scan may look exactly the same as a native PDF file, although a visually poor-quality file, often with skewed pages, gives away its nature. However, the file size will be different, and it will not be possible to search for text. For a scan of adequate quality it is possible with suitable software to regenerate the text of the document with Optical character recognition (OCR), and embed it in the file so as to make it searchable, subject to the accuracy of the OCR.=== Conversion ===To use software to convert a Scanned PDF into a Native PDF involves Optical character recognition (OCR) technology.  OCR will analyze the "image" of each character and match it to an electronic character-based file.  The level of accuracy depends on the quality of the scan and the font used.  OCR works primarily on typeset characters and not hand written text.View the list of [[PDF Viewers That Are Open Source]].{{:PDF Viewers That Are Open Source}}=== Windows Print Driver: PDF to TIFF ====== The GUI Way: Using Gimp and LibreOffice Draw ===[[The Virtual Image Printer driver by tariel]] will allow you to convert a PDF to multiple page image files in several image formats.  This is not all The Virtual Image Printer and it is not exclusively for converting PDF to images.  However, it is very handy for performing this task under the Windows XP operating system.It is fast, simple, and can all be accomplished without dropping to console, the creation of PDF documents from scanned images and other data sources. This method is for people that wish to:  Scan documents to images, make any modifications to the images, order the images and generate a custom multiple page PDF document.  Learn how to [[Create PDF Documents with Gimp and LibreOffice Draw]]. === The GUI Way: Using Simple Scan and PDF Chain  === If all you are looking to do is scan some documents page by page, then combine them as a single ordered PDF without the need to make any edits or do any fancy OCR, compression, or other modification related activity, you can accomplish this quite quickly and easily using two programs:* Simple Scan* PDF Chain With Simple Scan you can scan each page, and save each page as a PDF.  You can even skip using PDF Chain and scan a number of pages to save as a PDF.  However, if you need to re-order you can load each PDF you save into PDF chain and do some order changes, annotation, or other basic PDF related modification.You can compress an existing PDF (like one made with Gimp) into a smaller file size (ref: [https://www.shellhacks.com/linux-compress-pdf-reduce-pdf-size/ Compress PDF File In Linux]) ps2pdf big.pdf smaller.pdf=== Linux PDF Tools: imagemagick ===This particular method I highly recommend if you are comfortable with the linux shell.  I found found this to yield the best results with the least amount of labor.From the imagemagick package, use the convert command to perform tasks such as taking a folder of jpg images and creating a single PDF document.  If the images are numbered in a way such as 01 02 03 04 05 (use leading zeros) then the page order will concur. convert *.jpg document.pdfIt also works with png files convert *.png document.pdfThe PDF contracts.pdf is black and white and contains multiple pages, we can generate a tiff image for each page and add parameters so there isn't a bunch of quality loss. convert -colorspace rgb -density 300 contracts.pdf -monochrome  contracts-%03d.tiffYou can install imagemagick with apt sudo apt install imagemagick<big>See also: [[Create PDF Documents with ImageMagick and Ghostscript]]</big>=== Linux PDF Tools: qpdf PDF transformation software ===The  qpdf  program  is used to convert one PDF file to another equivalent PDF file.  It is capable of performing a variety of transformations such as linearization (also known as web optimization or fast web viewing), encryption, and decryption of PDF files.  It also has many options for inspecting or checking PDF files, some of which are useful primarily to PDF developers.For example, I have a password protected PDF and I know the password, I simply wish to remove password protection: qpdf –password=password –decrypt /home/nicole/Documents/resume.pdf /home/nicole/Documents/resume2.pdfReplace "password" with the actual password of the document.  qpdf was installed by default on my Linux Mint 18 system.  If it is not installed on yours: sudo apt install qpdf=== Linux PDF Tools: tiff2pdf and tiffcp ===The tiff2pdf utility can convert a single tiff file into a pdf document.  For multiple pages it will be necessary to create a multi-page tiff file.  Yes, a single tiff file can contain multiple pages.A 12 page black and white document was scanned into jpeg images.  Although jpeg was not the best choice for black and white documents, this is how it was presented and thus needed to be converted to a pdf.  imagemagick convert produced a large pdf over 6mb that was not optimized for black and white.  This is not referring to compression, as applying jpeg compression or changing the dpi is not the correct way to optimize black and white scanned images.Our fat pdf that was created from jpeg and not optimized for black and white is called: document.pdf  It will be deconstructed back to images, except this time into optimized for black and white tiff images.  A larger multi-page tiff file will then be created from the multiple tiff images.  The single multi-page tiff file will then be converted back into a much smaller optimized pdf document. convert -colorspace rgb -density 300 document.pdf -monochrome document-%03d.tiff tiffcp document-???.tiff multipage.tiff tiff2pdf -o documentfinal.pdf multipage.tiffWhile the original document.pdf is over 6 mb, the documentfinal.pdf is less than 1mb.=== Linux PDF Tools: pdfcrack ===To unlock a password protected PDF file when you do NOT know the password.  PDFCrack is a GNU/Linux tool for recovering passwords and content from PDF-files. It is small, command line driven without external dependencies. pdfcrack -f 2020CrackMe.pdfIf you see the error The specific version is not supported (Standard - 6)Then the version of pdfcrack does not support 256-bit''Other resources, look into John the Ripper to brute force crack a protected PDF.  John the Ripper is a fast password cracker.  Its primary purpose is to detect weak Unix passwords.''=== Print to PDF in Linux ===One simple option that works in Debian distributions such as the popular Ubuntu Linux is to use cups-pdf.  See: [[Ubuntu_How_Do_I:_A_Linux_Q%26A#Q:_How_do_I_print_to_a_PDF_document_from_something_like_Libra_Office.3F|Install and Use cups-pdf in Ubuntu]] for a detailed guide.=== Convert Images to PDF in Windows ===Free Image to PDF Converter. Supported formats are BMP, DIB, GIF, JPEG, JPG, JPE, JFIF, PNG, TIFF,TIF.  Multiple files to a multi-page PDF.  The tool combines multiple directories and images into one PDF.Installer:  PDFdu_Image_To_PDF_setup.exe<BR>Developer Web Site: http://pdfdu.com/app/image-to-pdf-converter.aspx=== Convert PDF to Images in Windows ======= Windows Print Driver: PDF to TIFF ====[[The Virtual Image Printer driver by tariel]] will allow you to convert a PDF to multiple page image files in several image formats.  This is not all The Virtual Image Printer and it is not exclusively for converting PDF to images.  However, it is very handy for performing this task under the Windows XP operating system.==== GhostScript ====The installer "gs915w32.exe" is the Win32 installer as of Dec 2014 for Microsoft Windows 32-bit Operating Systems such as Windows XP.  Using GhostScript a PDF can be converted to PNG for example.  gswin32c.exe -dNOPAUSE -dBATCH -sDEVICE=pnggray -sOutputFile="test.png" "test.pdf"GhostScript requires a proper PDF.  Some PDF files are broken, in that they will open in some viewers, but are not completely compliant with the standard.  In short, GhostScript is picky.=== OCR Scanned Images for your PDF Pages ===Tesseract is an optical character recognition utility that will work in Linux and Microsoft Windows as well as other operating systems.Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. Since version 3.00 Tesseract has supported output text formatting and besides TIFF allows for a number of new image formats.Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus or gImageReader.Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract.  *text x-height is at least 20 pixels*reduce or eliminate rotation or skew of the text*high contract is recommended*eliminate any border or dark boxes around textsee: [[Tesseract]] for usage and examples of this powerful OCR tool that beats many expensive commercial software products including Adobe.  It is pretty impressive!* [https://www.moreno.marzolla.name/software/scan-to-pdf/ Creating multi-page PDF documents from scanned images in Linux] - discusses tiff2pdf == Related Pages ==* [[GIMP]]* [[Ubuntu How Do I: A Linux Q&A]]* [[PDF: The Portable Document Format]]* [[Create PDF Documents with Gimp and LibreOffice Draw]]* [[Create PDF Documents with ImageMagick and Ghostscript]]
Bureaucrat, administrator
16,192
edits