Difference between revisions of "Talk:PDF: The Portable Document Format"

From Free Knowledge Base- The DUCK Project: information for everyone
Jump to: navigation, search
(GhostScript Examples)
(pdftoppm for Lisa's Research Papers: new section)
Line 562: Line 562:
 
* http://hublog.hubmed.org/archives/001875.html
 
* http://hublog.hubmed.org/archives/001875.html
 
* http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.postscript/2006-06/msg00089.html
 
* http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.postscript/2006-06/msg00089.html
 +
 +
== pdftoppm for Lisa's Research Papers ==
 +
 +
The script:
 +
 +
root@x:/etc# cat /bin/pdft
 +
------------------------------------------------------------------------------------------------------------------------
 +
#!/bin/sh
 +
mkdir tmp
 +
cp $@ tmp
 +
cd tmp
 +
pdftoppm * -f 1 -l 10 -r 600 ocrbook
 +
for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
 +
for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
 +
for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
 +
mv pdf-ocr-output.txt ..
 +
rm *
 +
cd ..
 +
rmdir tmp
 +
------------------------------------------------------------------------------------------------------------------------
 +
 +
The syntax
 +
 +
Say I want to convert a file called "Research.pdf" to a text file.
 +
 +
/bin/pdft Research.pdf
 +
 +
The output from tmp/.txt will be the converted text.
 +
 +
The script is a bug and needs modified.

Revision as of 19:48, 17 October 2016

G4-compressed-TIFF-to-PDF-conversion FAQ

G4-compressed-TIFF-to-PDF-conversion FAQ

Last change: 02 Aug 2006 (most content is from 2003 though), contact Holger Blasum, <img src="http://c42pdf.ffii.org/email.png" alt="c42pdf ATT ffii DOTT org"> for comments, critique or updates

Why PDF ?

<a href="http://partners.adobe.com/asn/developer/acrosdk/docs.html#filefmtspecs">PDF</a> readers are somewhat more common than TIFF readers.

What is an G4 compressed TIFF ?, example ?

A TIFF file that has been compressed according to the ITU G4 Fax compression standard, so typically it is a black-and-white scan (may consist of several pages). Here is an example <a href="http://c42pdf.ffii.org/src/sample.tif">(TIFF)</a> , <a href="http://c42pdf.ffii.org/src/sample.pdf">(PDF)</a> .

How can I check the compression of any TIFF / PDF files ?

  • TIFF: tiffdump sample.tif (libtiff-tools) or tifftopnm -headerdump file.tiff > /dev/null (netpbm)

  • PDF: open PDF in text editor, search for /CCITTFaxDecode, or /FlateDecode or if it is in another format for /Filter in the Objects of Subtype /Image

Which compression options are there for monocolor PDF and TIFF ?

(see also: Jason Summers, <a href="http://groups.google.com/groups?q=TIFF+deflate+compression&hl=en&rnum=10&selm=76mqnp%249j3%241%40news-2.news.gte.net">comp.infosystems.www.authoring.images, 1999/01/02</a>

  • no compression ("bitmap"), should not be used

  • packbit compression, not very powerful, in the PDF spec this is called "RunLengthDecode"

  • CCITT G3 compression, ITU fax standard, not optimal

  • LZW general compression algorithm, patented, should not be used

  • CCITT G4 compression, ITU fax standard

  • (PDF since version 1.2): Flate (zip/"gzip") general compression algorithm

In practice, for (a random) 300 dpi scan this boiled down to: 1.121 kb for no compression, 182 kb for packbit compression, 113 kb for G3 compression, 78 kb for G4 compression and 71 kb for Flate compression. But don't take this benchmark too serious, this varies widely !

File size of produced PDFs vs incoming G4 compressed TIFFs

Produced PDFs should usually not be more than 2% larger than incoming TIFFs. Depending on the structure of the TIFFs output may be even smaller and by changing compression (see below) you may typically gain another one-digit percentage. Deviations of more than 10% in either direction are atypical (though John has shown me some formally correct G4 compressed TIFF files that had huge chunks (15%) of repetitive data inside - this can be checked in a text editor or with tiffsplit).

What conversion programs are there ?

Well there is <a href="http://c42pdf.ffii.org">c42pdf</a> from this site.

Most more general converters come in a package that as command line tools install several small programs:

  • the <a href="http://www.libtiff.org/">libtiff-tools</a>, the important command line tools (for TIFF conversion) are tiffcp, tiff2ps', tiffdump and tiffsplit (AFAIK the best tool for TIFF handling)

  • <a href="http://www.ghostscript.com/">Ghostscript</a>, the important command line tools are ps2pdf, gs or gswin32 (Win version), (AFAIK the best tool for Postscript/PDF parsing)

  • <a href="http://www.imagemagick.org/">ImageMagick</a>, the important command line tool is convert (a swiss army knife for conversions of all types)

  • <a href="http://www.inf.bme.hu/~pts/sam2p/">sam2p</a>, similar raster converter to ImageMagick with less library dependencies, output sometimes smaller than convert. </ul> <p> You also may want to have a look at:

    • Other free software for PDF imaging: <a href="http://sourceforge.net/projects/netpbm/">netpbm</a>, important command tools are tifftopnm, pnmtops and 'pnmtotiff'; <a href="http://www.foolabs.com/xpdf/">Xpdf</a> (cool pdfimages command)

    • <a href="http://tumble.brouhaha.com/">tumble</a>, (GPL) Tumble converts one or more TIFF (B&W only) and JPEG files into a PDF file. It was developed primarily for use with scanned images. B&W images will be encoded with Group 4 fax (T.6) lossless compression. JPEG images are embedded as-is. Tumble supports multi-page TIFF files, and can generate PDF outline entries (bookmarks) based on the file name and/or the page number.

    • free-as-beer: <a href="http://www.fastio.com/">tiff2pdf</a> (the downloadable demo version adds an advertising line to the converted PDF); another shareware "tiff2pdf" can be found in the <a href="http://www.davince.com/">Davince PDF suite</a> (full version; Windows only; registration after 30 days)

    • the (non-free) Acrobat Exchange conversion facilities

    • APIs: <a href="http://www.pdflib.com/">PDFlib</a> (Aladdin license) now has a fast passthrough mode as well, you can use the image.c example to start with; other APIs are <a href="http://www.stillhq.com/">Panda</a> (also C, GPL; that would be my choice if I wrote c42pdf now) and <a href="http://www.reportlab.org/">ReportLab</a> (Python).

    • <a href="http://www.blasum.net/holger/wri/comp/data/image/g4tweak/">g4tweak</a> (GPL) some g4-compressed scanned image manipulations (experimental, eg inverse functionality to c42pdf) </ul>

      ... but I am on Windows !

      So many programs ! Which one shall I choose ? How shall I use it ? (Christoph)

      There are several tradeoffs between ease of installation of programs and ease of the actual program use you can strike.

      I: One-step conversion:

      Either use <a href="http://c42pdf.ffii.org">c42pdf</a>, command:

      
                           c42pdf sample.tif
      

      This will create a file sample.pdf. In cases it gives you an error message, a more versatile tool is the convert command from 'ImageMagick':

                              convert sample.tif sample.pdf
      

      You will be pleased to find out that convert also works in this intuitive way for many other formats.

      Big limitation with ImageMagick (tested: version 5.3.9): the resulting PDFs use the packbits/RunLengthDecode compression which for some b/w images is about factor 10 less efficient than CCITT4. During conversion, the image data is represented as bitmap, so it is rather memory intensive and slow ('time convert sample.tif sample.pdf': 2.105s in contrast to: 'time c42pdf sample.tif': 0.063s).

      Even if you are frustrated when ImageMagick just takes too long on your machine (and proceed to the next step) keep in mind that it is a very cool tool for raster image conversion in general (e.g. GIF to PNG), only that its black-white PDF image generation is not so good.

      II.1: Ghostscript-based two-step conversion (eg c42pdf doesn't work, output size is an issue):

      use tiff2ps from the libtiff-tools in connection with ps2pdf from Ghostscript (ps2pdf is just a shell script with reasonable params to gs):

                              tiff2ps -a sample.tif > sample.ps
                              ps2pdf sample.ps
      

      My benchmark (BTW, on a 137 MHz [237 bogomips] cpu, with the sample.tif included in the source distrib of c42pdf) for this is 0.387s for first plus 0.630s for the second step.

      I have so far not found a way of piping this (though I may be overlooking something obvious) so you would probably want to write a script like::

      unix/Cygwin bash:

                              #!/bin/bash
                              basename=${1%.tif*}
                              tiff2ps -a $1 > ${basename}.ps && \
                                      ps2pdf ${basename}.ps && \
                                      rm ${basename}.ps
      
      

      to be run as: script sample.tif It is of course also possible to enclose said script into a condition such as:

                              if ! test "c42pdf $1"; then script $1; fi
      

      so that c42pdf is used where it works fast and fine and sth else where it doesn't - this setup would emulate the behavior www.fastio.com's tiff2pdf.

      DOS shell, not tested:

                              tiff2ps -a %1.tif > %1.ps
                              ps2pdf %1.ps 
                              move %1.tif.pdf %1.pdf
                              del %1.tif.ps
      

      to be run as: script sample (without tif ending)

      Output paper format and resolution:

      On my localized linux system, the default of ps2pdf is to write A4 format, but this may be different on another machine. You can easily add an argument to ps2pdf for the papersize, e.g.:

                              ps2pdf -sPAPERSIZE=a4 sample.ps
      

      Other options for PAPERSIZE could be legal,b5, etc ..

      II.2: Alternative: c42pdf-based two-step conversion:

      use tiffcp from libtiff-tools in connection with c42pdf, first use tiffcp (small and fast) to bring your TIFFs into adequate format:

                              tiffcp -c g4 -r 100000 sample.tif sampleg4.tif
      

      Then run c42pdf on the new file:

                              c42pdf -o sample.pdf sampleg4.tif 
      

      unix/Cygwin bash:

                        #!/bin/bash
                        basename=${1%.tif*}
                        tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif
                        c42pdf -o ${basename}.pdf /tmp/temptif.tif
      

      DOS shell: 'SCRIPT.BAT':

                        tiffcp -c g4 -r 100000 %1.tif %1g4.tif
                        c42pdf -o %1.pdf %1g4.tif
                        del %1g4.tif
      

      to be run as: SCRIPT SAMPLE (without tif ending!) The "del" command will obviously delete a file named sampleg4.tif so you should make sure that there is no file sampleg4.tif in the current directory if you have a file sample.tif.

      III. Three-step conversion (advanced users):

      Another obvious way to go is to use tiffcp for the conversion of any TIFF document to a format c42pdf can definitely process, then do the fast throughput via c42pdf and then run ghostscript for FlateDecode and optimized page trees on it - so in effect c42pdf is used as intermediate link between the most powerful tool for reading TIFFs and the most powerful tool for producing PDFs. In comparison with the aforementioned second two-step conversion the trick is by directly jumping into PDF directly the poor Postscript compression can be avoided.

      unix/Cygwin bash:

                        #!/bin/bash
                        basename=${1%.tif*}
                        tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif
                        c42pdf -o /tmp/temptif.pdf /tmp/temptif.tif
                        gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                              -sOutputFile=${basename}.pdf /tmp/temptif.pdf 
      

      My benchmark for this is 0.751s on a 466 bogomips CPU for (III) as opposed to 1.121s for (II.2) on that CPU on the tiny sample.tif. On larger files the ratio even improves to about 50% savings in computing power and disk activity of III vs II.2.

      Remark on choosing compression modes in Ghostscript (advanced users)

      Ghostscript version 5.50 produces a /CCITTFaxDecode PDF, whereas Ghostscript version 6.50 produces /FlateDecode PDF.

      The compression efficiency of /FlateDecode seems slightly (usually about 0-15%) better than /CCITTFaxDecode on random documents, so this is nothing to worry about.

      If for whatever reason you want CCITTFaxDecode you can either deliberately use version 5.50 of Ghostscript, since explicit control by the -dAutoFilterMonoImages=false or -dMonoImageFilter=/CCITTFaxDecode seems not to be operational in version 6.50, or apply a trick posted to <a href="http://groups.google.com/groups?hl=en&threadm=B65308C7.6CBA%25fulvio%40omniacom.net&rnum=1&prev=/groups%3Fq%3DMonoImageFilter%26hl%3Den%26rnum%3D1%26selm%3DB65308C7.6CBA%2525fulvio%2540omniacom.net">comp.lang.postscript</a>

      Ghostscript also should be able to do it directly, but I haven't figured out how yet ;-) - pls tell me if you know how to do it because it is

      Stuff that I had tried too (but not really recommended):

      TIFF generation with tifftopnm:

                        tifftopnm sample.tif > sample.pbm; 
                        pnmtops sample.tif > sample.ps
      

      1. 664s for the first, 0.742s for the second step.

      Limitation: this sometimes results in black-white inversion.

      convert any TIFF to something else (eg non-multistripped TIFF),

      netpbm tools:

                              tifftopnm sample.tif > sample.pbm
                              pnmtotiff -g4 -rowsperstrip 100000 sample.pbm > sample.tif      
      
      

      1. 606s for the first, 0.791s for the second step.

      This creates a TIFF image with all (well the first 100000 which is sufficient for paper sizes less than 10 meters ;-) ) rows in one strip.

      Limitation: this results in black-white inversion. I have also not figured out how to do this on multipage TIFFs.

      Various routes from single-page to multipage (Hartmut)

      All of the above-mentioned conversion from TIFF to PDF work fine where input files are already multipage TIFFs. When you want to merge single page input files into multipage TIFFs conceptually you have three options when to do it: before conversion, during conversion and after conversion.

      Before conversion should be a piece of cake with libtiff and can definitely be done with the -adjoin option of convert but if we are converting anyway, why do one useless conversion more?

      During conversion: either use the c42pdf -l option for converting lists of files (see its documentation) if that is good enough for you or slightly adapt our 'tiff2ps'->ps2pdf two-step conversion process.

      A reasonable way for doing this (due to the bulkiness of postscript it is probably better to delay merging into the second step) could be:

                              #!/bin/bash
                              if [$# -le 2]
                              then 
                                      echo "Usage: miff2pdf outfile infiles..."
                                      exit 1
                              if
                              if [ -e tmpdir ]
                              then
                                      echo "remove tmpdir manually"
                                      exit 1
                              fi
                              outfile=$1
                              shift
                              mkdir tmpdir
                              for i in $@
                              do
                                      tiff2ps -a $i > tmpdir/$i
                              done
                              cd tmpdir
                                      gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                                              -sOutputFile=../$outfile $@
                              cd ..
                              rm -rf tmpdir
      

      My benchmark is 2.7s for running this for sample.tif and a copy of it, sample2.tif. Please do not forget to add the -sPAPERSIZE to gs if you want to control that.

      After conversion: not the best way to do it (one conversion more to do than during) but should be explained in case somebody just delivers you PDFs, so this is about concatenating PDFs ...

      The (recommended) fast way is to use Ghostscript:

                      gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                              -sOutputFile=output.pdf sample.pdf sample.pdf
      

      Benchmark: 1.224s for that command. Use -sPAPERSIZE where needed.

      If you want to add bookmarks or document information, this is easy, see section below: bookmarks and other pdfmarks.

      A more playful way is to use eg <a href="http://www.etymon.com/pj/">pjscript</a> , a working script for concatenation would be:

                      #concat.pjs: A simple concat pjscript. http://www.etymon.com/pj/
                      #invoked: pjscript concat.pjs infile1 infile2 outfile
                      println "Concatenating..."
                      #Command line args are stored in the vars 'arg0', 'arg1', etc.
                      =file arg0
                      readpdf
                      =file arg1
                      appendpdf
                      =file arg2
                      writepdf
                      println "Done."
      

      time pjscript output.pdf sample.pdf sample.pdf 4.329s

      For multiple arguments let's write a little java program based on the same PJ library (LGPL):

                      import com.etymon.pj.*;
                      import com.etymon.pj.object.*;
                      import java.io.*;
                      public class PjConcat {
                              public static void main (String [] args) {
                                      if (args.length < 2) System.out.println
                                        ("Usage: PjConcat outfile infiles ...");
                                      else try {
                                        Pdf pdf = new Pdf (args[1]);
                                        int filesno = args.length - 1;
                                        for (int i=2; i<=filesno; i++) 
                                          pdf.appendPdfDocument(new Pdf(args[i]));
                                        pdf.writeToFile(args[0]);
                                      } catch (Exception e) {
                                        e.printStackTrace();
                                      }
                              }
                      }       
      

      time java PjConcat output.pdf sample.pdf sample.pdf 6.175s If you are interested in detailed PDF parsing look at PJ's Pdf.appendPdfDocument as well as the pjscript source for a good point to start learning or if you are more C-oriented PandaLex (http://www.stillhq.com/).

      Or use something ready-made like pdcat, (expiring) demo versions at pdcat <a href="http://pdf.glance.ch/eval/CLT/Win/pdcat.zip">(win)</a>, <a href="http://pdf.glance.ch/eval/CLT/Linux/pdcat.gz">(linux)</a>, <a href="http://pdf.glance.ch/eval/CLT/Sun26/pdcat.Z">solaris</a>

      Other approach to <a href="http://ktmatu.com/info/merge-pdf-files/">joining PDFs</a>.

      Image size of PDFs

      This will mainly concern engineering drawings.

      Image size in PDF 1.1 and 1.2 is <a href="http://www.pdfzone.com/resources/tips/tip0031.html">limited</a> to 3240x3240 units, in PDF 1.3 it is limited to 14,400x14,400 units. <a href="http://partners.adobe.com/asn/developer/acrosdk/docs.html#filefmtspecs">PDF</a> 1.4 (in appendix C) gives the same number of 14,400x14,400 as a limitation of Acrobat Reader. Accordingly, Acrobat Reader as of version 5.05 was unable to display a sample scan of 13,568x42,438 pixels, whereas that scan could be displayed by ghostview.

      If you plan to convert a whole repository of legacy TIFFs it would be wise to use tiffdump on some of these if there are headers containing document metadata. Though this seldom occurs in practice you'd better at least check. You can also use tiffcp to check for unknown headers.

      Free OCR ?

      As an alternative to use Acrobat Capture in some cases you might be interested in looking at <a href="http://jOCR.sourceforge.net">GOCR</a> or <a href="http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html">here</a> .

      Bookmarks and other pdfmarks (Hartmut) ?

      By going through a Postscript stage you can utilize the pdfmark kludge for postscript (well documented by Thomas Merz, <a href="http://www.pdflib.com/pdfmark/">pdfmark Primer</a>").

      The cleanest way to this is to create another small file e.g. called "pdfmarks".

      Into it we write something like:

                      [ /Page 1 /View [/XYZ 0 842 1.0] /Title (1stpage) /OUT pdfmark
                      [ /Page 2 /View [/XYZ 0 842 1.0] /Title (2ndpage) /OUT pdfmark
                      [ /PageMode /UseOutlines /DOCVIEW pdfmark
      

      /OUT is outline, a bookmark in pdfmarkspeek. The last line makes sure bookmarks are displayed on startup. And you might want to add Document info as well:

                      [ /Title (A guide to TIFF-PDF conversion)
                        /Author (Holger Blasum)
                        /Creator (lousy software)
                        /Keywords (who knows)
                        /ModDate (D:20011004101012)
                        /DOCINFO pdfmark
      
      

      Now you just invoke ghostscript with the infile plus the new "pdfmarks" file:

                      gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                      -sOutputFile=withmarks.pdf withoutmarks.pdf pdfmarks
      

      This adds pdfmark annotations to an existing PDF document. You can also use multiple documents, it actually doesn't matter when ghostscript receives the additional information in pdfmark. And of course "withoutmarks" can also be in Postscript format. Use -sPAPERSIZE where needed.

      For much more detail see the above-mentioned pdfmark Primer (written at a time when gs had not yet supported pdfmark (an Acrobat Distiller "standard") and pdfmarks had to be embedded in PS but that is irrelevant here).

      Single files to single files: Any way to convert multiple input files (all files in a directory) into individual output files with the same name (new extension) (Greg) ?

      Converting all files in a directory can be achieved at the operating system's shell level:

                      Unix solution::
      
                              for A in *.tif; do c42pdf $A; done
      
                      DOS/Win solution: ("MS-DOS command line prompt")::
      
                              for %f in (*.tif) do c42pdf %f
      

      Another, more powerful approach (e.g. for converting all *.tif in an entire drive recursively (including subdirs) that is also robust for "long" filenames)) would be e.g. to use (download) the (Cygwin) GNU tools (http://www.cygwin.com/) and run from the freshly installed Cygwin bash shell located in the directory you want all subdirs of to be converted:

                      find . -name '*.tif' -exec c42pdf '{}' ';'
      

      I'd rather convert from PDF to TIFF.

      This is an easy one, use Ghostscript:

                      gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -r300 -sPAPERSIZE=a4
                      -sOutputFile=mypdf.tif mypdf.pdf
      

      -r is the resolution, not using it would default to 204x196 (standard FAX resolution) for tiffs.

      If you'd prefer multistrip -dMaxStripSize=8192 would be an option.

      Or simply use ImageMagick: convert mypdf.pdf mypdf.tif. Or use pdfimages (Xpdf) to convert to JPEG, PBM pr PPM.

      I still have questions on PDF

      Acknowledgments:

      To Christoph Schulze, John A Kunze, James Y Hope, Hartmut Pilch, Greg Falvo, Jimmy Ngo, Dan Cogliano, Bill Gilchrist, Eric Smith for comments and questions.

      GhostScript Examples

      Ghostscript PDF to PNG/TIF


      gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "TestFile.PDF" gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="c:\test.png" "c:\internet\Eudora\attach\12-30-2014 6-57 PM.pdf"

      copy "c:\internet\Eudora\attach\12-30-2014 6-57 PM.pdf" .\

      gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "test.pdf"


      the map file cidfmap was not found

      gswin32c.exe -s OutputFile="test.png" "test.pdf"

      gswin32c.exe -q -dNOPAUSE -sDEVICE=pngalpha -r300 -dEPSCrop -sOutputFile=test2.png test.pdf

      gswin32c.exe -q -sOutputFile="test3.png" "test3.pdf"

      gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "TestFile.PDF"

      gswin32c.exe -dNOPAUSE -dBATCH -sDEVICE=pnggray -sOutputFile="test.png" "test3.pdf"

      pdftoppm for Lisa's Research Papers

      The script:

      root@x:/etc# cat /bin/pdft 
      ------------------------------------------------------------------------------------------------------------------------
      #!/bin/sh
      mkdir tmp
      cp $@ tmp
      cd tmp
      pdftoppm * -f 1 -l 10 -r 600 ocrbook
      for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
      for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
      for i in *.txt; do cat $i >> ${name}.txt; echo "[pagebreak]" >> pdf-ocr-output.txt; done
      mv pdf-ocr-output.txt ..
      rm *
      cd ..
      rmdir tmp
      ------------------------------------------------------------------------------------------------------------------------
      

      The syntax

      Say I want to convert a file called "Research.pdf" to a text file.

      /bin/pdft Research.pdf
      

      The output from tmp/.txt will be the converted text.

      The script is a bug and needs modified.