Talk:PDF: The Portable Document Format
Contents
- 1 G4-compressed-TIFF-to-PDF-conversion FAQ
- 2 Why PDF ?
- 3 What is an G4 compressed TIFF ?, example ?
- 4 How can I check the compression of any TIFF / PDF files ?
- 5 Which compression options are there for monocolor PDF and TIFF ?
- 6 File size of produced PDFs vs incoming G4 compressed TIFFs
- 7 What conversion programs are there ?
- 8 ... but I am on Windows !
- 9 So many programs ! Which one shall I choose ? How shall I use it ? (Christoph)
- 9.1 I: One-step conversion:
- 9.2 II.1: Ghostscript-based two-step conversion (eg c42pdf doesn't work, output size is an issue):
- 9.3 II.2: Alternative: c42pdf-based two-step conversion:
- 9.4 III. Three-step conversion (advanced users):
- 9.5 Remark on choosing compression modes in Ghostscript (advanced users)
- 10 Various routes from single-page to multipage (Hartmut)
- 11 Image size of PDFs
- 12 Free OCR ?
- 13 Bookmarks and other pdfmarks (Hartmut) ?
- 14 Single files to single files: Any way to convert multiple input files (all files in a directory) into individual output files with the same name (new extension) (Greg) ?
- 15 I'd rather convert from PDF to TIFF.
- 16 I still have questions on PDF
- 17 Acknowledgments:
- 18 GhostScript Examples
G4-compressed-TIFF-to-PDF-conversion FAQ
Last change: 02 Aug 2006 (most content is from 2003 though), contact Holger Blasum, <img src="http://c42pdf.ffii.org/email.png" alt="c42pdf ATT ffii DOTT org"> for comments, critique or updates
Why PDF ?
<a href="http://partners.adobe.com/asn/developer/acrosdk/docs.html#filefmtspecs">PDF</a> readers are somewhat more common than TIFF readers.
What is an G4 compressed TIFF ?, example ?
A TIFF file that has been compressed according to the ITU G4 Fax compression standard, so typically it is a black-and-white scan (may consist of several pages). Here is an example <a href="http://c42pdf.ffii.org/src/sample.tif">(TIFF)</a> , <a href="http://c42pdf.ffii.org/src/sample.pdf">(PDF)</a> .
How can I check the compression of any TIFF / PDF files ?
TIFF:
tiffdump sample.tif
(libtiff-tools) ortifftopnm -headerdump file.tiff > /dev/null
(netpbm)PDF: open PDF in text editor, search for
/CCITTFaxDecode
, or/FlateDecode
or if it is in another format for/Filter
in the Objects of Subtype/Image
Which compression options are there for monocolor PDF and TIFF ?
(see also: Jason Summers, <a href="http://groups.google.com/groups?q=TIFF+deflate+compression&hl=en&rnum=10&selm=76mqnp%249j3%241%40news-2.news.gte.net">comp.infosystems.www.authoring.images, 1999/01/02</a>
no compression ("bitmap"), should not be used
packbit compression, not very powerful, in the PDF spec this is called "RunLengthDecode"
CCITT G3 compression, ITU fax standard, not optimal
LZW general compression algorithm, patented, should not be used
CCITT G4 compression, ITU fax standard
(PDF since version 1.2): Flate (zip/"gzip") general compression algorithm
In practice, for (a random) 300 dpi scan this boiled down to: 1.121 kb for no compression, 182 kb for packbit compression, 113 kb for G3 compression, 78 kb for G4 compression and 71 kb for Flate compression. But don't take this benchmark too serious, this varies widely !
File size of produced PDFs vs incoming G4 compressed TIFFs
Produced PDFs should usually not be more than 2% larger than incoming TIFFs. Depending on the structure of the TIFFs output may be even smaller and by changing compression (see below) you may typically gain another one-digit percentage. Deviations of more than 10% in either direction are atypical (though John has shown me some formally correct G4 compressed TIFF files that had huge chunks (15%) of repetitive data inside - this can be checked in a text editor or with tiffsplit).
What conversion programs are there ?
Well there is <a href="http://c42pdf.ffii.org">c42pdf</a> from this site.
Most more general converters come in a package that as command line tools install several small programs:
the <a href="http://www.libtiff.org/">libtiff-tools</a>, the important command line tools (for TIFF conversion) are
tiffcp
, tiff2ps',tiffdump
andtiffsplit
(AFAIK the best tool for TIFF handling)<a href="http://www.ghostscript.com/">Ghostscript</a>, the important command line tools are
ps2pdf
,gs
orgswin32
(Win version), (AFAIK the best tool for Postscript/PDF parsing)<a href="http://www.imagemagick.org/">ImageMagick</a>, the important command line tool is
convert
(a swiss army knife for conversions of all types)<a href="http://www.inf.bme.hu/~pts/sam2p/">sam2p</a>, similar raster converter to ImageMagick with less library dependencies, output sometimes smaller than convert. </ul> <p> You also may want to have a look at:
Other free software for PDF imaging: <a href="http://sourceforge.net/projects/netpbm/">netpbm</a>, important command tools are
tifftopnm
,pnmtops
and 'pnmtotiff'; <a href="http://www.foolabs.com/xpdf/">Xpdf</a> (cool pdfimages command)<a href="http://tumble.brouhaha.com/">tumble</a>, (GPL) Tumble converts one or more TIFF (B&W only) and JPEG files into a PDF file. It was developed primarily for use with scanned images. B&W images will be encoded with Group 4 fax (T.6) lossless compression. JPEG images are embedded as-is. Tumble supports multi-page TIFF files, and can generate PDF outline entries (bookmarks) based on the file name and/or the page number.
free-as-beer: <a href="http://www.fastio.com/">tiff2pdf</a> (the downloadable demo version adds an advertising line to the converted PDF); another shareware "tiff2pdf" can be found in the <a href="http://www.davince.com/">Davince PDF suite</a> (full version; Windows only; registration after 30 days)
the (non-free) Acrobat Exchange conversion facilities
APIs: <a href="http://www.pdflib.com/">PDFlib</a> (Aladdin license) now has a fast passthrough mode as well, you can use the
image.c
example to start with; other APIs are <a href="http://www.stillhq.com/">Panda</a> (also C, GPL; that would be my choice if I wrote c42pdf now) and <a href="http://www.reportlab.org/">ReportLab</a> (Python).<a href="http://www.blasum.net/holger/wri/comp/data/image/g4tweak/">g4tweak</a> (GPL) some g4-compressed scanned image manipulations (experimental, eg inverse functionality to c42pdf) </ul>
... but I am on Windows !
- <p>Imagemagick and tiff2pdf have links to the Windows binaries directly on the above-mentioned respective homepages</p>
- <p>the libtiff-tools can be currently found as executables at stillhq.com, e.g. http://www.stillhq.com/libtiff/win32/3.5.4/tiffcp.exe and http://www.stillhq.com/libtiff/win32/3.5.4/tiff2ps.exe
<a href="http://sourceforge.net/project/showfiles.php?group_id=1897&release_id=38186">Ghostscript</a>, recommended: gs651w32.exe
<a href="http://gnuwin32.sourceforge.net/packages/netpbm.htm">Netpbm</a>, (1) download netpbm-9.19-bin.zip. (2) You also need <a href="http://www.cygwin.com/">Cygwin</a> follow
Install Cygwin now
to run in (it's fun), during the installation routine please enable (check) TIFF support. (3) And might want to get the <a href="http://www.stillhq.com/libtiff/win32/3.5.4/libtiff.dll">libtiff only</a>, pls put it into an executable path.of course doing the compile yourself is always an option, but pls drop me a mail to c42pdf@ffii.org if any of these links becomes outdated
So many programs ! Which one shall I choose ? How shall I use it ? (Christoph)
There are several tradeoffs between ease of installation of programs and ease of the actual program use you can strike.
I: One-step conversion:
Either use <a href="http://c42pdf.ffii.org">c42pdf</a>, command:
c42pdf sample.tif
This will create a file
sample.pdf
. In cases it gives you an error message, a more versatile tool is theconvert
command from 'ImageMagick':convert sample.tif sample.pdf
You will be pleased to find out that convert also works in this intuitive way for many other formats.
Big limitation with ImageMagick (tested: version 5.3.9): the resulting PDFs use the
packbits/RunLengthDecode
compression which for some b/w images is about factor 10 less efficient than CCITT4. During conversion, the image data is represented as bitmap, so it is rather memory intensive and slow ('time convert sample.tif sample.pdf': 2.105s in contrast to: 'time c42pdf sample.tif': 0.063s).Even if you are frustrated when ImageMagick just takes too long on your machine (and proceed to the next step) keep in mind that it is a very cool tool for raster image conversion in general (e.g. GIF to PNG), only that its black-white PDF image generation is not so good.
II.1: Ghostscript-based two-step conversion (eg c42pdf doesn't work, output size is an issue):
use
tiff2ps
from the libtiff-tools in connection withps2pdf
from Ghostscript (ps2pdf is just a shell script with reasonable params to gs):tiff2ps -a sample.tif > sample.ps ps2pdf sample.ps
My benchmark (BTW, on a 137 MHz [237 bogomips] cpu, with the sample.tif included in the source distrib of c42pdf) for this is 0.387s for first plus 0.630s for the second step.
I have so far not found a way of piping this (though I may be overlooking something obvious) so you would probably want to write a script like::
unix/Cygwin bash:
#!/bin/bash basename=${1%.tif*} tiff2ps -a $1 > ${basename}.ps && \ ps2pdf ${basename}.ps && \ rm ${basename}.ps
to be run as:
script sample.tif
It is of course also possible to enclose said script into a condition such as:if ! test "c42pdf $1"; then script $1; fi
so that c42pdf is used where it works fast and fine and sth else where it doesn't - this setup would emulate the behavior www.fastio.com's
tiff2pdf
.DOS shell, not tested:
tiff2ps -a %1.tif > %1.ps ps2pdf %1.ps move %1.tif.pdf %1.pdf del %1.tif.ps
to be run as:
script sample
(without tif ending)Output paper format and resolution:
On my localized linux system, the default of ps2pdf is to write A4 format, but this may be different on another machine. You can easily add an argument to ps2pdf for the papersize, e.g.:
ps2pdf -sPAPERSIZE=a4 sample.ps
Other options for PAPERSIZE could be legal,b5, etc ..
II.2: Alternative: c42pdf-based two-step conversion:
use
tiffcp
from libtiff-tools in connection withc42pdf
, first use tiffcp (small and fast) to bring your TIFFs into adequate format:tiffcp -c g4 -r 100000 sample.tif sampleg4.tif
Then run
c42pdf
on the new file:c42pdf -o sample.pdf sampleg4.tif
unix/Cygwin bash:
#!/bin/bash basename=${1%.tif*} tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif c42pdf -o ${basename}.pdf /tmp/temptif.tif
DOS shell: 'SCRIPT.BAT':
tiffcp -c g4 -r 100000 %1.tif %1g4.tif c42pdf -o %1.pdf %1g4.tif del %1g4.tif
to be run as:
SCRIPT SAMPLE
(without tif ending!) The "del" command will obviously delete a file namedsampleg4.tif
so you should make sure that there is no filesampleg4.tif
in the current directory if you have a filesample.tif
.III. Three-step conversion (advanced users):
Another obvious way to go is to use
tiffcp
for the conversion of any TIFF document to a format c42pdf can definitely process, then do the fast throughput via c42pdf and then run ghostscript for FlateDecode and optimized page trees on it - so in effect c42pdf is used as intermediate link between the most powerful tool for reading TIFFs and the most powerful tool for producing PDFs. In comparison with the aforementioned second two-step conversion the trick is by directly jumping into PDF directly the poor Postscript compression can be avoided.unix/Cygwin bash:
#!/bin/bash basename=${1%.tif*} tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif c42pdf -o /tmp/temptif.pdf /tmp/temptif.tif gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \ -sOutputFile=${basename}.pdf /tmp/temptif.pdf
My benchmark for this is 0.751s on a 466 bogomips CPU for (III) as opposed to 1.121s for (II.2) on that CPU on the tiny sample.tif. On larger files the ratio even improves to about 50% savings in computing power and disk activity of III vs II.2.
Remark on choosing compression modes in Ghostscript (advanced users)
Ghostscript version 5.50 produces a
/CCITTFaxDecode PDF
, whereas Ghostscript version 6.50 produces/FlateDecode PDF
.The compression efficiency of
/FlateDecode
seems slightly (usually about 0-15%) better than/CCITTFaxDecode
on random documents, so this is nothing to worry about.If for whatever reason you want
CCITTFaxDecode
you can either deliberately use version 5.50 of Ghostscript, since explicit control by the-dAutoFilterMonoImages=false
or-dMonoImageFilter=/CCITTFaxDecode
seems not to be operational in version 6.50, or apply a trick posted to <a href="http://groups.google.com/groups?hl=en&threadm=B65308C7.6CBA%25fulvio%40omniacom.net&rnum=1&prev=/groups%3Fq%3DMonoImageFilter%26hl%3Den%26rnum%3D1%26selm%3DB65308C7.6CBA%2525fulvio%2540omniacom.net">comp.lang.postscript</a>Ghostscript
also should be able to do it directly, but I haven't figured out how yet ;-) - pls tell me if you know how to do it because it isStuff that I had tried too (but not really recommended):
TIFF generation with tifftopnm:
tifftopnm sample.tif > sample.pbm; pnmtops sample.tif > sample.ps
664s for the first, 0.742s for the second step.
Limitation: this sometimes results in black-white inversion.
convert any TIFF to something else (eg non-multistripped TIFF),
netpbm
tools:tifftopnm sample.tif > sample.pbm pnmtotiff -g4 -rowsperstrip 100000 sample.pbm > sample.tif
606s for the first, 0.791s for the second step.
This creates a TIFF image with all (well the first 100000 which is sufficient for paper sizes less than 10 meters ;-) ) rows in one strip.
Limitation: this results in black-white inversion. I have also not figured out how to do this on multipage TIFFs.
Various routes from single-page to multipage (Hartmut)
All of the above-mentioned conversion from TIFF to PDF work fine where input files are already multipage TIFFs. When you want to merge single page input files into multipage TIFFs conceptually you have three options when to do it: before conversion, during conversion and after conversion.
Before conversion should be a piece of cake with libtiff and can definitely be done with the
-adjoin
option ofconvert
but if we are converting anyway, why do one useless conversion more?During conversion: either use the
c42pdf
-l
option for converting lists of files (see its documentation) if that is good enough for you or slightly adapt our 'tiff2ps'->ps2pdf
two-step conversion process.A reasonable way for doing this (due to the bulkiness of postscript it is probably better to delay merging into the second step) could be:
#!/bin/bash if [$# -le 2] then echo "Usage: miff2pdf outfile infiles..." exit 1 if if [ -e tmpdir ] then echo "remove tmpdir manually" exit 1 fi outfile=$1 shift mkdir tmpdir for i in $@ do tiff2ps -a $i > tmpdir/$i done cd tmpdir gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \ -sOutputFile=../$outfile $@ cd .. rm -rf tmpdir
My benchmark is 2.7s for running this for sample.tif and a copy of it, sample2.tif. Please do not forget to add the
-sPAPERSIZE
to gs if you want to control that.After conversion: not the best way to do it (one conversion more to do than during) but should be explained in case somebody just delivers you PDFs, so this is about concatenating PDFs ...
The (recommended) fast way is to use Ghostscript:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \ -sOutputFile=output.pdf sample.pdf sample.pdf
Benchmark: 1.224s for that command. Use
-sPAPERSIZE
where needed.If you want to add bookmarks or document information, this is easy, see section below: bookmarks and other pdfmarks.
A more playful way is to use eg <a href="http://www.etymon.com/pj/">pjscript</a> , a working script for concatenation would be:
#concat.pjs: A simple concat pjscript. http://www.etymon.com/pj/ #invoked: pjscript concat.pjs infile1 infile2 outfile println "Concatenating..." #Command line args are stored in the vars 'arg0', 'arg1', etc. =file arg0 readpdf =file arg1 appendpdf =file arg2 writepdf println "Done."
time pjscript output.pdf sample.pdf sample.pdf 4.329s
For multiple arguments let's write a little java program based on the same PJ library (LGPL):
import com.etymon.pj.*; import com.etymon.pj.object.*; import java.io.*; public class PjConcat { public static void main (String [] args) { if (args.length < 2) System.out.println ("Usage: PjConcat outfile infiles ..."); else try { Pdf pdf = new Pdf (args[1]); int filesno = args.length - 1; for (int i=2; i<=filesno; i++) pdf.appendPdfDocument(new Pdf(args[i])); pdf.writeToFile(args[0]); } catch (Exception e) { e.printStackTrace(); } } }
time java PjConcat output.pdf sample.pdf sample.pdf 6.175s If you are interested in detailed PDF parsing look at PJ's Pdf.appendPdfDocument as well as the pjscript source for a good point to start learning or if you are more C-oriented PandaLex (http://www.stillhq.com/).
Or use something ready-made like
pdcat
, (expiring) demo versions at pdcat <a href="http://pdf.glance.ch/eval/CLT/Win/pdcat.zip">(win)</a>, <a href="http://pdf.glance.ch/eval/CLT/Linux/pdcat.gz">(linux)</a>, <a href="http://pdf.glance.ch/eval/CLT/Sun26/pdcat.Z">solaris</a>Other approach to <a href="http://ktmatu.com/info/merge-pdf-files/">joining PDFs</a>.
Image size of PDFs
This will mainly concern engineering drawings.
Image size in PDF 1.1 and 1.2 is <a href="http://www.pdfzone.com/resources/tips/tip0031.html">limited</a> to 3240x3240 units, in PDF 1.3 it is limited to 14,400x14,400 units. <a href="http://partners.adobe.com/asn/developer/acrosdk/docs.html#filefmtspecs">PDF</a> 1.4 (in appendix C) gives the same number of 14,400x14,400 as a limitation of Acrobat Reader. Accordingly, Acrobat Reader as of version 5.05 was unable to display a sample scan of 13,568x42,438 pixels, whereas that scan could be displayed by ghostview.
If you plan to convert a whole repository of legacy TIFFs it would be wise to use
tiffdump
on some of these if there are headers containing document metadata. Though this seldom occurs in practice you'd better at least check. You can also usetiffcp
to check for unknown headers.Free OCR ?
As an alternative to use
Acrobat Capture
in some cases you might be interested in looking at <a href="http://jOCR.sourceforge.net">GOCR</a> or <a href="http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html">here</a> .Bookmarks and other pdfmarks (Hartmut) ?
By going through a Postscript stage you can utilize the pdfmark kludge for postscript (well documented by Thomas Merz, <a href="http://www.pdflib.com/pdfmark/">pdfmark Primer</a>").
The cleanest way to this is to create another small file e.g. called "pdfmarks".
Into it we write something like:
[ /Page 1 /View [/XYZ 0 842 1.0] /Title (1stpage) /OUT pdfmark [ /Page 2 /View [/XYZ 0 842 1.0] /Title (2ndpage) /OUT pdfmark [ /PageMode /UseOutlines /DOCVIEW pdfmark
/OUT
is outline, a bookmark in pdfmarkspeek. The last line makes sure bookmarks are displayed on startup. And you might want to add Document info as well:[ /Title (A guide to TIFF-PDF conversion) /Author (Holger Blasum) /Creator (lousy software) /Keywords (who knows) /ModDate (D:20011004101012) /DOCINFO pdfmark
Now you just invoke ghostscript with the infile plus the new "pdfmarks" file:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \ -sOutputFile=withmarks.pdf withoutmarks.pdf pdfmarks
This adds pdfmark annotations to an existing PDF document. You can also use multiple documents, it actually doesn't matter when ghostscript receives the additional information in pdfmark. And of course "withoutmarks" can also be in Postscript format. Use -sPAPERSIZE where needed.
For much more detail see the above-mentioned pdfmark Primer (written at a time when gs had not yet supported pdfmark (an Acrobat Distiller "standard") and pdfmarks had to be embedded in PS but that is irrelevant here).
Single files to single files: Any way to convert multiple input files (all files in a directory) into individual output files with the same name (new extension) (Greg) ?
Converting all files in a directory can be achieved at the operating system's shell level:
Unix solution:: for A in *.tif; do c42pdf $A; done DOS/Win solution: ("MS-DOS command line prompt"):: for %f in (*.tif) do c42pdf %f
Another, more powerful approach (e.g. for converting all *.tif in an entire drive recursively (including subdirs) that is also robust for "long" filenames)) would be e.g. to use (download) the (Cygwin) GNU tools (http://www.cygwin.com/) and run from the freshly installed Cygwin bash shell located in the directory you want all subdirs of to be converted:
find . -name '*.tif' -exec c42pdf '{}' ';'
I'd rather convert from PDF to TIFF.
This is an easy one, use Ghostscript:
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -r300 -sPAPERSIZE=a4 -sOutputFile=mypdf.tif mypdf.pdf
-r
is the resolution, not using it would default to 204x196 (standard FAX resolution) for tiffs.If you'd prefer multistrip
-dMaxStripSize=8192
would be an option.Or simply use ImageMagick:
convert mypdf.pdf mypdf.tif
. Or use pdfimages (Xpdf) to convert to JPEG, PBM pr PPM.I still have questions on PDF
<a href="http://www.stillhq.com/cgi-bin/getpage?area=ctpfaq&page=index.htm">PDF FAQ</a>
<a href="http://www.pdfzone.com/">PDFzone</a> has a catalogue of PDF software vendors and distributors
<a href="http://www.stillhq.com/">Stillhq</a> and <a href="http://www.pdflib.com/">Thomas Merz</a> offer commercial consulting
Acknowledgments:
To Christoph Schulze, John A Kunze, James Y Hope, Hartmut Pilch, Greg Falvo, Jimmy Ngo, Dan Cogliano, Bill Gilchrist, Eric Smith for comments and questions.
GhostScript Examples
Ghostscript PDF to PNG/TIF
gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "TestFile.PDF" gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="c:\test.png" "c:\internet\Eudora\attach\12-30-2014 6-57 PM.pdf"copy "c:\internet\Eudora\attach\12-30-2014 6-57 PM.pdf" .\
gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "test.pdf"
the map file cidfmap was not foundgswin32c.exe -s OutputFile="test.png" "test.pdf"
gswin32c.exe -q -dNOPAUSE -sDEVICE=pngalpha -r300 -dEPSCrop -sOutputFile=test2.png test.pdf
gswin32c.exe -q -sOutputFile="test3.png" "test3.pdf"
gswin32c.exe -dNOPAUSE -dBATCH -r300 -sDEVICE=pnggray -sOutputFile="test.png" "TestFile.PDF"
gswin32c.exe -dNOPAUSE -dBATCH -sDEVICE=pnggray -sOutputFile="test.png" "test3.pdf"