Monday, June 4, 2012

How to convert scanned images to pdf

From time to time I need to convert scanned documents to a pdf format.

Usage scenario 1: I scan part of a book (i.e. some article) on a school's scanner that sends me 10 big separate color pdf files (one pdf per page). I want to get one nice, small (black and white) pdf file with all the pages.

Usage scenario 2: I download a web form, print it, fill it in, sign it, scan it on my own scanner using Gimp and now I want to convert the image into a nice pdf file (either color or black & white) to send back over email.

Solution: I save the original files (be it pdf or png) into a folder and use git to track it. Then create a simple reduce script to convert it to the final format (view it as a pipeline). Often I need to tweak one or two parameters in the pipeline.

Here is a script for scenario 1:

And here for scenario 2:

There can be several unexpected surprises along the way. From my experience:

  • If I convert png directly to tiff, sometimes the resolution can be wrong. The solution is to always convert to ppm (color) or pbm (black and white) first, which is just a simple file format containing the raw pixels. This is the "starting" format (so first I need to convert the initial pdf or png into ppm/pbm) and then do anything else. That proved to be very robust.
  • The tiff2pdf utility proved to be the most robust way to convert an image to a pdf. All other ways that I have tried failed in one way or another (resolution, positioning, paper format and other things were wrong....). It can create multiple pages pdf files, set paper format (US Letter, A4, ...) and so on.
  • The linux convert utility is a robust tool for cropping images, converting color to black and white (using a threshold for example) and other things. As long as the image is first converted to ppm/pbm. In principle it can also produce pdf files, but that didn't work well for me.
  • I sometimes use the unpaper program in the pipeline for some automatic polishing of the images.
In general, I am happy with my solution. So far I was always able to get what I needed using this "pipeline" method.


Anonymous said...

For combining pdf files, you can also use pdftk "pdftk *.pdf cat output combined.pdf" but maybe it does not do exactly what you want?

When converting images to pdf files, I always had the problem that the pdf file would reencode the image (and thereby loose quality or blow up the file size), so I wrote this python script to always preserve quality by not reencoding the image when embedding it into pdf. It can also output a multipage pdf if given multiple images. Maybe you find it useful.

hyperair said...

I actually did "convert foo.png foo.pdf" with no options yesterday, and it just worked. I didn't have to convert it into any intermediary formats as you mentioned, though.

Peter Keel said...

pdfjoin --outfile foo.pdf *.jpg

Will do the job _much_ faster. Plus it works for hundreds of megabytes of jpeg-files.

dimpase said...

Big scanners usually can do multipage pdf files, although the interface might be pretty bad. E.g. our school scanner only does multipage when you tell it that you do a double sided scan.

And then, for longer scanned texts djvu format is much more compact...

Michael Below said...

I like to use the djvu tools for your first scenario. Converting to djvu and then exporting as "cleaned" b/w pdf results in the best quality/size ratio for text, IMHO.

Thomas Koch said...

I currently need a sponsor for an update of the unpaper package:

Would you mind?

ALchEmiXt said...

Hoe about the other way aound from hq pdf to hq png or jpg?

Peter Keel said...

The other way round? Either you only need to extract images from pdf, then it's "pdfimages -j foo.pdf"

Or if you wan't to make some kind of screenshot/thumbnail, the fastest is "pdftoppm -jpeg bar.pdf". Also produces tiff or png.

Ondřej Čertík said...

Hi, thanks everybody for your comments! It's nice to have all the options at one place. Today I tried to convert directly with:

convert p1.png p2.png p3.png -colors 16 output.pdf

(This automatically makes the pdf size smaller by reducing the colors.)
It produces a US letter format and it seems to work. However, when I try to impose it with the "-page" switch (or use the A4 format), it produces a pdf with the paper size of 2x2 inches...

Thomas -- I am still finishing my DD application process, hopefully I'll finish it soon. After that, I'll be happy to upload unpaper anytime. Sorry about that.