Monday, June 4, 2012

How to convert scanned images to pdf

From time to time I need to convert scanned documents to a pdf format.


Usage scenario 1: I scan part of a book (i.e. some article) on a school's scanner that sends me 10 big separate color pdf files (one pdf per page). I want to get one nice, small (black and white) pdf file with all the pages.


Usage scenario 2: I download a web form, print it, fill it in, sign it, scan it on my own scanner using Gimp and now I want to convert the image into a nice pdf file (either color or black & white) to send back over email.

Solution: I save the original files (be it pdf or png) into a folder and use git to track it. Then create a simple reduce script to convert it to the final format (view it as a pipeline). Often I need to tweak one or two parameters in the pipeline.

Here is a script for scenario 1:
#! /bin/bash
# Creates a very small black and white pdf file.
set -e
mkdir -p tmp
for f in orig/*.pdf; do
filename=$(basename $f)
base="${filename%.[^.]*}"
echo "pdf -> ppm images: $base"
pdfimages $f tmp/b
mv tmp/b-000.ppm tmp/$base.ppm
convert -crop 1600x2632+1544+64 tmp/$base.ppm tmp/crop-$base-01.ppm
convert -crop 1600x2632+3224+72 tmp/$base.ppm tmp/crop-$base-02.ppm
done
# Manual correction: remove the first and last pages:
rm tmp/crop-a00-01.ppm
rm tmp/crop-a12-02.ppm
for f in tmp/crop-*.ppm; do
filename=$(basename $f)
base="${filename%.[^.]*}"
echo "ppm -> tiff: $base"
# This
#convert -monochrome -threshold 60% tmp/$base.ppm tmp/$base.tiff
# produces much worse result than this:
convert -threshold 60% tmp/$base.ppm tmp/$base.pbm
unpaper -s letter --no-border-align tmp/$base.pbm tmp/u$base.pbm
ppm2tiff tmp/u$base.pbm tmp/$base.tiff
done
# Create a pdf document
tiffcp tmp/crop-*.tiff tmp/article.tiff
tiff2pdf -p letter -F tmp/article.tiff -o article.pdf
view raw reduce2.sh hosted with ❤ by GitHub

And here for scenario 2:
#! /bin/sh
set -e
convert -threshold 80% form.png form.pbm
convert form.pbm form.tiff
tiff2pdf -p letter -F form.tiff -o form.pdf
convert card.png card.ppm
convert card.ppm card.tiff
tiff2pdf -p letter -r o -u i -x 5 -y 5 card.tiff -o card.pdf
view raw reduce1.sh hosted with ❤ by GitHub

There can be several unexpected surprises along the way. From my experience:

  • If I convert png directly to tiff, sometimes the resolution can be wrong. The solution is to always convert to ppm (color) or pbm (black and white) first, which is just a simple file format containing the raw pixels. This is the "starting" format (so first I need to convert the initial pdf or png into ppm/pbm) and then do anything else. That proved to be very robust.
  • The tiff2pdf utility proved to be the most robust way to convert an image to a pdf. All other ways that I have tried failed in one way or another (resolution, positioning, paper format and other things were wrong....). It can create multiple pages pdf files, set paper format (US Letter, A4, ...) and so on.
  • The linux convert utility is a robust tool for cropping images, converting color to black and white (using a threshold for example) and other things. As long as the image is first converted to ppm/pbm. In principle it can also produce pdf files, but that didn't work well for me.
  • I sometimes use the unpaper program in the pipeline for some automatic polishing of the images.
In general, I am happy with my solution. So far I was always able to get what I needed using this "pipeline" method.