Usage scenario 1: I scan part of a book (i.e. some article) on a school's scanner that sends me 10 big separate color pdf files (one pdf per page). I want to get one nice, small (black and white) pdf file with all the pages.
Usage scenario 2: I download a web form, print it, fill it in, sign it, scan it on my own scanner using Gimp and now I want to convert the image into a nice pdf file (either color or black & white) to send back over email.
Solution: I save the original files (be it pdf or png) into a folder and use git to track it. Then create a simple reduce script to convert it to the final format (view it as a pipeline). Often I need to tweak one or two parameters in the pipeline.
Here is a script for scenario 1:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /bin/bash | |
# Creates a very small black and white pdf file. | |
set -e | |
mkdir -p tmp | |
for f in orig/*.pdf; do | |
filename=$(basename $f) | |
base="${filename%.[^.]*}" | |
echo "pdf -> ppm images: $base" | |
pdfimages $f tmp/b | |
mv tmp/b-000.ppm tmp/$base.ppm | |
convert -crop 1600x2632+1544+64 tmp/$base.ppm tmp/crop-$base-01.ppm | |
convert -crop 1600x2632+3224+72 tmp/$base.ppm tmp/crop-$base-02.ppm | |
done | |
# Manual correction: remove the first and last pages: | |
rm tmp/crop-a00-01.ppm | |
rm tmp/crop-a12-02.ppm | |
for f in tmp/crop-*.ppm; do | |
filename=$(basename $f) | |
base="${filename%.[^.]*}" | |
echo "ppm -> tiff: $base" | |
# This | |
#convert -monochrome -threshold 60% tmp/$base.ppm tmp/$base.tiff | |
# produces much worse result than this: | |
convert -threshold 60% tmp/$base.ppm tmp/$base.pbm | |
unpaper -s letter --no-border-align tmp/$base.pbm tmp/u$base.pbm | |
ppm2tiff tmp/u$base.pbm tmp/$base.tiff | |
done | |
# Create a pdf document | |
tiffcp tmp/crop-*.tiff tmp/article.tiff | |
tiff2pdf -p letter -F tmp/article.tiff -o article.pdf |
And here for scenario 2:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /bin/sh | |
set -e | |
convert -threshold 80% form.png form.pbm | |
convert form.pbm form.tiff | |
tiff2pdf -p letter -F form.tiff -o form.pdf | |
convert card.png card.ppm | |
convert card.ppm card.tiff | |
tiff2pdf -p letter -r o -u i -x 5 -y 5 card.tiff -o card.pdf |
There can be several unexpected surprises along the way. From my experience:
- If I convert png directly to tiff, sometimes the resolution can be wrong. The solution is to always convert to ppm (color) or pbm (black and white) first, which is just a simple file format containing the raw pixels. This is the "starting" format (so first I need to convert the initial pdf or png into ppm/pbm) and then do anything else. That proved to be very robust.
- The tiff2pdf utility proved to be the most robust way to convert an image to a pdf. All other ways that I have tried failed in one way or another (resolution, positioning, paper format and other things were wrong....). It can create multiple pages pdf files, set paper format (US Letter, A4, ...) and so on.
- The linux convert utility is a robust tool for cropping images, converting color to black and white (using a threshold for example) and other things. As long as the image is first converted to ppm/pbm. In principle it can also produce pdf files, but that didn't work well for me.
- I sometimes use the unpaper program in the pipeline for some automatic polishing of the images.
In general, I am happy with my solution. So far I was always able to get what I needed using this "pipeline" method.