Thursday, 23 June 2011

Splitting PDF pages in two

Been a while. To fill the void, here's something I finally worked out how to do. "Do what?" you ask. I'll try to explain but maybe my inability is why I took so long to Google an answer.

Sometimes, you might find yourself with a document where two pages of real document are on each page of the document file. Like if you scanned two A5 sheets onto one A4 sheet. The document in question is typically a scan of something like, say, the 1989 IAU Style Manual. My quest, with such documents, is to separate the pages; to take the double-page layout and split it in two; to separate those A5 sheets from their doubled-up A4 version; or to go from the first screenshot to the second, subtly different, one...

To refine the problem slightly, I'm firstly presuming you're working with a PDF. Most documents should be circulated in this format but you can probably print to a PDF anyway. Secondly, I'm working on the Linux command line. This should work wherever the standard tools I use are present and will probably work on Macs with the Linux-based versions of OS X. If you know how to do this in Windows, let me know in the comments. Finally, just before you tell me that this is easy, I'm not paying for any software.

The real work here is done by a tool called Unpaper. It's capable of much more and I invite you to check out the documentation to see what other tricks are possible. It can be downloaded as a binary (navigate to /bin/ in the tarball) so it doesn't require permissions to use. Given that it runs here, I guess the binary must be 32-bit x86 compiled. Other architectures might require compilation from source.

Unpaper works with Portable Bitmap Files, or PBMs, so the first thing we need is to extract such images from the PDF using pdfimages.

pdfimages in.pdf in

This tool is part of the Xpdf package, which is itself bundled in just about every major Linux distro, as far as I know. It produces a set of files with names like in-012.pbm, where 012 is the page number in the PDF file. Unpaper can now get cracking. Following the example given there,

unpaper --layout double --output-pages 2 in-%03d.pbm out-%03d.pbm

The %03d is the wildcard for the numbers in the filenames. This will, unsurprisingly, produce twice as many output files. We now want to combine these PBM files back into a PDF. There might be a shortcut but I accomplish this by converting the PBMs to TIFFs, combining the TIFFs, and converting that. So, the first step is

ls out-*.pbm | xargs -I {} ksh -c 'pnmtotiff {} > {}.tiff'

where I've used xargs to pass the PBMs to pnmtotiff. I warn you that pnmtotiff might be deprecated, in which case pamtotiff should do it. Back on track, no-one really uses TIFFs, so as long as your pages are the only TIFFs around, you can combine them with

tiffcp *.tiff out.tiff

and finally convert to PDF with

tiff2pdf -z -o out.pdf out.tiff

where the -z flag indicates zip compression. That should be it!

One note I will make is that this seemed to use quite a lot of space. I say this as someone who has no problem with 10GB of user data space, so it probably won't worry anyone else. Regardless, it's still worth pointing out that working with a 5MB PDF file generated PBMs and TIFFs adding up to several hundred MB in each format, so the better part of a GB when everything was around. I warned you.


  1. Split multiple PDF files in a single processing

