(?) Combining multiple PDFs into one

From Faber Fedor

Answered By Ben Okopnik, Yann Vernier


From the chaos of creation
just the final form survives
-- The World Inside The Crystal, Steve Savitsky
We could have just posted the finished script in 2c tips. but there's juicy perl bits to learn from the crafting. Enjoy. -- Heather

(?) Hey Gang,

I was playing with my new scanner last night (under a legacy OS unfortunately) when I realized a shortcoming: I wanted all of the scanned pages to be in one PDF file, not in separate ones. Well, to that end, I threw together this quick and dirty Perl script to do just that.

The script assumes you have Ghostscript and pdf2ps installed. It takes two arguments: the name of the output file and a directory name that contains all of the PDFs (which have .pdf extensions) to be combined, e.g.

    ./combine-pdf.pl test.pdf test/

I'm sure you can point out many flaws with the script (like how I grab the command line arguments and clean up after myself), but that's why it's "quick and dirty". If/when I clean it up, I'll repost it.

See attached combine-pdf-faber1,pl.txt

(!) [Ben] If you don't mind, I'll toss in some ideas. :) See my version at the end.
#!/usr/bin/perl -w

use strict;
Good idea on both.
# n21pdf.pl: A quick and dirty little program to convert multiple PDFs
# to one PDF requires pdf2ps and Ghostscript
# written by Faber Fedor (faber@linuxnj.com) 2003-05-27

if (scalar(@ARGV) != 2 ) {
You don't need 'scalar'. Scalar behavior (which is defined by the comparison operator) would cause the list to return the number of its members, so "if ( @ARGV != 2 )" works fine.

(?) Okay. I was trying to get ptkdbi (my fave Perl debugger) to show me the scalar value of @ARGV and the only way was with scalar(). That's also what I found in the Perl Bookshelf.

(!) [Ben] This is the same as "$foo = @foo". $foo is going to contain the number of elements in @foo.
my $PDFFILE = shift ;
my $PDFDIR = shift;
You could also just do
my ( $PDFFILE, $PDFDIR ) = @ARGV;
Combining declaration and assignment is perfectly valid.

(?) Cute. I'll have to remember that.

(!) [Ben]
chomp($PDFDIR);
No need; the "\n" isn't part of @ARGV.
$PDFDIR = $PDFDIR . '/' if substr($PDFDIR, length($PDFDIR)-1) ne '/';
Yikes! You could just say "$PDFDIR .= '/'"; an extra slash doesn't hurt anything (part of the POSIX standard, as it turns out).

(?) I know, but I really don't like seeing "a_dir//a_file". I always expect it to fail (although it never does). :-)

(!) [Yonn] I'm no Perlist myself, but my first choice would be: $foo =~ s%/*$%/%;
Which simply ensures that the string ends with exactly one /.
(!) [Ben] <grin> That's one of the ten most common "Perl newbie" mistakes that CLPM wizards listed: "Using s/// where tr/// is more appropriate." When you're substituting strings, think "s///"; for characters, go with "tr///".
tr#/##s
Better yet, just ignore it; multiple slashes work just fine.
(!) [Yonn] I did say I'm no perlist. Tr to me would be the translation tool, for replacing characters, including deletion.
(!) [Yonn] Yep; that's exactly what it does. However, even the standard utils "tr" can _compress strings - which is exactly what was needed here (note the "s"queeze modifier at the end.)
(!) [Yonn] Thank you. It's a modifier I had not learned but should have noticed in your mail. The script would have to tack a / onto the end of the string before doing that tr.
(!) [Ben] You're welcome. Yep, either that or use the globbing mechanism the way I did; it eliminates all the hassle.
for ( <$dir/*pdf> ){

=head
	Here's the beef, Granny! :)

	All you get here are the specified files as returned by "sh".
	You could also use the actual "glob" keyword which is an alias for the
	internal function that implements <shell_expansion> mechanism.
=cut

	# Mung individual PDF to heart's content
	...

}
(!) [Yonn] I don't know how to apply it to the end of the string, which is very easy given a regular expression as the substitute command uses. I'm more used to dealing with sed. Remember, the input data may well look like "/foo/bar/" and not just "bar/".
(!) [Ben] You can't apply it to the end of the string, but then I'd imagine Faber would be just as unhappy with ////foo/////bar////. "tr", as above, will regularize all of that.
(!) [Ben]
opendir(DIR, $PDFDIR) or die "Can't open directory $PDFDIR: $! \n" ;
Take a look at "perldoc -f glob" or read up on the globbing operator <*.whatever> in "I/O Operators" in perlop. "opendir" is a little clunky for things like this.
     `$PDF2PS $file $outfile` ;
Don't use backticks unless you want the STDOUT output from the command you invoke. "system" is much better for stuff like this and lets you check the exit status.
Note - the following is untested but should work.

See attached combine-pdf-ben1.pl.txt

(?) Thanks, I've cleaned it up and attached it. there's one thing that I couldn't make work, but first...

(now looking inside Ben's version)

die "Usage: ", $0 =~ /([^\/]+)$/, " <outfile.pdf> <directory_of_pdf_files>\n"
       unless @ARGV == 2;

Uh, that regex there. Take $_, match one or more characters that aren't a / up to the end of line and remember it and place it in $0? Huh?

(!) [Ben] Nope - it's exactly the behavior that Jason was talking about. "print" takes a list - that's why the members are separated by commas. The "match" operator, =~, says to look in whatever comes before it; "$_" doesn't require it.
print if /gzotz/;		# Print $_ if $_ contains "gzotz"
print if $foo =~ /gzotz/;	# Print $_ if $foo contains "gzotz"
print $foo if /gzotz/;		# Print $foo if $_ contains "gzotz"
So, what I'm doing is looking at what's in "$0", and capturing/returning the part in the parens as per standard list behavior. It's a cute little trick.
I guess I will have to do this one soon in my One-Liner articles; it's a useful little idiom.

(?) I had to move a few things around to get it to work. I did have one problem though

#convert ps files to a pdf file
system $GS, $GS_ARGS, $filelist
	and die "Problem combining files!\n";

This did not work no way, no how. I kept getting "/undefinedfilename" from GS no matter how I quoted it (and I used every method I found in the Perl Bookshelf).

(!) [Ben] Hm. I didn't try it, but -
perl -we'$a="ls"; $b="-l"; $c="Docs"; system $a, $b, $c and die "Fooey!\n"'
That works fine. I wonder what "gs"s hangup was. Oh, well - you got it going, anyway. I guess there's not much of a security issue in handing it to "sh -c" instead of execvp()ing it in this case: the perms will take care of all that.

(?) To get it to finally work, I did:

#convert ps files to a pdf file
my $cmd_string = $GS . $GS_ARGS . $filelist ;
system $cmd_string
        and die "Problem combining files!\n";

<shrug>

Anywho, here's the final (?) working copy:

See attached combine-pdf-faber2.pl.txt

(!) [Ben] Cool! Glad I could help.


Copyright © 2003
Copying license http://www.linuxgazette.com/copying.html
Published in Issue 91 of Linux Gazette, June 2003
HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/


[ Table Of Contents ][ Answer Guy Current Index ] greetings   Meet the Gang   1   2   3   4 [ Index of Past Answers ]