- Professional Skills
- Articles
- Bookmarklets
- Javascript Console For IE
- Neural Net Extension for Ruby
- Writing a PDF Generation Framework In Ruby
- Developing Mambo Components
- PayPal Website Payments Pro
- Bulk File Renaming
- Displaying a Maintenance Page
- Ruby: escape, unescape
- Clojure Tutorial For the Non-Lisp Programmer
- Complex mocking with PHPUnit
- PHP Coding Tips
- Simple CRUD Application
- External Links
- Recent posts
An Introduction to PDF Generation
Submitted by moxley on Sun, 2006-10-15 06:02
This article attempts to make a good starting point for generating a PDF file from scratch, without using a PDF generation library, or for creating your own library. It's written in Ruby, but it can easily be translated to another high level language of your choice.
I found Adobe's Developer Resources page for PDF here. Everything I needed was in there. I paid special attention to the language syntax chapter, the document layout chapter, and finally the examples chaper. The PDF syntax is really just a programming language, except there are no control statements; there are only data structures. Also, the exact positions of statements in the source are significant.
From the examples chapter, I copied and pasted some PDF code into a text editor, saved it with with a ".pdf" extension, and opened it in Preview (what OS X uses to open PDF files). I had to fix a few things from the sample code to make it work, including substitution of a few non-ascii characters that somehow slipped into the sample code. In the PDF document, there were numerous references to exact character positions of items within the document, which I knew probably wouldn't all be correct, so I was pleasantly suprised when it opened up just fine in Preview. It took longer than expected for the opening to occur, compared to other PDF documents of that size, so I suspected that the character position references were a little off, and Preview was forgiving enough to open it up anyway.
The task of figuring out exact character positions of elements in the document proved too error-prone, so I moved the document to a series of file output statements in a Ruby script, and I let the script keep track of the various character positions. I still don't know if the references are all correct, but they're probably closer than I could have done with the text editor alone. Besides, I can now continually make changes to my PDF, and the script will calculate the positions automatically.
Here's the script, after a bit of refactoring and organization:
# pdf.rb
# Abstract representation of an "Indirect Object", which is a PDF object
# that is accessed by an object number. PDF generation software needs to
# account for the file positions of every indirect object definition, so
# that it can build a cross reference table.
class IndirectObject
attr_accessor :num, :pos
def initialize(num, pos)
@num = num
@pos = pos
end
end
# A File subclass that manages Indirect Objects
class PDFFile < File
attr_accessor :objects
def initialize(*args)
super *args
@objects = []
end
def indirect_object
obj = IndirectObject.new(@objects.size + 1, self.pos)
yield obj
@objects << obj
end
end
def create_pdf(filepath)
PDFFile.open(filepath, "w") do |pdf|
# The first line of the PDF: The PDF identifier
pdf.puts("%PDF-1.4")
# This object describes the file organization type and the number of
# pages.
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Type /Catalog"
pdf.puts " /Outlines 2 0 R"
pdf.puts " /Pages 3 0 R"
pdf.puts " >>"
pdf.puts "endobj"
pdf.puts
end
# I'm not sure what this is for
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Type Outlines"
pdf.puts " /Count 0"
pdf.puts " >>"
pdf.puts "endobj"
pdf.puts
end
# I don't know what this is for
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Type /Pages"
pdf.puts " /Kids [4 0 R]"
pdf.puts " /Count 1"
pdf.puts " >>"
pdf.puts
pdf.puts "endobj"
end
# This defines the page dimensions and the font that will be used
# The font name is "F1"
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Type /Page"
pdf.puts " /Parent 3 0 R"
pdf.puts " /MediaBox [0 0 612 792]"
pdf.puts " /Contents 5 0 R"
pdf.puts " /Resources << /ProcSet 6 0 R"
pdf.puts " /Font << /F1 7 0 R >>"
pdf.puts " >>"
pdf.puts " >>"
pdf.puts "endobj"
pdf.puts
end
# Here's the text that will be displayed on the page: "Hello World"
# We're using font "F1" at 24 points. The text start at 100 points
# from the left of the page, and 600 points from the bottom.
pdf.indirect_object do |obj|
stream = "\n BT\n"
stream += " /F1 24 Tf\n"
stream += " 100 600 Td\n"
stream += " (Hello World) Tj\n"
stream += " ET"
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Length #{stream.length} >>"
pdf.puts "stream" + stream
pdf.puts "endstream"
pdf.puts "endobj"
pdf.puts
end
# Not exactly what this is for. It kind of says, "This is a PDF,
# and it has text in it"
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " [/PDF /Text]"
pdf.puts "endobj"
pdf.puts
end
# This describes the font that will be used. The font will be referenced
# with the name "F1"
pdf.indirect_object do |obj|
pdf.puts "#{obj.num} 0 obj"
pdf.puts " << /Type /Font"
pdf.puts " /Subtype /Type1"
pdf.puts " /Name /F1"
pdf.puts " /BaseFont /Helvetica"
pdf.puts " /Encoding /MacRomanEncoding"
pdf.puts " >>"
pdf.puts "endobj"
pdf.puts
end
# Here is the cross reference table. It gives the file positions of
# all the indirect objects in the document.
startxref = pdf.pos
pdf.puts "xref"
pdf.puts "0 8"
pdf.puts "0000000000 65535 f"
pdf.objects.each do |obj|
pdf.puts(("%010d" % obj.pos) + " 00000 n")
end
pdf.puts ""
# This says there are 8 indirect objects referenced in the
# cross reference section, and the first one is number '1'
pdf.puts "trailer"
pdf.puts " << /Size 8"
pdf.puts " /Root 1 0 R"
pdf.puts " >>"
# The file position where the cross reference is located
pdf.puts "startxref"
pdf.puts startxref.to_s
# End of file identifier
pdf.puts "%%EOF"
end
end
if __FILE__ == $0
create_pdf "gen.pdf"
end
>>>>>>>The bulk of the document is a collection of objects. Each object has an "object number", which is a non-negative integer. It consists of a collection of key-value pairs enclosed in ">>" and "<<". Towards the end of the document is a cross-reference section that is like an index. It gives the exact positions of each of the objects in the document so that they can be accessed randomly, without having to scan the entire document.


Cape Cod Cottage for Rent