An Introduction to PDF Generation

This article attempts to make a good starting point for generating a PDF file from scratch, without using a PDF generation library, or for creating your own library. It's written in Ruby, but it can easily be translated to another high level language of your choice.

I found Adobe's Developer Resources page for PDF here. Everything I needed was in there. I paid special attention to the language syntax chapter, the document layout chapter, and finally the examples chaper. The PDF syntax is really just a programming language, except there are no control statements; there are only data structures. Also, the exact positions of statements in the source are significant.

From the examples chapter, I copied and pasted some PDF code into a text editor, saved it with with a ".pdf" extension, and opened it in Preview (what OS X uses to open PDF files). I had to fix a few things from the sample code to make it work, including substitution of a few non-ascii characters that somehow slipped into the sample code. In the PDF document, there were numerous references to exact character positions of items within the document, which I knew probably wouldn't all be correct, so I was pleasantly suprised when it opened up just fine in Preview. It took longer than expected for the opening to occur, compared to other PDF documents of that size, so I suspected that the character position references were a little off, and Preview was forgiving enough to open it up anyway.

The task of figuring out exact character positions of elements in the document proved too error-prone, so I moved the document to a series of file output statements in a Ruby script, and I let the script keep track of the various character positions. I still don't know if the references are all correct, but they're probably closer than I could have done with the text editor alone. Besides, I can now continually make changes to my PDF, and the script will calculate the positions automatically.

Here's the script, after a bit of refactoring and organization:

# pdf.rb

# Abstract representation of an "Indirect Object", which is a PDF object
# that is accessed by an object number. PDF generation software needs to
# account for the file positions of every indirect object definition, so
# that it can build a cross reference table.
class IndirectObject
  attr_accessor :num, :pos
  def initialize(num, pos)
    @num = num
    @pos = pos
  end
end

# A File subclass that manages Indirect Objects
class PDFFile < File
  attr_accessor :objects
  def initialize(*args)
    super *args
    @objects = []
  end
  def indirect_object
    obj = IndirectObject.new(@objects.size + 1, self.pos)
    yield obj
    @objects << obj
  end
end

def create_pdf(filepath)
  PDFFile.open(filepath, "w") do |pdf|
    # The first line of the PDF: The PDF identifier
    pdf.puts("%PDF-1.4")
    # This object describes the file organization type and the number of
    # pages.
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Type /Catalog"
      pdf.puts "     /Outlines 2 0 R"
      pdf.puts "     /Pages 3 0 R"
      pdf.puts "  >>"
      pdf.puts "endobj"
      pdf.puts
    end
    # I'm not sure what this is for
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Type Outlines"
      pdf.puts "     /Count 0"
      pdf.puts "  >>"
      pdf.puts "endobj"
      pdf.puts
    end
    # I don't know what this is for
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Type /Pages"
      pdf.puts "    /Kids [4 0 R]"
      pdf.puts "    /Count 1"
      pdf.puts "  >>"
      pdf.puts
      pdf.puts "endobj"
    end
    # This defines the page dimensions and the font that will be used
    # The font name is "F1"
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Type /Page"
      pdf.puts "     /Parent 3 0 R"
      pdf.puts "     /MediaBox [0 0 612 792]"
      pdf.puts "     /Contents 5 0 R"
      pdf.puts "     /Resources << /ProcSet 6 0 R"
      pdf.puts "       /Font << /F1 7 0 R >>"
      pdf.puts "     >>"
      pdf.puts "  >>"
      pdf.puts "endobj"
      pdf.puts
    end
    # Here's the text that will be displayed on the page: "Hello World"
    # We're using font "F1" at 24 points. The text start at 100 points
    # from the left of the page, and 600 points from the bottom.
    pdf.indirect_object do |obj|
      stream = "\n  BT\n"
      stream += "    /F1 24 Tf\n"
      stream += "    100 600 Td\n"
      stream += "    (Hello World) Tj\n"
      stream += "  ET"
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Length #{stream.length} >>"
      pdf.puts "stream" + stream
      pdf.puts "endstream"
      pdf.puts "endobj"
      pdf.puts
    end
    # Not exactly what this is for. It kind of says, "This is a PDF,
    # and it has text in it"
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  [/PDF /Text]"
      pdf.puts "endobj"
      pdf.puts
    end
    # This describes the font that will be used. The font will be referenced
    # with the name "F1"
    pdf.indirect_object do |obj|
      pdf.puts "#{obj.num} 0 obj"
      pdf.puts "  << /Type /Font"
      pdf.puts "     /Subtype /Type1"
      pdf.puts "     /Name /F1"
      pdf.puts "     /BaseFont /Helvetica"
      pdf.puts "     /Encoding /MacRomanEncoding"
      pdf.puts "  >>"
      pdf.puts "endobj"
      pdf.puts
    end
    # Here is the cross reference table. It gives the file positions of
    # all the indirect objects in the document.
    startxref = pdf.pos
    pdf.puts "xref"
    pdf.puts "0 8"
    pdf.puts "0000000000 65535 f"
    pdf.objects.each do |obj|
      pdf.puts(("%010d" % obj.pos) + " 00000 n")
    end
    pdf.puts ""
    # This says there are 8 indirect objects referenced in the
    # cross reference section, and the first one is number '1'
    pdf.puts "trailer"
    pdf.puts "  << /Size 8"
    pdf.puts "     /Root 1 0 R"
    pdf.puts "  >>"
    # The file position where the cross reference is located
    pdf.puts "startxref"
    pdf.puts startxref.to_s
    # End of file identifier
    pdf.puts "%%EOF"
  end
end
if __FILE__ == $0
  create_pdf "gen.pdf"
end

The bulk of the document is a collection of objects. Each object has an "object number", which is a non-negative integer. It consists of a collection of key-value pairs enclosed in ">>" and "<<". Towards the end of the document is a cross-reference section that is like an index. It gives the exact positions of each of the objects in the document so that they can be accessed randomly, without having to scan the entire document.