Have you ever tried to debug a problem with the XREF table in a PDF file? I’ve been debugging PDF problems for a long time, and every now and then I come across a file that has a corrupt XREF table – either because the PDF generating application did not emit a valid XREF table, or because somebody tried to edit a PDF file and did not update the XREF table. And yes, there are applications that write out corrupt XREF data – either as a result of a bug, or because the developer did not understand the PDF spec.
The most important information about the XREF table (and many who try to create a PDF file in a text editor will stumble over this) is that every entry is exactly 20 bytes long – including the line ending character(s). This means that the content of the line is different depending on the line ending conventions used (e.g. CRLF for Windows vs. LF for Unix or Mac OS). If the file is generated on a Windows system, the line ending will already use up two bytes, so we have 18 bytes left for the actual XREF entry:
– 10 digits for the byte offset
– 1 space between the byte offset and the generation number
– 5 digits for the generation number
– 1 space as delimiter between the generation number and the in-use/free flag
– 1 byte for the in-use/free flag
That makes 18 bytes. For a line that ends with just a LF character, we need to “stuff” the line with a space character after the flag. So, when writing XREF data, make sure that you are indeed writing out 20 bytes per XREF entry.
The next problem is what to use as the byte offset for the different entries. There are different ways to determine the byte offset. If you are using vim as your editor, put the following into your .vimrc file and you will get the byte count of the character that’s currently under the cursor:
set laststatus=2 set statusline=%o/%l/%c
The byte count is not the same as the byte offset relative to the beginning of the file: The first character in the document will show a byte count of “1” – it’s the first byte, but this character will have an offset of ‘0’ relative to the beginning of the file (it is the beginning of the file). So, in order to convert the byte count we need to subtract 1 from the value that is being displayed.
For the rest of this document I will use the basic PDF file that gets created via the excellent article series “Make Your Own PDF” – created by the people who brought you JPedal and PDF2HTML5, IDRsolutions.
When I try to open the file in Ghostscript, I end up with an error message:
gs test.pdf GPL Ghostscript 9.06 (2012-08-08) Copyright (C) 2012 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. **** Error reading a content stream. The page may be incomplete. **** Warning: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data.
Here is a screenshot of vim editing a PDF file. I’ve highlighted three lines: The first one shows where the cursor is positioned – it is on the start of the line that starts object 6. The second highlight represents the XREF entry for object 6. And the last highlight shows the character count up to and including the character under the cursor. This is 318 characters – we need to adjust that by subtracting one to get the byte offset, so object 6 starts at byte offset 317, which is what is found in the XREF table.
I assume that other editors can provide similar information. I won’t go into debugging this file any further in the editor, it’s a tedious process, and I have a better solution:
A different approach is to output the PDF file with the byte offsets for every line prepended to the line’s content via a small program. A while ago I wrote such a utility, which I cleaned up a little for this post. You can download the C source code here: print_pdf_offset.c.
Just compile the utility (e.g. via make print_pdf_offset
if you have a command line build system installed) and provide the filename of a PDF file on the command line:
print_pdf_offset test.pdf
This will then print the PDF file with the byte offsets (and this time this is the true byte offset, so no adjustments necessary) to stdout. Here is an example of the output:
00000317: 6 0 obj 00000325: <</Length 44>> 00000340: stream 00000347: BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET 00000391: endstream 00000401: endobj 00000408: xref 00000413: 0 7 00000417: 0000000000 65535 f 00000436: 0000000009 00000 n 00000455: 0000000056 00000 n 00000474: 0000000111 00000 n 00000493: 0000000212 00000 n 00000512: 0000000250 00000 n 00000531: 0000000317 00000 n 00000550: trailer <</Size 7/Root 1 0 R>> 00000581: startxref 00000591: 406 00000595: %%EOF
By looking at the XREF table, it’s pretty obvious that there is something wrong with this file: The XREF entries are not 20 bytes long, they are each one byte short. This probably means that the table was created on a system that uses CRLF as line endings, and I just pasted it into an editor on a Mac that only uses LF – hence the missing byte (this is not the real reason, I “broke” the file for demonstration purposes). Also, the startxref information points to byte offset 406, but the XREF table actually starts at location 408. No wonder this file gave me problems when trying to open it with Ghostscript. After fixing these two problems (adding a space after every XREF table entry and changing the 406 to 408) the file loads without any problem.
By the way: The corrupt file loads without any error message in Adobe Acrobat or the Adobe Reader: The only indication that something is wrong is that Acrobat wants to save the file when it’s closed – because it got repaired behind the scenes. Chances are that there is a popup informing the user that the file is being repaired, but with such a small file and a fast computer, it’s gone before the user is able to register it.