Buggy PDF Files, should we try to fix them?

Aug 16

Very frequently we come across malformed PDF files. These files view just fine in Acrobat(r) Reader but fail to load with our tools and other third-party tools. This has become a real nightmare for PDF developers. From the customer’s view, the PDF file opens with Acrobat Reader, so it has to be a perfectly valid PDF and the customer does not care about what happens in the background.

Here is one example of a situation that we encountered very recently. The PDF file is generated from a well known scanner brand whose manufacturer would probably sue us just for mentionning their name. At the end of every PDF file, there is a cross reference table that tells the PDF consumer (reader or processing engine) where the various parts of the PDF are located within the file. To locate the cross reference table, the consumer looks for the “startxref” entry towards the end of the file. Here is what this entry looks like in our problematic PDF file:


The PDF consumer goes to the location 101806 in the file and expects to find a cross reference table that looks like this:

0 153
0000000000 65535 f
0000101369 00000 n

Again, this table provides the location of each object within the PDF file. In the case of our buggy PDF, location 101806 points to a random location in the file that contains:


So how is Acrobat Reader opening the file? It actually displays a very brief warning when opening the file and then rebuilds the whole xref table by going through all the file. On most systems with Acrobat Reader already launched, the warning is so brief that users do not even notice it. There are many inconveniences and risks in doing this:

  1. If the file is fairly complex, rebuilding the xref table is time consuming and might fail for compressed objects.
  2. PDF files can be incrementally modified, which means the original information remains intact and new objects added to modify, add or delete things that existed in the original PDF. This is done by creating an updated cross reference table. A funny but dangerous thing might happen: The viewer might retrieve the old PDF data rather than the updated one, so if the user had deleted something from the PDF, the viewer might actually display the deleted information.
  3. The startxref might be valid, but the xref table itself contain invalid entries. So the problem does not appear when opening the file but only later on when scrolling through the document.

Explaining all this to a customer usually results in negative reactions such as “why do we care when the file was generated by this multi-billion-dollar corporation and is processed correctly by Adobe?” So what is the solution? Try to convince Adobe that invalid files should be rejected, or at least really warn the users? But then, some of Adobe’s own files have similar issues, so how could they explain this warning appearing on their own PDFs?

This is only one example of badly formatted PDFs that we come across. Our position has always been to inform our customers, try to fix the invalid PDF and generate a warning to the developer. But this has been going on for way too long with no end in sight.

Dany Amiouny is the CTO for Amyuni Technologies