Automating Preflight with PDF Analyzer

Sep 01 2009

Amyuni Technologies Blog

From Check List to PDF Options

More than a decade ago Chuck Weger coined the term “preflight” to define the verification process of digital files prior to printing. Since then, this process or “check list” has materialized into integrated PDF options, plug-ins, and standalone products.

One such standalone product is the Amyuni PDF Analyzer. Its purpose: to verify the internal validity and conformance of PDF documents.

Automated Batch Processing

Unlike integrated options and plug-ins, PDF Analyzer can be installed on a server to perform the preflight of large numbers of documents as an automated batch process (Figure 1) without user interaction.

Figure 1: Automated Preflight
blog-article-image-212

PDF Analyzer validates the structure of PDF documents with customizable VB.NET rule sets (Figure 2) to ensure that the structure of documents comply with industry or custom specifications.

Figure 2: VB .NET Rule Sets
blog-article-image-17

Customized VB .NET rules can be created, saved, and reloaded to verify that:

  • PDF document object instructions, syntax, and hierarchies do not contain errors.
  • Fonts and corresponding font information such as TrueType tables are properly embedded.
  • Embedded graphics and images are properly compressed for specific application processing.
  • PDF/A documents contain embedded fonts, XMP metadata, and device-independent colors, etc.
  • PDF/A documents do not contain encryption, JavaScript, embedded files, etc.

In addition to document analysis, you can also use PDF Analyzer to compare documents, verify sensitive metadata, and extract confidential information. Its design and functions stem from years of “on the field” application deployment experiences.

Learn more about PDF Analyzer at www.pdfanalyzer.com.

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

Missing PDF Fonts: How Metadata Affects Text Optimization (Part 2)

Aug 28 2009

Missing Font Information in PDFs: How Metadata Affects Text Optimization

This document is the second in a series of white papers that will explore the problem of missing font information in PDFs. Since its inception, the PDF has revolutionized the way individuals and businesses communicate and exchange information. The promise to maintain informational integrity and display content consistently across different platforms secured the PDF’s position as a leader in document exchange. Yet despite its innovations, the PDF’s own evolution would also bring with it, new challenges.

Missing PDF Fonts: Who Does It Affect and Why is it Important?

At first glance, missing font information may appear trivial. After all, who hasn’t experienced unintelligible characters while scrolling through a PDF? However, this problem is more than just improperly rendered text on a screen. For developers and IT managers, missing font information is problematic as it can delay software development and hinder the production cycle. For end‐users, it translates into lost time and compromised deadlines when they cannot display, print, or edit content properly.

The inconvenience of missing font information affects more than disgruntled individuals in the work place, it also undermines a document’s accuracy and its value as a product. The original purpose of the portable document format was to ensure content integrity and display consistency, but what happens when content is incomplete, cannot print accurately, or can even change?

The issue of improper text rendering is an ironic side‐effect of the PDF’s popularity. After all, the format’s portability is the cornerstone of its purpose. And although developers will always design PDFs to accommodate as many configurations as possible, ultimately, they will never be able to fully accommodate how users choose to work with the portable document format.

Missing Metadata and the User Experience

The portable document format is a complex technology and there are many internal variables that, if left unchecked, can compromise its final output. The following sections will provide a brief overview of one of the many problems that can occur when development corners are cut, specifically – the problem of missing or incorrect metadata.

When metadata goes missing or is incorrect (whether through corruption or developmental oversight), viewers cannot optimize the contents of the document, which means that the PDF cannot guarantee optimal usability within the context in which it is being used. For example, if a user is unable to display or print the contents of a document accurately, the usability of the PDF is not 100% reliable. Likewise, if a user is unable to search a document for a word or extract content, the PDF has also failed to provide optimal usability.

The term “context” is important because it is closely tied to the user experience and often, a positive or negative user experience influences product or vendor perception. Therefore, PDFs that cannot be properly optimized have the potential (if it happens often enough) to directly affect product perception. Unfortunately, many PDF developers are unaware that their PDFs are internally unsound (as in the case of missing or incorrect font information), and their work goes unchecked, only to fall under the scrutiny of disgruntled end‐users.

If metadata information goes missing, a PDF viewer can experience text rendering problems such as missing or unintelligible characters and a slow refresh rate. Within certain contexts, these problems are more apparent (and frustrating) than others. Take for example, working with documents remotely. Before the advent of thin client environments such as terminal servers, users worked with PDF documents locally, using a viewing application such as Adobe Reader.

Consequently, as the popularity of remote access grew, so did the expectation of working remotely with PDFs while maintaining the same real‐time functionalities. However PDF rendering issues are increased under these remote environments. The lack of quality and detail of poorly optimized text is more apparent, as is the document’s slower refresh rate.

Text Rendering on Screen

The process that a viewer or rendering application undergoes to display text instructions into meaningful glyphs on screen is complex. Figure 1 provides a simplified overview of this process; however it is important to note that it is during this process that missing metadata takes its toll. The following sections will outline some of different scenarios that might occur when font information goes missing.

Figure 1: Overview of the Text Rendering Process For On Screen Display (Click image to enlarge)
word-image-11

Missing Font Resources

When a PDF viewer encounters a text drawing instruction, it loads a specific font from the font resource. If this resource is missing (Figure 2), the viewer is unable to display any character(s) that uses the specified font and is also unable to provide a substitute for it. Most often the viewer will simply fail to load the document. In some cases the viewer may randomly substitute the missing fonts, but most times it produces unpredictable text rendering results.

Figure 2: Missing Font Resource
word-image-2

Missing Font Family Name

Another instance where font information can go missing is the font family name. For example, if the font family name “Arial” is missing (Figure 3), the viewer is even unable to determine an equivalent system font to use as a replacement. As a result, the viewer is unable to optimize the loading and rendering of the PDF file.

Figure 3: Missing Font Family Name (Click image to enlarge)
white-paper-part-2-image-3

Character Codes to Glyphs: The CMap Leads the Way

However, if the font resource is found, the viewer processes the information and depending on whether the fonts in the PDF are embedded or not, takes a number of different text decoding methods. If the font is embedded and the viewer is not set to optimize the rendering of text, the viewer will refer to the embedded (CID to Glyph) CMap in the PDF for information on how the font engine can convert the text into glyphs.

Essentially, the CMap is metadata (Figure 4) that maps character codes to their corresponding graphical representations (glyphs) in order for the font engine to render all the details of each character. However, if there is information missing in the CMap, the font engine is unable to accurately render the characters and the text is unrecognizable.

Figure 4: CID to Glyph CMap
white-paper-part-2-image-4

In this situation, anti‐aliasing is unable to function using Windows GDI. For thin client services such as remote terminals, PDAs, and virtualization environments, the absence of anti‐aliasing dramatically affects text processing speed and display quality (Figure 5).

Figure 5: Anti‐aliasing experienced with thin client services (Click image to enlarge)
white-paper-part-2-image-5

Viewer Rendering Options

If the font is not embedded, the viewer will look to the system to find a substitute or replacement font. In most cases, even though the binary information that makes up the font file is missing, its corresponding positional and descriptive metadata is sufficient to enable the viewer to compensate by substituting the font.

If a matching font is found on the system, the viewer proceeds to use the services of a font engine (such as GDI, FreeType, or commercial library) to render the final output to the screen. But what if the system fonts are not found or the viewer is not supplied with its own standard list of replacement fonts? The viewer will have to select the closest font replacement instead and fall back on the drawing parameters provided in the PDF’s metadata.

Without these parameters, the viewer is unable to provide a font engine with the information required to draw the glyph(s). If all the information in the font metadata is valid and well‐structured, the viewer is able to load the appropriate glyphs and the font engine can render the final output.

Character Codes to Unicode: Different Metadata, Same Result

As we have seen in the previous section, should there be missing information in the CMap, rendering problems occur yet again. The next question is then, what about the CID to Unicode CMap? What happens in the event that there is information missing there also? Just like its embedded counter‐part, the Unicode CMap provides character to Unicode mapping information (Figure 6). This includes character encoding parameters such as WinAnsi, MacRoman, and Unicode. And just as with the (CID to Glyph) CMap, if there is information missing in the Unicode CMap, the font engine is once again unable to draw the appropriate glyphs and the user can expect more of the same unpredictable text rendering results.

Figure 6: CID to Unicode CMap
white-paper-part-2-image-6

Incorrect Metadata

By contrast, incorrect metadata presents a different set of problems. Because the information is incorrect, the resulting text may (in extreme case) display incorrectly, as when a Unicode CMap table contains incomplete or wrong entries. In Figure 7, both lower‐ and upper‐case characters are pointing to the same Unicode values. As a result, the viewer may display the same characters for lower‐ and upper‐case letters.

Figure 7: Unicode CMap Errors
white-paper-part-2-image-71

Is There Blame to Place?

Most of the aforementioned problems occur even before the PDF reaches the viewing application. The causes? Often, weak document design and poor development practices are the main culprits behind many of the missing or incorrect metadata problems found in PDFs circulating today. For example, some developers choose not to embed vital font information if these make the file too large or if the data is not required by the PDF specifications.

Missing or incorrect metadata is a commonly overlooked problem because its effects are often only noticeable away from the familiar development environment. Lack of testing and the assumption that “if a PDF renders properly in Acrobat, it will render the same elsewhere” create problematic documents not only for end‐users, but also for the vendors that generate them.

So Many Branches to Prune – Starting at the Root: Where to Begin Tackling the Problem of Incorrect Metadata and Font Information

Where does one begin tackling the problem of incorrect metadata and font information? Since PDF development and production is always changing and growing, where is the starting point? What type of development tool or best practices should developers think about and why?

First and foremost, at the production level, developers need to use the right software tools. With the right tools, developers can start generating well‐structured PDFs that will optimize and render properly. Not only are these good documents appreciated by end-users, but other developers who need to work with them later in different environments also benefit. For example, some tools used by developers tend to remove some TrueType tables because according to the PDF specifications, they are not needed by Acrobat. However, these tables could be required for other purposes such as rendering PDFs on thin clients and PDAs, or exporting document content to other formats such as XPS or XAML.

Yet, using the right tools is often not enough. Because PDF is a complex technology, developers need to think “outside the box” especially regarding the PDF specifications. Many items are not included or mentioned in the PDF specifications, yet these are part of the solution when trying to create well‐structured and optimized documents. The following sections outline some of the best practices (based on years of working with countless problematic and optimized PDF documents) that Amyuni Technologies believes leads to better quality PDFs.

Solid Tables Mean Solid Font Files

Developers should ensure that embedded font files contain all of their respective tables. This way the font file is valid not only from a PDF standpoint, but also for other tools or font engines that might be required to process the document.

Include Valid Metadata

Developers should ensure that the font metadata contains a valid font family name, either through the FontName attribute or the FamilyName attribute. Using cryptic names such as “/F1234,Bold” to represent /Arial,Bold is permitted by PDF specifications, but prevents the viewer from doing any optimization, since the viewer will not recognize “/F1234”as a valid font family name. Developers also need to make sure that all metadata values reflect the actual value(s) of the font file, even if these values seem unimportant. A common example is setting an incorrect value for the AvgWidth which has no (immediate) visual effect until a viewer attempts to optimize the viewing of the PDF.

Font Duplication

Optimizing PDF files so that they do not contain multiple instances of the same font is also important. Developers frequently encounter PDFs that contain one instance of a specific font per page. Font duplication not only hinders optimization but it also increases file size and slows down document processing. True, it is easier to generate a PDF that contains multiple instances of a font, but then it becomes much more complicated to make sure those duplicate fonts are removed before saving the document afterwards.

Tools and Alternatives

PDF is not a new technology, yet there are constantly new PDF tools to work with. These include free online PDF conversion services, application plug‐ins, and popular open source tools. How is one to choose which tools offer the quality and optimization output most appropriate for the task at hand?

To many, PDF is simply the final output of work document. To others, (especially within development environments) a PDF is a document format that may be integrated into a larger, more complex series of tasks. For instance, some applications process large numbers of individual PDF files, remove specific or sensitive metadata from them and recreate single PDF documents.

By contrast, other applications take single large PDF files, and recreate hundreds (or more) of individual documents. In both cases, such applications routinely process PDF files that come from different producers, differ in (internal) structure, contain errors, or have vital information missing. It is these demanding PDF processing tasks (often in large corporate environments) in which the absence of the right PDF tools (or their customizations) and development experience, that lead to problems.

An example of a tool that was designed to operate within the confines of demanding PDF processing is the Amyuni PDF Converter. Aware early on of the technological and development directions that PDF was heading, the PDF Converter was designed to reflect and accommodate the ever‐changing PDF landscape. From the start, it provided what developers expected from a conversion tool – documents that rendered and optimized predictably, regardless of the environment output.

A Needle in the PDF Haystack

In addition to missing metadata, the inability to know why or where the internal structure of a PDF has gone wrong slows down development cycles, not to mention increases technical support costs later on. An example of a tool designed to explore these problems is the Amyuni PDF Analyzer. It was designed with the developer in mind, who needs to know if a document complies with minimum font specifications that are needed to optimize font rendering. Its ability to scrutinize the many PDF objects means developers have a better understanding of the inner workings of their documents and ultimately – give them greater control over how their documents are processed and optimized.

Learning From the Evolution of PDF

Having been involved with the portable document format almost from the beginning, developers at Amyuni Technologies have had the opportunity to experience and troubleshoot a multitude of PDF development scenarios. The results are PDF tools that produce documents that remove and do not include duplicate fonts, to make PDFs smaller and faster to process.

Because the best practices discussed earlier are integrated into Amyuni products, documents are already well‐structured. The well‐structured fonts allow any viewer to optimize the rendering and display the document seamlessly, whether from a desktop or a remote connection. If a font that is in the PDF is missing from the system, a viewer such as the Amyuni PDF Creator can easily find a substitute font. The result: a PDF that renders as it should on different platforms with an almost indistinguishable accuracy from the original document.

Applications

Pruning the tree of problematic PDFs is one approach to fixing missing or incorrect metadata and this is simply the reality of software development. However, it has always been Amyuni’s approach to avoid potential metadata‐related problems (by combining the right tools and best practices) straight at the root before they can arise. Why? Because the PDF is expected to do more than it did 15 years ago. For example PDFs are expected to:

  • Display in numerous applications and viewers other than Adobe Acrobat.
  • Become archived in various media formats, such as XML, XAML, databases, etc.
  • Be accessed and processed using different tools and platforms.

Of course no development environment can ever predict or avoid every possible PDF scenario. Developers are often left having to fix some of the problems discussed in this paper and again, the choice of tools can lead to different results–some not always apparent until later on. A tool like the Amyuni PDF Creator is another example. Positioned to enable developers to optimize documents, the PDF Creator can:

  • Fix a font file that was not properly embedded by a third‐party tool or fix errors in its table (Figure 8).
  • Detect and remove duplicate font entries in a PDF.
  • Ensure that all font file tables and font metadata are accurate.

Figure 8: Font Error and Repair (Click image to enlarge)
white-paper-part-2-image-8

As we have seen, achieving optimal PDF results can be a daunting affair. Missing or erroneous metadata is just one of several scenarios that can undermine the user experience and integrity of a PDF document. The nuances inherent in a PDF document are not always obvious until it’s too late.

Conclusion

Inaccurate or ineligible characters may be negligible to some in an office memo, but when PDF documents are the cornerstone of medical records, insurance policies, or judicial statements, there is no room for inaccuracies or difficult legibility. Their content must contain clarity and undisputable accuracy. The same expectations are also warranted in PDF application environments that rely on the timely and efficient processing of documents to avoid software crashes and production interruptions.

As our reliance on PDF continues to grow, so does our dependence. The recent emergence (and importance) of PDF/A as a standard is a testament to how seriously document consortiums view the integrity of PDF content. Although new document formats poise themselves as potential alternatives to the portable document format, it’s up to PDF developers and vendors to continue to push for better and more efficient methods of improving a technology we sometimes take for granted.

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

© Amyuni Technologies Inc. All rights reserved. All trademarks are property of their respective owners.

Missing PDF Fonts: Why it Happens and What You Can Do About it (Part 1)

Aug 28 2009

Missing PDF Fonts: What Causes Fonts in a PDF to Render Incorrectly
or Even Go Missing

This document is the first of two that will look at some of the challenges faced by developers and non-developers who work with PDF technologies and who are curious about what causes fonts in a PDF to render incorrectly or even go missing. Specifically, these documents provide an overview of some of the problems associated with missing font information in PDFs.

The first document presents the Portable Document Format as well as industry terms and concepts related to that format. The problem of missing font information will also be introduced. The second document expands on those terms and concepts and explores some of the common scenarios in which PDFs are either missing partial or entire font information.

Brief Overview of PDF

The Portable Document Format was originally conceived in 1991 as the Camelot Project, by Adobe’s co-founder Dr. John Warnock. Inspired by the device independence of PostScript, Dr. Warnock wanted to develop a technology that could accurately display and print electronic documents across different operating systems, hardware, or applications. His answer was
the PDF.

Unlike its predecessor (i.e., PostScript), PDF was first and foremost a file format and not a programming language. Although PDF evolved from PostScript, the primary difference is that PostScript is a true page description language and PDF is not. PDF does not contain programming constructs such as looping, control-flow constructs, or variables.

Rather, PDF was envisioned to go further than PostScript by being able to describe how pages behave and what type of information a document could contain. Years later, the PDF would encompass complex features and functionalities such as search capabilities, audio, and even video.

On July 1, 2008, PDF became an open standard published by ISO as ISO 32000-1: 2008.

PDF Structure

PDFs are essentially collections of data objects organized in a hierarchical manner that describe how one or more pages in a document must be displayed. These data objects can describe a page, a resource, other objects, a sequence of operating instructions, and so on. Furthermore, a data object can reference other objects and be referenced by other objects (i.e., an object can be a parent object and a child object at the same time).

PDF documents contain four main types of objects that define its structure:

  • The document catalog object
  • Page objects
  • Page content objects
  • Document and page resources

Document Catalog Objects

The document object typically contains a cross reference table and page objects. It can also contain elements such as document information, named destinations, thumbnails, and bookmarks.

Page Objects

Page objects can contain one or more content objects as well as several other types of elements such as page cropping information, hyperlinks, article threads, file annotations, form fields, digital signatures, and child pages in the document. Page objects also contain references to all the resources used by a page.

Content Objects

Content objects contain marking operators (i.e., drawings) and use resources such as fonts, images or colorspaces that are needed to fully render the page.

Resource Objects

PDF defines a number of resource objects such as fonts, images, color spaces, patterns, etc. Fonts are needed to render text, color spaces represent colors used in the document, patterns define how backgrounds are painted, etc.

PDF Organization

PDFs are sectioned into four separate areas:

  • The header.
  • The body.
  • The cross-reference table (Xref).
  • The trailer.

The Header

The header contains a comment that identifies the nature of a PDF document and the specifications to which it adheres. For example, the comment outlined in Figure 1, indicates that the document conforms to Version 1.7 of the PDF specification.

Figure 1. The Header
figure-14

The Body

The body of a PDF is where the content objects in the document are located. These objects include text streams, image data, fonts, annotations, and so on (Figure 2). The body can also contain numerous types of invisible (non-display) objects that help implement the document’s interactivity, security features, or logical structure. Each object has three essential components: a numerical identifier, a fixed position (also known as an offset), and its content.

Figure 2. Examples of Objects in the Body
figure-2

The Cross-Reference Table

The cross-reference table (Figure 3) lists the locations of all the objects in a PDF document. The cross-reference table is divided into sections where each section begins with the starting and ending identifiers of the objects in that section. With the cross-reference table, a PDF parser can randomly identify object offsets and quickly access object locations throughout the document without having to read the entire file.

Figure 3. Xref Table
figure-3

The Trailer

Even though the trailer (Figure 4) is technically the end of a PDF document, it is the first entry point that applications use to access the essential components of a PDF. The trailer contains pointers that parsers and applications use to locate the cross-reference table and other important objects in
a PDF. Examples of important objects include the root object (that identifies the beginning of a page tree) and info objects (that contain vital metadata).

Figure 4. Trailer
figure-4

Terms and Concepts

Before outlining the challenge of missing fonts in PDFs, it is important to review some of the underlying concepts and technologies that will be used throughout the rest of the documents.

Glyphs and Characters

Norman Walsh defines a glyph as: “the actual shape (bit pattern, outline) of a character image. For example, an italic “a” and a Roman “a” are two different glyphs representing the same underlying character. In this strict sense, any two images which differ in shape constitute different glyphs.” Consequently, glyphs are organized into different types of fonts. By contrast a character is an abstract symbol that is given shape through a glyph’s design.

Character Codes

A character code is a digit associated to a specific character. For example, a character with the character code “37” displays a different glyph depending on its typeface (e.g., Calibri, Arial, Webdings, etc). At the most basic level, an application that renders PDF documents only needs to access the character codes, the font information, and the mapping from the character code to the font information. With this information the rendering application extracts the key graphical data to draw a glyph on an output device such as a screen or printer.

Fonts and Typefaces

Although the difference between fonts and typefaces may seem trivial to some, confusion still lingers within some development circles, where the term font families is commonly used when referring to typefaces. This is why it is important to clarify some of the upcoming terms.

A font is a comprehensive group of characters with a specific style of type. It includes the letter and number set, special characters, as well as diacritical marks (accents). Furthermore, a font specifies the member of a type family such as roman, boldface, or italic type.

Within the context of PDF software development, a font is a PDF object commonly referred to as a font object (Figure 5), font dictionary or font data file. A font object contains a set of glyphs, characters, or symbols (such as wingdings). The font object also identifies the font program and contains additional information such as its properties.

Figure 5. Example of a Font Object
figure-5

By contrast, a typeface specifies a consistent visual appearance or style which can be a “family” or related set of fonts. Arial, Tahoma, or Helvetica are examples of typefaces. A typeface can contain a series of fonts. For example, a typeface such as Helvetica may include roman, bold, and italic fonts.

Font Technologies: Laying the Foundation

From their inception in the mid-1980s, font technologies have helped jump start the desktop publishing revolution and have enabled the written word to cross over to digital typesetting mediums.

Standards expanded, new font technologies emerged, and within a few years, the world of PDF had become more complex. Not only did those who developed PDF viewers and convertors had to adapt to the emerging trends within the PDF industry, but they also had to support the rising demands for different languages.

Asian languages presented PDF developers with new challenges as the existing font technologies could no longer sufficiently answer increasing font complexities. These new challenges helped push font technologies and developers forward.

Outline Fonts

Although digital fonts are generally grouped into three format types (namely, bitmap, stroke, and outline (vector) types), this paper will focus on outline fonts. Unlike bitmap fonts that are collections of raster images of glyphs, outline fonts (also known as scalable fonts) are collections of vector images. This means that outline fonts describe glyphs using points that are interpreted as lines and curves.

The advantage to using vector images is that they can be scaled to varying sizes without losing too many details. By contrast, bitmap fonts lose their detailed edges and often appear jagged or choppy when resized (Figure 6).

Figure 6. Font Type Scalability Differences
figure-61

Hinting: When Scalability Isn’t Enough

Even though outline fonts are scalable, there are many instances in which proper rasterization can be compromised. For example, different applications, output devices, or printers can affect rasterization. To address this problem hinting technologies were developed. Hinting is additional mathematical information added to a font to ensure it retains its visual integrity when rasterized under various conditions.

Type 1 (PostScript) Fonts

Developed by Adobe Systems, PostScript fonts were developed to answer the demands of emerging laser printing technologies at the time. Using a subset of the PostScript language, Type 1 fonts contain an organized collection of procedures to describe glyph forms.

In addition, glyph outlines were interpreted by Type 1 fonts using a field of mathematical analysis known as (cubic) Bézier curves. When first introduced, Type 1 fonts were the first to include proprietary hinting technology to improve their display capabilities. Type 1 fonts store information in two files. One file contains the character outlines (referred to as printer fonts) and the other contains the character information to display on screen.

Type 3 Fonts

Type 3 fonts are essentially the same to Type 1 fonts except that they don’t include hinting technology. While Type 1 fonts only use a subset of the PostScript language, Type 3 fonts encompass most of the PostScript language. This makes Type 3 fonts capable of displaying more elaborate designs and ligatures than Type 1 fonts. However, the added weight of the PostScript language into Type 3 fonts also makes their file sizes larger. They therefore take up more memory. Because Type 3 fonts use bit-mapped technology instead of hinting, they often produce poorer display results when they are scaled.

TrueType Fonts

Developed by Apple Computers, TrueType fonts are similar to Type 1 fonts, but include some important differences. Like Type 1 fonts, TrueType also uses Bézier curves to describe glyph information; however, TrueType employs quadratic mathematics rather than cubic.

Another difference between TrueType and Type 1 is that TrueType contains both the screen and printer font data in a single file. In addition, hinting information is stored inside the font file. This additional information makes TrueType fonts larger than their original PostScript rivals.

Unlike Type 1 files, however, which are composed of a subset of the PostScript language, TrueType font files are composed of structured tables. Each table contains the necessary information that applications or PDF viewers need to use and display a font. Tables also contain information to ensure that glyphs are displayed correctly when there are different types of internal encodings used in a document.

OpenType

OpenType fonts bring together some of Type 1 and TrueType technologies into one cross-platform format. OpenType’s character encoding is based on Unicode and as a result, can support up to 65,536 glyphs, OpenType offers more development flexibility especially when working with Asian character sets and more sophisticated Roman glyphs that may use non-lining numerals, small caps, fractions, ligatures, and swashes. Like TrueType, an OpenType font contains all of its outline, metric, and bitmap information in a single file.

Font File Structures

In addition to their technological differences, fonts can also be categorized according to how they are structured as PDF objects. Generally, fonts can be structured as:

  • Simple Fonts
  • Composite Fonts

PDFs contain font objects (Figure 5) that essentially act as wrappers for embedded font programs that contain the actual font data. Font programs can be TrueType, OpenType, Type 1, and so forth. Font objects also contain a number of properties and descriptions of the font data in order to enable PDF applications and viewers to use the font in the document.

Simple Fonts

Simple Fonts use a single byte of information to represent a glyph. As a result, a maximum of 256 (28) different glyph representations are possible. The Simple Font category includes the original instances of Type 1 and TrueType fonts.

Composite Fonts

Because of their 256 character encoding limitation, Simple Fonts could not support complex Asian glyphs, where a typical Japanese font can have over 7,000 Kanji, Katakana, and Hiragana characters, or non-horizontal writing.

The solution was the development of Composite Fonts (or CID fonts). Unlike Simple Fonts, Composite Fonts are multi-byte and can thus contain an arbitrary number of glyphs. As a result, Composite Fonts are able to support a wider range of glyphs.

Composite Font technologies enable developers to use any number of base fonts and create new composite fonts. Composite font technologies also enable developers to include two sets of character spacing details (metrics) in fonts. One metric can be used for horizontal writing mode and another for vertical writing mode. Aside from their ability to handle complex glyphs, Composite Fonts are also flexible
and expandable.

CMap File

A CMap is an ASCII text file that contains the PostScript language instructions required to map character codes to CID codes used by Composite Fonts. For example, after a character code is processed (from a keyboard input), the CMap file maps the character code to a corresponding Character Identifier number (CID). The CID code is then passed on to the Composite Font which will in turn generate the appropriate glyph. As we shall see in the next document, CMap files can also be missing and impact proper PDF processing.

Font Embedding

To display, print, or process a PDF accurately, it must contain the necessary font information. If font information is missing, recipients may not be able to display or edit the document properly or, worse, applications may not be able to process the PDF at all.

Embedding fonts in a PDF ensures that they display and print exactly from one system to another as the author intended. The following sections will look at how fonts are embedded in PDFs and introduce the upcoming subject matter for the following document.

Full Font Embedding

The first method of embedding fonts is full font embedding. Full font embedding effectively makes the font part of the PDF thereby preventing font substitution when recipients need to display or print a PDF. Essentially recipients don’t need the same fonts to view or edit the document. This method is advisable in situations in which modifications to the PDF are expected.

Full font embedding can also potentially help avoid some of the problems associated with missing system fonts and ensure optimal viewing regardless of the system and platform. In an ideal PDF world, fully embedding all fonts would reduce many development woes.

The main drawbacks to full font embedding are file size and licensing issues. Every embedded font makes the document larger, especially if it contains Chinese, Japanese, or Korean (CJK) fonts, which can be problematic. In fact, CJK fonts are rarely fully embedded due to their large character sets. Also, fully embedded fonts can be extracted and used outside of the PDF file. As a result, this font extraction can create the potential of unlimited font distribution and violate the licensing policy of the font manufacturer. The solution then is to partially embed fonts in a document.

Partial Font Embedding (Subsetting Fonts)

Unlike full font embedding, subsetting a font only embeds the glyph definitions for the characters used (i.e., that are displayed in the PDF).

There are three main reasons one should subset fonts. First, as previously stated, PDFs are primarily for content exchange and viewing. PDF is not an ideal editing format, despite the popularity of PDF editing programs available on the Internet, and it is generally assumed (rightfully or wrongfully) by the PDF’s creator that the recipient will not modify the document’s contents. As we shall see in the following document, editing a PDF is not always a straightforward affair.

Second, subsetting fonts reduces document size. For example, the size of the font “Arial Unicode MS” is nearly 20MB; however, subsetting this font to show 10 Kanji characters would instead only add approximately 25KB to the PDF. In cases where CJK fonts are used, full embedding all fonts would result in problematically very large files.

Third, subsetting of fonts avoids licensing issues because the font then becomes unusable for other purposes then rendering the document which is often permitted by the font licensors.
The draw back with partially embedded fonts is that if recipients do not have the fonts on their system, they will not be able to edit the document or will be very limited in their ability to edit text. This is where the problem of missing fonts begins to emerge.

When Fonts Go Missing

Now that some of the key PDF and font concepts have been reviewed, the different problems that can occur when font information is missing can be addressed.

The following document (Part 2) will explore how problems associated with missing font information can start right at the source, with the creation of the PDF document itself. These problems include full and partial font embedding, incomplete font information in TrueType fonts, and missing CMap files.

References:

Walsh, Norman. “Frequently Asked Questions About Fonts.”14 August 1996. < http://nwalsh.com/comp.fonts/FAQ/index.html>

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

© Amyuni Technologies Inc. All rights reserved. All trademarks are property of their respective owners.

The Benefits of the Layered PDF and the Amyuni PDF Creator

Aug 07 2009

With the release of Acrobat 6 (PDF 1.5), Adobe introduced the concept of the layered PDF. A layered PDF contains optional content that can be displayed or hidden by its reader. A layered PDF is a document that was originally generated by software with layer-generating capabilities. For example, AutoCAD, CorelDraw, Adobe’s In Design, Photoshop, and Microsoft’s Visio are all capable of producing layered documents or files.

Users who work with such software, can separate content into layers during page composition, save their file into PDF, and retain their separated content. A layered PDF viewer such as Amyuni’s PDF Creator then detects the document’s optional content and displays it according to its visibility settings. The primary benefit of a layered PDF is its ability to contain different types of content within a single document. For example, a single software brochure can contain multiple languages (where each layer is a separate language) that can be set to reflect a specific linguistic market.

The layered PDF has been extensively used by government immigration, education, and foreign relations departments. In addition, optional content is also used in the following fields and industries:

  • Engineering firms.
  • Architectural firms.
  • Cartography, geology, and geography departments.
  • Legal and judicial departments.
  • Software research and development departments.
  • Multimedia companies and departments that work with layered artwork.

Since the inception of optional content, Amyuni has initiated significant development efforts to ensure this feature of the PDF Creator:

  • Continues to meet and surpass the expectations of the layered PDF market.
  • Remains a competitive alternative to other layered PDF viewers.

We invite you to learn more about the Amyuni’s PDF Creator and its optional content capabilities or the Amyuni PDF Suite (which also includes the PDF Creator) at Amyuni.com.

Franc

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

Under the Radar: How PDF Provides Cover and Sends the Goods

Jul 28 2009

Amyuni Technologies Blog
Email technology is a carefully exploited terrain and individuals are increasingly aware of its vulnerabilities. In early 2009 Microsoft, reported that 97% of all emails are of “unwanted origin”. “The good news is that the majority of that never hits your inbox although some will get through.” said Cliff Evans to BBC News.

However, as a result of this “catch-all” approach, many legitimate and important messages never make it to their intended destinations. This is where the portable document format and Amyuni’s expertise come together as a single service–PDF Courier.

With PDF Courier, businesses and individuals can use an online or desktop application to bypass email servers, send large attachments, and even convert documents on the fly to PDF for enhanced security.

Learn more about PDF Courier at: http://www.pdfcourier.com

Franc

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

References:

http://news.bbc.co.uk/2/hi/technology/7988579.stm
http://www.securityfocus.com/news/3106

Analyzing Your PDFs: Save Time and Reduce Costs

Jul 13 2009

Amyuni Technologies Blog

By the mid- to late-1990s, Adobe’s version updates coupled with the use of PDF to send complex documents via email gave emerging PDF vendors and developers new technological opportunities. Consequently, these opportunities also introduced different development methodologies and often, these differences created a number of problems, in particular, “missing font information”.

For developers and IT managers, missing font information is problematic as it can delay software development and hinder the production cycle. PDF Analyzer enables developers to troubleshoot and repair PDF documents early, saving time, and reducing technical support overhead.

Learn more about the PDF Analyzer at http://www.pdfanalyzer.com.

Franc

Franc Gagnon is the technical copywriter for Amyuni Technologies
www.amyuni.com

Amyuni in SD Times Top 100

Jun 19 2009

Amyuni Developer Tools are among a handful of components to win in the Components and Libraries category of the 2009 SD Times 100.

The SD Times 100 recognizes the top innovators and leaders in multiple software development industry areas.

2009sdt100_logo

Learn more about Amyuni Developer Tools

Alex Furness is the Marketing Director for Amyuni Technologies
www.amyuni.com

Creating Layered PDFs with Escape Calls to PDF Converter Printer

Jun 01 2009

Background Information

PDF Layers are called Optional Content within the PDF specifications. This is a feature that was added as of PDF-1.5 (Acrobat 6.) A layer is identified by an identifier and a title. Only the layer title is visible to the end-user of the PDF document. Layers can be structured as a tree or as a hierarchy of layers.  Here is a sample Optional Content object in a PDF:

9 0 obj <</Type/OCG/Name(Blue Layer) >> endobj

Each page using this layer should add the layer to its list of resources, for example:

6 0 obj <</Type /Page /Parent 3 0 R /MediaBox [0 0 612 792 ] /Contents [7 0 R ]
/Resources << /ProcSet [/PDF /Text] /Font <</F10 10 0 R >>
/Properties <</OC9 9 0 R >>
>> >> endobj

OC9 is the layer ID (Blue Layer) and is the layer title (strings in PDF are surrounded by parenthesis). To indicate that various objects within the page content belong to specific layers, the following syntax is used:

% start the Blue Layer that contains the text: Page 1
/OC /OC9 BDC
BT /F10 5.76  Tf 1 0 0 1 24 745.56 Tm 0.026  Tc -0.067  Tw (Page 1) Tj ET

% start a sub-layer that will contain a filled rectangle
/OC /OC11 BDC
n 1 g 96 326.96 35.76 -6.6 re f*
% end of sub-layer 1
EMC

% start another sub-layer that will contain another filled rectangle
/OC /OC13 BDC
n 1 g 96 526.96 35.76 -6.6 re f*
% end of sub-layer 2
EMC

%end of blue layer
EMC

Although the code shows one parent layer and two nested layers (contained within the parent layer), this is not enough to define the hierarchy of layers. The Order string is used to determine the hierarchy of layers. The Order string is part of the document Catalog. The Order string for the above PDF sample should look like this:

/Order [ 9 0 R [11 0 R 13 0 R] ]

The object IDs are needed to create a proper order string. The order string can also contain a parent node that is not a layer but only a container for other layers. In the example above we could have all the layers below to one parent layer called (Layered PDF) by using:

/Order [ (Layered PDF) 9 0 R [11 0 R 13 0 R] ]

It is important that the Order string matches the order of layers as they appear in the page content, otherwise the PDF viewer will not be able to properly “show and hide” layers with their sub-layers.

Note: There is an inconsistency in the PDF specifications related to defining nodes that are parents of other layers. To have a Layered PDF as the parent of layer 9, we do not use [(Layered PDF) [9 0 R] ] as one might expect but [(Layered PDF) 9 0 R] as if the 2 nodes were on the same level.

Creating Groups of Layers

The PDF format allows creating groups of mutually exclusive layers where showing one layer will automatically hide the other. These are called radio-button groups and defined using the RBGroups PDF keyword. One can also define which layers are visible by default and which are hidden, for example:

/RBGroups [[ 11 0 R 13 0 R]] /ON[ 11 0 R] /OFF [13 0 R] indicates that layers 11 and 13 are grouped and that layer 11 is visible by default whereas 13 is invisible.

Dynamically Creating Layers with the Amyuni PDF Converter

External developers need to be aware of layer titles only. The layers IDs and object numbers are not significant to the developer, so the implemented API will only use layer titles to insert layers and define the layer hierarchy.

Layers can be inserted within a PDF file while the document is being printed by calling Escape sequences. Escape sequences are GDI artifacts that enable developers to send custom data to a printer driver.

Checking the Printer for PDF Layer Support

The first step to start adding layers is to check that the printer has support for the escape sequences that are defined by Amyuni. The following calls are needed include:

// check if PDF printer supports layers
#define ESCAPE_SETLAYER 248

CHAR technology[4];
int escape = ESCAPE_SETLAYER;
ExtEscape( hDC, GETTECHNOLOGY, 0, NULL, sizeof(technology), (LPSTR)technology
);

// the technology should be PDF
if ( lstrcmp(technology, “PDF”) )
{
MessageBox( 0, “Not an Amyuni PDF driver”, “Error”, MB_ICONERROR );
}

// and support the SETLAYER escape
if ( !ExtEscape( hDC, QUERYESCSUPPORT, sizeof(escape), (LPCSTR)&escape, 0,
NULL ) )
{
MessageBox( 0, “Not an Amyuni PDF driver”, “Error”, MB_ICONERROR );
}

GETTECHNOLOGY and QUERYESCSUPPORT are escape sequences that are predefined by Windows GDI. ESCAPE_SETLAYER is a custom escape that is processed by the Amyuni printer.

The SETLAYER escape takes one parameter which is a Unicode string. The escape can be wrapped into a helper function such as:

void SetLayer( HDC hDC, LPWSTR LayerInfo )
{
switch ( ExtEscape( hDC, ESCAPE_SETLAYER, (int)((wcslen(LayerInfo) + 1)
* sizeof(LayerInfo[0])), (LPCSTR)LayerInfo, 0, NULL ) )
{
default:
// positive return indicates success
case 0:
// no error
return;

case -1:
// error occured, closing a layer when none was open
break;

case -2:
// memory allocation error (low memory or invalid string)
break;
}
}

The same escape call is used to start a layer or sub-layer, end a layer or define the hierarchy of layers.

Setting the Hierarchy of Layers

The order or hierarchy of layers is sent to the PDF printer using the SETLAYER escape before the call to StartDoc. It can actually be called anytime during the printing as long as the call is made outside of a StartPage/EndPage block but before the EndDoc is called. The advantage of setting the hierarchy before StartDoc is that it allows the PDF printer to expect layer support and switch the file header to PDF-1.5.

If the hierarchy of layers is not known before the call the StartDoc, the developer can call SetLayer( L) with an empty string before the call to StartDoc simply to instruct the PDF printer to output PDF-1.5 header. The order string should contain the titles of all the layers and their hierarchy, all in Unicode format. For example:

// set the order and hierrachy of layers
SetLayer( hDC, L“/Order[(Blue Layer)[(Blue Layer – 1)(Blue Layer – 2)](Red
Layer)(Green Layer)]”
);

This will produce a layer structure like this:

layer-structure

To group layers within radio-button groups, the /RBGroups, /ON and /OFF entries should be added to the call above. This code for example will make the red and green layers as part of a group:

// set the order and radio/button groups of layers
SetLayer( hDC, L“/Order[(Blue Layer)[(Blue Layer – 1)(Blue Layer – 2)](Red
Layer)(Green Layer)]”
\
L“/RBGroups[[(Red Layer)(Green Layer)]]/OFF[(Red Layer)]/ON[(Green Layer)]
);

Drawing Objects within Layers

Within a StartPage/EndPage sequence of calls, calling SetLayer with a valid layer title will start a new layer and place all subsequent drawing instructions within that layer. Calling SetLayer with an empty string will end the last layer.

If multiple layers are nested, multiple calls to SetLayer with an empty string are needed to close all layers. If all layers are not closed at the call to EndPage, the PDF printer will automatically close all open layers, for example:

// start parent layer
SetLayer( hDC, L“Blue Layer” );
SetTextColor( hDC, RGB(0, 0, 255) );
TextOut( hDC, 200, yPos, buf, lstrlen(buf) );
yPos += 200;

// start blue sub-layer 1
SetLayer( hDC, L“Blue Layer – 1” );
SetTextColor( hDC, RGB(0, 0, 128) );
TextOut( hDC, 800, yPos, “Blue Layer – 1”, lstrlen(“Blue Layer – 1”) );
yPos += 200;
SetLayer( hDC, L);    // close blue sub-layer 1
SetLayer( hDC, L);    // close blue parent layer

The sequence of opening and closing layers should be consistent with the order string for the PDF viewer to show or hide layers with their sub-layers in a consistent way.

The contents for this post were provided by Dany Amiouny using the Amyuni PDF Converter.

To learn more about Amyuni Developer Pro tools please visit www.amyuni.com.

Dany Amiouny is the CTO for Amyuni Technologies
www.amyuni.com

Amyuni Documentation is Now Online!

May 11 2009

Hi everyone!

Our product documentation is now online.

Have a look and browse through the subject matter in the Content area (by product name) or by using the Search tab. This week I’ve started working on the index to provide you with an additional option to look up information faster.

The online help is an ongoing project. If you have any comments or suggestions about it, feel free to contact me at franc.gagnon@amyuni.com.

Franc

Franc Gagnon is the Technical Writer for Amyuni Technologies
www.amyuni.com

Amyuni Tools Certified for Windows Server 2008

Apr 13 2009

Windows Server 2008

The Amyuni Developer Tools are now Certified for Windows Server 2008.

License holders can login to update/upgrade to v4.0

Dany

Dany Amiouny is the CTO for Amyuni Technologies
www.amyuni.com