Understand the PDF format

23 March 2019 Off By luigi

The purpose of these notes is that to propose a brief analysis of a document PDF, underlining the structure and the constitutive elements of it and facing therefore the problem list related to the management of the fonts, of the vectorial graphics and of the images raster. 

For brevity, we won’t face the problem of the compression of the data, of the revision or update of a file created in precedence, neither we will see whether to realize the thumbails or the structure of bookmark of the documentThe information contained in these section ARE NOT NECESSARY to use the library, but can be used for building new functions.Edit

The basic elements

A file PDF is substantially a text file, understood as sequence of characters and separators of line (ASCII 13 or ASCII 10), type structured,where which information assume a particular meaning in how much you insert in structures that respect a particular syntax. 

The basic elements are the objects “obj”, that can contain sequences to the necessity “stream”, dictionaries “dictonary” or other. An object can represent a page, an image, a graphic sequence, etc.   

Every object, contained among the keywords obj and endobj is identified by a number and by a revision. Considering that we won’t consider the changes of file created in precedence, all of our objects will have as number of revision 0 (zero).

2 0 obj
.. .. .. .. ..
endobj

The objects don’t necessarily have to present themselves in numerical order and it are possible to make reference to a “future” object, or not yet defined; this results particularly useful and perhaps essential in some cases (for instance when in the file it is necessary to point out the length of a text before the same text has been inserted). When it is necessary to effect a reference to an object, all it takes is pointing out his number and the revision followed by the letter “R”.

.. ..
/Parent 3 0 R
.. ..

In general, every makes whenever him necessary to use more times the same object in more points of the document both it an image or a generic resource, to optimize the use of memory and the speed of visualization, it is worthwhile to create an object that contains the resource and to use references to this in all the points in which it applies.

The sequences of data are contained among the key words stream and endstream. This can contain any sequence of characters (also those not printable) and they serve to describe a text, an image or other.

stream
//.. sequences of characters ..//
endstream

The dictionaries are couples variable/value contained among the delimited « and ». They are used for characterizing particular objects defining the attributes of it. A value can be express with a numerical constant (the decimal part foresees the point and not the comma), an alphanumeric lace (delimited by a couple of round parenthesis), with a further dictionary, with a reference to an object or with an array (delimited by a couple of square parenthesis). 

<< /Type /Page
/Parent 3 0 R
/Resources << /ProcSet 6 0 R >>
/MediaBox [0 0 612 792]
.. .. ..
>>

A file PDF is not anything else other than an opportune sequence of objects, built respecting an enough simple syntax (case sensitive), tied up among them from specific references, equipped from how much necessary to the application that the file reads (for instance Acrobat Reader) to know whether to recover the information and in what order. Edit

Structure of a PDF file

The structure of file PDF can be reassumed in the following scheme 

HEADER
BODY  
CROSS-REFERENCE TABLE
TRAILER

The section HEADER contains useful information for the software Reader to identify the type of file and the standard used PDF.

Is represented by the first line of the file and it is of the type 

%PDF-1.3

where the symbol % generally points out a line of comment and 1.3 it points out whether to correctly read the contained information in the file is necessary Acrobat Reader 4.0 (rather than 1.4 for which it is necessary Acrobat Reader 5.0, rather than 1.5 for Acrobat Reader 6.0 and so street).In the succession, we will consider to always operate with formed compatible with Acrobat Reader 4.0 or with a specification PDF-1.3

The section BODY contains the objects that will be represented on the pages and on which will detain subsequently there. 

The section CROSS-REFERENCE TABLE is a tables that brings a reference to every object present in the section BODY and his possible revision; particularly it points out the position of the first character of the definition of an object in comparison to the beginning of the file and the number of revision to which it refers. 

xref
0 23
0000000019 65535 f
0000000009 00000 n
.. .. ..
0000000300 00000 n
0000000384 00000 n

The section TRAILER points out to the Reader how many objects are present in the section BODY (/ Size), qual is the initial object (/ Root), what object contains the general information of the document what author, title, dates of creation (/ Info), whether to find the CROSS- REFERENCE TABLE and besides it marks the end of the file (%% EOF). 

trailer
<< /Size 7
/Root 1 0 R
/Info 2 0 R
>>
startxref
408
%%EOF

Edit

Structure of a PDF document

The structure of a PDF document can be reassumed in the following scheme

The object CATALOG represents the root of the whole document and has to be that to which stings the reference /Root foresees in the section TRAILER

1 0 obj
<< /Type /Catalog
/Pages 3 0 R
/Outlines 20 0 R
>>
endobj

In turn it contains a reference to the root of the pages, (/ Pages) and a reference to the root of the tree that serves as index (/Outlines), that that, when it opens a document with Acrobat Reader, it appears to the left usually some page and it allows to quickly stir in the document, and that we for simplicity won’t analyze. 

The object PAGES represents the root of the pages, it points out the general number of the pages of the document (/Count) and it brings a reference to the object that contains every page (/Kids). 

3 0 obj
<< /Type /Pages
/Count 3
/Kids [4 0 R 8 0 R 10 0 R]
>>
endobj

The object PAGE brings a reference to his own root (/Parent), a list of the resources used in the page (/Resources, will see subsequently what are), an array with the dimensions of the anticipated format of press (/MediaBox) and finally a reference to the object that contains the elements to represent on the page (/Contents). 

4 0 obj
<< /Type /Page
/Parent 3 0 R
/Resources << /ProcSet [/PDF /Text] >>
/MediaBox [0 0 595.2 842]
/Contents 5 0 R
>>
endobj

If in the document they are wanted to bring information related to the author, to the application that has produced the file or the date of creation, the following object can be used (watching out for the parentheses that belong to the syntax). These information appear if from the Reader the ownerships of the document are visualized. 

2 0 obj
<< /Title (title)
/Author (author)
/Creator (application_creator)
/Producer (copyright)
/CreationDate (D:yyyymmddhhmmss+0100)
>>

Edit

The system of coordinates

The system of default coordinates in document PDF has as unity of measure the point, defined as 1/72 of inch. The origin of the aces is set in the angle in low to the left. In this system, the normal sheet A4 (21×29.7 cm) has dimension 595.2 x 842 points. For default all the images, to less than an explicit translation/scaling/rotation, have dimension 1×1 and they are set with the left bottom corner in the point (0,0). Edit

The operators

For the definition of the contained elements in every page, the syntax of the standard PDF foresees the use of some operating ones. Following some are listed, only postponing to the texts in bibliography for a description more detailed. Particularly, keep mind that the graphic operators describe only a run (path) that will be traced only physically when a special operator will be used and that the components of color go from 0 to 1 and not from 0 to 255. 

OperatorDescription
X Y mSet the current cursor point at (X,Y)
X Y lAdd a line ending at (X,Y) to the current path
X wSet the line width to X points
hClose a path
nEnd a path
fFill a path
r g b RGSet the color for stroking operations
r g b rgSet the color for no stroking operations
gray GSet the percent shading gray for stroking operations
gray gSet the percent shading gray for non stroking operatios
x1 y1 x2 y2 x3 y3 c  Add a Bezier curve to path
x y width height re Add a rectangle to path
W 0 0 H X Y cmGraphic transformation matrix
SStroke the path
sClose and stroke the path
BClose and fill the path
BTStart a text sequence
ETEnd a text sequence
space TcSet the char spacing
space TwSet the words spacing
scale TzSet the percent scaling for text
space TLSet the line spacing
X Y TdSet the insertion point for text
fontname size TfSet the font and the fontsize
string TjDraw a text
/name DoPlay the object name

Edit

The paths

The paths represent some invisible layouts, that become visible only following an opportune command. We explain better this concept with an example: in a normal graphic context, if I want to trace one segment, established therefore from more lines, physically trace the first line, then the second, and so way. In a document PDF, prepares instead the first line, (or I build an invisible path that describes him), then the second and so way and after the last line throwing an operator that physically traces the whole broken. This way of operating is due to the fact that the path described could be used not for physically tracing on the page one segment, but even to delimit (clipping) another portion of following graphics, to contain a text, etc.. .  Edit

Some commons elements – (Font and text)

One of the first objects that we analyze are the necessary object to describe a font. In this phase we will see whether to use one of the 14 default font TYPE1 in the standard PDF or font that the Reader already knows and that they don’t need particular information (vice versa, for other types of font TrueType is necessary to furnish detailed information, as for instance the width in points of every single character). 

The 14 standard font TYPE1 are:

  •  CourierNew (.Italic, .Bold, .BoldItalic)
  • Arial (.Italic, .Bold, .BoldItalic)
  • TimesNewRoman (.Italic, .Bold, .BoldItalic)
  • ZapfDingbats
  • Symbol

To use one of them, it is necessary to create an object that contains it. In the example that follows, the object 7 contain the font Arial with attributes Boldand Italic, to the font the name Fn1, with a charset ANSI

7 0 obj
<< /Type /Font
/Subtype /Type1
/Name /Fn1
/BaseFont /Arial.BoldItalic
/Encoding /WinAnsiEncoding
>>
endobj

To write the classic “Hello World” to the position of coordinates (100,400) with the font defined in precedence named Fn1, it will be enough to insert in an object the sequence  

\\  % Write Hello World with Arial Bold Italic 24 pts
BT
/Fn1 24 Tf
100 400 Td
(Hello World) Tj
ET

Edit

Some commons elements – (Vector graphic)

If we want to insert some graphic elements, it will be enough to insert a sequence like

% Draw a filled red rect, with blue border
.5 .75 1 rg
1 0 0 RG
200 300 50 75 re
B

% Draw a line with width 2 points
2 w
150 250 m
150 350 l
S

Edit

Some commons elements – (Bitmap images)

Another common object is that necessary to represent some images type BITMAP (BMP). For the images it is possible to opt for two way: to create an object that every time can be recalled that the same image must be visualizes, even if with different dimensions (and it is this the case that we will analyze) or to define the image inside the page without the possibility to be able to reused.

For instance, if an image Img1 is defined, 10 x 5 pixels24 bit color (8 x 3 component color RGB), using a hex representation, they are necessary 150elements or 10 x 5 x 3.

12 0 obj
<< /Type /XObject
/Subtype /Image
/Name /Img1
/Width 10
/Height 5
/BitsPerComponent 8
/ColorSpace /DeviceRGB
/Filter /ASCIIHexDecode
/Length 13 0 R
>>
stream
80 A1 2F .. .. %150 hex elements
endstream
endobj
 
13 0 obj
150
endobj

To notice as to the place of the length of the stream a reference has been used to an object (13 0) that it contains (only) the value 150

If we want to visualize the image with the left inferior angle in the point (100, 80), with a horizontal dimension 200 and a vertical of 300, it will be enough to insert the sequence 

q                             % save the graphic state
200 0 0 300 100 80 cm         % graphic transformation matrix
                              % to translate/scaling the image
/Img1 Do                      % draw the image Img1
Q                             % restore the graphic state

Likewise the sequence 

.. ..
/BitsPerComponent 8
/ColorSpace /DeviceGray
/Length 50
>>
stream
$-#etT .. .. %50 bytes
endstream
endobj

define an image in shade of gray (only 1 component of color), represented with a sequence of 50 bytes (10x5x1).Edit

Some commons elements – (The Form)

Often understands that portions of documents, are goes reproduced more times, as for instance the header or the footer of pages, the watermarks (those oblique writings under to the text type draft, …). The standard PDF foresees the use of a particular object, know as forms, that, we immediately clarify, they don’t have anything to whether to do with the forms of Visual Basic. For instance all of this that is contained in the stream of the object 13 with name Frm1

13 0 obj
<< /Type /XObject
/Subtype /Form
/FormType 1
/Name /Frm1
/BBox [0 0 595.2 842]
/Matrix [1 0 0 1 0 0]
/Length 14 0 R >>
stream
.. .. .. .. ..
endstream
endobj

that can be reproduced, anywhere, with the sequence

/Frm1 Do

Edit

The idea: the structure of document made by the library

As we have seen, the principal elements of a document PDF are the objects. We have the necessity therefore to build a mechanism that allows us to manage the writing of the object and contextually the management of the section CROSS-REFERENCE TABLE, in which, as already says, a reference must be written to the created object. It doesn’t need to forget that there are some objects that in a logical order they would go before others, for example the root of the pages /Pages, even if they contain references, to the pages, that are known only after all the objects have been built. To resolve the underlined problems, is thought about using a structure and such a numeration of the objects that the references were known previously. This way doing, the document is built in such way that the object /Pages, and in similar way the others, have always the same number of reference, even if is physically built after the others. In other words, the structure type of the produced document is the following  

1 0 obj Info
2 0 obj Catalog
3 0 obj Encoding
6 0 obj (available)
7 0 obj (available)
.. .. .. ..
.. .. .. .. ..
n-1 0 obj (available)
n 0 obj (available)
4 0 obj Pages
5 0 obj Resource

by this way the reference to the intermediary objects, for instance the /Parent foresees in every object /Page, is always known, while the other ones can be built to hand hand that the objects are created and inserted at the end in the definition of the objects 4 and 5. Edit

The object Resources

We have mentioned that every object /Page contains a reference to an object /Resources, a dictionary that substantially describes the content and the resources used in the document or the list of the used Font and the objects Form, and it describes the content of the document. For instance  

4 0 obj
<< /Type /Pages
/Resources 5 0 R
.. ..
>>
endobj

5 0 obj
<< /Font <</Fnt1 6 0 R /Fnt2 7 0 R >>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/XObject <</Img1 8 0 R /Frm1 9 0 R >>
>>
endobj

the object 5 0 precise that the document uses two fonts (Fnt1 and Fnt2) respectively represented by the objects 6 0 and 7 0, an object image Img1 (object 8 0) and an object form (object 9 0). Besides it points out that inside the document there are some standard operators (/PDF), of the objects type text (/Text), of the images in staircase of grey (/ImageB), of the color images RGB (/ImageC) and of the images to indexed palette (/ImageI). Edit

The unities of measure

The unity of measure of the standard PDF is 72 points for thumb. Obviously with the opportune conversions, it is possible to use others of it (centimeters, millimeters and thumbs). As it regards the decimal figures, the standard foresees 3 figures on the absolute value in unity of measure standard, and you/he/she should not be necessary to go over. Edit

Vector graphics

Before using the operating type line, arc or other, to always remember himself/herself/themselves settare the point of beginning of the layout with the operative MoveTo. In the case of layouts composed from more elementary lines, recommend him to use only on the last line the options of sketch, closing or filling.