2 Jul 2008

PDF Content Streams

Posted by khk

Update

Please visit the same post on my business site. The comments are closed here, so if you want to comment, you have to head over to http://khkonsulting.com/2008/07/pdf-content-streams/

How Did You Get Here?

I did some research on why and how visitors come to my site. One interesting finding is that a number of people are searching for information about PDF content streams. Here is the list of the 50 most common Google searches that contain the string “content”:


1  pdf content streams
2  content streams pdf
3  acrobat content streams
4  pdf "content streams"
5  adobe content streams
6  content streams
7  acrobat +"content streams"
8  content streams acrobat
9  content streams in pdfs
10  what are content streams in pdf
11  "content stream" pdf issues
12  "content stream"+pdf
13  "content streams" pdf
14  "content streams"+"pdf"
15  "content streams"+adobe+pdf
16  acrobat content stream
17  acrobat content stream reduction
18  acrobat professional content streams
19  acrobat reduce content stream
20  acrobat what is a content stream
21  acrobat, "content stream"
22  content stream adobe acrobat pdf
23  content stream in acrobat pdf
24  content streams adobe pdf
25  content streams in a pdf
26  content streams in adobe
27  content streams in pdf
28  content streams pdf acrobat
29  how to create content stream in pdf
30  how to reduce adobe acrobat content stream?
31  how to reduce content streams acrobat
32  move content streams pdf
33  pdf "content stream"
34  pdf "content stream" example
35  pdf "content streams" help
36  pdf acrobat what is content stream
37  pdf stream content
38  pdf what are content streams
39  pdf what is content streams
40  pdf+content+streams
41  reduce content streams in acrobat professional
42  what are content streams in pdfs
43  what are content streams pdf
44  what are pdf content streams
45  what are pdf content streams?
46  what is a content stream in acrobat
47  what is a content stream in pdf
48  what is content stream acrobat
49  what is content stream in a pdf?
50  what pdf content streams

So, I guess you want to learn more about what PDF content streams are, and how to create them, with an example or two thrown in… I think I can do that.

In a previous post and here, I’ve shown you how to look into a content stream with the tools that Acrobat has on board. The good news here is that with Acrobat 9 Professional (or Pro Extended), you do no longer have to run a preflight first, the option to browse the internal structure of the PDF is available right away.

What does the PDF spec has to say about content streams?

PDF Specification

Section 3.7 in the PDF Specification talks about content streams (and resource objects – the two travel together). Here we read that “Content streams are the primary means for describing the appearance of pages and other graphical elements.” Section 3.7.1 goes into more detail. The name “content stream” does already give away an important piece of information: We are talking about stream objects. The content stream is a stream object that describes how a page will be rendered. If your recollection of what a stream object is is a bit fuzzy, please review section “3.4.6 Object Streams” in the PDF spec again.

When we look at a page object in a PDF document, we will see a number of required entries in the page object dictionary:

  • Type
  • Parent
  • Resources
  • MediaBox

Hmmm… This list does not include the Contents entry (which does point to a content stream). Because this entry is optional, a page does not need page content, so an empty page in a PDF document does not necessarily contain a content stream. This makes it very easy to add blank pages to a PDF file.

Back to the spec: The Contents can be either a single stream or an array of streams. It is up to the creating application to decide which way to go. In general, if it’s possible to create the content stream in one operation, it’s probably best to use a single stream object, whereas a page content that contains different parts that are created either at different times, or copied from other objects or locations would suggest an array of content streams.

Contents of a Content Stream

So, what exactly is the content of a content stream? We find this information in the “Operator Summary” (appendix A in the PDF specification). This section lists all operators and a reference to where the operator is introduced in the body of the PDF specification.

I don’t want to discuss every operator (maybe in a future post – let me know if that’s something you want to see), but just fore reference purposes and so that this stuff shows up if somebody googles for one or more of these operators, here is a list:

b,B,b*,B*,BDC,BI,BMC,BT,BX,c,cm,CS,cs,d,d0,d1,Do,DP,EI,EMC,ET,EX,f,F,f*,
G,g,gs,h,I,ID,J,j,K,k,l,m,M,MP,n,q,Q,re,RG,rg,rl,s,S,SC,SCN,scn,sh,T*,Tc,Td,
TD,Tf,TJ,Tj,TL,Tm,Tr,Ts,Tw,Tz,v,w,W,W*,y,',"

Go and read up on those operators 🙂

Creating Content Streams

So, how do you create a page stream out of nothing? There are two ways: easy (that is relative!) and complicated.

Let’s take a look at the simple method first.

Using a Library or a Framework

The most simple approach to creating a content stream is to let somebody else to do the heavy lifting: If you have a PDF library or a framework that allows you to create PDF content, then you don’t have to mess with the details of what needs to be where in your content stream. Examples for libraries are of course the Adobe PDF Library, or the Acrobat API (take a look at the PDE level of API functions), PDFLib or iText. Just get familiar with the environment and create PDF content streams as complicated as you need them to be – without too much hassle.

Manually Creating Content Streams

OK, before we go any further, allow me a question: Why do you want to do this the hard way? Just stick to the approach mentioned in the last paragraph and be done with it. There really are not many reasons to torture yourself with this stuff, so just get a nice library and enjoy life…

Still here? There are only two things I can tell you at this point: Read the PDF spec, and read it again, and when you try to create your first content stream, make sure you get the stream length right. If you are dedicated to learning how to do this from scratch, there is nothing I can say that will magically make it unnecessary to read (and understand!) the PDF spec. So, get started, and if you have questions, ask them in the comments to this article. Good luck.

Â

Content Streams in PDF Files

So, what does a content stream in a PDF file look like? Here is an example:

Â

13 0 obj
<<
  /Length 66/Filter/FlateDecode
>> stream
[some binary data]

endstream
endobj

This stream is obviously compressed – which is indicated by the “/Filter” option of “/FlatDecode” in the stream dictionary. Let’s take a look at the uncompressed stream:

The first image shows the content stream of a page using the “View content stream with q/Q nesting levels collapsed”, and the second image uses the “View content stream by marked object”. The important difference is that the first image shows just the content stream operators, whereas the second image shows the operator without any parameters, followed by a description. To see the actual operators with parameters, the individual blocks need to be expanded.

ContentStream_1.png
ContentStream_2.png

Do you have any idea what this PDF page will look like? Here is the PDF document: test.pdf

Subscribe to Comments

8 Responses to “PDF Content Streams”

  1. I have a download section in my site which has PDF of approx 21 mb each. I checked the PDF and found that 95% of it is covered by content stream. Is there a way to reduce the content streams ?

    I made this PDFs from coreldraw 13. It contains vector graphics and no text.

    Please suggest.

     

    Mohammed Shahid Rahman

  2. Karl,

    Under Resources – Xobjects the pdf browser is telling how many images you have inside the pdf file, as example im0 – im1. When double clicking on Im0 in the internal pdf structure window Acrobat isn’t drawing a rectangle around the object inside the Acrobat document window. When going to the help file on:
    http://help.adobe.com/en_US/Acrobat/8.0/Professional/help.html?content=WS561DA35A-C1C6-4493-AE1D-80C46C2C571A.html
    Acrobat is telling as tip “You can also view content streams as snippets by selecting Show Selected Page Object In Snap View in the Preflight window”. After opening the snap view window and double clicking on an image content stream object the system is showing nothing. Is there a possibility after double clicking a content stream object, acrobat is viewing that object inside the document window or snap view window?

    Regards,
    Vantomme Bart

     

    Bart Vantomme

  3. I want to be able to “read” a page of content in a pdf looking for specific headings and then copy all of the text from that heading to the beginning of the next heading and paste it into a word doc.

    I need to be able to repeat this process with multiple headings.

    Any help you can provide will be greatly appreciated.

    Bill

     

    William Hughes

  4. Bill,

    that’s a pretty complicated task. Actually, text extraction is probably one of the most complex things you can do with a PDF. And, explaining it would definitely be outside of the scope of a reply in the comments. Before you start something like this, you need a pretty good understanding of the PDF format. Start by reading the PDF spec a couple of times, especially the sections about text and fonts.

     

    khk

  5. Is the tool that you use to browse the PDF contents generally available? I am working on on a book about privacy leakage and the tool would be quite useful to my readers.

    Thank you.

     

    Simson Garfinkel

  6. Simson, the tool I show is part of Adobe Acrobat Pro. There are other tools available, if you want more information, please contact me privately.

     

    khk

  7. Hi Karl,
    I am working on a project in which I need to get the bounds of paragraphs, images and other content’s in the PDF page. Digging down into the stream I came to know that BT and ET are the operators which indicates the beginning and end of content. I am trying to find something more precise which directly gives me the bound of content. Please let me know if you can help me with this.
    Thanks you.

     

    saurav

  8. There is no simple solution you will have to determine the bounding box for your individual elements and then assemble the overall bounding box based on the elements. For a paragraph, that would be the individual characters/text runs.

     

    khk