2 Jul 2008
PDF Content Streams
Update
Please visit the same post on my business site. The comments are closed here, so if you want to comment, you have to head over to http://khkonsulting.com/2008/07/pdf-content-streams/
How Did You Get Here?
I did some research on why and how visitors come to my site. One interesting finding is that a number of people are searching for information about PDF content streams. Here is the list of the 50 most common Google searches that contain the string “content”:
1 pdf content streams 2 content streams pdf 3 acrobat content streams 4 pdf "content streams" 5 adobe content streams 6 content streams 7 acrobat +"content streams" 8 content streams acrobat 9 content streams in pdfs 10 what are content streams in pdf 11 "content stream" pdf issues 12 "content stream"+pdf 13 "content streams" pdf 14 "content streams"+"pdf" 15 "content streams"+adobe+pdf 16 acrobat content stream 17 acrobat content stream reduction 18 acrobat professional content streams 19 acrobat reduce content stream 20 acrobat what is a content stream 21 acrobat, "content stream" 22 content stream adobe acrobat pdf 23 content stream in acrobat pdf 24 content streams adobe pdf 25 content streams in a pdf 26 content streams in adobe 27 content streams in pdf 28 content streams pdf acrobat 29 how to create content stream in pdf 30 how to reduce adobe acrobat content stream? 31 how to reduce content streams acrobat 32 move content streams pdf 33 pdf "content stream" 34 pdf "content stream" example 35 pdf "content streams" help 36 pdf acrobat what is content stream 37 pdf stream content 38 pdf what are content streams 39 pdf what is content streams 40 pdf+content+streams 41 reduce content streams in acrobat professional 42 what are content streams in pdfs 43 what are content streams pdf 44 what are pdf content streams 45 what are pdf content streams? 46 what is a content stream in acrobat 47 what is a content stream in pdf 48 what is content stream acrobat 49 what is content stream in a pdf? 50 what pdf content streams
So, I guess you want to learn more about what PDF content streams are, and how to create them, with an example or two thrown in… I think I can do that.
In a previous post and here, I’ve shown you how to look into a content stream with the tools that Acrobat has on board. The good news here is that with Acrobat 9 Professional (or Pro Extended), you do no longer have to run a preflight first, the option to browse the internal structure of the PDF is available right away.
What does the PDF spec has to say about content streams?
PDF Specification
Section 3.7 in the PDF Specification talks about content streams (and resource objects – the two travel together). Here we read that “Content streams are the primary means for describing the appearance of pages and other graphical elements.” Section 3.7.1 goes into more detail. The name “content stream” does already give away an important piece of information: We are talking about stream objects. The content stream is a stream object that describes how a page will be rendered. If your recollection of what a stream object is is a bit fuzzy, please review section “3.4.6 Object Streams” in the PDF spec again.
When we look at a page object in a PDF document, we will see a number of required entries in the page object dictionary:
- Type
- Parent
- Resources
- MediaBox
Hmmm… This list does not include the Contents entry (which does point to a content stream). Because this entry is optional, a page does not need page content, so an empty page in a PDF document does not necessarily contain a content stream. This makes it very easy to add blank pages to a PDF file.
Back to the spec: The Contents can be either a single stream or an array of streams. It is up to the creating application to decide which way to go. In general, if it’s possible to create the content stream in one operation, it’s probably best to use a single stream object, whereas a page content that contains different parts that are created either at different times, or copied from other objects or locations would suggest an array of content streams.
Contents of a Content Stream
So, what exactly is the content of a content stream? We find this information in the “Operator Summary” (appendix A in the PDF specification). This section lists all operators and a reference to where the operator is introduced in the body of the PDF specification.
I don’t want to discuss every operator (maybe in a future post – let me know if that’s something you want to see), but just fore reference purposes and so that this stuff shows up if somebody googles for one or more of these operators, here is a list:
b,B,b*,B*,BDC,BI,BMC,BT,BX,c,cm,CS,cs,d,d0,d1,Do,DP,EI,EMC,ET,EX,f,F,f*, G,g,gs,h,I,ID,J,j,K,k,l,m,M,MP,n,q,Q,re,RG,rg,rl,s,S,SC,SCN,scn,sh,T*,Tc,Td, TD,Tf,TJ,Tj,TL,Tm,Tr,Ts,Tw,Tz,v,w,W,W*,y,',"
Go and read up on those operators 🙂
Creating Content Streams
So, how do you create a page stream out of nothing? There are two ways: easy (that is relative!) and complicated.
Let’s take a look at the simple method first.
Using a Library or a Framework
The most simple approach to creating a content stream is to let somebody else to do the heavy lifting: If you have a PDF library or a framework that allows you to create PDF content, then you don’t have to mess with the details of what needs to be where in your content stream. Examples for libraries are of course the Adobe PDF Library, or the Acrobat API (take a look at the PDE level of API functions), PDFLib or iText. Just get familiar with the environment and create PDF content streams as complicated as you need them to be – without too much hassle.
Manually Creating Content Streams
OK, before we go any further, allow me a question: Why do you want to do this the hard way? Just stick to the approach mentioned in the last paragraph and be done with it. There really are not many reasons to torture yourself with this stuff, so just get a nice library and enjoy life…
Still here? There are only two things I can tell you at this point: Read the PDF spec, and read it again, and when you try to create your first content stream, make sure you get the stream length right. If you are dedicated to learning how to do this from scratch, there is nothing I can say that will magically make it unnecessary to read (and understand!) the PDF spec. So, get started, and if you have questions, ask them in the comments to this article. Good luck.
Â
Content Streams in PDF Files
So, what does a content stream in a PDF file look like? Here is an example:
Â
13 0 obj << /Length 66/Filter/FlateDecode >> stream [some binary data] endstream endobj
This stream is obviously compressed – which is indicated by the “/Filter” option of “/FlatDecode” in the stream dictionary. Let’s take a look at the uncompressed stream:
The first image shows the content stream of a page using the “View content stream with q/Q nesting levels collapsed”, and the second image uses the “View content stream by marked object”. The important difference is that the first image shows just the content stream operators, whereas the second image shows the operator without any parameters, followed by a description. To see the actual operators with parameters, the individual blocks need to be expanded.
Do you have any idea what this PDF page will look like? Here is the PDF document: test.pdf
I have a download section in my site which has PDF of approx 21 mb each. I checked the PDF and found that 95% of it is covered by content stream. Is there a way to reduce the content streams ?
I made this PDFs from coreldraw 13. It contains vector graphics and no text.
Please suggest.
Mohammed Shahid Rahman
December 2nd, 2009 at 10:15 ampermalink
Karl,
Under Resources – Xobjects the pdf browser is telling how many images you have inside the pdf file, as example im0 – im1. When double clicking on Im0 in the internal pdf structure window Acrobat isn’t drawing a rectangle around the object inside the Acrobat document window. When going to the help file on:
http://help.adobe.com/en_US/Acrobat/8.0/Professional/help.html?content=WS561DA35A-C1C6-4493-AE1D-80C46C2C571A.html
Acrobat is telling as tip “You can also view content streams as snippets by selecting Show Selected Page Object In Snap View in the Preflight window”. After opening the snap view window and double clicking on an image content stream object the system is showing nothing. Is there a possibility after double clicking a content stream object, acrobat is viewing that object inside the document window or snap view window?
Regards,
Vantomme Bart
Bart Vantomme
November 23rd, 2010 at 9:02 ampermalink
I want to be able to “read” a page of content in a pdf looking for specific headings and then copy all of the text from that heading to the beginning of the next heading and paste it into a word doc.
I need to be able to repeat this process with multiple headings.
Any help you can provide will be greatly appreciated.
Bill
William Hughes
April 25th, 2012 at 10:32 pmpermalink
Bill,
that’s a pretty complicated task. Actually, text extraction is probably one of the most complex things you can do with a PDF. And, explaining it would definitely be outside of the scope of a reply in the comments. Before you start something like this, you need a pretty good understanding of the PDF format. Start by reading the PDF spec a couple of times, especially the sections about text and fonts.
khk
July 17th, 2012 at 2:06 pmpermalink
Is the tool that you use to browse the PDF contents generally available? I am working on on a book about privacy leakage and the tool would be quite useful to my readers.
Thank you.
Simson Garfinkel
July 4th, 2013 at 3:26 pmpermalink
Simson, the tool I show is part of Adobe Acrobat Pro. There are other tools available, if you want more information, please contact me privately.
khk
July 24th, 2013 at 9:45 ampermalink
Hi Karl,
I am working on a project in which I need to get the bounds of paragraphs, images and other content’s in the PDF page. Digging down into the stream I came to know that BT and ET are the operators which indicates the beginning and end of content. I am trying to find something more precise which directly gives me the bound of content. Please let me know if you can help me with this.
Thanks you.
saurav
December 17th, 2014 at 1:58 ampermalink
There is no simple solution you will have to determine the bounding box for your individual elements and then assemble the overall bounding box based on the elements. For a paragraph, that would be the individual characters/text runs.
khk
December 27th, 2014 at 12:57 pmpermalink