PDFContentExtractor¶

Namespace: O2S.Components.PDF4NET.Content

Defines an extractor for various type of content from PDF pages.

public class PDFContentExtractor

Inheritance Object → PDFContentExtractor

Constructors¶

PDFContentExtractor(PDFPage)¶

Initializes a new PDFContentExtractor object.

public PDFContentExtractor(PDFPage page)

Parameters

page PDFPage
The page from which content is extracted.

Properties¶

Encodings¶

Gets the list of additional encodings used for extracting text from PDF files.

public PDFEncodingCollection Encodings { get; }

Property Value

PDFEncodingCollection
List of custom encodings used for extracting text from PDF files.

Remarks

The library supports a set of commonly used encodings for extracting text. Additional encodings can be loaded and added to this collection and they will be also used when PDF files require them.

Methods¶

ExtractColorSpaces()¶

Extracts the colorspaces that exist in the page resources.

public PDFColorSpaceCollection ExtractColorSpaces()

Returns

PDFColorSpaceCollection
List of colorspaces defined in page resources.

ExtractContentStreamOperators()¶

Extracts the page content stream as a list of graphic operators with their operands.

public PDFContentStreamOperatorCollection ExtractContentStreamOperators()

Returns

PDFContentStreamOperatorCollection
The page content stream as a list of graphic operators.

ExtractImages(Boolean)¶

Extracts the information related to the images displayed on the page.

public PDFVisualImageCollection ExtractImages(bool includeImageData)

Parameters

includeImageData Boolean
True if image data should be extracted and decoded.

Returns

PDFVisualImageCollection
The list of images that are displayed on the page.

Remarks

Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.

ExtractOptionalContentGroup(String)¶

Extracts the content of an optional content group.

public PDFPageOptionalContent ExtractOptionalContentGroup(string ocgName)

Parameters

ocgName String
The name of the optional content group to be extracted.

Returns

PDFPageOptionalContent
The optional content group if it exists.

Remarks

The optional content group is extracted as a form XObject that can be later drawn on another page.

ExtractText()¶

Extracts the text from the PDF page.

public string ExtractText()

Returns

String
The text on the page.

ExtractText(PDFContentExtractionContext)¶

Extracts the text from the PDF page.

public string ExtractText(PDFContentExtractionContext context)

Parameters

context PDFContentExtractionContext
Context for text extraction.

Returns

String
The text on the page.

Remarks

When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextRuns() method calls.

ExtractTextLines()¶

Extracts the text from the PDF page as a collection of PDFTextLine objects.

public PDFTextLineCollection ExtractTextLines()

Returns

PDFTextLineCollection
The lines of text on the page.

Remarks

When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextLines() method calls.

ExtractTextLines(PDFContentExtractionContext)¶

Extracts the text from the PDF page as a collection of PDFTextLine objects.

public PDFTextLineCollection ExtractTextLines(PDFContentExtractionContext context)

Parameters

context PDFContentExtractionContext
Context for text extraction.

Returns

PDFTextLineCollection
The lines of text on the page.

Remarks

When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextLines() method calls.

ExtractTextRuns()¶

Extracts the text fragments from the PDF page.

public PDFTextRunCollection ExtractTextRuns()

Returns

PDFTextRunCollection
The list of text fragments in the order they appear in the page content stream.

ExtractTextRuns(PDFContentExtractionContext)¶

Extracts the text fragments from the PDF page.

public PDFTextRunCollection ExtractTextRuns(PDFContentExtractionContext context)

Parameters

context PDFContentExtractionContext

Returns

PDFTextRunCollection
The list of text fragments in the order they appear in the page content stream.

Remarks

When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextRuns() method calls.

ExtractVisualObjects(Boolean)¶

Extracts the page content as a list of visual objects.

public PDFVisualObjectCollection ExtractVisualObjects(bool includeImageData)

Parameters

includeImageData Boolean
True if image data should be extracted and decoded.

Returns

PDFVisualObjectCollection
The list of visual objects that are displayed on the page.

Remarks

Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.

ExtractVisualObjects(Boolean, Boolean)¶

Extracts the page content as a list of visual objects.

public PDFVisualObjectCollection ExtractVisualObjects(bool includeImageData, bool keepGraphicContainers)

Parameters

includeImageData Boolean
True if image data should be extracted and decoded.

keepGraphicContainers Boolean
True if the list of objects should preserve the graphic containers such as form XObjects.

Returns

PDFVisualObjectCollection
The list of visual objects that are displayed on the page.

Remarks

Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.

ExtractVisualObjects(Boolean, Boolean, PDFContentExtractionContext)¶

Extracts the page content as a list of visual objects.

public PDFVisualObjectCollection ExtractVisualObjects(bool includeImageData, bool keepGraphicContainers, PDFContentExtractionContext context)

Parameters

includeImageData Boolean
True if image data should be extracted and decoded.

keepGraphicContainers Boolean
True if the list of objects should preserve the graphic containers such as form XObjects.

context PDFContentExtractionContext
Context for content extraction

Returns

PDFVisualObjectCollection
The list of visual objects that are displayed on the page.

Remarks

Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.

ExtractWords()¶

Extracts the text from the PDF page as a collection of words.

public PDFTextWordCollection ExtractWords()

Returns

PDFTextWordCollection
A collection of words on the page.

ExtractWords(PDFContentExtractionContext)¶

Extracts the text from the PDF page as a collection of words.

public PDFTextWordCollection ExtractWords(PDFContentExtractionContext context)

Parameters

context PDFContentExtractionContext
Context for text extraction.

Returns

PDFTextWordCollection
A collection of words on the page.

SearchText(String)¶

Searches the page content for the specified text.

public PDFTextSearchResultCollection SearchText(string text)

Parameters

text String
Text to find in page content.

Returns

PDFTextSearchResultCollection
A collection of zero or more search results.

SearchText(String, PDFTextSearchOptions)¶

Searches the page content for the specified text.

public PDFTextSearchResultCollection SearchText(string text, PDFTextSearchOptions searchOptions)

Parameters

text String
Text to find in page content.

searchOptions PDFTextSearchOptions
A set of options that specify how the text search should be performed.

Returns

PDFTextSearchResultCollection
A collection of zero or more search results.

SearchText(String, PDFTextSearchOptions, PDFContentExtractionContext)¶

Searches the page content for the specified text.

public PDFTextSearchResultCollection SearchText(string text, PDFTextSearchOptions searchOptions, PDFContentExtractionContext context)

Parameters

text String
Text to find in page content.

searchOptions PDFTextSearchOptions
A set of options that specify how the text search should be performed.

context PDFContentExtractionContext
Context for text extraction.

Returns

PDFTextSearchResultCollection
A collection of zero or more search results.