PDFContentExtractor¶
Namespace: O2S.Components.PDF4NET.Content
Defines an extractor for various type of content from PDF pages.
Inheritance Object → PDFContentExtractor
Constructors¶
PDFContentExtractor(PDFPage)¶
Initializes a new PDFContentExtractor object.
Parameters
page PDFPage
The page from which content is extracted.
Properties¶
Encodings¶
Gets the list of additional encodings used for extracting text from PDF files.
Property Value
PDFEncodingCollection
List of custom encodings used for extracting text from PDF files.
Remarks
The library supports a set of commonly used encodings for extracting text. Additional encodings can be loaded and added to this collection and they will be also used when PDF files require them.
Methods¶
ExtractColorSpaces()¶
Extracts the colorspaces that exist in the page resources.
Returns
PDFColorSpaceCollection
List of colorspaces defined in page resources.
ExtractContentStreamOperators()¶
Extracts the page content stream as a list of graphic operators with their operands.
Returns
PDFContentStreamOperatorCollection
The page content stream as a list of graphic operators.
ExtractImages(Boolean)¶
Extracts the information related to the images displayed on the page.
Parameters
includeImageData Boolean
True if image data should be extracted and decoded.
Returns
PDFVisualImageCollection
The list of images that are displayed on the page.
Remarks
Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.
ExtractOptionalContentGroup(String)¶
Extracts the content of an optional content group.
Parameters
ocgName String
The name of the optional content group to be extracted.
Returns
PDFPageOptionalContent
The optional content group if it exists.
Remarks
The optional content group is extracted as a form XObject that can be later drawn on another page.
ExtractText()¶
Extracts the text from the PDF page.
Returns
String
The text on the page.
ExtractText(PDFContentExtractionContext)¶
Extracts the text from the PDF page.
Parameters
context PDFContentExtractionContext
Context for text extraction.
Returns
String
The text on the page.
Remarks
When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextRuns() method calls.
ExtractTextLines()¶
Extracts the text from the PDF page as a collection of PDFTextLine objects.
Returns
PDFTextLineCollection
The lines of text on the page.
Remarks
When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextLines() method calls.
ExtractTextLines(PDFContentExtractionContext)¶
Extracts the text from the PDF page as a collection of PDFTextLine objects.
Parameters
context PDFContentExtractionContext
Context for text extraction.
Returns
PDFTextLineCollection
The lines of text on the page.
Remarks
When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextLines() method calls.
ExtractTextRuns()¶
Extracts the text fragments from the PDF page.
Returns
PDFTextRunCollection
The list of text fragments in the order they appear in the page content stream.
ExtractTextRuns(PDFContentExtractionContext)¶
Extracts the text fragments from the PDF page.
Parameters
context PDFContentExtractionContext
Returns
PDFTextRunCollection
The list of text fragments in the order they appear in the page content stream.
Remarks
When extracting text from multiple pages of the same document the same context object should be used with all PDFContentExtractor.ExtractTextRuns() method calls.
ExtractVisualObjects(Boolean)¶
Extracts the page content as a list of visual objects.
Parameters
includeImageData Boolean
True if image data should be extracted and decoded.
Returns
PDFVisualObjectCollection
The list of visual objects that are displayed on the page.
Remarks
Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.
ExtractVisualObjects(Boolean, Boolean)¶
Extracts the page content as a list of visual objects.
public PDFVisualObjectCollection ExtractVisualObjects(bool includeImageData, bool keepGraphicContainers)
Parameters
includeImageData Boolean
True if image data should be extracted and decoded.
keepGraphicContainers Boolean
True if the list of objects should preserve the graphic containers such as form XObjects.
Returns
PDFVisualObjectCollection
The list of visual objects that are displayed on the page.
Remarks
Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.
ExtractVisualObjects(Boolean, Boolean, PDFContentExtractionContext)¶
Extracts the page content as a list of visual objects.
public PDFVisualObjectCollection ExtractVisualObjects(bool includeImageData, bool keepGraphicContainers, PDFContentExtractionContext context)
Parameters
includeImageData Boolean
True if image data should be extracted and decoded.
keepGraphicContainers Boolean
True if the list of objects should preserve the graphic containers such as form XObjects.
context PDFContentExtractionContext
Context for content extraction
Returns
PDFVisualObjectCollection
The list of visual objects that are displayed on the page.
Remarks
Image data should be extracted only if the images need to be saved for later processing because this operation can take a significant amount of time, depending on the number of images and how they are encoded.
ExtractWords()¶
Extracts the text from the PDF page as a collection of words.
Returns
PDFTextWordCollection
A collection of words on the page.
ExtractWords(PDFContentExtractionContext)¶
Extracts the text from the PDF page as a collection of words.
Parameters
context PDFContentExtractionContext
Context for text extraction.
Returns
PDFTextWordCollection
A collection of words on the page.
SearchText(String)¶
Searches the page content for the specified text.
Parameters
text String
Text to find in page content.
Returns
PDFTextSearchResultCollection
A collection of zero or more search results.
SearchText(String, PDFTextSearchOptions)¶
Searches the page content for the specified text.
Parameters
text String
Text to find in page content.
searchOptions PDFTextSearchOptions
A set of options that specify how the text search should be performed.
Returns
PDFTextSearchResultCollection
A collection of zero or more search results.
SearchText(String, PDFTextSearchOptions, PDFContentExtractionContext)¶
Searches the page content for the specified text.
public PDFTextSearchResultCollection SearchText(string text, PDFTextSearchOptions searchOptions, PDFContentExtractionContext context)
Parameters
text String
Text to find in page content.
searchOptions PDFTextSearchOptions
A set of options that specify how the text search should be performed.
context PDFContentExtractionContext
Context for text extraction.
Returns
PDFTextSearchResultCollection
A collection of zero or more search results.