How to Parse and Extract Content from PDF Documents in C# VB.NET

Chelsea Devereaux - May 5 '22 - - Dev Community

What You Will Need

Controls Referenced

Tutorial Concept

Learn how to parse and modify text, extract images, and utilize regular expressions to find specific data within PDF documents using C# and a powerful .NET PDF API.


PDF files can sometimes be the most challenging files to work with while simultaneously being among the most common files in the digital world. Parsing text or images from a PDF can seem daunting, but Document Solutions for PDF v7 makes this process easy! Our latest releases continue the tradition of improving and upgrading the handling of text within PDF documents, as well as adding many other upgrades and feature enhancements, specifically, being able to parse/read text from a PDF using C# and modifying text throughout a PDF document. New samples covering every featurewith a full code-behind view also help developers editing PDF documents get up and running quickly.

Starting with version 3.2, we’ve continually improved the logic regarding parsing, extracting, and reading text from a PDF - efficiently handling individual cases, such as text that is rendered multiple times to create bold or shadowed text effects so that text is not repeated in the output but only appears once in the document.

For text within a PDF, the Document Solutions for PDF API contains the FindText method, which can find text that spans more than one line. The FindText method returns a FoundPosition object, which contains an array of Quadrilateral structures from the FoundPosition object’s Bounds property. A new property ITextMap.Paragraphs returns a collection of ITextParagraph objects associated with the ITextMap.

For images within a PDF, the Document Solutions for PDF API contains the GetImages method, which utilizes the ImageBrush class to return an array of images from the PDF file. You can see an example of this in action directly within our demos.

In this blog, we will be exploring the following topics:

Create C# PDF Parsing Code with the ITextMap.Paragraphs Property

This example reads an existing multi-page PDF document and demonstrates how to use ITextMap.Paragraphs to extract paragraphs from each page of a PDF document. The complete example and code are included in the updated demo sample explorer for Document Solutions for PDF.

Original PDF

Original PDF with image below the text

The code extracts the text paragraphs on each page, rendering each section in alternating colors (for clarity) in a new PDF document:

New PDF

Extracted paragraphs now in a new PDF

Set the Formatting for the Generated PDF

The code used to generate the format settings for the above PDF is shown below. First, the code sets an integer value to indicate the PDF page margins we will be working with, along with some colors we’ll be using throughout. Then, the code creates a new PDF document where the text paragraphs will be rendered and adds a note explaining the sample at the top of the first page.

Next, new separate TextFormat objects are created to format the captions and paragraphs, and a new TextLayout object is created to specify the page margins.

Finally, a new TextSplitOptions object is made to handle pagination. Using the new ITextMap.Paragraphs property, the code required to perform this task is straightforward:

    const int margin = 36;  
    Color c1 = Color.PaleGreen;  
    Color c2 = Color.PaleGoldenrod;

    GcPdfDocument doc = new GcPdfDocument();  
    var page = doc.NewPage();

    var rc = Common.Util.AddNote(  
        "Here we load an existing PDF (Wetlands) into a temporary GcPdfDocument, " +  
        "and iterate over the pages of that document, printing all paragraphs found on the page. " +  
        "We alternate the background color for the paragraphs so that the bounds between paragraphs are more clear. " +  
        "The original PDF is appended to the generated document for reference.",  
        page,   
        new RectangleF(margin, margin, page.Size.Width - margin * 2, 0));

    // Text format for captions:  
    var tf = new TextFormat()  
    {  
        Font = GCTEXT.Font.FromFile(Path.Combine("Resources", "Fonts", "yumin.ttf")),  
        FontSize = 14,  
        ForeColor = Color.Blue  
    };  
    // Text format for the paragraphs:  
    var tfpar = new TextFormat()  
    {  
        Font = StandardFonts.Times,  
        FontSize = 12,  
        BackColor = c1,  
    };  
    // Text layout to render the text:  
    var tl = page.Graphics.CreateTextLayout();  
    tl.MaxWidth = doc.PageSize.Width;  
    tl.MaxHeight = doc.PageSize.Height;  
    tl.MarginAll = rc.Left;  
    tl.MarginTop = rc.Bottom + 36;  
    // Text split options for widow/orphan control:  
    TextSplitOptions to = new TextSplitOptions(tl)  
    {  
        MinLinesInFirstParagraph = 2,  
        MinLinesInLastParagraph = 2,  
        RestMarginTop = rc.Left,  
    };
Enter fullscreen mode Exit fullscreen mode

Code Analysis of Document Solutions Parsing/Reading PDF with C

Now, we’ll showcase how to utilize the GetTextMap method to extract the text from the original PDF. First, the Wetlands.pdf document (the original PDF) is opened and loaded into a new GcPdfDocument object. Then, the new ITextMap.Paragraphs API is used to get the text paragraphs and append them into a different document. After each paragraph is appended, the TextFormat class is used for the paragraphs and updates tfpar to alternate the background color, highlighting the separate paragraphs in the new document.

Then, the final document is completed using TextLayout.PerformLayout and TextLayout.Split to paginate the results, merging those into one single output document using the GdPdfDocument.MergeWithDocument method.

The final result is saved as a new PDF using the GcPdfDocument.Save method.

    // Open an arbitrary PDF, load it into a temp document and get all page texts:  
    using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "Wetlands.pdf")))  
    {  
        var doc1 = new GcPdfDocument();  
        doc1.Load(fs);

        for (int i = 0; i < doc1.Pages.Count; ++i)  
        {  
            tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1), tf);

            var pg = doc1.Pages[i];  
            var pars = pg.GetTextMap().Paragraphs;  
            foreach (var par in pars)  
            {  
                tl.AppendLine(par.GetText(), tfpar);  
                tfpar.BackColor = tfpar.BackColor == c1 ? c2 : c1;  
            }  
        }

        tl.PerformLayout(true);  
        while (true)  
        {  
            // 'rest' will accept the text that did not fit:  
            var splitResult = tl.Split(to, out TextLayout rest);  
            doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);  
            if (splitResult != SplitResult.Split)  
                break;  
            tl = rest;  
            doc.NewPage();  
        }  
        // Append the original document for reference:  
        doc.MergeWithDocument(doc1, new MergeDocumentOptions());  
    }  
    // Done:  
    doc.Save(stream);
Enter fullscreen mode Exit fullscreen mode

Parse/Read the Text Across Multiple Lines or Paragraphs with C# and DsPdf

FindText method

Using the FindText method across lines and paragraphs

Finalize C# PDF Parsing/Reading Code and Extract Data (Save)

The FindText method now supports finding text that appears in multiple lines in a paragraph or across pages. To illustrate this, code similar to the code in the FindText demo sample is added, which searches for longer text strings that span across multiple lines and paragraphs. FindText will return a list with the found positions of all the instances where the indicated text string was found within the document. The FoundPosition.Bounds property returns an array of Quadrilateral structures, forming the bounds in each successive line or section.

In the code below, we use the FindText method to find two longer text strings, where the first string spans across multiple lines, and the second string spans across various paragraphs.

The code uses GcGraphics.FillPolygon to highlight the found text and fill the area of the found text with a semi-transparent orange-red color, as shown in the output image above.

    var findIt = doc.FindText(new FindTextParams("Hundreds, if not thousands, of invertebrates that form the food of birds also rely on water for most, if not all, phases of their existence.", true, false), OutputRange.All);   
    foreach (var find in findIt)   
        foreach (var ql in find.Bounds)   
            doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));   
    var findIt2 = doc.FindText(**new** FindTextParams("To lose any more of these vital areas is almost unthinkable. Wetlands enhance and protect water quality in lakes and streams where additional species spend their time and from which we draw our water.", true, false), OutputRange.All);   
    foreach (var find in findIt2)   
        foreach (var ql in find.Bounds)   
            doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));  
    // Done:
    doc.Save(stream);
Enter fullscreen mode Exit fullscreen mode

Utilize REGEX to Extract Data from a PDF

Another useful application of DsPDF’s FindText method is the use of regular expressions so that specific known pieces of information can be quickly and easily extracted from PDF documents.

Document Solutions for PDF supports finding text based on regular expressions using the FindText method of the GcPdfDocument class and passing the regular expression to this method using the FindTextParams class. The code snippet below makes use of the FindText method to extract an invoice total and a customer email address from an invoice:

    void ExtractRegex()  
    {  
        using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "InvoiceDemo.pdf")))  
        {  
            //Load Sample PDF document  
            GcPdfDocument doc = **new** GcPdfDocument();  
            doc.Load(fs);

            //Find Invoice total amount  
            FindTextParams searchParam1 = **new** FindTextParams(@"(Total)\r\n\$([-+]?[0-9]*\.?[0-9]+)", false, false, 72, 72, true, true);  
            IList<FoundPosition> pos1 = doc.FindText(searchParam1);  
            string totalAmount = pos1[0].NearText.Substring(pos1[0].PositionInNearText + pos1[0].TextMapFragment[0].Length).TrimStart();  
            Console.WriteLine("Total amount found using regex in FindText method: " + totalAmount);

            //Find customer's email address from Invoice  
            FindTextParams searchParam2 = new FindTextParams(@"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+", false, false, 72, 72, true, true);  
            IList<FoundPosition> pos2 = doc.FindText(searchParam2);  
            string foundEmail = pos2[0].NearText.Substring(pos2[0].PositionInNearText, pos2[0].TextMapFragment[0].Length);  
            Console.WriteLine("Email Address found using regex in FindText method: " + foundEmail);  
        }  
    }
Enter fullscreen mode Exit fullscreen mode

Here is an image of the inputted PDF document we are running the REGEX expressions against:

PDF to be REGEX searched

Example of invoice PDF to be REGEX searched

Below is the console window output after we run the REGEX search on a PDF from the code above:

REGEX Expression search

Console output from REGEX Expression search on PDF

Parse and Extract Images from a PDF

While many people may only be interested in extracting text from PDFs, Document Solutions for PDF’s powerful API library also offers the ability to extract images from PDFs. Let’s revisit the Wetlands.pdf that we extracted text from in the first portion of this blog, but this time, we’ll only be extracting the images instead of the text. Each image extracted from the original PDF will exist on a separate page of our newly generated PDF. The full sample can be found in the demo section of our website.

The image below shows the extracted image from the original Wetlands.pdf inside a new PDF file:

Extracted image

Extracted image inside a new PDF file

To extract the image, the GetImages method was used. In the code below, we start by using the Wetlands.pdf file and loading it in as the data source for a new GcPdfDocument object. We then create a variable called imageInfos that will store the array of images returned from the GetImages method. Next, we create a new PDF document to hold the extracted images and then begin iterating through imageInfos to extract each image found in the original PDF document by adding each image to a new page within our new PDF document. While doing this, we also mark the new PDF with the page number where the image was found in the original PDF document, which is possible because the Image objects returned from the GetImage method contain information concerning their page indices and location from where they existed in the original PDF. Lastly, we save our newly generated PDF document to view later.

    using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "Wetlands.pdf")))
    {
        var docSrc = new GcPdfDocument();
        // Load an existing PDF with some images:
        docSrc.Load(fs);
        // This call extracts information about images from the loaded PDF,
        // note that for large files it may take a while to complete:
        var imageInfos = docSrc.GetImages();

        var doc = new GcPdfDocument();
        var textPt = new PointF(72, 72);
        var imageRc = new RectangleF(72, 72 * 2, doc.PageSize.Width - 72 * 2, doc.PageSize.Height - 72 * 3);

        foreach (var imageInfo in imageInfos)
        {
            // The same image may appear on multiple locations, 
            // imageInfo includes page indices and locations on pages;
            // for simplicity sake we only print page numbers here:
            var sb = new StringBuilder();
            imageInfo.Locations.ForEach(il_ => sb.Append((il_.Page.Index + 1).ToString() + ", "));
            var g = doc.NewPage().Graphics;
            g.DrawString($"This image appears on page(s) {sb.ToString().TrimEnd(' ', ',')} of the original PDF:", tf, new PointF(72, 72));
            g.DrawImage(imageInfo.Image, imageRc, null, ImageAlign.ScaleImage);
        }
        // Done:
        doc.Save(stream);
    }

Enter fullscreen mode Exit fullscreen mode

We hope you have found this helpful! Contact us with any questions you may have related to this blog or the Document Solutions product family, and keep on coding!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .