Automate data entry from PDF to Excel with Python

I will show you today how you can automate the process of importing PDF data into Excel. This can save a lot of time for everyone doing this daily.

This is our test data:

Let's jump to the code!

First we need to import dependencies

from tika import parser
import pprint
from collections import defaultdict
import re
import pandas as pd

After that we need to define PrettyPrinter and get the content of the PDF file and convert it into a list:

pp = pprint.PrettyPrinter(indent=3)
parsedPDF = parser.from_file("final-test.pdf")

content = parsedPDF['content']
contentlist = content.split('\n')

Remove empty strings in the list resulting from the split

contentlist = list(filter(lambda a: a != '', contentlist))

Create an iterator and other flags that we will use to for the algorithm, Iterator of the contents of PDF per line:

iterateContent = iter(contentlist)

Dictionary placeholder of the data scraped

data = defaultdict(dict)

Our counter to count how many blocks did we able to get

 cntr = 0

Indicator which line are we in a specific block of data

 line = 1

The algorithm will use the flags cntr and line to determine if we are in a new block or existing block

while True:
    try:
        string = next(iterateContent)
    except StopIteration:
        break

    if re.match('^[A-Z\s]+$', string):
        cntr += 1           

        data[cntr]['Name'] = string
        line = 2
        print('matched')

    elif line == 2:
        data[cntr]['Address'] = string
        line += 1

    elif line == 3:
        data[cntr]['Website'] = string
        line += 1

print("Total data:", len(data.keys()))

Setting up the data into Dataframe

df = pd.DataFrame(data.values())
df.index += 1
print(df)

Write the dataframe into excel

writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='output', index=False)
writer.save()

Our final results:

Thank you all.