Using Python for Data Extraction from PDFs

Data extraction refers to obtaining valuable information from different sources. These sources might include CSV files, websites, PDF documents, Excel files, and many other file formats. Portable Document File (PDF) is the dominant document format that is popular worldwide. It is extensively used across enterprises, government offices, education, finance, healthcare, and other industries. PDF format documents contain a massive volume of unstructured data. Extracting and analyzing this data accurately is a regular task that data scientists and other professionals face. There many Python libraries developed for working with PDF documents. This article would attempt to describe in simple terms the use of various python libraries for PDF data extraction, such as PyPDF2, a versatile library built as a PDF toolkit.

PDF Formatting

Tabular data in PDF documents exists in two basic types. One is XML Forms Architecture (XFA), and the other is Acroforms. Later is Adobe’s oldest and original interactive form generation technique, introduced in 1996 as a part of PDF 1.2 specification. Acroforms allowed designing the form layout using Adobe Illustrator, Adobe InDesign, or Microsoft Word. Then it adds the form elements, fields, dropdown controls, checkboxes, and so on. Adobe’s AEM allows you to create interactive and dynamic forms. Users can create and publish PDF forms using Adobe Experience Manager (AEM) Forms Designer. These dynamic forms are based on the XML Forms Architecture of Adobe. On the other hand, Acroforms provide a traditional static layout for PDF and interactive form fields.

Frequently Used Python Libraries

Since a wide range of data exists in PDF documents, extracting the text for further analysis is needed. PDF documents can have structured or unstructured data. Therefore, there is a need to choose the right package and library for data extraction is necessary to achieve maximum accuracy. Following are some famous Python libraries and packages that help extract PDF documents:

PyPDF2
Tika
PyMuPDF
Textract
Tabula
PDFMiner
PDFtotext
PDFQuery
Xpdf
Slate
pdflib

PyPDF2

PyPDF2 is purely a Python library that allows users to split, merge, crop, encrypt, and transform PDFs. It adds customs data, viewing options, and encryption methods to PDF documents. This library allows smooth working with any Python platform and spares the users from the hassle of using dependencies and external libraries. Moreover, the library allows various other operations to the PDf files, such as decrypting data, adding data, watermarks, viewing options and passwords.

PyPDF2 Example

Following is a simple example for extracting text and page numbers using PyPDF2 with input PDF and output extraction text:

path = r"\....Downloads\sample.pdf"
import PyPDF4
pdfFileObj = open(path, 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
for i in range(pdfReader.numPages):
    pypdf2_text +=pdfReader.getPage(i).extractText()
pdfFileObj.close()

There are also some disadvantages of using PyPDF2. The library helps in data extraction but cannot preserve the structure of the original PDF document. And the original tabular structure. Moreover, the library also includes the next lines and spaces in data extraction. Therefore, if the users try to extract data from a LATEX-based PDF, users might lose valuable information due to potential spaces.

PDFMiner

This tool extracts data from PDF documents. PDFMiner enables analysis of text and tabular data and obtains the actual location of a text. It offers information, such as fonts, lines, and metadata. It can also work as a PDF transformer and a PDF parser. PDFMiner is compatible with Python versions 2.5 to 2.7, but it does not perform well with Python 3. The primary purpose of PDFMiner is text extraction and providing the exact location of text on any page. This library can convert PDF format files into other formats such as HTML or XML. The library provides services through API requests.

PDFMiner Example

Following is a sample example of using PDFMiner. Even if the extraction code using the PDFMiner library is extended, it still provides more accurate results and user input and text extraction than other libraries.

path = r"\....Downloads\Sample.pdf"
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr,      codec=codec,laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    text = retstr.getvalue()
    fp.close()
    device.close()
    retstr.close()
    return text
pdf_miner_text = convert_pdf_to_txt(path1)

PDFtotext

PDFtotext is a pure python package to extract text from PDF. This library only supports PDF file formats and not the others. The library extracts data in the form of an object. The main benefit of using the PDFtotext library is that it preserves the actual structure of the PDF file.

PDFtotext Example

Following is a sample example of using PDFtotext library along with user input PDF for data extraction:

path = r"\....Downloads\Sample.pdf"
import pdftotext
with open(path2, "rb") as f:
    pdf = pdftotext.PDF(f)
pdftotext_text = "\n\n".join(pdf)

Other Libraries

A famous Python wrapper for PDF data extraction is Tabula-py. It is derived from tabula-java, which can read tables from PDF files and convert them into Pandas Dataframe or into CSV/TSV/JSON file formats.

Another Python package is called Slate. It enables the extraction of information but requires a PDFMiner library. Tika is a Python-based package famous for binding with Apache TikaTM REST services. However, Tika requires the system to have Java for proper functioning. Tika performs operations such as extracting PDF metadata and extracting keys and contents for the dictionary.

Finally, PDFQuery is a python wrapper, using minimum programming to extract PDF data. It wraps around PDFminer, lxml, and pyquery. It is beneficial for extracting data from PDF sets. Moreover, it is cost-friendly, user-friendly, and helps in PDF scraping by utilizing JQuery and XPath syntax.

Python Alternatives

Although python libraries are quite versatile and have great features for PDF data extraction. It is a coding-based approach and may not be suitable for many users. These libraries also have dependencies and may not offer the most accurate results in some cases. To address these issues, dedicated solutions for PDF data extraction are recommended, such as ByteScout and PDF Solutions.

ByteScout

ByteScout is a document generation and configuration tool. It enables easy PDF document creation, supporting various formats, including PNG, JPEG, TIFF, and CCITT Fax. ByteScout integrates advanced security features, allowing 40-bit, 128-bit, and 256-bit encryption and enabling Type1, TrueType, and Unicode font embedding. It does not require any third-party software and provides PDF generation in C# or VB.

PDF Solutions

PDF Solutions offer a specialized solution for semiconductor manufacturing and test operations. Its consultants have expertise in big data analytics and AI/ML algorithms, providing state-of-the-art services worldwide. PDF Solutions has amassed a wide-ranging product portfolio to execute end-to-end data analysis across FDC, Yield, Test, and Assembly & Packaging. Eighteen of the Top 20 Semiconductor manufacturing firms and the top 6 foundries in the semiconductor industry use PDF Solutions’ product.

Other useful articles: