Efficient PDF File Handling with Python

A young boy is using his laptop while next to him is a document discussing the process of utilizing Python to work with PDF files.

Introduction

PDF (Portable Document Format) files are widely used for sharing and presenting documents in a standardized manner. Whether you need to extract data from a PDF, modify its content, or generate new PDF files programmatically, Python provides several powerful libraries and tools to accomplish these tasks. In this article, we will explore various techniques for working with PDF files in Python.

Installing Required Library

Before getting started, we need to install the necessary Python library. The most commonly used libraries for PDF manipulation in Python is PyPDF2. You can install this using the following command:


pip install PyPDF2

In case you encounter the `PyPDF2.errors.DeprecationError` message while running any of the program provided here, consider attempting the alternative way to install the ‘PyPDF2’ library.


pip install 'PyPDF2<3.0'

How to read a PDF File

The PyPDF2 library provides a convenient way to read and extract data from existing PDF files. Here’s an example of how to open a PDF file and extract its text content:


import PyPDF2

pdf_file = open('example.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

for page in range(pdf_reader.numPages):
    page_obj = pdf_reader.getPage(page)
    text_content = page_obj.extractText()
    print(text_content)

pdf_file.close()

Extracting Meta Data from PDF Files

PDF
files often contain valuable metadata such as author information,
creation dates, and keywords. Extracting this metadata can provide
valuable insights and streamline document management processes. Here, we
will explore how to extract metadata from PDF files using Python. Here’s an example:


import PyPDF2

# Opening the PDF File
file_path = 'path_to_pdf_file.pdf'
pdf_file = open(file_path, 'rb')

# Extracting Metadata
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

author = pdf_reader.getDocumentInfo().author
title = pdf_reader.getDocumentInfo().title
creation_date = pdf_reader.getDocumentInfo().created

# Displaying the Extracted Metadata
print("Author:", author)
print("Title:", title)
print("Creation Date:", creation_date)

# Closing the PDF File
pdf_file.close()

Extracting Images from PDF Files

PDF files often contain images that you may want to extract for further processing. The PyPDF2 library also allows you to extract images from PDF files. Here’s an example:


import PyPDF2

pdf_file = open('example.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

for page in range(pdf_reader.numPages):
    page_obj = pdf_reader.getPage(page)
    images = page_obj.extract_images()

    for image in images:
        img_data = image[0]['image']
        img_type = image[1]['/Filter']
        # Process the image data as required

pdf_file.close()

Rotating a Single Page of a PDF file

A PDF file could potentially include pages that are misaligned. To rotate a specific page of a PDF file using Python, you can utilize the PyPDF2 library. Here’s an example that demonstrates how to rotate a single page within a PDF file:


import PyPDF2

# Open the PDF file in read-binary mode
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Get the specific page you want 
# to rotate (e.g., page 2)
# Adjust this to the desired page number
# (starting from 0 for the first page)
page_number = 1  
page = pdf_reader.getPage(page_number)

# Rotate the page clockwise by 90 degrees
page.rotateClockwise(90)

# Create a new PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Add the rotated page to the writer object
pdf_writer.addPage(page)

# Iterate through the remaining pages and 
# add them to the writer object
for page_num in range(pdf_reader.numPages):
    if page_num != page_number:
        page = pdf_reader.getPage(page_num)
        pdf_writer.addPage(page)

# Save the modified PDF to a new file
output_pdf = open('rotated.pdf', 'wb')
pdf_writer.write(output_pdf)

# Close the file handles
pdf_file.close()
output_pdf.close()

In this example, we open the PDF file using the `open()` function in read-binary mode (`‘rb’`). Then, we create a PDF reader object using the `PdfFileReader` class from PyPDF2. Next, we retrieve the specific page we want to rotate using the `getPage()` method.

To rotate the page, we use the `rotateClockwise()` method of the `PageObject` class. In this example, we rotate the page by 90 degrees clockwise, but you can adjust the rotation angle according to your needs.

We then create a new PDF writer object using the `PdfFileWriter` class. We add the rotated page to the writer object using the `addPage()` method. After that, we iterate through the remaining pages of the original PDF file, excluding the rotated page, and add them to the writer object.

Finally, we save the modified PDF to a new file using the `write()` method of the writer object. Remember to close the file handles using the `close()` method for both the input and output PDF files.

By executing this code, you will generate a new PDF file named ‘rotated.pdf’ with the specified page rotated.

Rotating Multiple Pages of a PDF File

Multiple pages of a PDF can be rotated using Python and PyPDF2 library, you can follow the simple steps:


import PyPDF2

# Open the PDF file in read-binary mode
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Create a new PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Iterate through each page of the PDF
for page_number in range(pdf_reader.numPages):
    # Get the specific page
    page = pdf_reader.getPage(page_number)

    # Rotate the page clockwise by 90 degrees
    page.rotateClockwise(90)

    # Add the rotated page to the writer object
    pdf_writer.addPage(page)

# Save the modified PDF to a new file
output_pdf = open('rotated.pdf', 'wb')
pdf_writer.write(output_pdf)

# Close the file handles
pdf_file.close()
output_pdf.close()

In this example, we open the PDF file using the `open()` function in read-binary mode (`‘rb’`). Then, we create a PDF reader object using the `PdfFileReader` class from PyPDF2.

Next, we create a new PDF writer object using the `PdfFileWriter` class. We iterate through each page of the original PDF file using a for loop and the `numPages` property of the PDF reader.

For each page, we retrieve the specific page using the `getPage()` method. We then rotate the page clockwise by 90 degrees using the `rotateClockwise()` method of the `PageObject` class.

We add the rotated page to the writer object using the `addPage()` method. This way, we accumulate all the rotated pages.

Finally, we save the modified PDF to a new file named ‘rotated.pdf’ using the `write()` method of the writer object. Remember to close the file handles using the `close()` method for both the input and output PDF files.

How to Split Pages of a PDF File

The PyPDF2 library provides the option to divide the pages of a PDF file. Here’s an example of How to split pages of a PDF file using Python.


import PyPDF2

# Open the PDF file in read-binary mode
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Iterate through each page of the PDF
for page_number in range(pdf_reader.numPages):
    # Create a new PDF writer object for each page
    pdf_writer = PyPDF2.PdfFileWriter()

    # Get the specific page
    page = pdf_reader.getPage(page_number)

    # Add the page to the writer object
    pdf_writer.addPage(page)

    # Save the individual page as a new PDF file
    output_pdf = 
    open(f'page_{page_number + 1}.pdf', 'wb')
    
    pdf_writer.write(output_pdf)

    # Close the file handle for
    # the individual page PDF
    output_pdf.close()

# Close the file handle for the original PDF
pdf_file.close()

In this example, we open the PDF file using the `open()` function in read-binary mode (`‘rb’`). Then, we create a PDF reader object using the `PdfFileReader` class from PyPDF2.

We iterate through each page of the original PDF file using a for loop and the `numPages` property of the PDF reader. For each page, we create a new PDF writer object using the `PdfFileWriter` class.

We retrieve the specific page using the `getPage()` method. We add the page to the writer object using the `addPage()` method.

We save the individual page as a new PDF file by opening a new file handle with a name that includes the page number (`f’page_{page_number + 1}.pdf’`). We write the contents of the writer object to the new file handle using the `write()` method.

After saving each individual page as a separate PDF file, we close the file handle for the individual page PDF using the `close()` method.

By executing this code, you will generate separate PDF files for each page of the original PDF, with filenames in the format `page_1.pdf`, `page_2.pdf`, and so on.

How to Merge PDF Files

To merge multiple PDF files into a single PDF file using Python, you can follow the steps below:


import PyPDF2

# List the PDF files to be merged
pdf_files =['file1.pdf', 'file2.pdf', 'file3.pdf']

# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Iterate through each PDF file
for file in pdf_files:
    # Open each PDF file in read-binary mode
    pdf_file = open(file, 'rb')

    # Create a PDF reader object for each file
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Iterate through each page of the PDF
    for page_number in range(pdf_reader.numPages):
        # Get the specific page
        page = pdf_reader.getPage(page_number)

        # Add the page to the writer object
        pdf_writer.addPage(page)

    # Close the file handle for the individual PDF
    pdf_file.close()

# Save the merged PDF to a new file
output_pdf = open('merged.pdf', 'wb')
pdf_writer.write(output_pdf)

# Close the file handle for the output PDF
output_pdf.close()

In this example, we first list the PDF files to be merged in the `pdf_files` list. You can modify this list to include the filenames of the PDF files you want to merge.

We then create a PDF writer object using the `PdfFileWriter` class from PyPDF2. This object will store the merged PDF.

Next, we iterate through each PDF file in the `pdf_files` list. For each file, we open it in read-binary mode using the `open()` function.

Inside the loop, we create a PDF reader object for the current file using the `PdfFileReader` class. We iterate through each page of the PDF using a for loop and the `numPages` property of the PDF reader.

For each page, we retrieve it using the `getPage()` method. We add the page to the writer object using the `addPage()` method.

After iterating through all the pages of the current PDF file, we close the file handle for that file using the `close()` method.

Finally, after merging all the PDF files, we save the merged PDF to a new file named ‘merged.pdf’ using the `write()` method of the writer object. We also close the file handle for the output PDF file.

By executing this code, you will generate a new PDF file named ‘merged.pdf’ that contains all the pages from the listed PDF files merged into a single PDF.

Adding Watermark to a PDF File

Watermarks are a popular way to add a professional touch to PDF documents. Whether you want to protect your intellectual property or simply brand your PDFs, Python offers a simple and efficient solution for adding watermarks to PDF files. Here’s an example:


import PyPDF2

# Open the original PDF file in 
# read-binary mode
pdf_file = open('original.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Open the watermark PDF file in 
# read-binary mode
watermark_file = open('watermark.pdf', 'rb')

# Create a PDF reader object for 
# the watermark file
watermark_reader = 
PyPDF2.PdfFileReader(watermark_file)

# Get the first page of the watermark PDF
watermark_page = watermark_reader.getPage(0)

# Iterate through each page of the original PDF
for page_number in range(pdf_reader.numPages):
    # Get the specific page
    page = pdf_reader.getPage(page_number)

    # Merge the watermark page 
    # with the original page
    page.mergePage(watermark_page)

    # Add the merged page to the writer object
    pdf_writer.addPage(page)

# Save the modified PDF with 
# the watermark to a new file
output_pdf = open('watermarked.pdf', 'wb')
pdf_writer.write(output_pdf)

# Close the file handles
pdf_file.close()
watermark_file.close()
output_pdf.close()

In this example, we open the original PDF file in read-binary mode using the `open()` function. We create a PDF reader object using the `PdfFileReader` class from PyPDF2.

Next, we create a PDF writer object using the `PdfFileWriter` class. We open the watermark PDF file in read-binary mode and create a PDF reader object for the watermark file.

We retrieve the first page of the watermark PDF using the `getPage()` method. This assumes that the watermark PDF contains a single page for the watermark.

We then iterate through each page of the original PDF using a for loop and the `numPages` property of the PDF reader. For each page, we retrieve it using the `getPage()` method.

We merge the watermark page with the original page using the `mergePage()` method of the `PageObject` class. This overlays the watermark onto the original page.

We add the merged page to the writer object using the `addPage()` method.

After iterating through all the pages of the original PDF and adding the watermark, we save the modified PDF with the watermark to a new file named ‘watermarked.pdf’ using the `write()` method of the writer object.

Finally, we close the file handles for the original PDF, watermark PDF, and output PDF using the `close()` method.

By executing this code, you will generate a new PDF file named ‘watermarked.pdf’ with the watermark applied to each page of the original PDF.

Generating PDF Files

If you need to create new PDF files programmatically, you can use libraries like ReportLab or FPDF. Here’s an example using ReportLab:


from reportlab.pdfgen import canvas

pdf_file = canvas.Canvas('new_file.pdf')

# Add content to the PDF file
pdf_file.setFont("Helvetica", 12)
pdf_file.drawString(100, 700, "Hello, World!")

# Save and close the PDF file
pdf_file.save()

Text Search within PDF Files

PDF files are a popular format for document storage and distribution, but extracting specific information from them can be a challenge. Python, with its versatile libraries, offers a solution. Here, we will explore how to search for specific texts within PDF files using Python. Here’s an example:


import PyPDF2

# Opening the PDF File
file_path = 'path_to_pdf_file.pdf'
pdf_file = open(file_path, 'rb')

# Specify the text you want to search for
search_text = "your_search_text"

# Creating a PDF Reader Object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Iterating through Pages and Searching Text
for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    text = page.extractText()
    
    if search_text in text:
        print("Text found on page:", page_num+1)

pdf_file.close()

Encrypt a PDF File

In today’s digital age, data security is of paramount importance. When it comes to sensitive information stored in PDF files, encrypting them provides an extra layer of protection. You can encrypt PDF files using Python and the PyPDF2 library. Here’s an example:


import PyPDF2
import PyPDF2.PdfWriter.StandardEncryption as En

# Open the PDF file in read-binary mode
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Set the encryption options
# Set the user password (optional)
user_password = "password"  
# Set the owner password (optional)
owner_password = "owner_password"  

# Set the permissions (optional)
permissions = En.RW_ACCESSIBLE  

# Iterate through each page of the PDF
for page_number in range(pdf_reader.numPages):
    # Get the specific page
    page = pdf_reader.getPage(page_number)

    # Add the page to the writer object
    pdf_writer.addPage(page)

# Encrypt the PDF with the specified options
pdf_writer.encrypt(user_password, 
owner_password, permissions)

# Save the encrypted PDF to a new file
output_pdf = open('encrypted.pdf', 'wb')
pdf_writer.write(output_pdf)

# Close the file handles
pdf_file.close()
output_pdf.close()

In this example, we open the PDF file using the `open()` function in read-binary mode (`‘rb’`). Then, we create a PDF reader object using the `PdfFileReader` class from PyPDF2.

Next, we create a PDF writer object using the `PdfFileWriter` class. We set the encryption options for the PDF using the following variables:

  • `user_password`: Set the desired user password (optional). This password is required to open the PDF file.
  • `owner_password`: Set the desired owner password (optional). This password grants full permissions on the PDF file.
  • `permissions`: Set the permissions for the encrypted PDF (optional). You can specify the desired permissions using the constants provided by `PyPDF2.PdfWriter.StandardEncryption`.

After setting the encryption options, we iterate through each page of the original PDF using a for loop and the `numPages` property of the PDF reader. For each page, we retrieve it using the `getPage()` method and add it to the writer object using the `addPage()` method.

Once all the pages are added, we encrypt the PDF using the specified encryption options using the `encrypt()` method of the writer object.

Finally, we save the encrypted PDF to a new file named ‘encrypted.pdf’ using the `write()` method of the writer object. Remember to close the file handles using the `close()` method for both the input and output PDF files.

By executing this code, you will generate a new PDF file named ‘encrypted.pdf’ that is encrypted with the specified password and permissions.

Decrypt a PDF File

Just as you can enhance the security of your PDF documents by applying passwords, Python allows you to reverse the encryption of a PDF file that is protected by a password. You can follow the steps below:


import PyPDF2

# Open the encrypted PDF file in read-binary mode
pdf_file = open('encrypted.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Check if the PDF is encrypted
if pdf_reader.isEncrypted:
    # Decrypt the PDF with the provided password
    # Enter the password used to encrypt the PDF
    password = "password"  
    pdf_reader.decrypt(password)

    # Create a PDF writer object
    pdf_writer = PyPDF2.PdfFileWriter()

    # Iterate through each page of the PDF
    for page_number in range(pdf_reader.numPages):
        # Get the specific page
        page = pdf_reader.getPage(page_number)

        # Add the page to the writer object
        pdf_writer.addPage(page)

    # Save the decrypted PDF to a new file
    output_pdf = open('decrypted.pdf', 'wb')
    pdf_writer.write(output_pdf)

    # Close the file handles
    pdf_file.close()
    output_pdf.close()

    print("PDF decrypted successfully.")
else:
    print("The PDF file is not encrypted.")

In this example, we open the encrypted PDF file using the `open()` function in read-binary mode (`‘rb’`). We create a PDF reader object using the `PdfFileReader` class from PyPDF2.

We check if the PDF is encrypted using the `isEncrypted` property of the PDF reader. If it is encrypted, we proceed to decrypt the PDF.

To decrypt the PDF, we provide the password used to encrypt the PDF in the    `password` variable.

We then create a PDF writer object using the `PdfFileWriter` class. We iterate through each page of the encrypted PDF using a for loop and the `numPages` property of the PDF reader. For each page, we retrieve it using the `getPage()` method and add it to the writer object using the `addPage()` method.

Once all the pages are added, we save the decrypted PDF to a new file named ‘decrypted.pdf’ using the `write()` method of the writer object. Remember to close the file handles using the `close()` method for both the input and output PDF files.

If the PDF is not encrypted, we display a message indicating that the PDF file is not encrypted.

By executing this code, you will generate a new PDF file named ‘decrypted.pdf’ if the provided password successfully decrypts the input encrypted PDF file.

Exercises

Here are some exercises to practice working with PDF files in Python:

1. Extract Text: Write a Python script to extract text from a given PDF file and save it as a text file.

2. Merge PDFs: Create a program that merges multiple PDF files into a single PDF document.

3. Watermarking: Develop a script to add a watermark (e.g., a logo or text) to each page of a PDF file.

4. Page Rotation: Build a program that rotates the pages of a PDF file by a specified angle (e.g., 90 degrees clockwise).

5. Split PDFs: Write a script that splits a large PDF file into individual pages or smaller PDF documents.

6. Password Protection: Create a program that encrypts a PDF file with a password to restrict unauthorized access.

7. Metadata Extraction: Develop a script to extract metadata (e.g., author, creation date) from a PDF file and display it.

8. Image Extraction: Write a Python script to extract images from a PDF file and save them in a separate folder.

9. Text Search: Build a program that searches for a specific keyword or phrase within a PDF document and highlights the matches.

These exercises will provide hands-on experience in various aspects of working with PDF files using Python.

Conclusion

In conclusion, working with PDF files using Python provides a versatile and powerful way to manipulate and extract information from these widely used document formats. Throughout this tutorial, we explored various aspects of PDF file handling in Python, including reading, extracting, rotating, splitting, merging, and adding watermarks to PDFs.

Whether you need to automate repetitive tasks, extract data from PDF forms, or integrate PDF processing into your Python applications, the techniques discussed here provide a solid foundation to get started. 

As technology continues to evolve, PDF files remain a prevalent and essential file format in various industries. Being able to handle and manipulate PDFs programmatically using Python provides a valuable skill set for professionals and opens up opportunities for automation, data extraction, and content customization.

Share your love
Subhankar Rakshit
Subhankar Rakshit

Hey there! I’m Subhankar Rakshit, the brains behind PySeek. I’m a Post Graduate in Computer Science. PySeek is where I channel my love for Python programming and share it with the world through engaging and informative blogs.

Articles: 147