Working with PDFs in Python
Recently I tried out pdf manipulation in my code to extract data to use in a project. Being an unorthodox file system to use in data storage and manipulation as opposed to the more traditional excel and CSV files, it comes with some loops in python. In this walkthrough, I’ll guide you on how to do exactly just that using a helper package called PyPDF2.
Let’s get started.
We’ll begin by setting up our environment, create a folder followed by a virtual environment. You can choose virtualenv
or pipenv
for the virtual environment. I’ll be using pipenv
Creating the virtualenv in the working directory:
mkdir pypdfdir
cd pypdfdir
pipenv shell
After creating the virtual environment you should proceed to install the package that will be central to our tutorial.
pip install PyPDF2
Using PyPDF2 in Read Mode
Once the package has been installed we can begin. I have two pdf files in my root folder that I’ll be using for demonstration. The first is a lorem ipsum dummy text data pdf file and the second has dummy data in a table structure.
Let’s create a file where our code will reside, i’ll call mine pdf.py
. While in the directory run below command to create the file.
touch pdf.py
You’ll start off by calling the PDFReader
class that will give you read access to your pdf files. Add the code below to your python script.
import PyPDF2
def pdf_read_mode():
reader = PyPDF2.PdfFileReader('lorem.pdf')
# Get the number of pages in the pdf
print(f"Number of pages are {reader.numPages}")
if __name__ == '__main__':
pdf_read_mode()
You first import the package to gain access to its PdfFileReader
class instance, you then proceed to pass in the pdf file name that is located on the same level as your script and call the method numPages
which returns the number of pages our pdf has. You can now go to the terminal and run the file using the command
python pdf.py
Which should output the message indicating the integer value representing the number of pages in your pdf file:
Number of pages are 1
Let’s test out more methods in the PdfFileReader
class, add the below code to your python script, we’ll maintain the reader
instance:
import PyPDF2
def pdf_read_mode():
file = open('lorem.pdf', 'rb')
reader = PyPDF2.PdfFileReader(file)
# # Get the number of pages in the pdf
print(f"Number of pages are {reader.numPages}")
# # Get metadata attached to a pdf file
print(f"The pdf file has the following metadata: {reader.documentInfo}")
# Get the page instance (we fetch the first page)
page = reader.getPage(0)
# We proceed to fetch the contents of the page by calling the extractText method
print(page.extractText())
if __name__ == '__main__':
pdf_read_mode()
In the code, you’ll continue to extract the contents of a page. First, you’ll call the page instance referencing the first page in your pdf file and then you’d call the extractText
function which does as the name presumes, gets all the text on the page. You should see an output of the text in your terminal.
Above is the partial text outputted. You can proceed to explore some more functionality on your own.
Using PyPDF2 in Write Mode
We’ll move from reading pdf files for now and venture briefly to writing into pdf files. You’ll be using the PdfFileWriter
class of the package. Let’s get started, add below code to your file
def pdf_write_mode():
file = open('lorem.pdf', 'rb')
reader = PyPDF2.PdfFileReader(file)
page = reader.getPage(0)
newpdf = open('new.pdf', 'wb')
writer = PyPDF2.PdfFileWriter()
writer.addPage(page)
writer.write(newpdf)
newpdf.close()
if __name__ == '__main__':
# pdf_read_mode()
pdf_write_mode()
Write mode caveat…
Unfortunately, the package does not allow you to edit existing pdf files with new content from python.
It does, however, provide the ability to copy content from one pdf file to a new one. That’s exactly what you’ll do in the code above, going line by line firstly we open the old file in read mode
and fetch the only page that exists. Secondly, proceed to create a new file by running the open file in wb
(write mode) and providing an arbitrary name in this case new.pdf
we proceed to create a new writer instance from the PdfFileWriter
class and add the page instance defined earlier. Finally, use the writer class to write the new page content to our earlier created pdf file and close the file.
On a subsequent run of the script file, a new file originally 😉 labelled new.pdf
should be created and it should have the same content as the earlier file.
Merging two pdf files
We can also merge two or more pdf files into one, add the following code to your script.
def pdf_merger():
file = open('lorem.pdf', 'rb')
file2 = open('mock.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
final_pdf = open('final.pdf', 'wb')
merger.merge(position=0, fileobj=file2)
merger.merge(position=2, fileobj=file)
merger.write(final\_pdf)
final_pdf.close()
if __name__ == '__main__':
# pdf_read_mode()
# pdf_write_mode()
pdf_merger()
In the code above we create a pdf_merger
function and in it resides the code to merge the two pdf files we have. You can proceed to run the file again. A new pdf file is created and based on the position you provided while calling the merge instance here, merger.merge(position=0, fileobj=file)
the file will be appended into that position/page.
Conclusion
We’ll pen the article at that, it’s by no stretch of the imagination a comprehensive coverage of what the PyPDF2 python package can do and I would advise going through their documentation for more details. Check out more of my writing here
Happy programming.
Code used in the article can be found here.