Skip to content

Extracting Text from a PDF using Python

I was recently asked by a dear friend to extract some text from a PDF.

I was sure that Python could do it via the pdfminer.six library.

pip install pdfminer.six

This is the code:

import sys
from pdfminer.high_level import extract_text

def extract_pdf_text(pdf_path):
    text = extract_text(pdf_path)
    return text

if len(sys.argv) < 2:
    print("Usage: python script.py <path_to_pdf>")
    sys.exit(1)

pdf_path = sys.argv[1]
text_content = extract_pdf_text(pdf_path)

with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(text_content)

print("Text extracted successfully! Check output.txt for your content.")

I just put the pdf in the same directory as the script and type:

python pdf_to_text.py your_pdf.pdf

Reading down the code you can see it’s only a few commands and lets the library do the work. I added the encoding part for the output file as sometimes strange characters are in pdfs.

Enjoy!

Tags:

Leave a Reply