I was recently asked by a dear friend to extract some text from a PDF.
I was sure that Python could do it via the pdfminer.six library.
pip install pdfminer.six
This is the code:
import sys from pdfminer.high_level import extract_text def extract_pdf_text(pdf_path): text = extract_text(pdf_path) return text if len(sys.argv) < 2: print("Usage: python script.py <path_to_pdf>") sys.exit(1) pdf_path = sys.argv text_content = extract_pdf_text(pdf_path) with open('output.txt', 'w', encoding='utf-8') as file: file.write(text_content) print("Text extracted successfully! Check output.txt for your content.")
I just put the pdf in the same directory as the script and type:
python pdf_to_text.py your_pdf.pdf
Reading down the code you can see it’s only a few commands and lets the library do the work. I added the
encoding part for the output file as sometimes strange characters are in pdfs.