I was recently asked by a dear friend to extract some text from a PDF.
I was sure that Python could do it via the pdfminer.six library.
pip install pdfminer.six
This is the code:
import sys
from pdfminer.high_level import extract_text
def extract_pdf_text(pdf_path):
text = extract_text(pdf_path)
return text
if len(sys.argv) < 2:
print("Usage: python script.py <path_to_pdf>")
sys.exit(1)
pdf_path = sys.argv[1]
text_content = extract_pdf_text(pdf_path)
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(text_content)
print("Text extracted successfully! Check output.txt for your content.")
I just put the pdf in the same directory as the script and type:
python pdf_to_text.py your_pdf.pdf
Reading down the code you can see it’s only a few commands and lets the library do the work. I added the encoding
part for the output file as sometimes strange characters are in pdfs.
Enjoy!