This post covers basic PDF manipulation for daily tasks using simple Python modules.
- Merging mulitple PDF
- Extract text from PDF
- Extract image from PDF
Merging PDF
from PyPDF2 import PdfFileMerger pdfs = ['a.pdf', b.pdf] merger = PdfFileMerger() for pdf in pdfs: merger.append(pdf) merger.write("output.pdf")
Extract text from PDF
import pdftotext # Load your PDF with open("Target.pdf", "rb") as f: pdf = pdftotext.PDF(f) # Save all text to a txt file. with open('output.txt', 'w') as f: f.write("\n\n".join(pdf))
More information from “Convert PDF pages to text with python”
Extract Image (JPEG) from PDF
import os import tempfile from pdf2image import convert_from_path filename = 'target.pdf' with tempfile.TemporaryDirectory() as path: images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0) base_filename = os.path.splitext(os.path.basename(filename))[0] + '.jpg' save_dir = 'your_saved_dir' for page in images_from_path: page.save(os.path.join(save_dir, base_filename), 'JPEG')
More information from “Convert PDF pages to JPEG with python“
The installation on win is not straight forward, very good and detailed level steps are mentioned on this blog, it would be worth including that link in this post, the steps worked for me on Win10 x64: https://coder.haus/2019/09/27/installing-pdftotext-through-pip-on-windows-10/