Uncategorized

PDF manipulation with Python

This post covers basic PDF manipulation for daily tasks using simple Python modules.

Merging mulitple PDF
Extract text from PDF
Extract image from PDF

Merging PDF

from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("output.pdf")

Extract text from PDF

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

More information from “Convert PDF pages to text with python”

Extract Image (JPEG) from PDF

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

More information from “Convert PDF pages to JPEG with python“

How to Install Scrapy in Windows

scraper24x7

^543DDAB9D1F7B62090D7E854E3A49575E5E9C30402B1E8631F^pimgpsh_fullsize_distr

It took a lot of time for me to install scrapy in my windows pc. I have tried the Installation Guide by scrapy, and tried the tutorials from YouTube and always ended up with having errors. And i tried for weeks installing and uninstalling components, always got different errors. And finally, with lots of research, I successfully installed Scrapy. So, this is how i did it.

Step 1: Install Python 2.7

You can download Python 2.7 from here. Please make sure that you are downloading and installing Python 2.7, because scrapy don’t support the Python 3 versions. But, scrapy is working on making it compatible with Python 3. If you have already installed Python 3, uninstall it before installing Python 2.7.

Python

Now you need to add C:Python27 and C:Python27Scripts to your Path environment variable. To do this open your command prompt and type the following and hit enter:

c:python27python.exe c:python27toolsscriptswin_add2path.py

To check whether Python have installed properly, go to…

View original post 316 more words

Searching for github projects to contribute?

For those who looking for Git Hub projects to contribute to:

LFPR: Looking for pull Requests

Hotel Reviews Scraping

A python module for scraping hotel rating and reviews from Trip Advisor and Orbitz. The documentation seems to suggest only scraping for US hotels. It is worth studying to see how it can apply to other countries…..