Month: July 2019

Easy Web Scraping with Google Sheets

Google sheets simplify the process of web scraping especially for table and list elements. For below project, the purpose is to obtain common/essential words and their corresponding definitions for GMAT/GRE preparations.

Below are examples of each.

Table type extraction (source)

In one of the cells, type in =IMPORTHTML(“url-site“,“table”,<table_id>) where <table_id> is the table position in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the table num)

tabletypeexample

tabletypeexamplegooglesheet

List Type Extraction (source)

In one of the cells, type in =IMPORTHTML(“url-site“,“list”,<list_id>) where <list_id> is the list order in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the list num)

listtypeexamplegooglesheet

listtypeexamplegooglesheet1

The above techniques can also apply to other websites that have list or table elements. For this project, one of the next step is to create flash cards video to help in the learning. With the table format in google sheets, it is easy to download the whole list or table as .CSV file and create in the form of flash cards. Check the link for the quick project.

Create own flash cards video using Python

Build your own study flash cards video (+ background music) using Python easily.

Required Modules

moviepy
ImageMagick — for creating text clip
pandas — optional for managing CSV file

Basic steps

Read in the text information. Pandas can be used to read in a .csv file for table manipulation.
create a Textclip object for each text and append all Textclips together
Add in an audio if desired. Allow the audio to loop through duration of the clip
Save the file as mp4.

Sample Python Project — Vocabulary flash cards

Below is a simple project to create a vocabulary list of common words use in GMAT etc. For each word and meaning pair, it will flash the word followed by its meaning . There is slight pause in the timing to allow some time for the user to recall on the meaning for the particular words

Sample table for wordlist.csv (which essentially is a table of words and their respective meanings) * random sample (subset) obtained from web

Screen Shot 2019-07-23 at 11.32.42 PM


def create_txtclip(tgt_txt, duration = 2, fontsize = 18):
    try:
        txt_clip = TextClip(tgt_txt, fontsize = fontsize, color = 'black',bg_color='white', size=(426,240)).set_duration(duration)
        clip_list.append(txt_clip)
    except UnicodeEncodeError:
        txt_clip = TextClip("Issue with text", fontsize = fontsize, color = 'white').set_duration(2)
        clip_list.append(txt_clip)

from moviepy.editor import *

df = pd.read_csv("wordlist.csv")
for word, meaning in zip(df.iloc[:,0], df.iloc[:,1]):
    create_txtclip(word,1, 70)
    create_txtclip(meaning,3)

final_clip = concatenate(clip_list, method = "compose")

# optional music background with loop
music = AudioFileClip("your_audiofile.mp3")
audio = afx.audio_loop( music, duration=final_clip.duration)

final_clip = final_clip.set_audio(audio)

final_clip.write_videofile("flash_cards.mp4", fps = 24, codec = 'mpeg4')<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

In some cases, the audio for the flash cards does not work when play with Quicktime, will work on VLC

Sample video (converted to gif)

ezgif.com-video-to-gif

PDF manipulation with Python

This post covers basic PDF manipulation for daily tasks using simple Python modules.

Merging mulitple PDF
Extract text from PDF
Extract image from PDF

Merging PDF

from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("output.pdf")

Extract text from PDF

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

More information from “Convert PDF pages to text with python”

Extract Image (JPEG) from PDF

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

More information from “Convert PDF pages to JPEG with python“