search results

YouTube videos download using Python (Part 2)

A continuation from the “Search and download YouTube videos using Python” post with more features added.

The initial project only allows searching of playlists within YouTube and downloading the videos for all the playlist found. The project is expanded with the following features:

Multiple searches of different playlist can be inputted at one go (key in all search phrases in a text file) and automatically download for all videos found relating to the search phrases. Playlist search recommended for search such as songs playlist or online courses (eg. “Top favorite English songs/Most popular English songs”, “Machine learning Coursera”)
Non playlist search (normal video search); Both single and multiple search can be performed. For normal video search or general topic with less likely chance of being in a playlist. (eg. “Python Machine learning”)
Single video download (directly use Pafy module). User just need to input the video link.
Multiple options: users can limit the number of downloads, include filter count such as popularity, video length limit, download in video or audio format.

The script makes use of Python Pattern module for URL request and DOM object processing. For actual downloading of videos, it utilizes Pafy. Pafy is very comprehensive python module, allowing download in both video and audio format. There are other features of Pafy which is not used in this module.

The full script can be found in the GitHub.

Search and download youtube videos using Python

The following python module allows users to search YouTube videos and download all the videos from the different playlists found within the search. Currently, it is able to search for playlists or collections of videos and download individual videos from each of the playlists.

For example, searching for “Top English KTV” will scan for all the songs playlists found in the search results and collect the individual songs web link from each playlist to be downloaded locally. Users can choose either to download as video format or as audio format.

The following are the main flow of the script.

Form the YouTube search URL with the prefix “https://www.youtube.com/results?search_query=” and the search keyword
Based on the above URL, scrape and get all the urls that are linked to a playlist. The Xpath for the playlist element can be easily obtained using any web browser developer options, inspecting the element and retrieving the Xpath. The playlist url can be obtained using pattern dom object: ‘dom_object(div ul li a[class=”yt-uix-sessionlink”])’.
Filter the list of extracted link to cater only for URL link starting with “/playlist?“. A typical url for playlist looks something like below:
- https://www.youtube.com/playlist?list=PLO9BZlXiK-j-6MElJgnYvPaVyESLGaz-G
From the list of playlist, scrape the individual playlist webpage to retrieve the url link for each individual videos. The playlist element can be retrieved using pattern dom object: ‘dom_object(div ul li a[class=”yt-uix-sessionlink”])’.
Download each individual video/audio to local computer using Pafy module by passing in the video URL to Pafy.

Below is the sample code to download a series of videos.


from youtube_search_and_download import YouTubeHandler

search_key = 'chinese top ktv' #keywords
yy = YouTubeHandler(search_key)
yy.download_as_audio =1 # 1- download as audio format, 0 - download as video
yy.set_num_playlist_to_extract(5) # number of playlist to download

print 'Get all the playlist'
yy.get_playlist_url_list()
print yy.playlist_url_list

## Get all the individual video and title from each of the playlist
yy.get_video_link_fr_all_playlist()
for key in  yy.video_link_title_dict.keys():
    print key, '  ', yy.video_link_title_dict[key]
    print
print

print 'download video'
yy.download_all_videos(dl_limit =200) #number of videos to download.

This is the initial script. There are still work in progress such as option to download individual videos instead of playlist from the search page and catering for multiple search.

The full script can be found in the GitHub.

Scaping google results using python (Part 3)

The post on the testing of google search script I created last week describe the limitations of the script to scrape the required information. The search phrase is “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The other limitation is that the script can only take in one input or key phrase at one go. This is not very useful. Users would tend to search a variation of the key phrases to get the desirable results. I done some modifications to the script so it can take in either a key phrase (str) or a list of key phrases (list) so it can search all the key phrases at one go.

The script will now iterate the search phrases. Below is the summarized flow:

For each key phrase in key phrase list, generate the associated google search url, append all url to list.
For the list of google search url, Scrapy will scrape the individual url for the google results links. Append all links to a output file. There is one drawback. The links for the first key phrases will be displayed first followed by the 2nd key phrase.
For each of the links, Scrapy will scrape the content namely the title, meta description and for now, if specified, all the text within the <p> tag.
The resulting file will be very big depending on the size of the search results.

The format of the output is still not to satisfaction. Also printing all the <p> tag does not accomplished much in summarizing what I need.

The next step, hopefully, can utilize some of the NLTK and summarize tools to help filter the results.

The current script is in Git Hub.

Getting Google Search results with python (testing the program)

I was testing out the google search script I created last week. I was searching for the “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The python module was created with the intention to display more meaningful and relevant data without clicking to individual websites. However, with just the meta title and meta contents from the search results, it is not really useful in obtaining meaningful results.

I tried to modify the module by extraction of the paragraphs from each site and output them together with the meta descriptions. I make some changes to the script to handle multiple newline characters and debug on the unicode error that keeps popping out when output the text results.

To extract the paragraphs from each site, I used the xpath command as below.

sel = Selector(response)
paragraph_list = sel.xpath('//p/text()').extract()

To handle the unicode identification error, the following changes are made. The stackoverflow link provides the solution to the problem.

## convert the paragraph list to one continuous string
para_str = self.join_list_of_str(paragraph_list, joined_chars= '..')
## Replace any unknown unicode characters with ?
para_str = para_str.encode(errors='replace')
## Remove newline characters
para_str = self.remove_whitespace_fr_raw(para_str)

With the paragraphs displayed at the output, I was basically reading large chunks of texts and it was certainly messy with the newline removed. I could not really get good information out of it.

For example, it is better to get the ranked hotels from tripadvisor site but from the google search module, tripadvisor only displays the top page without any hotels listed. Below is the output I get from TripAdvisor site pertaining to the search result.

Tokyo Hotels: Check Out 653 hotels with 77,018 Reviews – TripAdvisor
ttp://www.tripadvisor.com.sg/Hotels-g298184-Tokyo_Tokyo_Prefecture_Kanto-Hotels.html

Tokyo Hotels: Find 77,018 traveller reviews and 2,802 candid photos for 653 hotels in Tokyo, Japan on TripAdvisor.

Price per night..Property type..Neighbourhood..Traveller rating..Hotel class..Amenities..Property name..Hotel brand

Performing recursive crawling on TripAdvisor itself perhaps will achieve more meaningful results.

Currently, I do not have much idea on enhancing the script to extract more meaningful data. Perhaps I can use text processing to summarize the paragraphs into meaningful data which would be the next step, utilizing the NLTK module. However, I am not hopeful of the final results.

For this particular search query, perhaps it would be easier to cater specific crawling methods on several target website such as TripAdvisor, Agoda etc rather than a general extraction of text.