web

Adding PostgreSQL to Django

Requirements

VirtualEnv
Django
PostgreSQL

Add on to post from Painless PostgreSQL + Django

The recommendation is to follow the steps from the original well-written post and refers to the following to fill in some of the possible gaps .

Activate a virtualenv
Git clone the project (in the post) to local directory
Run pip install -r requirements.txt
Upgrade Django version (will encounter error if this step is not performed). pip install django==1.11.17. This only applies if you following the post and cloning the project used in the post.
Create new user in Postgres, create new database & grant assess (Step 1 & 2 of post)
Update settings.py on the database portion.
Create environment variables in the virtualenv. See link for more information.
1. Note: Secret Key needs to be included as one of the environment variable.
2. Update the postactivate file of the virtualenv so the environment variables are present when virtualenv is activated.
3. To get path of the virtualenv: echo $VIRTUAL_ENV

Create new user in Postgres

# Psql codes for Step 1 and 2 of original post.
# ensure Postgres server is running
psql
# create user with password
CREATE USER sample_user WITH PASSWORD 'sample_password';
# create database
CREATE DATABASE sample_database WITH OWNER sample_user;

Update database information in Setting.py

# Changes in the settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': os.environ.get('DB_NAME', ''),
        'USER': os.environ.get('DB_USER', ''),
        'PASSWORD': os.environ.get('DB_PASS', ''),
        'HOST': 'localhost',
        'PORT': '5432',
    }
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = os.environ.get('DJANGO_SECRET_KEY', '')

Update environment variables in VirtualEnv

# postactivate script in the project virtual env bin path.
# E.g. ~/.virtualenv/[projectname]/bin/postactivate

#!/bin/bash
# This hook is sourced after this virtualenv is activated.
export DB_NAME='sample_database'
export DB_USER='sample_user'
export DB_PASS='sample_password'
export DJANGO_SECRET_KEY='thisissecretkey'

Running migrations (Ensure PostgreSQL server is running)

python manage.py makemigrations
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver

Additional notes:

When running python manage.py runserver on local host and error occurs, check domain is included in the ALLOWED_HOSTS of setting.py. Alternatively, you can use below:
- ALLOWED_HOSTS = [‘*’] # for local host only
No database created when running psql command: CREATE DATABASE …, check if semi-colon add to end of the statement. In the event, the ‘;’ is missing, type ‘;’ and try inputting the commands again. See link for more details.

Hosting static website with GitHub Pages

Create static website with custom domain names. Perks is having your own web hosting at minimal cost. The only cost is the cost of the custom domain name.

Requirements:

Github account: For hosting the static website.
Custom domain name: Purchase domain names from GoDaddy or Namecheap etc. Alternatively, can use GitHub default url <username>.github.io

Steps:

Github
1. Create new repository with following format <username>.github.io where username refers to GitHub userid.
2. In the repository, go to setting: Under Theme, choose a Jekyll theme. When finish, click on Source, select master branch. A file needs to exist in repository before Source option can be selected.
3. If you have purchase your custom domain, you need to configure the A records and CNAME for the domain at the registrar to point to the GitHub site. Proceed to make the necessary changes at the domain registrar website.
Registrar (Below is using GoDaddy as example)
1. Under My Products, select the domain name that will be used. Click on Manage button.
2. Once in setting page, scroll down to Additional Settings and click Manage DNS
3. Within the DNS management page, Add in 4 “A” row with each pointing to IP as follows:
  1. 185.199.108.153
  2. 185.199.109.153
  3. 185.199.110.153
  4. 185.199.111.153
4. Add in the CNAME pointing to your repository at Github: <username>.github.io
5. View link for more info on configuring domain name with goDaddy
6. Similarly, see following link for Namecheap
7. Note: if you setup using A records and CNAME, leave the nameservers as default.
8. Once the settings are configured, return to GitHub pages to add the custom domain name
Github
1. At the setting page, add the custom domain name in the Custom Domain section.
2. Tick Enforce Https (may take up to 24 hours to take effect)
3. Completed.
Proceed to add in contents in GitHub using markdown.

Resources

Notes

GoDaddy default A records: 50.63.202.32

Shorte.st Url Shortener API with Python: Create multiple shorteners at one go (& monetize your links)

A mini project that shortens urls with Shorte.st using python. Shorte.st only provides the “curl” command version of the API. In this post, the command is translated in the form of python requests for easy integration with rest of python scripts and enable multiple urls shortening.

Please note that I have an account with Shorte.st.

Objectives:
1. 1. Create python function to shorten url using Shorte.st
Required Tools:
1. 1. Requests — for handling HTML protocol. Use pip install requests.
  2. Shorte.st account — Shorte.st account to shorten url.
Steps:
1. 1. Retrieve the API token from Shorte.st by going to Link Tools –> Developer API and copy the API token.
  2. Use request.put with the following parameters:
    1. headers containing the API token and user-agent
    2. data which contains the target url to shorten.
  3. Get the response.text which contain the shortened url
  4. Complete! Include shortened url in target sites/twitter/social media etc.

Curl commands as provided by Shorte.st

curl -H "public-api-token: your_api_token" -X PUT -d "urlToShorten=target_url_to_shortened.com" https://api.shorte.st/v1/data/url

Python function to insert to part of your code or as standalone

import os, sys, re
import requests

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

def shorten_url(target_url, api_token):
    """
        Function to shorten url (With your shorte.st) account.
        Args:
            target_url (str): url to shorten
            api_token (str): api token str
        Returns:
            shortened_url (str)

    """

    headers = {'user_agent':USER_AGENT, 'public-api-token':api_token}
    data = dict(urlToShorten=target_url)

    url = 'https://api.shorte.st/v1/data/url'

    r= requests.put(url, data, headers= headers)

    shortened_url = re.search('"shortenedUrl":"(.*)"',r.text).group(1)
    shortened_url = shortened_url.replace('\\','')

    return shortened_url

if __name__ == "__main__":

    api_token = 'your_api_token'

    urllist = [
                'https://simply-python.com/2018/07/20/fast-download-images-from-google-image-search-with-python-requests-grequests',
                'https://simply-python.com/2018/04/22/building-a-twitter-bot-with-python'

                ]

    for target_url in urllist:
        shortened_url = shorten_url(target_url, api_token)
        print 'shortened_url: {}'.format(shortened_url)

Results

shortened_url: http://destyy.com/wKqD2s
shortened_url: http://destyy.com/wKqD17

Further notes

If you have some fantastic links to share and hope to monetize your links, you can click on below banner to explore more.
The above script is not meant for spamming with huge amount of urls. Shorte.st will monitor on the quality of the urls be shortened.
An ads-free shortener will be with bit.ly. Please see post on using the bit.ly shortener with python if prefer an alternative.

Fast Download Images from Google Image search with python requests/grequests

A mini project that highlights the usage of requests and grequests.

Objectives:
1. 1. Download multiple images from Google Image search results.
Required Modules:
1. 1. Requests – for HTTP request
  2. grequests – for easy asynchronous HTTP Requests.
  3. Both can be installed by using pip install requests, grequests
Steps:
1. 1. Retrieve html source from the google image search results.
  2. Retrieve all image url links from above html source. (function: get_image_urls_fr_gs)
  3. Feed the image url list to grequests for multiple downloads (function: dl_imagelist_to_dir)
Breakdown: Steps on grequests implementation.
1. Very similar to requests implementation which instead of using requests. get() use grequests.get() or grequests.post()
2. Create a list of GET or POST actions with different urls as the url parameters. Identify a further action after getting the response e.g. download image to file after the get request.
3. Map the list of get requests to grequests to activate it. e.g. grequests.map(do_stuff, size=x) where x is the number of async https requests. You can choose x for values such as 20, 50, 100 etc.
4. Done !

Below is the complete code.


import os, sys, re
import string
import random
import requests, grequests
from functools import partial
import smallutils as su  #only use for creating folder

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
headers = { 'User-Agent': USER_AGENT }

def get_image_urls_fr_gs(query_key):
    """
        Get all image url from google image search
        Args:
            query_key: search term as of what is input to search box.
        Returns:
            (list): list of url for respective images.

    """

    query_key = query_key.replace(' ','+')#replace space in query space with +
    tgt_url = 'https://www.google.com.sg/search?q={}&tbm=isch&tbs=sbd:0'.format(query_key)#last part is the sort by relv

    r = requests.get(tgt_url, headers = headers)

    urllist = [n for n in re.findall('"ou":"([a-zA-Z0-9_./:-]+.(?:jpg|jpeg|png))",', r.text)] 

    return urllist

def dl_imagelist_to_dir(urllist, tgt_folder, job_size = 100):
    """
        Download all images from list of url link to tgt dir
        Args:
            urllist: list of the image url retrieved from the google image search
            tgt_folder: dir at which the image is stored
        Kwargs:
            job_size: (int) number of downloads to spawn.

    """
    if len(urllist) == 0:
        print "No links in urllist"
        return

    def dl_file(r, folder_dir, filename, *args, **kwargs):
        fname = os.path.join(folder_dir, filename)
        with open(fname, 'wb') as my_file:
            # Read by 4KB chunks
            for byte_chunk in r.iter_content(chunk_size=1024*10):
                if byte_chunk:
                    my_file.write(byte_chunk)
                    my_file.flush()
                    os.fsync(my_file)

        r.close()

    do_stuff = []
    su.create_folder(tgt_folder)

    for run_num, tgt_url in enumerate(urllist):
        print tgt_url
        # handle the tgt url to be use as basename
        basename = os.path.basename(tgt_url)
        file_name = re.sub('[^A-Za-z0-9.]+', '_', basename ) #prevent special characters in filename

        #handling grequest
        action_item =  grequests.get(tgt_url, hooks={'response': partial(dl_file, folder_dir = tgt_folder, filename=file_name)}, headers= headers,  stream=True)
        do_stuff.append(action_item)

    grequests.map(do_stuff, size=job_size)

def dl_images_fr_gs(query_key, tgt_folder):
    """
        Function to download images from google search

    """
    url_list = get_image_urls_fr_gs(query_key)
    dl_imagelist_to_dir(url_list, tgt_folder, job_size = 100)

if __name__ == "__main__":

    query_key= 'python symbol'
    tgt_folder = r'c:\data\temp\addon'
    dl_images_fr_gs(query_key, tgt_folder)

Further notes

Note that the images download from google search are only those displayed. Additional images which are only shown when “show more results” button is clicked will not be downloaded. To resolve this case:
1. a user can continuously clicked on “show more results”, manually download the html source and run the 2nd function (dl_imagelist_to_dir) on the url list extracted.
2. Use python selenium to download the html source.
Instead of using grequests, request module can be used to download the images sequentially or one by one.
The downloading of files are break into chunks especially for those very big files.
Code can be further extended for downloading other stuff.
Further parameters in the google search url here.

Create Train breakdown notifications

Imagine walking 10 mins to the train station, finds the train has broken down and the bus stop is 20 mins walk away in opposite direction from the station. This is extremely frustrating especially if you are living in a country with relatively frequent train delay and breakdown. The solution: create a simple alert system to your phone using Python, Pattern and Pushbullet.

The below script will scrape the MRT website for latest announcements using Python Pattern and send to the phone using Pushbullet. In this version of script, it will always return the latest post on the website. As such, the latest post might be a few days ago if there is no new breakdown.

We will assume that MRT is working well if it returns a non-current post. In such case, we will also pull the date and time of latest post for date comparison. The script is then scheduled to run every day at specific timing preferably before going out for work.

import os, sys, time, datetime
from pattern.web import URL, extension, download, DOM, plaintext
from pyPushBullet.pushbullet import PushBullet

#target website url
target_website = 'https://twitter.com/SMRT_Singapore'

# Require user-agent in the download field
html = download(target_website, unicode=True,
                user_agent='"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"',
                cached=False)

#optional, for checking purpose
with open(r'c:\data\temp\ans.html','wb') as f:
    f.write(html.encode('utf8'))

dom = DOM(html)

#X-path for scraping target parameters
time_str =  dom('a[class="tweet-timestamp js-permalink js-nav js-tooltip"]')[0].attributes['title']
content_str =  plaintext(dom('p[class="TweetTextSize TweetTextSize--26px js-tweet-text tweet-text"]')[0].content)

full_str = time_str + '\n' + content_str

api_key_path = r'API key str filepath'
with open(api_key_path,'r') as f:
    apiKey = f.read()

p = PushBullet(apiKey)
p.pushNote('all', 'mrt alert', full_str ,recipient_type="random1")

Modification can be done such that it will only create alert if there is current news simply by comparing the date of the post to today’s date.

Configuring mobile alert with pushbullet: “Sending alerts to iphone or Android phone using python“.
Example of web scrape using pattern “Simple Python Script to retrieve all stocks data from Google Finance Screener“

Scraping housing prices using Python Scrapy Part 2

This is the continuation of the previous post on “Scraping housing prices using Python Scrapy“. In this session, we will use Xpath to retrieve the corresponding fields from the targeted website instead of just having the full html page. For a preview on how to extract the information from a particular web page, you can refer to the following post “Retrieving stock news and Ex-date from SGX using python“.

Parsing the web page using Scrapy will require the use of Scrapy spider “parse” function. To test out the function, it might be an hassle to run Scrapy crawl command each time you try out a field as this means making requests to the website every single time.

There are two ways to go about it. One way is to let Scrapy cache the data. The other is to make use of the html webpage downloaded in the previous session. I have not really try out caching the information using scrapy but it is possible to run using Scrapy Middleware. Some of the links below might help to provide some ideas.

For utilizing the downloaded copy of the html page which is what I have been using, the following script demonstrate how it is done. The downloaded page is taken from this property website link. Create an empty script and input the following snippets, run the script as normal python script.

    import os, sys, time, datetime, re
    from scrapy.http import HtmlResponse

    #Enter file path
    filename = r'targeted file location'

    with open(filename,'r') as f:
        html =  f.read()

    response = HtmlResponse(url="my HTML string", body=html) # Key line to allow Scrapy to parse the page

    item = dict()

    for sel in response.xpath("//tr")[10:]:
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        for k, v in item.iteritems():
            print k, v

Once verified there are no issue retrieving the various components, we can paste the portion to the actual Scrapy spider parse function. Remember to exclude the statement “response = HtmlResponse …”.

From the url, we noticed that the property search results are available in multiple pages. The idea is to traverse each page and obtain the desired information from each page. This would need Scrapy to know the next url to go to. To parse the information, the same method can be use to retrieve the url link to the next page.

Below show the parse function use in the Scrapy spider.py.

def parse(self, response):

    for sel in response.xpath("//tr")[10:]:
        item = ScrapePropertyguruItem()
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        yield item

    #get next page link
    next_page = response.xpath("//div/div[6]/div/a[10]/@href")
    if next_page:
        page_url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(page_url, self.parse)

For the next post, I will share how to migrate the running of spider to Scrapy Cloud

Scraping housing prices using Python Scrapy

This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the following post “How to Install Scrapy in Windows“.

For a guide of running Scrapy, you can refer to the Scrapy tutorial. The following guidelines can be used for building a simple project.

Create project
scrapy startproject name_of_project

Define items in items.py (temporary set a few fields)

from scrapy.item import Item, Field

class ScrapePropertyguruItem(Item):
    # define the fields for your item here like:
    name = Field()
    id = Field()
    block_add = Field()

Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.

import scrapy
from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project

class DmozSpider(scrapy.Spider):
    name = "demo"
    allowed_domains = ['propertyguru.com.sg']
    start_urls = [
       r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14'
    ]
    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        print
        print
        print 'filename', filename 

        with open(filename, 'wb') as f:
            f.write(response.body)

Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.

You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.

BOT_NAME = 'propertyguru_sim'

SPIDER_MODULES = ['propertyguru_sim.spiders']
NEWSPIDER_MODULE = 'propertyguru_sim.spiders'

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

Run the script again with the updated code and you will see an html page appear in the project folder. Success.

In the next post, we will look at getting the individual components from the html page using xpath.

Google Search results web crawler (Updates)

A continuation of the project based on the following post “Google Search results web crawler (re-visit Part 2)” & “Getting Google Search results with Scrapy”. The project will first obtain all the links of the google search results of target search phrase and comb through each of the link and save them to a text file.

Two new main features are added. First main feature allows multiple keywords to be search at one go. Multiple search phrases can be entered from a target file and search all at one go.

There is also an option to converge all the results of all the search phrases. This is useful when all the search phrases are related and you wish to see all the top ranked results group together. The results will display all the top search result of all the key phrases followed by the 2nd and so forth.

Other options include specifying the number of text sentences of each result to print, min length of the sentence, sort results by date etc. Below are the key options to choose from:

    NUM_SEARCH_RESULTS = 30  # number of search results returned
    SENTENCE_LIMIT = 50
    MIN_WORD_IN_SENTENCE = 6
    ENABLE_DATE_SORT = 0

The second feature is an experimental feature that deal with language processing. It will try to retrieve all the noun phrases from all the search results and note the its frequency. The idea is to retrieve the most popular noun phrases based on the results of all the search, this is something similar to word cloud.

This is done using the python pattern module which also deal with the HTML request and processing used in the script. Under the pattern module, there is sub module that handles natural language processing. For this feature, the pattern module will tokenize the text and (part-of-speech) tag each of the word. With the in-built tag identifcation, you can specify it to detect noun phrase chunk tag or NP (Tags: DT+RB+JJ+NN + PR). For more part-of-speech tag, you can refer to pattern website. I have included part of the code for the noun phrase detection (Under pattern_parsing.py).

def get_noun_phrases_fr_text(text_parsetree, print_output = 0, phrases_num_limit =5, stopword_file=''):
    """ Method to return noun phrases in target text with duplicates
        The phrases will be a noun phrases ie NP chunks.
        Have the in build stop words --> check folder address for this.
        Args:
            text_parsetree (pattern.text.tree.Text): parsed tree of orginal text

        Kwargs:
            print_output (bool): 1 - print the results else do not print.
            phrases_num_limit (int): return  the max number of phrases. if 0, return all.
        
        Returns:
            (list): list of the found phrases. 

    """
    target_search_str = 'NP' #noun phrases
    target_search = search(target_search_str, text_parsetree)# only apply if the keyword is top freq:'JJ?+ NN NN|NNP|NNS+'

    target_word_list = []
    for n in target_search:
        if print_output: print retrieve_string(n)
        target_word_list.append(retrieve_string(n))

    ## exclude the stop words.
    if stopword_file:
        with open(stopword_file,'r') as f:
            stopword_list = f.read()
        stopword_list = stopword_list.split('\n')

    target_word_list = [n for n in target_word_list if n.lower() not in stopword_list ]

    if (len(target_word_list)>= phrases_num_limit and phrases_num_limit>0):
        return target_word_list[:phrases_num_limit]
    else:
        return target_word_list
        
def retrieve_top_freq_noun_phrases_fr_file(target_file, phrases_num_limit, top_cut_off, saveoutputfile = ''):
    """ Retrieve the top frequency words found in a file. Limit to noun phrases only.
        Stop word is active as default.
        Args:
            target_file (str): filepath as str.
            phrases_num_limit (int):  the max number of phrases. if 0, return all
            top_cut_off (int): for return of the top x phrases.
        Kwargs:
            saveoutputfile (str): if saveoutputfile not null, save to target location.
        Returns:
            (list) : just the top phrases.
            (list of tuple): phrases and frequency

    """
    with open(target_file, 'r') as f:
        webtext =  f.read()

    t = parsetree(webtext, lemmata=True)

    results_list = get_noun_phrases_fr_text(t, phrases_num_limit = phrases_num_limit, stopword_file = r'C:\pythonuserfiles\google_search_module_alt\stopwords_list.txt')

    #try to get frequnecy of the list of words
    counts = Counter(results_list)
    phrases_freq_list =  counts.most_common(top_cut_off) #remove non consequencial words...
    most_common_phrases_list = [n[0] for n in phrases_freq_list]

    if saveoutputfile:
        with open(saveoutputfile, 'w') as f:
            for (phrase, freq) in phrases_freq_list:
                temp_str = phrase + ' ' + str(freq) + '\n'
                f.write(temp_str)
            
    return most_common_phrases_list, phrases_freq_list

The second feature is very crude and give rise to quite a number of redundant phrases. However, in some cases, are able to pick up certain key phrases. Below are the frequency results based on list of the search key phrases. As seen, the accuracy still need some refinement.

Key phrases

Top cafes in singapore
where to go to for coffee in singapore
Recommended cafes in singapore
Most popular cafes singapore

================
Results

=================

Singapore 139
coffee 45
the past year 23
plenty 23
the Singapore cafe scene 22
new additions 22
View Photo 19
PH 16
cafes 14
20 Best Cafes 13
Fri 11
Coffee 11
Nylon 10
Thu 10
Artistry 10
Indonesia 10
The coffee 9
The Plain 9
Chye Seng Huat Hardware 9
the coffee 9
Photos 9
you re 9
Everton Park 8
sugar 8
Hours 8
t 8
Changi Airport 7
time 7
Food 7
p. 7
Common Man Coffee Roasters 7
Tel 7
Rise & Grind Coffee Co 6
good coffee 6
40 Hands 6
a lot 6
the cafe 6
The Coffee Bean 6
your friends 6
Malaysia 6
s 6
a cup 6
Korea 6
Sarnies 6
Waffles 6
Address 6
Chinese New Year 6
desserts 6
the river 6
Taiwan 6
home 6
the city 5
service 5
the best coffee 5
Tea Leaf 5
great coffee 5
a couple 5
the heart 5
people 5
the side 5
Nylon Coffee Roasters 5
hours 5
Singaporeans 5
food 5
any time 5
eve 5
eggs 5
a bit 5
Eve 5
the day 5
kopi 5
Thailand 5
brunch 5
their coffee 5
Chinatown 5
Restaurants 4
Brunch 4
the top 4
Jalan Besar 4
Ideas 4
Dutch Colony 4
night 4
Cafes 4
a variety 4
Visit 4
course 4
Melbourne 4
The Best 4

Main script can be obtained from Github.

RSS feeds Reader GUI

The last post mentions about retrieving RSS feeds. To allow easy viewing, a GUI is constructed. The GUI is built using wxpython and consists of few adjustable pane with scrolling enabled. The user can choose to display the different group (eg: “World” and “SG” news) in separate panels.

For live updates, a wx.timer function is added to the GUI so the data can update every x time specified by the users. This post highlights the use of wx MultiSplitterWindow, scrollable panels and wx.timer for feeds live updates.

import os, sys, re, time
import wx
from wx.lib.splitter import MultiSplitterWindow
from General_feed_extract import FeedsReader
import  wx.lib.scrolledpanel as scrolled

class SamplePane(scrolled.ScrolledPanel):
    """
    Just a simple test window to put into the splitter.
    Set to scrollable, set to word wrap
    """
    def __init__(self, parent, label):
        scrolled.ScrolledPanel.__init__(self, parent,style = wx.BORDER_SUNKEN)
        #self.SetBackgroundColour(colour)
        self.textbox = wx.TextCtrl(self, -1, label,style=wx.TE_MULTILINE )
        vbox = wx.BoxSizer(wx.VERTICAL)
        vbox.Add(self.textbox, 1, wx.ALIGN_LEFT | wx.ALL|wx.EXPAND, 5)
        self.SetSizer(vbox)
        self.SetAutoLayout(1)
        self.SetupScrolling()

        self.SetupScrolling()
    def SetOtherLabel(self, label):
        self.textbox.SetValue(label)
        self.SetupScrolling()

class MyPanel(wx.Panel):
    def __init__(self, parent):
        wx.Panel.__init__(self, parent, -1)
        self.parent = parent

        ## Add in the feeds parameters
        self.reader = FeedsReader()

        ## Add in timer
        self.timer = wx.Timer(self)
        self.Bind(wx.EVT_TIMER, self.on_timer_update_feeds, self.timer)
        self.timer.Start(30000) # start timer after a delay, time in milli sec

        splitter = MultiSplitterWindow(self, style=wx.SP_LIVE_UPDATE)
        self.splitter = splitter
        sizer = wx.BoxSizer(wx.HORIZONTAL)
        sizer.Add(splitter, 1, wx.EXPAND)
        self.SetSizer(sizer)

        self.world_news_panel = SamplePane(splitter, "Panel One")
        splitter.AppendWindow(self.world_news_panel, 140)

        self.SG_panel = SamplePane(splitter, "Panel Two")
        #self.SG_panel.SetMinSize(self.SG_panel.GetBestSize())
        splitter.AppendWindow(self.SG_panel, 180)

        self.others_panel = SamplePane(splitter,  "Panel Three")
        splitter.AppendWindow(self.others_panel, 105)

        ## Set the orientation
        self.splitter.SetOrientation(wx.VERTICAL)

        ## Updates the panel
        self.update_panels()

    def get_feeds(self):
        """ Run the get feeds class. Use for getting updates of the feeds.

        """
        self.reader.parse_rss_sites_by_cat()

    def update_panels(self):
        """ Update all the panels with the updated feeds.
            Can use the set other label method

        """
        self.get_feeds()
        self.update_SG_panel()
        self.update_world_panel()

    def update_world_panel(self):
        """ Update World_panel on the World news.

        """
        date_key = self.reader.set_last_desired_date(0)
        if self.reader.rss_results_dict_by_cat['World'].has_key(date_key):
            World_news_list = self.reader.rss_results_dict_by_cat['World'][date_key]
            World_news_str = '\n********************\n'.join(['\n'.join(n) for n in World_news_list])
            self.world_news_panel.SetOtherLabel(World_news_str)

    def update_SG_panel(self):
        """ Update SG_panel on the Singapore stock news.

        """
        date_key = self.reader.set_last_desired_date(0)
        if self.reader.rss_results_dict_by_cat['SG'].has_key(date_key):
            SG_news_list = self.reader.rss_results_dict_by_cat['SG'][date_key]
            SG_news_str = '\n********************\n'.join(['\n'.join(n) for n in SG_news_list])
            self.SG_panel.SetOtherLabel(SG_news_str)

    def on_timer_update_feeds(self,evt):
        """ Update feeds once timer reach.
        """
        print 'Updating....'
        self.update_panels()

    def SetLiveUpdate(self, enable):
        if enable:
            self.splitter.SetWindowStyle(wx.SP_LIVE_UPDATE)
        else:
            self.splitter.SetWindowStyle(0)

class MyFrame(wx.Frame):
    def __init__(self, parent, ID, title):      

        wx.Frame.__init__(self, parent, ID, title,pos=(150, 20), size=(850, 720))#size and position

        self.top_panel = MyPanel(self)

class MyApp(wx.App):
    def __init__(self):
        wx.App.__init__(self,redirect =False)
        self.frame= MyFrame(None,wx.ID_ANY, "Feeds Watcher")
        self.SetTopWindow(self.frame)

        self.frame.Show()

def run():
    try:
        app = MyApp()
        app.MainLoop()
    except Exception,e:
        print e
        del app

if __name__== "__main__":
    run()

The following links contains information on setting up scroll bars in wx and also working with wx.timers.

Get RSS feeds using python pattern

Python Pattern allows easy way to retrieve RSS feeds. The following script will act as a feeds reader and retrieve feeds from various sites, focusing on world news and related Singapore stock market in this example.

The pattern module has the NewsFeed() function that can take in RSS url and output the corresponding results. The following is the description of the Newsfeed object from the pattern website “The Newsfeed object is a wrapper for Mark Pilgrim’s Universal Feed Parser. Newsfeed.search() takes the URL of an RSS or Atom news feed and returns a list of Result objects.”

This will return object that has the following attributes title, link and desc. The script below takes in a dict with the different categories as key. The value are the list of RSS url belonging to that category. The script will output results in the form of dict of categories and results of each category are segregated by date key. This script allows consolidation of different feeds from various RSS sources enabling the user to further process the feeds. The printing of the feeds can be limited by the set_last_desired_date() which display only results from a certain date.

import os, re, sys, time, datetime, copy, calendar
from pattern.web import URL, extension, cache, plaintext, Newsfeed

class FeedsReader(object):
    def __init__(self):

        #For grouping to various category
        self.rss_sites_by_category_dict = {
                                            'SG':   [
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgtopstories.rss',
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgmarkets.rss',
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgproperty.rss',
                                                      ],
                                            'World':[
                                                        'http://www.ft.com/rss/home/asia',
                                                        'http://rss.cnn.com/rss/money_news_economy.rss',
                                                        'http://feeds.reuters.com/reuters/businessNews',
                                                      ],
                                            }
        self.rss_sites = []

        ## num of feeds to parse_per_site
        self.num_feeds_parse_per_site = 100

        ## individual group storage of feeds.
        self.rss_results_dict = {} # dict with date as key
        self.rss_title_list = []

        ## full results set consist of category
        self.rss_results_dict_by_cat ={} # dict of dict
        self.rss_title_list_by_cat = {}  # dict of list

    def set_rss_sites(self, rss_site_urls):
        """ Set to self.rss_sites.
            Args:
                rss_site_urls (list): list of rss site url for getting feeds.
        """
        self.rss_sites = rss_site_urls

    def convert_date_str_to_date_key(self, date_str):
        """ Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k <10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k >=10})

        return int(date_list[3] + month_dict[date_list[2]] + date_list[1])

    def parse_rss_sites(self):
        """ Function to parse the RSS sites.
            Results are stored in self.rss_results_dict with date as key.
        """
        self.rss_results_dict = {}
        self.rss_title_list = []

        cache.clear()

        for rss_site_url in self.rss_sites:
            print "processing: ", rss_site_url
            for result in Newsfeed().search(rss_site_url)[:self.num_feeds_parse_per_site]:
                date_key = self.convert_date_str_to_date_key(result.date)
                self.rss_title_list.append(result.title)
                if self.rss_results_dict.has_key(date_key):
                    self.rss_results_dict[date_key].append([result.title,  plaintext(result.text)])
                else:
                    self.rss_results_dict[date_key] = [[result.title,  plaintext(result.text)]]
        print 'done'

    def parse_rss_sites_by_cat(self):
        """ Iterate over the list of categories and parse the list of rss sites.
        """
        self.rss_results_dict_by_cat ={} # dict of dict
        self.rss_title_list_by_cat = {}  # dict of list

        for cat in self.rss_sites_by_category_dict:
            print 'Processing Category: ', cat
            self.set_rss_sites(self.rss_sites_by_category_dict[cat])
            self.parse_rss_sites()
            self.rss_results_dict_by_cat[cat] = self.rss_results_dict
            self.rss_title_list_by_cat[cat] = self.rss_title_list

    def set_last_desired_date(self, num_days = 0):
        """ Return the last date in which the results will be displayed.
            It is set to be the current date - num of days as set by users.
            Affect only self.print_feeds function.
            Kwargs:
                num_days (int): num of days prior to the current date.
                Setting to 0 will only retrieve the current date
            Returns:
                (int): datekey as yyyyymmdd.
        """
        last_eff_date_list = list((datetime.date.today() - datetime.timedelta(num_days)).timetuple()[0:3])

        if len(str(last_eff_date_list[1])) == 1:
            last_eff_date_list[1] = '0' + str(last_eff_date_list[1])

        return int(str(last_eff_date_list[0]) + last_eff_date_list[1] + str(last_eff_date_list[2]))

    def print_feeds(self, rss_results_dict):
        """ Print the RSS data results. Required the self.rss_results_dict.
            Args:
                rss_results_dict (dict): dict containing date as key and title, desc as value.
        """
        for n in rss_results_dict.keys():
            print 'Results of date: ', n
            dataset = rss_results_dict[n]
            if int(n) >= self.set_last_desired_date():
                print '===='*10
                for title,desc in dataset:
                    print title
                    print desc
                    print '--'*5
                    print

    def print_feeds_for_all_cat(self):
        """ Print feeds for all the category specified by the self.rss_results_dict_by_cat

        """
        for cat in self.rss_results_dict_by_cat:
            print 'Printing Category: ', cat
            self.print_feeds(self.rss_results_dict_by_cat[cat])
            print
            print "####"*18

if __name__ == '__main__':
        f = FeedsReader()
        f.parse_rss_sites_by_cat()
        print '=='*19
        f.print_feeds_for_all_cat()

The results are as followed:

Processing Category: World
processing: http://www.ft.com/rss/home/asia
processing: http://rss.cnn.com/rss/money_news_economy.rss
processing: http://feeds.reuters.com/reuters/businessNews
done
Processing Category: SG
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgtopstories.rss
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgmarkets.rss
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgproperty.rss
done
======================================

Printing Category: World
Results of date: 20150126
Results of date: 20150127
========================================
China seeks end to gold medal fixation
‘Blind pursuit’ of success condemned as sports administrator scraps rewards for victory
———-

Tsipras poised to unveil new Greek cabinet
Athens and international creditors dig in on Greek debt
———-

EU threatens Russia with more sanctions
Call comes as violence in eastern Ukraine escalates
———-

……..

Printing Category: SG
Results of date: 20150127
========================================
Singapore shares higher; ComfortDelGro shines on broker upgrade
SINGAPORE (Jan 27): Gains in most Asian markets helped lift Singapore shares, with much of the buying centred on penny stocks.

Gainers outnumbered decliners 267 to 187, with some 1.84 billion shares worth $1.49 billion shares changin…
———-

Job vacancies in Singapore up 8.9%
SINGAPORE (Jan 27): The number of job vacancies in Singapore swelled to 67,400 in the year to September 2014, from 61,900 the previous year, with the services industry in greatest need of workers.

The bulk of the vacancies was from c…