twitter

Building a twitter bot with python

For this post, we will be creating a bot that tweet daily (and automatically) on world events or any categories desired.

Major steps as follows:

1. Create a twitter account and API authorization.

As we will be automating using python, we will require to authorize the twitter API to work with python. Sign in to twitter application, click the “create new App” button and fill the required fields. You will need to obtain the “Access Token” and “Access Token Secret.” These two token will be used for python module in the later part.

2. Using python and tweepy

Tweepy module will be used to handle twitter related actions such as posting and getting results or even following/follow. Below snippet shows how to initialize the api for posting tweets and twitter related api. It will require consumer key and secret key from part 1.

import os, sys, datetime, re
import tweepy
import ConfigParser

def get_twitter_api():

    config_file_list = [
                        'directory/configfile_that_contain_credentials.ini'
                        ]

    #get the config_file that exists
    config_file = [n for n in config_file_list if os.path.exists(n)][0] #take the first entry

    parser = ConfigParser.ConfigParser()
    parser.read(config_file)

    CONSUMER_KEY =parser.get('CONFIG', 'CONSUMER_KEY')
    CONSUMER_SECRET = parser.get('CONFIG', 'CONSUMER_SECRET')
    ACCESS_KEY = parser.get('CONFIG', 'ACCESS_KEY')
    ACCESS_SECRET = parser.get('CONFIG', 'ACCESS_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

    api = tweepy.API(auth)
    return api

3. Getting Contents

We can either create own contents or get contents from various sources (the twitter will be like some sort of feeds/content aggregators). We will explore one simple case of displaying RSS feeds from various sources (such as blog, news etc) as contents for our twitter bot. The first step is to get all the RSS feeds from various sites. Below are some of the python scripts that will aid in the the collection of RSS feeds, links and contents. The main modules used are python pattern for all url/RSS feed access and downloading.

You can pip install the following modules pattern, smallutils and pandas for below python snippets.

3.1 Getting all url links from particular website.

This is for cases such as an aggregation site that display a list of websites that you might be interested to get all the website links. Note that the following scripts will retrieve all the link tags in the website and there might be redundant data. You can set the filter to limit the website search or you can manually select from the output list.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed

def get_all_url_link_fr_target_website(tgt_site):
    """ Quick way to harvest all the url links and extract those that are feeds"""

    url = URL(tgt_site)
    page_source = url.download()

    return find_urls(page_source)

for site in  [n for n in get_all_url_link_fr_target_website(tgt_site) if not re.search("jpg|jpeg|png|ico|bit|svg|js",n)]:
	site_list.append(site)

site_list = [n for n in site_list if re.search("http(?:s)?://(?:www.)?[a-zA-Z0-9_]*.[a-zA-Z0-9_]*/$",n)]

for n in sorted(site_list):
	print n

3.2 Getting RSS feeds link from a website

Sometimes it is difficult to search for the RSS link from a particular website and blog. The following script will search for any RSS feeds link in the website and output it. Again, there might be some redundant links present.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su

def get_feed_link_fr_target_website(tgt_site, pull_one = 1):
    """ Get the feed url from target website
        Args:
            tgt_site = url of target site
            pull_one = pull only 1 particular feed link

    """

    url = URL(tgt_site)
    page_source = url.download()

    if pull_one:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)][0]
    else:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)]

tgt_file = r'directory/txtfile_with_all_url.txt'
url_list = su.read_data_fr_file(tgt_file)

for url in url_list:
	try:
		w =  get_feed_link_fr_target_website(url,0)
	except:
		continue

if type(w) == list:
	for n in w:
		print n

3.3 Extracting contents from the RSS feeds

To extract contents from the RSS feeds, we need a python module that can parse a RSS feed structure (primarily xml format). We will make use of python pattern for RSS feed parsing and pandas to save extracted data in csv format. The following snippet will take in a file that contain a list of feeds url and retrieve the corresponding feeds.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su
import pandas as pd

def get_feed_details_fr_url_list(url_list, save_csvfilename):
    """ Get the feeds info and save as dataframe to target location"""
    target_list = []
    for feed_url in url_list:
        print feed_url
        if feed_url == "-":
            break
        try:
            for result in Newsfeed().search(feed_url)[:2]:
                print repr(result.title), repr(result.url),  repr(result.date)
                temp_data = {"title":result.title, "feed_url":result.url, "date":result.date, "ref":extract_site_name_fr_url(feed_url)}
                target_list.append(temp_data)
            print "*"*18
            print
        except:
            print "No feeds found"
            continue

    ## save to padnas
    df = pd.DataFrame(target_list)
    df.to_csv(save_csvfilename, index= False , encoding='utf-8')

tgt_file = r'directory\tgt_file_that_contain_list_of_feeds_url.txt'
url_list = su.read_data_fr_file(tgt_file)

get_feed_details_fr_url_list(url_list, r"output\feed_result.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

You can also refer below post on feeds extraction.

Get RSS feeds using python pattern

3.4 URL shortener

Normally we would like to include the actual link in the twitter after including the content. However, sometimes the url is too long and may hit the twitter word limit. In this case, we can use URL shortener to help in our job. There are a couple of URL shortener services such as google, tinyurl. We will incorporate tinyurl in our python script.

from pattern.web import URL, extension

def shorten_target_url(tgt_url):
    agent = 'http://tinyurl.com/api-create.php?url={}'
    query_url = agent.format(tgt_url)

    url = URL(query_url)
    page_source = url.download()

    return page_source

4. Posting contents to Twitter

We make use of the snippets in section 2 and 3 and create a combined script that authenticate the user, get all feeds from a list a feeds url text file, select a few of the more recent feeds and post them to the twitter account with targeted hash tags and url shortening. Do observe proper tweeting etiquette and avoid spamming.

import os, sys, datetime, time
import pandas as pd
from FeedsHandler import get_feed_details_fr_url_list
from urlshortener import shorten_target_url
from initialize_twitter_api import get_twitter_api
import smallutils as su

if __name__  == "__main__":

    print "start of project"

    ## Defined parameters
    tgt_file_list = [
                        r'directory\tgt_file_contain_feedurl_list.txt'
                        ]

    #get the tgt_file that exists
    tgt_file = [n for n in tgt_file_list if os.path.exists(n)][0] #take the first entry

    feeds_outputfile =  r"c:\data\temp\feed_result.csv"
    hashtags = '#DIY #hacks' #include hash tags
    feeds_sample_size = 8

    ## Get feeds from url list
    print "Get feeds from target url list ... "
    url_list = su.read_data_fr_file(tgt_file)
    get_feed_details_fr_url_list(url_list, feeds_outputfile)

    ## Read the feeds_outputfile and
    print "Handling the feeds data"
    feeds_df = pd.read_csv(feeds_outputfile)
    feeds_df['date'] = pd.to_datetime(feeds_df['date'])

    ## filter the date within one day to today
    feeds_df['date_delta'] = datetime.datetime.now() - feeds_df['date']
    feeds_df['date_delta_days'] = feeds_df['date_delta'].apply(lambda x: float(x.days))

    feeds_df_filtered =  feeds_df[feeds_df['date_delta_days']  feeds_sample_size:# do a sampling if the input is high
        feeds_df_filtered_sample = feeds_df_filtered.sample(feeds_sample_size)
    else:
        feeds_df_filtered_sample = feeds_df_filtered

    ## set up for twitter api
    print "Initialized the Twitter API"
    api = get_twitter_api()

    ## handling message to twitter
    print "Sending all data to twitter"
    for index, row in feeds_df_filtered_sample.iterrows():
        #convert to full text for output
        target_txt = 'Via @' + row['ref'] + ': ' + row['title'] + ' ' + row['feeds_url_shorten'] + ' ' + hashtags
        try:
            api.update_status(target_txt)
        except:
            pass
        time.sleep(60*30)

5. Scheduling tweets

We can use either windows task scheduler or cron job to do scheduling of tweet posting daily.

6. What to do next

Above contents are derived mainly from RSS feeds. We can add contents by retweeting or embedding youtube videos automatically. A sample twitter bot created using the above methods are included in the link.

You can refer to some of the posts that include retrieving data from twitter.

Get Stocks tweets using Twython (Updates)

Add more functionality to the script on getting stocks tweets using Twython and python. Add in a class StockTweetsReader that inherited the base class TweetsReader.

The StockTweetReader class is able to take in a series of stock name (as in company name) and incorporate the different search phrases such as ( <stockname> stock, <stockname> sentiment, <stockname> buy) to form a combined twitter query.

This search phrases are joined together by the “OR” keywords and the twitter search is based on the series of queries. Below is part of code showing the joining of stock name to the additional parts and which the phrases will eventually be joined with the “OR” operator. The final query will look something like <stockname> OR <stockname> shares OR <stockname> stock etc based on the modified part of the list as [”,’shares’,’stock’, ‘Sentiment’, ‘buy’, ‘sell’]


    self.modified_part_search_list = ['','shares','stock', 'Sentiment', 'buy', 'sell']
    
    def set_search_list_and_form_search_query(self):
        """ Set the search list for individual stocks.
            Set to self.search_list and self.twitter_search_query.
        """
        self.search_list = ['&quot;' + self.target_stock + ' ' + n + '&quot;'for n in self.modified_part_search_list]
        self.form_seach_str_query()

After iterating through the series of stocks symbols, it will compute the number of tweets, group by date, for each company or stock name to see any sudden spike in interest of the particular stock at any given date. Sample of the tweets count results from a series of Singapore stocks are shown below:

Processing stock: Sembcorp Ind
Processing stock: Mapletree Com Tr
Processing stock: Riverstone
20141006 14
20141007 86
Processing stock: NeraTel
20140930 3
Processing stock: Amtek Engg
Processing stock: Fortune Reit HKD
Processing stock: SATS
20141007 100
Processing stock: UOB Kay Hian
20141001 1
20141003 2
Processing stock: CapitaR China Tr
Processing stock: LantroVision
Processing stock: Sim Lian
20140929 1
20141001 2
20141005 1

There are currently limitation of the results due to API limitation. One is that the query is limited to 100 results and that it is limited to recent tweets (maybe capped within a month or two period). The other is that for short form stock name it may get other tweets having the same short form as the stockname or it might get stuff irrelevant of the stock news eg SATS which has 100 tweets in a single day.

The updated script is found in GitHub. It may need certain workaround to resolve some of the limitations observed.

Get Stocks tweets using Twython

Twython is a python twitter API for getting tweets as well as performing more advanced features such as posting or updating status. A particular project of mine requires monitoring stock tweets in the hope that it will help to give more insight about the particular stock. One of the way, I thinking, is to detect sudden rise in number of tweets for a particular stock for a particular day which signify increased attention or activities of that stock.

The script required authentication from Twitter hence requiring a twitter account. We just be needing the OAuth2 authentication, which is sufficient for only requesting feeds. Twython have described in their documentation on the setting up of the various authorization. After setting up, querying the search is relatively easy which can be found in the following tutorial. Additional parameters of the search function can also be found in the website.

A sample of a script that scan based on series of keywords is as below. The script will formed the search query string based on the include_search_list and ignore items based on the exclude list. More advanced usage of the different query method can be found in the tutorial.. The items in the include_search_list are joined by the “OR” words. Similarly, the items in the exclude_list is joined by “-” , meaning the tweets that have the phrases will be excluded from the search results.

The date extracted from the search function under “created_at” are modified to a date_key for easy comparison. Hence, by grouping the date_key, we can know the number of tweets for the particular stock for each day. Any unusual sign or increased activities can then be noted. Below code shows the query method used for the twitter search function.

    def perform_twitter_search(self):
        """Perform twitter search by calling the self.twitter_obj.search function.
            Ensure the setting for search such as lang, count are being set.
            Will store the create date and the contents of each tweets.
        """
        for n in self.twitter_obj.search(q=self.twitter_search_query, lang = self.lang,
                                         count= self.result_count, result_type = self.result_type)[&quot;statuses&quot;]:
            # store the date
            date_key =  self.convert_date_str_to_date_key(n['created_at'])
            contents = n['text'].encode(errors = 'ignore')
            self.search_results.append([date_key, contents])

To convert the date str to date key for easy processing, the calendar module is used to convert the month to integer and eventually join with the year str and day str.

    def convert_date_str_to_date_key(self, date_str):
        """Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k &lt;10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k &gt;=10})

        return int(date_list[5] + month_dict[date_list[1]] + date_list[2])

To count the number of tweets for a particular day, pandas module is used in this case but other method can do the job too.

    def count_num_tweets_per_day(self):
        """ Count the number of tweets per day present. Only include the days where there are at least one tweets,.
        """
        day_info = [n[0] for n in self.search_results]
        date_df = pandas.DataFrame(day_info)
        grouped_date_info = date_df.groupby(0).size()
        date_group_data = zip(list(grouped_date_info.index), list(grouped_date_info.values))
        for date, count in date_group_data:
            print date,' ', count

The full script is found in GitHub. Note that there seems to have some limitations or number tweets from using Twitter API compared to the search results displayed from the main Twitter interface. This poses some limitations to the information the program can provide.