Building a twitter bot with python

For this post, we will be creating a bot that tweet daily (and automatically) on world events or any categories desired.

Major steps as follows:

1. Create a twitter account and API authorization.

As we will be automating using python, we will require to authorize the twitter API to work with python. Sign in to twitter application, click the “create new App” button and fill the required fields. You will need to obtain the “Access Token” and “Access Token Secret.” These two token will be used for python module in the later part.

2. Using python and tweepy

Tweepy module will be used to handle twitter related actions such as posting and getting results or even following/follow. Below snippet shows how to initialize the api for posting tweets and twitter related api. It will require consumer key and secret key from part 1.

import os, sys, datetime, re
import tweepy
import ConfigParser

def get_twitter_api():

    config_file_list = [
                        'directory/configfile_that_contain_credentials.ini'
                        ]

    #get the config_file that exists
    config_file = [n for n in config_file_list if os.path.exists(n)][0] #take the first entry

    parser = ConfigParser.ConfigParser()
    parser.read(config_file)

    CONSUMER_KEY =parser.get('CONFIG', 'CONSUMER_KEY')
    CONSUMER_SECRET = parser.get('CONFIG', 'CONSUMER_SECRET')
    ACCESS_KEY = parser.get('CONFIG', 'ACCESS_KEY')
    ACCESS_SECRET = parser.get('CONFIG', 'ACCESS_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

    api = tweepy.API(auth)
    return api

3. Getting Contents

We can either create own contents or get contents from various sources (the twitter will be like some sort of feeds/content aggregators). We will explore one simple case of displaying RSS feeds from various sources (such as blog, news etc) as contents for our twitter bot. The first step is to get all the RSS feeds from various sites. Below are some of the python scripts that will aid in the the collection of RSS feeds, links and contents. The main modules used are python pattern for all url/RSS feed access and downloading.

You can pip install the following modules pattern, smallutils and pandas for below python snippets.

3.1 Getting all url links from particular website.

This is for cases such as an aggregation site that display a list of websites that you might be interested to get all the website links. Note that the following scripts will retrieve all the link tags in the website and there might be redundant data. You can set the filter to limit the website search or you can manually select from the output list.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed

def get_all_url_link_fr_target_website(tgt_site):
    """ Quick way to harvest all the url links and extract those that are feeds"""

    url = URL(tgt_site)
    page_source = url.download()

    return find_urls(page_source)

for site in  [n for n in get_all_url_link_fr_target_website(tgt_site) if not re.search("jpg|jpeg|png|ico|bit|svg|js",n)]:
	site_list.append(site)

site_list = [n for n in site_list if re.search("http(?:s)?://(?:www.)?[a-zA-Z0-9_]*.[a-zA-Z0-9_]*/$",n)]

for n in sorted(site_list):
	print n

3.2 Getting RSS feeds link from a website

Sometimes it is difficult to search for the RSS link from a particular website and blog. The following script will search for any RSS feeds link in the website and output it. Again, there might be some redundant links present.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su

def get_feed_link_fr_target_website(tgt_site, pull_one = 1):
    """ Get the feed url from target website
        Args:
            tgt_site = url of target site
            pull_one = pull only 1 particular feed link

    """

    url = URL(tgt_site)
    page_source = url.download()

    if pull_one:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)][0]
    else:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)]

tgt_file = r'directory/txtfile_with_all_url.txt'
url_list = su.read_data_fr_file(tgt_file)

for url in url_list:
	try:
		w =  get_feed_link_fr_target_website(url,0)
	except:
		continue

if type(w) == list:
	for n in w:
		print n

3.3 Extracting contents from the RSS feeds

To extract contents from the RSS feeds, we need a python module that can parse a RSS feed structure (primarily xml format). We will make use of python pattern for RSS feed parsing and pandas to save extracted data in csv format. The following snippet will take in a file that contain a list of feeds url and retrieve the corresponding feeds.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su
import pandas as pd

def get_feed_details_fr_url_list(url_list, save_csvfilename):
    """ Get the feeds info and save as dataframe to target location"""
    target_list = []
    for feed_url in url_list:
        print feed_url
        if feed_url == "-":
            break
        try:
            for result in Newsfeed().search(feed_url)[:2]:
                print repr(result.title), repr(result.url),  repr(result.date)
                temp_data = {"title":result.title, "feed_url":result.url, "date":result.date, "ref":extract_site_name_fr_url(feed_url)}
                target_list.append(temp_data)
            print "*"*18
            print
        except:
            print "No feeds found"
            continue

    ## save to padnas
    df = pd.DataFrame(target_list)
    df.to_csv(save_csvfilename, index= False , encoding='utf-8')

tgt_file = r'directory\tgt_file_that_contain_list_of_feeds_url.txt'
url_list = su.read_data_fr_file(tgt_file)

get_feed_details_fr_url_list(url_list, r"output\feed_result.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

You can also refer below post on feeds extraction.

Get RSS feeds using python pattern

3.4 URL shortener

Normally we would like to include the actual link in the twitter after including the content. However, sometimes the url is too long and may hit the twitter word limit. In this case, we can use URL shortener to help in our job. There are a couple of URL shortener services such as google, tinyurl. We will incorporate tinyurl in our python script.

from pattern.web import URL, extension

def shorten_target_url(tgt_url):
    agent = 'http://tinyurl.com/api-create.php?url={}'
    query_url = agent.format(tgt_url)

    url = URL(query_url)
    page_source = url.download()

    return page_source

4. Posting contents to Twitter

We make use of the snippets in section 2 and 3 and create a combined script that authenticate the user, get all feeds from a list a feeds url text file, select a few of the more recent feeds and post them to the twitter account with targeted hash tags and url shortening. Do observe proper tweeting etiquette and avoid spamming.

import os, sys, datetime, time
import pandas as pd
from FeedsHandler import get_feed_details_fr_url_list
from urlshortener import shorten_target_url
from initialize_twitter_api import get_twitter_api
import smallutils as su

if __name__  == "__main__":

    print "start of project"

    ## Defined parameters
    tgt_file_list = [
                        r'directory\tgt_file_contain_feedurl_list.txt'
                        ]

    #get the tgt_file that exists
    tgt_file = [n for n in tgt_file_list if os.path.exists(n)][0] #take the first entry

    feeds_outputfile =  r"c:\data\temp\feed_result.csv"
    hashtags = '#DIY #hacks' #include hash tags
    feeds_sample_size = 8

    ## Get feeds from url list
    print "Get feeds from target url list ... "
    url_list = su.read_data_fr_file(tgt_file)
    get_feed_details_fr_url_list(url_list, feeds_outputfile)

    ## Read the feeds_outputfile and
    print "Handling the feeds data"
    feeds_df = pd.read_csv(feeds_outputfile)
    feeds_df['date'] = pd.to_datetime(feeds_df['date'])

    ## filter the date within one day to today
    feeds_df['date_delta'] = datetime.datetime.now() - feeds_df['date']
    feeds_df['date_delta_days'] = feeds_df['date_delta'].apply(lambda x: float(x.days))

    feeds_df_filtered =  feeds_df[feeds_df['date_delta_days']  feeds_sample_size:# do a sampling if the input is high
        feeds_df_filtered_sample = feeds_df_filtered.sample(feeds_sample_size)
    else:
        feeds_df_filtered_sample = feeds_df_filtered

    ## set up for twitter api
    print "Initialized the Twitter API"
    api = get_twitter_api()

    ## handling message to twitter
    print "Sending all data to twitter"
    for index, row in feeds_df_filtered_sample.iterrows():
        #convert to full text for output
        target_txt = 'Via @' + row['ref'] + ': ' + row['title'] + ' ' + row['feeds_url_shorten'] + ' ' + hashtags
        try:
            api.update_status(target_txt)
        except:
            pass
        time.sleep(60*30)

5. Scheduling tweets

We can use either windows task scheduler or cron job to do scheduling of tweet posting daily.

6. What to do next

Above contents are derived mainly from RSS feeds. We can add contents by retweeting or embedding youtube videos automatically. A sample twitter bot created using the above methods are included in the link.

You can refer to some of the posts that include retrieving data from twitter.