For this post, we will be creating a bot that tweet daily (and automatically) on world events or any categories desired.
Major steps as follows:
1. Create a twitter account and API authorization.
As we will be automating using python, we will require to authorize the twitter API to work with python. Sign in to twitter application, click the “create new App” button and fill the required fields. You will need to obtain the “Access Token” and “Access Token Secret.” These two token will be used for python module in the later part.
2. Using python and tweepy
Tweepy module will be used to handle twitter related actions such as posting and getting results or even following/follow. Below snippet shows how to initialize the api for posting tweets and twitter related api. It will require consumer key and secret key from part 1.
import os, sys, datetime, re import tweepy import ConfigParser def get_twitter_api(): config_file_list = [ 'directory/configfile_that_contain_credentials.ini' ] #get the config_file that exists config_file = [n for n in config_file_list if os.path.exists(n)][0] #take the first entry parser = ConfigParser.ConfigParser() parser.read(config_file) CONSUMER_KEY =parser.get('CONFIG', 'CONSUMER_KEY') CONSUMER_SECRET = parser.get('CONFIG', 'CONSUMER_SECRET') ACCESS_KEY = parser.get('CONFIG', 'ACCESS_KEY') ACCESS_SECRET = parser.get('CONFIG', 'ACCESS_SECRET') auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_KEY, ACCESS_SECRET) api = tweepy.API(auth) return api
3. Getting Contents
We can either create own contents or get contents from various sources (the twitter will be like some sort of feeds/content aggregators). We will explore one simple case of displaying RSS feeds from various sources (such as blog, news etc) as contents for our twitter bot. The first step is to get all the RSS feeds from various sites. Below are some of the python scripts that will aid in the the collection of RSS feeds, links and contents. The main modules used are python pattern for all url/RSS feed access and downloading.
You can pip install the following modules pattern, smallutils and pandas for below python snippets.
3.1 Getting all url links from particular website.
This is for cases such as an aggregation site that display a list of websites that you might be interested to get all the website links. Note that the following scripts will retrieve all the link tags in the website and there might be redundant data. You can set the filter to limit the website search or you can manually select from the output list.
from pattern.web import URL, extension from pattern.web import find_urls from pattern.web import Newsfeed def get_all_url_link_fr_target_website(tgt_site): """ Quick way to harvest all the url links and extract those that are feeds""" url = URL(tgt_site) page_source = url.download() return find_urls(page_source) for site in [n for n in get_all_url_link_fr_target_website(tgt_site) if not re.search("jpg|jpeg|png|ico|bit|svg|js",n)]: site_list.append(site) site_list = [n for n in site_list if re.search("http(?:s)?://(?:www.)?[a-zA-Z0-9_]*.[a-zA-Z0-9_]*/$",n)] for n in sorted(site_list): print n
3.2 Getting RSS feeds link from a website
Sometimes it is difficult to search for the RSS link from a particular website and blog. The following script will search for any RSS feeds link in the website and output it. Again, there might be some redundant links present.
from pattern.web import URL, extension from pattern.web import find_urls from pattern.web import Newsfeed import smallutils as su def get_feed_link_fr_target_website(tgt_site, pull_one = 1): """ Get the feed url from target website Args: tgt_site = url of target site pull_one = pull only 1 particular feed link """ url = URL(tgt_site) page_source = url.download() if pull_one: return [n for n in find_urls(page_source) if re.search("feed|feeds",n)][0] else: return [n for n in find_urls(page_source) if re.search("feed|feeds",n)] tgt_file = r'directory/txtfile_with_all_url.txt' url_list = su.read_data_fr_file(tgt_file) for url in url_list: try: w = get_feed_link_fr_target_website(url,0) except: continue if type(w) == list: for n in w: print n
3.3 Extracting contents from the RSS feeds
To extract contents from the RSS feeds, we need a python module that can parse a RSS feed structure (primarily xml format). We will make use of python pattern for RSS feed parsing and pandas to save extracted data in csv format. The following snippet will take in a file that contain a list of feeds url and retrieve the corresponding feeds.
from pattern.web import URL, extension from pattern.web import find_urls from pattern.web import Newsfeed import smallutils as su import pandas as pd def get_feed_details_fr_url_list(url_list, save_csvfilename): """ Get the feeds info and save as dataframe to target location""" target_list = [] for feed_url in url_list: print feed_url if feed_url == "-": break try: for result in Newsfeed().search(feed_url)[:2]: print repr(result.title), repr(result.url), repr(result.date) temp_data = {"title":result.title, "feed_url":result.url, "date":result.date, "ref":extract_site_name_fr_url(feed_url)} target_list.append(temp_data) print "*"*18 print except: print "No feeds found" continue ## save to padnas df = pd.DataFrame(target_list) df.to_csv(save_csvfilename, index= False , encoding='utf-8') tgt_file = r'directory\tgt_file_that_contain_list_of_feeds_url.txt' url_list = su.read_data_fr_file(tgt_file) get_feed_details_fr_url_list(url_list, r"output\feed_result.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
You can also refer below post on feeds extraction.
3.4 URL shortener
Normally we would like to include the actual link in the twitter after including the content. However, sometimes the url is too long and may hit the twitter word limit. In this case, we can use URL shortener to help in our job. There are a couple of URL shortener services such as google, tinyurl. We will incorporate tinyurl in our python script.
from pattern.web import URL, extension def shorten_target_url(tgt_url): agent = 'http://tinyurl.com/api-create.php?url={}' query_url = agent.format(tgt_url) url = URL(query_url) page_source = url.download() return page_source
4. Posting contents to Twitter
We make use of the snippets in section 2 and 3 and create a combined script that authenticate the user, get all feeds from a list a feeds url text file, select a few of the more recent feeds and post them to the twitter account with targeted hash tags and url shortening. Do observe proper tweeting etiquette and avoid spamming.
import os, sys, datetime, time import pandas as pd from FeedsHandler import get_feed_details_fr_url_list from urlshortener import shorten_target_url from initialize_twitter_api import get_twitter_api import smallutils as su if __name__ == "__main__": print "start of project" ## Defined parameters tgt_file_list = [ r'directory\tgt_file_contain_feedurl_list.txt' ] #get the tgt_file that exists tgt_file = [n for n in tgt_file_list if os.path.exists(n)][0] #take the first entry feeds_outputfile = r"c:\data\temp\feed_result.csv" hashtags = '#DIY #hacks' #include hash tags feeds_sample_size = 8 ## Get feeds from url list print "Get feeds from target url list ... " url_list = su.read_data_fr_file(tgt_file) get_feed_details_fr_url_list(url_list, feeds_outputfile) ## Read the feeds_outputfile and print "Handling the feeds data" feeds_df = pd.read_csv(feeds_outputfile) feeds_df['date'] = pd.to_datetime(feeds_df['date']) ## filter the date within one day to today feeds_df['date_delta'] = datetime.datetime.now() - feeds_df['date'] feeds_df['date_delta_days'] = feeds_df['date_delta'].apply(lambda x: float(x.days)) feeds_df_filtered = feeds_df[feeds_df['date_delta_days'] feeds_sample_size:# do a sampling if the input is high feeds_df_filtered_sample = feeds_df_filtered.sample(feeds_sample_size) else: feeds_df_filtered_sample = feeds_df_filtered ## set up for twitter api print "Initialized the Twitter API" api = get_twitter_api() ## handling message to twitter print "Sending all data to twitter" for index, row in feeds_df_filtered_sample.iterrows(): #convert to full text for output target_txt = 'Via @' + row['ref'] + ': ' + row['title'] + ' ' + row['feeds_url_shorten'] + ' ' + hashtags try: api.update_status(target_txt) except: pass time.sleep(60*30)
5. Scheduling tweets
We can use either windows task scheduler or cron job to do scheduling of tweet posting daily.
6. What to do next
Above contents are derived mainly from RSS feeds. We can add contents by retweeting or embedding youtube videos automatically. A sample twitter bot created using the above methods are included in the link.
You can refer to some of the posts that include retrieving data from twitter.