python pattern

Google Search results web crawler (Updates)

A continuation of the project based on the following post “Google Search results web crawler (re-visit Part 2)” & “Getting Google Search results with Scrapy”. The project will first obtain all the links of the google search results of target search phrase and comb through each of the link and save them to a text file.

Two new main features are added. First main feature allows multiple keywords to be search at one go. Multiple search phrases can be entered from a target file and search all at one go.

There is also an option to converge all the results of all the search phrases. This is useful when all the search phrases are related and you wish to see all the top ranked results group together. The results will display all the top search result of all the key phrases followed by the 2nd and so forth.

Other options include specifying the number of text sentences of each result to print, min length of the sentence, sort results by date etc. Below are the key options to choose from:

    NUM_SEARCH_RESULTS = 30  # number of search results returned
    SENTENCE_LIMIT = 50
    MIN_WORD_IN_SENTENCE = 6
    ENABLE_DATE_SORT = 0

The second feature is an experimental feature that deal with language processing. It will try to retrieve all the noun phrases from all the search results and note the its frequency. The idea is to retrieve the most popular noun phrases based on the results of all the search, this is something similar to word cloud.

This is done using the python pattern module which also deal with the HTML request and processing used in the script. Under the pattern module, there is sub module that handles natural language processing. For this feature, the pattern module will tokenize the text and (part-of-speech) tag each of the word. With the in-built tag identifcation, you can specify it to detect noun phrase chunk tag or NP (Tags: DT+RB+JJ+NN + PR). For more part-of-speech tag, you can refer to pattern website. I have included part of the code for the noun phrase detection (Under pattern_parsing.py).

def get_noun_phrases_fr_text(text_parsetree, print_output = 0, phrases_num_limit =5, stopword_file=''):
    """ Method to return noun phrases in target text with duplicates
        The phrases will be a noun phrases ie NP chunks.
        Have the in build stop words --> check folder address for this.
        Args:
            text_parsetree (pattern.text.tree.Text): parsed tree of orginal text

        Kwargs:
            print_output (bool): 1 - print the results else do not print.
            phrases_num_limit (int): return  the max number of phrases. if 0, return all.
        
        Returns:
            (list): list of the found phrases. 

    """
    target_search_str = 'NP' #noun phrases
    target_search = search(target_search_str, text_parsetree)# only apply if the keyword is top freq:'JJ?+ NN NN|NNP|NNS+'

    target_word_list = []
    for n in target_search:
        if print_output: print retrieve_string(n)
        target_word_list.append(retrieve_string(n))

    ## exclude the stop words.
    if stopword_file:
        with open(stopword_file,'r') as f:
            stopword_list = f.read()
        stopword_list = stopword_list.split('\n')

    target_word_list = [n for n in target_word_list if n.lower() not in stopword_list ]

    if (len(target_word_list)>= phrases_num_limit and phrases_num_limit>0):
        return target_word_list[:phrases_num_limit]
    else:
        return target_word_list
        
def retrieve_top_freq_noun_phrases_fr_file(target_file, phrases_num_limit, top_cut_off, saveoutputfile = ''):
    """ Retrieve the top frequency words found in a file. Limit to noun phrases only.
        Stop word is active as default.
        Args:
            target_file (str): filepath as str.
            phrases_num_limit (int):  the max number of phrases. if 0, return all
            top_cut_off (int): for return of the top x phrases.
        Kwargs:
            saveoutputfile (str): if saveoutputfile not null, save to target location.
        Returns:
            (list) : just the top phrases.
            (list of tuple): phrases and frequency

    """
    with open(target_file, 'r') as f:
        webtext =  f.read()

    t = parsetree(webtext, lemmata=True)

    results_list = get_noun_phrases_fr_text(t, phrases_num_limit = phrases_num_limit, stopword_file = r'C:\pythonuserfiles\google_search_module_alt\stopwords_list.txt')

    #try to get frequnecy of the list of words
    counts = Counter(results_list)
    phrases_freq_list =  counts.most_common(top_cut_off) #remove non consequencial words...
    most_common_phrases_list = [n[0] for n in phrases_freq_list]

    if saveoutputfile:
        with open(saveoutputfile, 'w') as f:
            for (phrase, freq) in phrases_freq_list:
                temp_str = phrase + ' ' + str(freq) + '\n'
                f.write(temp_str)
            
    return most_common_phrases_list, phrases_freq_list

The second feature is very crude and give rise to quite a number of redundant phrases. However, in some cases, are able to pick up certain key phrases. Below are the frequency results based on list of the search key phrases. As seen, the accuracy still need some refinement.

Key phrases

Top cafes in singapore
where to go to for coffee in singapore
Recommended cafes in singapore
Most popular cafes singapore

================
Results

=================

Singapore 139
coffee 45
the past year 23
plenty 23
the Singapore cafe scene 22
new additions 22
View Photo 19
PH 16
cafes 14
20 Best Cafes 13
Fri 11
Coffee 11
Nylon 10
Thu 10
Artistry 10
Indonesia 10
The coffee 9
The Plain 9
Chye Seng Huat Hardware 9
the coffee 9
Photos 9
you re 9
Everton Park 8
sugar 8
Hours 8
t 8
Changi Airport 7
time 7
Food 7
p. 7
Common Man Coffee Roasters 7
Tel 7
Rise & Grind Coffee Co 6
good coffee 6
40 Hands 6
a lot 6
the cafe 6
The Coffee Bean 6
your friends 6
Malaysia 6
s 6
a cup 6
Korea 6
Sarnies 6
Waffles 6
Address 6
Chinese New Year 6
desserts 6
the river 6
Taiwan 6
home 6
the city 5
service 5
the best coffee 5
Tea Leaf 5
great coffee 5
a couple 5
the heart 5
people 5
the side 5
Nylon Coffee Roasters 5
hours 5
Singaporeans 5
food 5
any time 5
eve 5
eggs 5
a bit 5
Eve 5
the day 5
kopi 5
Thailand 5
brunch 5
their coffee 5
Chinatown 5
Restaurants 4
Brunch 4
the top 4
Jalan Besar 4
Ideas 4
Dutch Colony 4
night 4
Cafes 4
a variety 4
Visit 4
course 4
Melbourne 4
The Best 4

Main script can be obtained from Github.

Google Search results web crawler (re-visit Part 2)

Added 2 new features to Google search results web crawler. This is continuation of previous work on web crawler with Pattern. The script can be found at GitHub.

The first feature is to return the google search results sorted by date relevance. To turn on the date filter manually in google search, the following url string (“&as_qdr=d“) is appended. The following website provide more information on this. For the script based crawler, the url string to be appended is “&tbs=qdr:d,sbd:1” which will sort the date in descending, i.e, the most current date first.

The 2nd feature is the enable_results_converging options where it will merge all results from a list of keyword search. The merging is such that the top results from each search keyword are grouped together, i.e, it will list all the #1 search together followed by the #2 and so forth.

A sample run of the script is as below. The date filtered is turn off in this case. The example focus on fetching all the news from a particular stock “Sheng Siong” by searching for multiple keywords. It is assumed the most relevant are grouped at the top list hence consolidating all the same ranked results will provide more useful information.

        print 'Start search'

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned 
        search_words = ['Sheng Siong buy' , 'Sheng Siong sell', 'Sheng Siong sentiment', 'Sheng Siong stocks review', 'Sheng siong stock market']  # set the keyword setting
        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)
        #hh.enable_sort_date_descending()# enable sorting of date by descending. --> not enabled

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()
        hh.consolidated_results()
        
        print 'End Search'

Top 5 Output are displayed as below. The link from google results + the descriptions are printed. Note that there are repeated entry as there are some keywords that return the exact website. Further work is on-going to remove the duplicates.

================
Results

=================

link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.sharejunction.com/sharejunction/listMessage.htm%3FtopicId%3D10021%26msgbdName%3DSheng%2520Siong%26topicTitle%3DSheng%2520Siong
Description:
ShareJunction – Stock Forum Messages : Sheng Siong
****
link: https://sg.finance.yahoo.com/echarts%3Fs%3DOV8.SI
Description:
Sheng Siong Share Price Chart | OV8.SI – Yahoo! Singapore Finance
****
link: http://sbr.com.sg/source/motley-fool-singapore/here-are-5-things-you-should-know-about-sheng-siong
Description:
Here are 5 things you should know about Sheng Siong | Singapore …
****
link: Sheng+Siong+buy&hq=Sheng+Siong+buy&hnear=0x31da1767b42b8ec9:0x400f7acaedaa420,Singapore
Description:
Local business results for Sheng Siong buy near Singapore
****

Further works include scraping the individual sites for more details much like what is done in the post with Scrapy. The duplicates entries will also be addressed.