Google Search results web crawler (Updates)

A continuation of the project based on the following post “Google Search results web crawler (re-visit Part 2)” & “Getting Google Search results with Scrapy”. The project will first obtain all the links of the google search results of target search phrase and comb through each of the link and save them to a text file.

Two new main features are added. First main feature allows multiple keywords to be search at one go. Multiple search phrases can be entered from a target file and search all at one go.

There is also an option to converge all the results of all the search phrases. This is useful when all the search phrases are related and you wish to see all the top ranked results group together. The results will display all the top search result of all the key phrases followed by the 2nd and so forth.

Other options include specifying the number of text sentences of each result to print, min length of the sentence, sort results by date etc. Below are the key options to choose from:

    NUM_SEARCH_RESULTS = 30  # number of search results returned
    SENTENCE_LIMIT = 50
    MIN_WORD_IN_SENTENCE = 6
    ENABLE_DATE_SORT = 0

The second feature is an experimental feature that deal with language processing. It will try to retrieve all the noun phrases from all the search results and note the its frequency. The idea is to retrieve the most popular noun phrases based on the results of all the search, this is something similar to word cloud.

This is done using the python pattern module which also deal with the HTML request and processing used in the script. Under the pattern module, there is sub module that handles natural language processing. For this feature, the pattern module will tokenize the text and (part-of-speech) tag each of the word. With the in-built tag identifcation, you can specify it to detect noun phrase chunk tag or NP (Tags: DT+RB+JJ+NN + PR). For more part-of-speech tag, you can refer to pattern website. I have included part of the code for the noun phrase detection (Under pattern_parsing.py).

def get_noun_phrases_fr_text(text_parsetree, print_output = 0, phrases_num_limit =5, stopword_file=''):
    """ Method to return noun phrases in target text with duplicates
        The phrases will be a noun phrases ie NP chunks.
        Have the in build stop words --> check folder address for this.
        Args:
            text_parsetree (pattern.text.tree.Text): parsed tree of orginal text

        Kwargs:
            print_output (bool): 1 - print the results else do not print.
            phrases_num_limit (int): return  the max number of phrases. if 0, return all.
        
        Returns:
            (list): list of the found phrases. 

    """
    target_search_str = 'NP' #noun phrases
    target_search = search(target_search_str, text_parsetree)# only apply if the keyword is top freq:'JJ?+ NN NN|NNP|NNS+'

    target_word_list = []
    for n in target_search:
        if print_output: print retrieve_string(n)
        target_word_list.append(retrieve_string(n))

    ## exclude the stop words.
    if stopword_file:
        with open(stopword_file,'r') as f:
            stopword_list = f.read()
        stopword_list = stopword_list.split('\n')

    target_word_list = [n for n in target_word_list if n.lower() not in stopword_list ]

    if (len(target_word_list)>= phrases_num_limit and phrases_num_limit>0):
        return target_word_list[:phrases_num_limit]
    else:
        return target_word_list
        
def retrieve_top_freq_noun_phrases_fr_file(target_file, phrases_num_limit, top_cut_off, saveoutputfile = ''):
    """ Retrieve the top frequency words found in a file. Limit to noun phrases only.
        Stop word is active as default.
        Args:
            target_file (str): filepath as str.
            phrases_num_limit (int):  the max number of phrases. if 0, return all
            top_cut_off (int): for return of the top x phrases.
        Kwargs:
            saveoutputfile (str): if saveoutputfile not null, save to target location.
        Returns:
            (list) : just the top phrases.
            (list of tuple): phrases and frequency

    """
    with open(target_file, 'r') as f:
        webtext =  f.read()

    t = parsetree(webtext, lemmata=True)

    results_list = get_noun_phrases_fr_text(t, phrases_num_limit = phrases_num_limit, stopword_file = r'C:\pythonuserfiles\google_search_module_alt\stopwords_list.txt')

    #try to get frequnecy of the list of words
    counts = Counter(results_list)
    phrases_freq_list =  counts.most_common(top_cut_off) #remove non consequencial words...
    most_common_phrases_list = [n[0] for n in phrases_freq_list]

    if saveoutputfile:
        with open(saveoutputfile, 'w') as f:
            for (phrase, freq) in phrases_freq_list:
                temp_str = phrase + ' ' + str(freq) + '\n'
                f.write(temp_str)
            
    return most_common_phrases_list, phrases_freq_list

The second feature is very crude and give rise to quite a number of redundant phrases. However, in some cases, are able to pick up certain key phrases. Below are the frequency results based on list of the search key phrases. As seen, the accuracy still need some refinement.

Key phrases

Top cafes in singapore
where to go to for coffee in singapore
Recommended cafes in singapore
Most popular cafes singapore

================
Results

=================

Singapore 139
coffee 45
the past year 23
plenty 23
the Singapore cafe scene 22
new additions 22
View Photo 19
PH 16
cafes 14
20 Best Cafes 13
Fri 11
Coffee 11
Nylon 10
Thu 10
Artistry 10
Indonesia 10
The coffee 9
The Plain 9
Chye Seng Huat Hardware 9
the coffee 9
Photos 9
you re 9
Everton Park 8
sugar 8
Hours 8
t 8
Changi Airport 7
time 7
Food 7
p. 7
Common Man Coffee Roasters 7
Tel 7
Rise & Grind Coffee Co 6
good coffee 6
40 Hands 6
a lot 6
the cafe 6
The Coffee Bean 6
your friends 6
Malaysia 6
s 6
a cup 6
Korea 6
Sarnies 6
Waffles 6
Address 6
Chinese New Year 6
desserts 6
the river 6
Taiwan 6
home 6
the city 5
service 5
the best coffee 5
Tea Leaf 5
great coffee 5
a couple 5
the heart 5
people 5
the side 5
Nylon Coffee Roasters 5
hours 5
Singaporeans 5
food 5
any time 5
eve 5
eggs 5
a bit 5
Eve 5
the day 5
kopi 5
Thailand 5
brunch 5
their coffee 5
Chinatown 5
Restaurants 4
Brunch 4
the top 4
Jalan Besar 4
Ideas 4
Dutch Colony 4
night 4
Cafes 4
a variety 4
Visit 4
course 4
Melbourne 4
The Best 4

Main script can be obtained from Github.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s