I was testing out the google search script I created last week. I was searching for the “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.
The python module was created with the intention to display more meaningful and relevant data without clicking to individual websites. However, with just the meta title and meta contents from the search results, it is not really useful in obtaining meaningful results.
I tried to modify the module by extraction of the paragraphs from each site and output them together with the meta descriptions. I make some changes to the script to handle multiple newline characters and debug on the unicode error that keeps popping out when output the text results.
To extract the paragraphs from each site, I used the xpath command as below.
sel = Selector(response) paragraph_list = sel.xpath('//p/text()').extract()
To handle the unicode identification error, the following changes are made. The stackoverflow link provides the solution to the problem.
## convert the paragraph list to one continuous string para_str = self.join_list_of_str(paragraph_list, joined_chars= '..') ## Replace any unknown unicode characters with ? para_str = para_str.encode(errors='replace') ## Remove newline characters para_str = self.remove_whitespace_fr_raw(para_str)
With the paragraphs displayed at the output, I was basically reading large chunks of texts and it was certainly messy with the newline removed. I could not really get good information out of it.
For example, it is better to get the ranked hotels from tripadvisor site but from the google search module, tripadvisor only displays the top page without any hotels listed. Below is the output I get from TripAdvisor site pertaining to the search result.
Tokyo Hotels: Check Out 653 hotels with 77,018 Reviews – TripAdvisor
Tokyo Hotels: Find 77,018 traveller reviews and 2,802 candid photos for 653 hotels in Tokyo, Japan on TripAdvisor.
Price per night..Property type..Neighbourhood..Traveller rating..Hotel class..Amenities..Property name..Hotel brand
Performing recursive crawling on TripAdvisor itself perhaps will achieve more meaningful results.
Currently, I do not have much idea on enhancing the script to extract more meaningful data. Perhaps I can use text processing to summarize the paragraphs into meaningful data which would be the next step, utilizing the NLTK module. However, I am not hopeful of the final results.
For this particular search query, perhaps it would be easier to cater specific crawling methods on several target website such as TripAdvisor, Agoda etc rather than a general extraction of text.