Selecting which keywords to bid for for keywords Ads in Shopee can be a hassle on Shopee platform. One of the main reason is that you cannot really sort or rank by the number of searches and/or bid price on Shopee platform. Having the option to export to excel/csv can really help on the analysis. And is really simple to do so with Python.
Navigate to the “Create (keyword) Ads”. Select Add Keywords and add as many related keywords as you like. Once completed, save the page as html file. Next we will use python pandas to parse the table tag in the html file and generate as pandas DataFrame.
## Sample Code
import pandas as pd
tgt = r'C:\yourfilelocation\shopee.html'
# list of table.
# For this, table 0 is header col and table 1 is data
tb = pd.read_html(tgt)
# Assign header from table 0 to table 1
tb[1].columns= tb[0].columns.tolist()
# Drop empty columns
bid_table = tb[1].dropna(1,'all')
# Can save to excel as well
bid_table.to_csv(r'c:\data\output.csv', index=False)
Sample of the output is shown above. I usually sort by search volume (highest). I also add in the Search Vol/Num Bids columns which give some indication of the search volume per 0.1cents of bids.
Selenium can be used to automated the saving of html file.
The “Quality Score” is not able to parse using the read_html method given it is a generated image file. However, for those who are really keen, the quality score is reflected in the image tag attribute [style=”width: x%]. Parsing this will give the the estimated quality score.
The recommendation is to follow the steps from the original well-written post and refers to the following to fill in some of the possible gaps .
Activate a virtualenv
Git clone the project (in the post) to local directory
Run pip install -r requirements.txt
Upgrade Django version (will encounter error if this step is not performed). pip install django==1.11.17. This only applies if you following the post and cloning the project used in the post.
Create new user in Postgres, create new database & grant assess (Step 1 & 2 of post)
Update settings.py on the database portion.
Create environment variables in the virtualenv. See link for more information.
Note: Secret Key needs to be included as one of the environment variable.
Update the postactivate file of the virtualenv so the environment variables are present when virtualenv is activated.
To get path of the virtualenv: echo $VIRTUAL_ENV
Create new user in Postgres
# Psql codes for Step 1 and 2 of original post.
# ensure Postgres server is running
psql
# create user with password
CREATE USER sample_user WITH PASSWORD 'sample_password';
# create database
CREATE DATABASE sample_database WITH OWNER sample_user;
Update database information in Setting.py
# Changes in the settings.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': os.environ.get('DB_NAME', ''),
'USER': os.environ.get('DB_USER', ''),
'PASSWORD': os.environ.get('DB_PASS', ''),
'HOST': 'localhost',
'PORT': '5432',
}
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = os.environ.get('DJANGO_SECRET_KEY', '')
Update environment variables in VirtualEnv
# postactivate script in the project virtual env bin path.
# E.g. ~/.virtualenv/[projectname]/bin/postactivate
#!/bin/bash
# This hook is sourced after this virtualenv is activated.
export DB_NAME='sample_database'
export DB_USER='sample_user'
export DB_PASS='sample_password'
export DJANGO_SECRET_KEY='thisissecretkey'
Running migrations (Ensure PostgreSQL server is running)
When running python manage.py runserver on local host and error occurs, check domain is included in the ALLOWED_HOSTS of setting.py. Alternatively, you can use below:
ALLOWED_HOSTS = [‘*’] # for local host only
No database created when running psql command: CREATE DATABASE …, check if semi-colon add to end of the statement. In the event, the ‘;’ is missing, type ‘;’ and try inputting the commands again. See link for more details.
Boston Housing prices dataset is used for 1, 2. Titanic Dataset for item 3.
Basic Python module import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
df = pd.DataFrame(X, columns= boston.feature_names)
Multiple Histogram plots of numeric features
Stack the dataframe with all the features together. May consume significant memory if dataset have large number of features and observations.
If need to separate by group (hue in FacetGrid), can modify the numeric_features:
Google sheets simplify the process of web scraping especially for table and list elements. For below project, the purpose is to obtain common/essential words and their corresponding definitions for GMAT/GRE preparations.
In one of the cells, type in =IMPORTHTML(“url-site“,“table”,<table_id>) where <table_id> is the table position in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the table num)
In one of the cells, type in =IMPORTHTML(“url-site“,“list”,<list_id>) where <list_id> is the list order in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the list num)
The above techniques can also apply to other websites that have list or table elements. For this project, one of the next step is to create flash cards video to help in the learning. With the table format in google sheets, it is easy to download the whole list or table as .CSV file and create in the form of flash cards. Check the link for the quick project.
Build your own study flash cards video (+ background music) using Python easily.
Required Modules
moviepy
ImageMagick — for creating text clip
pandas — optional for managing CSV file
Basic steps
Read in the text information. Pandas can be used to read in a .csv file for table manipulation.
create a Textclip object for each text and append all Textclips together
Add in an audio if desired. Allow the audio to loop through duration of the clip
Save the file as mp4.
Sample Python Project — Vocabulary flash cards
Below is a simple project to create a vocabulary list of common words use in GMAT etc. For each word and meaning pair, it will flash the word followed by its meaning . There is slight pause in the timing to allow some time for the user to recall on the meaning for the particular words
Sample table for wordlist.csv (which essentially is a table of words and their respective meanings) * random sample (subset) obtained from web
def create_txtclip(tgt_txt, duration = 2, fontsize = 18):
try:
txt_clip = TextClip(tgt_txt, fontsize = fontsize, color = 'black',bg_color='white', size=(426,240)).set_duration(duration)
clip_list.append(txt_clip)
except UnicodeEncodeError:
txt_clip = TextClip("Issue with text", fontsize = fontsize, color = 'white').set_duration(2)
clip_list.append(txt_clip)
from moviepy.editor import *
df = pd.read_csv("wordlist.csv")
for word, meaning in zip(df.iloc[:,0], df.iloc[:,1]):
create_txtclip(word,1, 70)
create_txtclip(meaning,3)
final_clip = concatenate(clip_list, method = "compose")
# optional music background with loop
music = AudioFileClip("your_audiofile.mp3")
audio = afx.audio_loop( music, duration=final_clip.duration)
final_clip = final_clip.set_audio(audio)
final_clip.write_videofile("flash_cards.mp4", fps = 24, codec = 'mpeg4')<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
In some cases, the audio for the flash cards does not work when play with Quicktime, will work on VLC
This post covers basic PDF manipulation for daily tasks using simple Python modules.
Merging mulitple PDF
Extract text from PDF
Extract image from PDF
Merging PDF
from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("output.pdf")
Extract text from PDF
import pdftotext
# Load your PDF
with open("Target.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Save all text to a txt file.
with open('output.txt', 'w') as f:
f.write("\n\n".join(pdf))
from google.colab import drive
drive.mount(‘/content/drive’)
View list of files:
!ls “/content/drive/My Drive”
Note: In the notebook, click on the charcoal > on the top left of the notebook and click on Files, select the file and right click to “copy path”. Note the path must begin with “/content/xxx”
OR IRkernel::installspec(user = FALSE) #install system-wide
Open a notebook and open new R script.
Further notes
After getting Additional R library might be hard to install inside the Notebook. For workaround, install desired library in R terminal then open the Notebook.
If need to use R.exe on windows command terminal, ensure R.exe is on path. [likely location: C:\R\R-2.15.1\bin]
I created a more generic text cleaning function that can accommodate various text data sets. This can use as a base function for text related problem set. The function, if enabled all options, will be able to perform the following:
Converting all text to lowercase.
Stripping html tags especially if data is scrapped from web.
Replacing accented characters with closest English alphabets/characters.
Removing special characters which includes punctuation. Digits may or may not be excluded depending on context. (Digits are not removed for this data set)
Removing stop-words (simple vs detailed. If detailed, will tokenize words before removal else will use simple word replacement.
Removing extra white spaces and newlines.
Normalize text. This either refer to stemming or lemmatizing.
In this example, we only turn on:
converting text to lowercase
remove special characters (need to keep digits) and white spaces,
do a simple stop words removal.
As mentioned in previous post, it is likely a seller would not include much stop words and will try to keep the title as concise as possible given the limited characters and also to make the title more relevant to search engine. As the text length is not too long, will skip normalizing text to save time.
# Text pre-processing modules
from bs4 import BeautifulSoup
import unidecode
import spacy, en_core_web_sm
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
STOPWORDS = set(stopwords.words('english'))
# Compile regular expression
SPEC_CHARS_REPLACE_BY_SPACE = re.compile('[/(){}\[\]\|@,;]')
SPEC_CHARS = re.compile(r'[^a-zA-z0-9\s]')
SPEC_CHARS_INCLUDE_DIGITS = re.compile(r'[^a-zA-z\s]')
EXTRA_NEWLINES = re.compile(r'[\r|\n|\r\n]+')
## Functions for text preprocessing, cleaning
def strip_htmltags(text):
soup = BeautifulSoup(text,"lxml")
return soup.get_text()
def replace_accented_chars(text):
return unidecode.unidecode(text)
def stem_text(text):
ps = PorterStemmer()
modified_txt = ' '.join([ps.stem(word) for word in text.split()])
return modified_txt
def lemmatize(text):
modified_text = nlp(text)
return ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in modified_text])
def normalize(text, method='stem'):
""" Text normalization to generate the root form of the inflected words.
This is done by either "stem" or "lemmatize" the text as defined by the 'method' arguments.
Note that using "lemmatize" will take much longer to run compared to "stem".
"""
if method == 'stem':
return stem_text(text)
if method == 'lemmatize':
return lemmatize(text)
print('Please choose either "stem" or "lemmatize" method to normalize.')
return text
def rm_special_chars(text, rm_digits=False):
# remove & replace below special chars with space
modified_txt = SPEC_CHARS_REPLACE_BY_SPACE.sub(' ', text)
# remove rest of special chars, no replacing with space
if rm_digits:
return SPEC_CHARS_INCLUDE_DIGITS.sub('', modified_txt)
else:
return SPEC_CHARS.sub('', modified_txt)
def rm_extra_newlines_and_whitespace(text):
# rm extra newlines
modified_txt = EXTRA_NEWLINES.sub(' ', text)
# rm extra whitespaces
return re.sub(r'\s+', ' ', modified_txt)
def rm_stopwords(text, simple=True):
""" Remove stopwords using either the simple model with replacement.
or using nltk.tokenize to split the words and replace each words. This will incur speed penalty.
"""
if simple:
return ' '.join(word for word in text.split() if word not in STOPWORDS)
else:
tokens = word_tokenize(text)
tokens = [token.strip() for token in tokens]
return ' '.join(word for word in tokens if word not in STOPWORDS)
def clean_text(raw_text, strip_html = True, replace_accented = True,
normalize_text = True, normalize_methd = 'stem',
remove_special_chars = True, remove_digits = True,
remove_stopwords = True, rm_stopwords_simple_mode = True):
""" The combined function for all the various preprocessing method.
Keyword args:
strip_html : Remove html tags.
replace_accented : Convert accented characters to closest English characters.
normalize_text : Normalize text based on normalize_methd.
normalize_methd : "stem" or "lemmatize". Default "stem".
remove_special_chars : Remove special chars.
remove_digits : Remove digits/numeric as special characters.
remove_stopwords : Stopwords removal basedon NLTK corpus.
rm_stopwords_simple_mode : skip tokenize before stopword removal. Speed up time.
"""
text = raw_text.lower()
if strip_html:
text = strip_htmltags(text)
if replace_accented:
text = replace_accented_chars(text)
if remove_special_chars:
text = rm_special_chars(text, remove_digits)
if normalize_text:
text = normalize(text, normalize_methd)
if remove_stopwords:
text = rm_stopwords(text, rm_stopwords_simple_mode)
text = rm_extra_newlines_and_whitespace(text)
return text
Grid Search for Hyper Parameters Tuning
Using pipelines, it is easy to incorporate the sklearn grid search to sweep through the various the hyper parameters and select the best value. Two main parameters tuning are:
ngram range in CountVectorizer:
In the first part, we only looking a unigram or single word but there are some attributes that are identified by more than one word alone (eg 4G network, 32GB Memory etc) therefore we will sweep the ngram range to find the optimal range.
The larger the ngram range the more feature columns will be generated so it will be more memory consuming.
alpha in SGDClassifier
This will affect the regularization term and the learning rate of the training model.
With the ngram range and alpha parameters sweep and the best value selected, we can see quite a significant improvement to the accuracy to all the attribute prediction compared to the first version. Most of the improvement comes from the ngram adjusted to (1,3), meaning account for trigram. This is within expectation as more attributes are described by more than one word.
# Prepare model -- Drop na and keep those with values
def get_X_Y_data(x_col, y_col):
sub_df = df[[x_col, y_col]]
sub_df.head()
sub_df = sub_df.dropna()
return sub_df[x_col], sub_df[y_col]
# Model training & GridSearch
def generate_model(X, y, verbose = 1):
text_vect_pipe = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer())
])
pred_model = Pipeline([
('process', text_vect_pipe),
('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None))
])
parameters = {}
parameters['process__vect__ngram_range'] = [(0,1),(1,2),(1,3)]
parameters['clf__loss'] = ["hinge"]
parameters['clf__alpha'] = [5e-6,1e-5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
CV = GridSearchCV(pred_model, parameters)
CV.fit(X_train, y_train)
y_pred = CV.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print("=="*18)
print()
print("Details of GridSearch")
if verbose:
print('Best score and parameter combination = ')
print(CV.best_score_)
print(CV.best_params_)
print()
print("Grid scores on development set:")
means = CV.cv_results_['mean_test_score']
stds = CV.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, CV.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print("=="*18)
print()
return CV
X, y = get_X_Y_data('title1', 'Brand')
brand_model = generate_model(X, y)
print('='*29)
The full script is as below. The text cleaning function takes a large part of the code. Excluding the function, the additional of few lines of code for the grid search and pipeline can can bring a relatively significant accuracy improvement.
Next Actions
So far only text features are considered, the next part we will try adding numeric features to see if further improvement can be made.