Machine Learning

Predict Product Attributes from Product Listings Part 2 – Pipelines & GridSearch

Further improvement on the Product Attributes Text Classifier

This is part 2 of the extracting attributes from product title with the following improvements or add on.

  1. Creating a more generic text cleaning function.
  2. Adding GridSearch for hyper parameters tuning.

Text Cleaning Function

I created a more generic text cleaning function that can accommodate various text data sets. This can use as a base function for text related problem set. The function, if enabled all options, will be able to perform the following:

  1. Converting all text to lowercase.
  2. Stripping html tags especially if data is scrapped from web.
  3. Replacing accented characters with closest English alphabets/characters.
  4. Removing special characters which includes punctuation. Digits may or may not be excluded depending on context. (Digits are not removed for this data set)
  5. Removing stop-words (simple vs detailed. If detailed, will tokenize words before removal else will use simple word replacement.
  6. Removing extra white spaces and newlines.
  7. Normalize text. This either refer to stemming or lemmatizing.

In this example, we only turn on:

  1. converting text to lowercase
  2. remove special characters (need to keep digits) and white spaces,
  3. do a simple stop words removal.

As mentioned in previous post, it is likely a seller would not include much stop words and will try to keep the title as concise as possible given the limited characters and also to make the title more relevant to search engine. As the text length is not too long, will skip normalizing text to save time.

# Text pre-processing modules
from bs4 import BeautifulSoup
import unidecode
import spacy, en_core_web_sm
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
STOPWORDS = set(stopwords.words('english')) 

# Compile regular expression
SPEC_CHARS_REPLACE_BY_SPACE = re.compile('[/(){}\[\]\|@,;]')
SPEC_CHARS = re.compile(r'[^a-zA-z0-9\s]')
SPEC_CHARS_INCLUDE_DIGITS = re.compile(r'[^a-zA-z\s]')
EXTRA_NEWLINES = re.compile(r'[\r|\n|\r\n]+')

## Functions for text preprocessing, cleaning

def strip_htmltags(text):
    soup = BeautifulSoup(text,"lxml")
    return soup.get_text()

def replace_accented_chars(text):
    return unidecode.unidecode(text)

def stem_text(text):
    ps = PorterStemmer()
    modified_txt = ' '.join([ps.stem(word) for word in text.split()])
    return modified_txt    

def lemmatize(text):
    modified_text = nlp(text)
    return ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in modified_text])

def normalize(text, method='stem'):
    """ Text normalization to generate the root form of the inflected words.
        This is done by either "stem" or "lemmatize" the text as defined by the 'method' arguments.
        Note that using "lemmatize" will take much longer to run compared to "stem".
    if method == 'stem':
        return stem_text(text)
    if method == 'lemmatize':
        return lemmatize(text)
    print('Please choose either "stem" or "lemmatize" method to normalize.')
    return text

def rm_special_chars(text, rm_digits=False):
    # remove & replace below special chars with space
    modified_txt = SPEC_CHARS_REPLACE_BY_SPACE.sub(' ', text)

    # remove rest of special chars, no replacing with space
    if rm_digits:
        return SPEC_CHARS_INCLUDE_DIGITS.sub('', modified_txt)
        return SPEC_CHARS.sub('', modified_txt)

def rm_extra_newlines_and_whitespace(text):
    # rm extra newlines
    modified_txt =  EXTRA_NEWLINES.sub(' ', text)

    # rm extra whitespaces
    return re.sub(r'\s+', ' ', modified_txt)

def rm_stopwords(text, simple=True):
    """ Remove stopwords using either the simple model with replacement.
        or using nltk.tokenize to split the words and replace each words. This will incur speed penalty.
    if simple:
        return ' '.join(word for word in text.split() if word not in STOPWORDS)
        tokens = word_tokenize(text)
        tokens = [token.strip() for token in tokens]
        return ' '.join(word for word in tokens if word not in STOPWORDS)

def clean_text(raw_text, strip_html = True, replace_accented = True,
                normalize_text = True, normalize_methd = 'stem',
                remove_special_chars = True, remove_digits = True,
                remove_stopwords = True, rm_stopwords_simple_mode = True):

    """ The combined function for all the various preprocessing method.
        Keyword args:
            strip_html               : Remove html tags.
            replace_accented         : Convert accented characters to closest English characters.
            normalize_text           : Normalize text based on normalize_methd.
            normalize_methd          : "stem" or "lemmatize". Default "stem".
            remove_special_chars     : Remove special chars.
            remove_digits            : Remove digits/numeric as special characters.
            remove_stopwords         : Stopwords removal basedon NLTK corpus.
            rm_stopwords_simple_mode : skip tokenize before stopword removal. Speed up time.

    text = raw_text.lower()

    if strip_html:
        text = strip_htmltags(text)
    if replace_accented:
        text = replace_accented_chars(text)
    if remove_special_chars:
        text = rm_special_chars(text, remove_digits)
    if normalize_text:
        text = normalize(text, normalize_methd)
    if remove_stopwords:
        text = rm_stopwords(text, rm_stopwords_simple_mode)

    text = rm_extra_newlines_and_whitespace(text)  

    return text

Grid Search for Hyper Parameters Tuning

Using pipelines, it is easy to incorporate the sklearn grid search to sweep through the various the hyper parameters and select the best value. Two main parameters tuning are:

  1. ngram range in CountVectorizer:
    • In the first part, we only looking a unigram or single word but there are some attributes that are identified by more than one word alone (eg 4G network, 32GB Memory etc) therefore we will sweep the ngram range to find the optimal range.
    • The larger the ngram range the more feature columns will be generated so it will be more memory consuming.
  2. alpha in SGDClassifier
    • This will affect the regularization term and the learning rate of the training model.

With the ngram range and alpha parameters sweep and the best value selected, we can see quite a significant improvement to the accuracy to all the attribute prediction compared to the first version. Most of the improvement comes from the ngram adjusted to (1,3), meaning account for trigram. This is within expectation as more attributes are described by more than one word.

# Prepare model -- Drop na and keep those with values
def get_X_Y_data(x_col, y_col):
    sub_df =  df[[x_col, y_col]]
    sub_df = sub_df.dropna()
    return sub_df[x_col], sub_df[y_col]

# Model training & GridSearch
def generate_model(X, y, verbose = 1):

    text_vect_pipe = Pipeline([
                            ('vect', CountVectorizer()),
                            ('tfidf', TfidfTransformer())

    pred_model = Pipeline([
                ('process', text_vect_pipe),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None))

    parameters = {}
    parameters['process__vect__ngram_range'] = [(0,1),(1,2),(1,3)]
    parameters['clf__loss'] = ["hinge"]
    parameters['clf__alpha'] = [5e-6,1e-5]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

    CV = GridSearchCV(pred_model, parameters), y_train)
    y_pred = CV.predict(X_test)

    print('accuracy %s' % accuracy_score(y_pred, y_test))
    print("Details of GridSearch")

    if verbose:
        print('Best score and parameter combination = ')
        print("Grid scores on development set:")
        means = CV.cv_results_['mean_test_score']
        stds = CV.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, CV.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))

    return CV

X, y = get_X_Y_data('title1', 'Brand')
brand_model = generate_model(X, y)

The full script is as below. The text cleaning function takes a large part of the code. Excluding the function, the additional of few lines of code for the grid search and pipeline can can bring a relatively significant accuracy improvement.

Next Actions

So far only text features are considered, the next part we will try adding numeric features to see if further improvement can be made.

See Also

  1. Predict Product Attributes from Product Listing Title — Text Feature Extraction and Classification



Predict Product Attributes From Product Listing Title — Text Feature Extraction and Classification

Extracting Attributes from Product Title and Image

This is a National (Singapore) Data Science Challenge organised by Shopee hosted on Kaggle. In the advanced category, the tasks is to extract a list of attributes from each product listing given product title and the accompanied image (a text and a image input). Training sets and full instructions are available in the Kaggle link. This is a short attempt of the problem which include the basic data exploration, data cleaning, feature extraction and classification.

Basic Data Exploration

While the project requirement have 3 main product categories, Beauty, Mobile, & Fashion, I will just focus on the Mobile data set. The two other categories will follow the same approach. For the mobile data set, the requirement is to extract the following attributes such as Brand, Phone Model, Camera, Phone Screen Size, Color Family.  A brief exploration of the training data set observed.

  1. Only title (text) & image (pic) available to predict the several attributes
    of the product.
  2. The attributes are already label-encoded.
  3. There are a lot of missing values particularly like Network Connections etc have more than 80% of data missing. This is quite expected as sellers unlikely to put some of these more obscured attributes in the title description while attributes like Brand and Model should have less missing data.

From seller’s perspective, seller will try to include as much information as possible in a
concise manner especially attributes like brands, models etc to make their posting relevant to search and stand out to the buyers. Using only image to extract attributes such as Brand and model might be difficult especially for mobile category where it is difficult to differentiate from pic even with human eye.

From the exploration, I planned the following steps.

  1. Using title (text) as main classification input and ignore images.
  2. Train and predict each attribute at a time.

Basic Data and Text Cleaning

There are some attributes Network Connections, Warranty Period which have large proportion of missing data. However, those attributes have majority of the observations having a certain attribute. In this case, those missing values are assigned with the mode of the training population (e.g. it is likely for Network Connections , most phones should be 4G etc). The attributes are also converted to integer for training purpose.

For the title, before extracting the numeric features, we perform cleaning on the data set. Since most users would highlight the most important feature in the product tile to make their product stand out and relevant, they would generally have omitted most of the stop words, most punctuation. and white spaces Hence for this data set, I will try minimal cleaning: change the title to lowercase and remove special characters. This can reduce a significant amount of time in text cleaning especially for large data set.

Data Cleaning and pipelines

For the advanced data extraction, I chose the Bag-Of-Word (BOW) model to generate the features from the text columns. In the BOW model, I use TF-IDF approach which computes the weighted frequency of each word in each title. For classification, SVM is chosen as the classifier. Pipe-lining makes it easy to streamline the whole text processing and attributes classification making it run on all the different attributes.

Below is the complete code running from extraction, cleaning to classification.

Further Improvement

This is the starting point of the project and take only a few lines of code to get it up and running for quick analysis.  I will improve the existing code by incorporate gridsearch for hyperparameters and expanding on the pipelines and features in the subsequent posts.

See Also

  1. Predict Product Attributes from Product Listings Part 2 – Pipelines & GridSearch


Using k-means clustering to detect abnormal profile or sudden trough


For a particular test we are handling, we need to ensure a particular metric A maintain a certain parabolic or relatively flat profile across a range of metric B. In recent days, we encountered an issue where certain samples of the population are experiencing a significant and sudden drop in metric A within a sub range of metric B.

We need to comb through the population to detect those that has the abnormal profile as shown in chart below for further failure analysis. While it is easy to identify by eye which sample are seeing abnormal performance after plotting metric B against metric A, it is impossible to scan through all the plots to identify the problem sample.


I decide to use machine learning to comb through the population to get the defective samples. Given the limited training samples on hand and the hassle of getting more data, I will use unsupervised learning for quick detection in this case.

** Note the examples below are set to be to randomly generated as model to the real data set.


There are certain pre-processing done on actual data but not on the sample data. Some of the usual pre-processing tasks performed are illustrated below.

  1. check and remove missing data (can use pd.isnan().sum()
  2. drop non required columns (pd.drop())

Features Engineering

To detect the abnormal profile, I need to build the features that might be able to differentiate normal vs abnormal profile. Below are some of the features I can think of which is derived by aggregating Metric A measured across all Metric B for each sample:

  1. Standard deviation of Metric A
    • Abnormal profile will have larger stddev due to the sharp drop.
  2. Range of Metric A
    • larger range of max – min for the abnormal profile.
  3. Standard deviation of Running delta of Metric A
    • Running delta is defined as the delta of Metric A for particular Metric B against Metric A of previous Metric B. A sudden dip in Metric A will be reflected in the sudden large delta.
    • Standard deviation of the running delta will catch the variation in the rise and dip.
  4. Max of Running delta of Metric A
    • This will display the largest delta within a particular sample.

Scaling and K-means Clustering

A basic scaling is done to normalize the features before applying the KMeans. All the functions will be from SkLearn. KMeans cluster is set to 2 (normal vs abnormal profile)


This is a short and quick way to get some of the samples out for failure analysis but will still need further fine tuning if turn on for production modes.

Sample Script


Tensorflow: Low Level API with iris DataSets

This post demonstrates the basic use of TensorFlow low level core API and tensorboard to build machine learning models for study purposes. There are higher level API (Tensorflow Estimators etc) from TensorFlow which will simplify some of the process and are easier to use by trading off some level of control. If fine or granular level of control is not required, higher level API might be a better option.

The following python script will use the iris data set and the following python modules to build and run the model: Numpy, scikit-learn and TensorFlow.  For this program, Numpy will be used mainly for array manipulation. Scikit-learn is used for the min-max Scaling, test-train set splitting and one-hot encoding for categorical data/output. The iris data set is imported using the Scikit-learn module.

A. Data Preparation

There are 4 input features (all numeric), 150 data row, 3 categorical outputs for the iris data set. The list of steps involved in the data processing steps are as below :

  1. Split into training and test set.
  2. Min-Max Scaling (‘Normalization’) on the features to cater for features with different units or scales.
  3. Encode the categorical outputs (3 types: setosa, virginica and versicolor ) using one-hot encoding.

import tensorflow as tf
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# reset graph

## Loading the data set
raw_data =  load_iris()

## split data set
X_train, X_test, Y_train, Y_test = train_test_split(,, test_size=0.33, random_state=42, stratify=

## max min scalar on parameters
X_scaler = MinMaxScaler(feature_range=(0,1))

## Preprocessing the dataset
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.fit_transform(X_test)

## One hot encode Y
onehot_encoder = OneHotEncoder(sparse=False)
Y_train_enc = onehot_encoder.fit_transform(Y_train.reshape(-1,1))
Y_test_enc = onehot_encoder.fit_transform(Y_test.reshape(-1,1))

B. Model definition or building the computation graph

Next we will build the computation graph. As defined by Tensorflow: “a computational graph is a series of TensorFlow Operations arranged into a graph of nodes. Each node takes zero or more tensors as inputs and produces a tensor as output”. Hence, we would need to define certain key nodes and operations such as the inputs, outputs, hidden layers etc.

The following are the key nodes or layers required:

  1. Input : This will be a tf.placeholder for data feeding. The shape depends on the number of features
  2. Hidden layers: Here we are using 2 hidden layers. Output of each hidden layer will be in the form of f(XW+B) where X is the input from either the previous layer or the input layer itself, W is the weights and B is the Bias. f() is an activation function.
    • Rectified Linear Unit (ReLu) activation function is selected for this example to introduce non-linearity to the system. ReLu: A(x) = max(0, x) i.e. output x when x > 0 and 0 when x < 0. Sigmoid activation function can also be used for this example.
    • Weights and Bias are variables here. They are changed at each training steps/epoch in this case.
    • Weights are initialized with xavier_initializer and bias are initialized to zero.
  3. Output or prediction or y hat: This is output of the Neural Network,  the computation results from the hidden layers.
  4. Y: actual output use for comparison against the predicted value. This will be tensor (tf.placeholder) for data feeding.
  5. Loss function: Compute the error between the predicted vs the actual classification ( or Yhat vs Y).  TensorFlow build-in function tf.nn.softmax_cross_entropy_with_logits is used for multiple class classification problem. “Tensorflow : It measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class)”
  6. Train model or optimizer: This defined the training algothrim use to minimize cost or loss. For this example, we are using the gradient descent to find minimum cost by updating the various weights and bias.

In addition, the learning rate and the total steps or epoches are defined for the above model.

# Define Model Parameters
learning_rate = 0.01
training_epochs = 10000

# define the number of neurons
layer_1_nodes = 150
layer_2_nodes = 150

# define the number of inputs
num_inputs = X_train_scaled.shape[1]
num_output = len(np.unique(Y_train, axis = 0)) 

# Define the layers
with tf.variable_scope('input'):
    X = tf.placeholder(tf.float32, shape= (None, num_inputs))

with tf.variable_scope('layer_1'):
    weights = tf.get_variable('weights1', shape=[num_inputs, layer_1_nodes], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias1', shape=[layer_1_nodes], initializer = tf.zeros_initializer())
    layer_1_output =  tf.nn.relu(tf.matmul(X, weights) +  biases) 

with tf.variable_scope('layer_2'):
    weights = tf.get_variable('weights2', shape=[layer_1_nodes, layer_2_nodes], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias2', shape=[layer_2_nodes], initializer = tf.zeros_initializer())
    layer_2_output =  tf.nn.relu(tf.matmul(layer_1_output, weights) + biases)

with tf.variable_scope('output'):
    weights = tf.get_variable('weights3', shape=[layer_2_nodes, num_output], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias3', shape=[num_output], initializer = tf.zeros_initializer())
    prediction =  tf.matmul(layer_2_output, weights) + biases

with tf.variable_scope('cost'):
    Y = tf.placeholder(tf.float32, shape = (None, num_output))#use 1 instead of num output unless one hot encoding??
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = Y, logits = prediction))

with tf.variable_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with tf.variable_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(Y, axis =1), tf.argmax(prediction, axis =1) )
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Logging results
with tf.variable_scope("logging"):
    tf.summary.scalar('current_cost', cost)
    tf.summary.scalar('current_accuacy', accuracy)
    summary = tf.summary.merge_all()

C. Running the computation Graph or Session

Actual computation takes place during the running of computation graph (handled by tf.Session). The first step is to initialize the global variables and create the log writer object to log the parameters defined in “logging” scope for Tensorboard.

Next we are iterating through each training steps. For simplicity, we are using the full training data at each steps to train and update the respective weights, bias by calling session run on the optimizer.

Intermediate results is being output every 5 steps interval both to default sys out and also stored in respective csv file. The optimization is using the training data but the accuracy assessment is based on both the test and the train data.

# Initialize a session so that we can run TensorFlow operations

with tf.Session() as session:

    # Run the global variable initializer to initialize all variables and layers of the neural network

    # create log file writer to record training progress.
    training_writer = tf.summary.FileWriter(r'C:\data\temp\tf_try\training', session.graph)
    testing_writer = tf.summary.FileWriter(r'C:\data\temp\tf_try\testing', session.graph)

    # Run the optimizer over and over to train the network.
    # One epoch is one full run through the training data set.
    for epoch in range(training_epochs):

        # Feed in the training data and do one step of neural network training, feed_dict={X:X_train_scaled, Y:Y_train_enc})

        # Every 5 training steps, log our progress
        if epoch %5 == 0:
            training_cost, training_summary =[cost, summary], feed_dict={X: X_train_scaled, Y: Y_train_enc})
            testing_cost, testing_summary =[cost, summary], feed_dict={X: X_test_scaled, Y: Y_test_enc})

            train_accuracy =, feed_dict={X: X_train_scaled, Y: Y_train_enc})
            test_accuracy =, feed_dict={X: X_test_scaled, Y: Y_test_enc})

            print(epoch, training_cost, testing_cost, train_accuracy, test_accuracy )

            training_writer.add_summary(training_summary, epoch)
            testing_writer.add_summary(testing_summary, epoch) 

    # Training is now complete!
    print("Training is complete!\n")

    final_train_accuracy =, feed_dict={X: X_train_scaled, Y: Y_train_enc})
    final_test_accuracy =, feed_dict={X: X_test_scaled, Y: Y_test_enc})

    print("Final Training Accuracy: {}".format(final_train_accuracy))
    print("Final Testing Accuracy: {}".format(final_test_accuracy))


D. Viewing in Tensorboard

The logging of the cost and the accuracy (tf.summary.scalar) allows us to view the performance of both the test and train set.

Results is as shown below

Final Training Accuracy: 1.0
Final Testing Accuracy: 0.9599999785423279

Untitled - Copy

Analyzing Iris Data Set with Scikit-learn

The following code demonstrate the use of python Scikit-learn to analyze/categorize the iris data set used commonly in machine learning. This post also highlight several of the methods and modules available for various machine learning studies.

While the code is not very lengthy, it did cover quite a comprehensive area as below:

  1. Data preprocessing: data encoding, scaling.
  2. Feature decomposition/dimension reduction with PCA. PCA is not needed or applicable to the Iris data set as the number of features is only 4. Nevertheless, it is shown here as a tool.
  3. Splitting test and training set.
  4. Classifier: Logistic Regression. Only logistic regression is shown here. Random forest and SVM can also be used for this dataset.
  5. GridSearch: for parameters sweeping.
  6. Pipeline: Pipeline which combined all the steps + gridsearch with Pipeline
  7. Scoring metrics, Cross Validation, confusion matrix.
import sys, re, time, datetime, os
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import plt

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix

def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
        pretty print for confusion matrixes
        Code from:

    columnwidth = max([len(x) for x in labels]+[5]) # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print "    " + empty_cell,
    for label in labels:
        print "%{0}s".format(columnwidth) % label,
    # Print rows
    for i, label1 in enumerate(labels):
        print "    %{0}s".format(columnwidth) % label1,
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] &gt; hide_threshold else empty_cell
            print cell,

def pca_2component_scatter(data_df, predictors, legend):
        outlook of data set by decomposing data to only 2 pca components.
        do: scaling --&gt; either maxmin or stdscaler


    print 'PCA plotting'

    data_df[predictors] =  StandardScaler().fit_transform(data_df[predictors])

    pca_components = ['PCA1','PCA2'] #make this exist then insert the fit transform
    pca = PCA(n_components = 2)
    for n in pca_components: data_df[n] = ''
    data_df[pca_components] = pca.fit_transform(data_df[predictors])

    sns.lmplot('PCA1', 'PCA2',
       scatter_kws={"marker": "D",
                    "s": 100})

if __name__ == "__main__":

    iris =  load_iris()
    target_df = pd.DataFrame(data=, columns=iris.feature_names )

    #combining the categorial output
    target_df['species'] = pd.Categorical.from_codes(codes=,categories = iris.target_names)
    target_df['species_coded'] = #encoding --&gt; as provided in iris dataset

    print '\nList of features and output'
    print target_df.columns.tolist()

    print '\nOutlook of data'
    print target_df.head()

    print "\nPrint out any missing data for each rows. "
    print np.where(target_df.isnull())

    predictors =[ n for n in target_df.columns.tolist() if n not in  ['species','species_coded']]
    target = 'species_coded' #use the encoded version y-train, y-test

    print '\nPCA plotting'
    pca_2component_scatter(target_df, predictors, 'species')

    print "\nSplit train test set."
    X_train, X_test, y_train, y_test = train_test_split(target_df[predictors], target_df[target], test_size=0.25, random_state=42)
    #test_size -- should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split
    #random state -- Pseudo-random number generator state used for random sampling.(any particular number use?
    print "Shape of training set: {}, Shape of test set: {}".format(X_train.shape, X_test.shape)

    print "\nCreating pipeline with the estimators"
    estimators = [
                    ('reduce_dim', PCA()),
                    ('clf', LogisticRegression())#the logistic regression use from ML teset not part of actual test. --&gt; may have to change the way it is is done

    #Parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax:
    pipe = Pipeline(estimators)

    #input the grid search
    params = dict(reduce_dim__n_components=[2, 3, 4], clf__C=[0.1, 10, 100,1000])
    grid_search = GridSearchCV(pipe, param_grid=params, cv =5), y_train)

    print '\nGrid Search Results:'
    gridsearch_result = pd.DataFrame(grid_search.cv_results_)
    gridsearch_display_cols = ['param_' + n for n in params.keys()] + ['mean_test_score']
    print gridsearch_result[gridsearch_display_cols]
    print '\nBest Parameters: ', grid_search.best_params_
    print '\nBest Score: ', grid_search.best_score_

    print "\nCross validation Performance on the training set with optimal parms"
    pipe.set_params(reduce_dim__n_components=4)#how much PCA should reduce??
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print scores

    print "\nPerformance on the test set with optimal parms:", y_train)
    predicted = pipe.predict(X_test)

    print 'Acuracy Score on test set: {}'.format(accuracy_score(y_test, predicted))

    print "\nCross tab(confusion matrix) on results:"

    print_cm(confusion_matrix(y_test, predicted),iris.target_names)



Installing XGBoost On Windows

Below is the guide to install XGBoost Python module on Windows system (64bit). It can be used as another ML model in Scikit-Learn. For more information on XGBoost or  “Extreme Gradient Boosting”, you can refer to the following material.

The following steps are compiled based on combined information from below 3 links:

  1. Installing Xgboost on Windows
  2. xgboost readthedocs
  3. StackOverFlow

Resources to be used as below. All have to be for 64bit platform.

  1. Git bash for windows
  2. Mingwin (TDM-GCC) for building. Need to ensure OpenMP install option is ticked. Please see details here.

Below commands have to be performed on the Git Bash on Windows. (may encounter error if using windows cmd prompt)

  1. git clone –recursive
  2. cd xgboost
  3. git submodule init
  4. git submodule update

Additional steps below to resolve the “build” issue based on information

  1. cd dmlc-core
  2. mingw32-make -j4
  3. cd ../rabit
  4. mingw32-make lib/librabit_empty.a -j4
  5. cd ..
  6. cp make/
  7. mingw32-make -j4

You can use an alias for mingw32-make. (alias make=’mingw32-make’)

Finally, setup for python installation.

  1. cd xgboost\python-package
  2. python install

Note that python, numpy and scipy need to be installed to use. All have to be on 64 bit platform.

After successful installation, you can try out the following quick example to verify that the xgboost module is working.