Natural Language Processing¶
Dataset: The Multilingual Amazon Reviews Corpus¶
This project is available in this repository in GitLab.
Details about the dataset:
- It can be downloaded here.
- Description.
- License
Let's import all necessary libraries, define constansts and functions for the analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from time import time
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
random_state = 2021-1-30
fontsize = 16
base_sample = 10000
max_features = 2000
nlp = spacy.load("es_core_news_sm")
def print_elapsed_minutes(start_time):
minutes_count = (time() - start_time) / 60
print(f"{minutes_count:.2f} min")
Import training data and print the first 5 rows:
reviews_train_data = pd.read_json("./data/dataset_es_train.json", lines=True)
reviews_train_data.head()
| review_id | product_id | reviewer_id | stars | review_body | review_title | language | product_category | |
|---|---|---|---|---|---|---|---|---|
| 0 | es_0491108 | product_es_0296024 | reviewer_es_0999081 | 1 | Nada bueno se me fue ka pantalla en menos de 8... | television Nevir | es | electronics |
| 1 | es_0869872 | product_es_0922286 | reviewer_es_0216771 | 1 | Horrible, nos tuvimos que comprar otro porque ... | Dinero tirado a la basura con esta compra | es | electronics |
| 2 | es_0811721 | product_es_0474543 | reviewer_es_0929213 | 1 | Te obligan a comprar dos unidades y te llega s... | solo llega una unidad cuando te obligan a comp... | es | drugstore |
| 3 | es_0359921 | product_es_0656090 | reviewer_es_0224702 | 1 | No entro en descalificar al vendedor, solo pue... | PRODUCTO NO RECIBIDO. | es | wireless |
| 4 | es_0068940 | product_es_0662544 | reviewer_es_0224827 | 1 | Llega tarde y co la talla equivocada | Devuelto | es | shoes |
Exploratory Data Analysis¶
reviews_train_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200000 entries, 0 to 199999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 200000 non-null object 1 product_id 200000 non-null object 2 reviewer_id 200000 non-null object 3 stars 200000 non-null int64 4 review_body 200000 non-null object 5 review_title 200000 non-null object 6 language 200000 non-null object 7 product_category 200000 non-null object dtypes: int64(1), object(7) memory usage: 12.2+ MB
There are no null values and the only numeric column is the review stars.
A bar plot will help to visualize the type of the problem:
g = sns.countplot(x="stars", data=reviews_train_data)
g.axes.set_title("Reviews by Scores", fontsize = 1.5 * fontsize)
g.set_xlabel("Stars", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
plt.show()
From the previous figure, it can be seen that it is a multi-class classification problem.
The following plot will provide a better understanding about the context of the data:
order = reviews_train_data["product_category"].value_counts().index
plt.figure(figsize=(20, 5))
g = sns.countplot(x="product_category", data=reviews_train_data, order=order)
g.axes.set_title("Reviews count by Category", fontsize = 1.5 * fontsize)
g.set_xlabel("Product Category", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
g.tick_params(labelsize=fontsize)
plt.xticks(rotation=90)
plt.show()
reviews_count = reviews_train_data[["product_category", "stars"]]\
.groupby(["product_category", "stars"])["product_category"]\
.count()\
.reset_index(name="count")\
.copy()
plt.figure(figsize=(20, 8))
g = sns.lineplot(data=reviews_count, x="product_category", y="count", hue="stars", palette = "Set1")
g.axes.set_title("Count by Product Category", fontsize = 1.5 * fontsize)
g.set_xlabel("Product Category", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
plt.xticks(rotation=90)
plt.show()
Home and Wireless are the most frequent categories, however, the frequency patters across all categories are similar.
Some extra information about the data:
print(f"Unique languages: {reviews_train_data.language.nunique()}")
print(f"Unique products: {reviews_train_data.product_id.nunique()} ({100 * reviews_train_data.product_id.nunique() / reviews_train_data.shape[0]}%)")
print(f"Unique reviewer: {reviews_train_data.reviewer_id.nunique()} ({100 * reviews_train_data.reviewer_id.nunique() / reviews_train_data.shape[0]}%)")
Unique languages: 1 Unique products: 150938 (75.469%) Unique reviewer: 179076 (89.538%)
At the end, it can be seen that:
HomelessyWirelessare the categories with more rows.- The quantity of review stars is balanced across the categories.
- Mostly all the reviews correspond to unique users and products.
Let's set the seed (for reproducibility) to analyze the text in one review body:
np.random.seed(random_state)
random_index = np.random.randint(reviews_train_data.shape[0])
reviews_train_data.iloc[random_index]
review_id es_0018170 product_id product_es_0239974 reviewer_id reviewer_es_0518937 stars 1 review_body La primera vez que la metí en el lavavajillas ... review_title Lo barato sale caro... language es product_category sports Name: 9162, dtype: object
random_review_body = reviews_train_data.review_body[random_index]
random_review_body
'La primera vez que la metí en el lavavajillas con un programa de 50 grados, se deformó. Ya antes se había partido el asa con tan solo unos días de uso.'
Create a spaCy document to handle the text:
doc = nlp(random_review_body)
print(doc.text)
La primera vez que la metí en el lavavajillas con un programa de 50 grados, se deformó. Ya antes se había partido el asa con tan solo unos días de uso.
Listing the sentences and tokens generated by the library:
for sentence in doc.sents:
print(sentence)
La primera vez que la metí en el lavavajillas con un programa de 50 grados, se deformó. Ya antes se había partido el asa con tan solo unos días de uso.
tokenized_review_body = [token for token in doc]
print(tokenized_review_body)
[La, primera, vez, que, la, metí, en, el, lavavajillas, con, un, programa, de, 50, grados, ,, se, deformó, ., Ya, antes, se, había, partido, el, asa, con, tan, solo, unos, días, de, uso, .]
spaCy has a Lemmatisation implementation so this process is the one that is going to be used in this analysis.
Below, it is going to be shown the results of the lemmatisation process together with the original tokens:
for token in doc:
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_lemma = token.lemma_
token_pos = token.pos_
token_dep = token.dep_
token_explanation = spacy.explain(token_pos)
print(f"{token_text:<13}{token_lemma:<13}{token_pos:<10}{token_dep:<10}{token_explanation}")
La La DET det determiner primera primero ADJ amod adjective vez vez NOUN nsubj noun que que SCONJ obl subordinating conjunction la lo DET obj determiner metí meter NOUN acl noun en en ADP case adposition el el DET det determiner lavavajillas lavavajillas NOUN obl noun con con ADP case adposition un uno DET det determiner programa programar NOUN obl noun de de ADP case adposition 50 50 NUM nummod numeral grados grado NOUN nmod noun , , PUNCT punct punctuation se se PRON obj pronoun deformó deformar VERB ROOT verb . . PUNCT punct punctuation Ya Ya ADV advmod adverb antes antes ADV advmod adverb se se PRON obj pronoun había haber VERB aux verb partido partir VERB ROOT verb el el DET det determiner asa asar PROPN nsubj proper noun con con ADP case adposition tan tan INTJ advmod interjection solo solo INTJ fixed interjection unos uno DET det determiner días día NOUN obl noun de de ADP case adposition uso usar NOUN nmod noun . . PUNCT punct punctuation
Excluding all stop words and punctuation signs:
for token in doc:
if token.is_stop or token.is_punct:
continue
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_pos = token.pos_
token_dep = token.dep_
token_lemma = token.lemma_
token_explanation = str(spacy.explain(token_pos))
print(f"{token_text:<13}{token_lemma:<13}{token_pos:<10}{token_dep:<10}{token_explanation}")
metí meter NOUN acl noun lavavajillas lavavajillas NOUN obl noun programa programar NOUN obl noun 50 50 NUM nummod numeral grados grado NOUN nmod noun deformó deformar VERB ROOT verb partido partir VERB ROOT verb asa asar PROPN nsubj proper noun
The quantity of words is reduced. However, there are some words (numbers in this case) that have less than three characters. To reduce the probability that a undesired word can avoid the filter, all tokens with less than 4 characters are going to be discarded.
The following function is a summary of the filtering process described above:
def is_valid_token(token, min_word_length=4, exceptions=[]):
has_min_length = len(token.lemma_) >= min_word_length
is_exception = token.text.lower() in exceptions
return (not token.is_stop and not token.is_punct and has_min_length) or is_exception
Creating a function to lemmatize text:
def lemmatize_text(text):
doc = nlp(text.lower())
lemmatized_words = [token.lemma_ for token in doc if is_valid_token(token, exceptions=["no"])]
return " ".join(lemmatized_words)
The final result is the following:
lemmatize_text(random_review_body)
'meter lavavajillas programar grado deformar partir asar'
Note:
The word no was included as an exception to the filtering. It can be important to differenciate positive and negative reviews. The next cell will illustrate why it was included. Two titles El producto no fue recibido and El producto no fue recibido are going to be compared after the raw lemmatisation process:
def lemmatize_text_without_exceptions(text):
doc = nlp(text.lower())
lemmatized_words = [token.lemma_ for token in doc if is_valid_token(token)]
return " ".join(lemmatized_words)
original_negative_text = "El producto no fue recibido"
original_positive_text = "El producto fue recibido"
print(f"Negative text: {original_negative_text} => {lemmatize_text_without_exceptions(original_negative_text)}")
print(f"Positive text: {original_positive_text} => {lemmatize_text_without_exceptions(original_positive_text)}")
Negative text: El producto no fue recibido => producto recibir Positive text: El producto fue recibido => producto recibir
N-grams can provide more information about the content (and differenciate it from others). The following method will be used in a couple of cells later to visualize the most frequent bigrams and trigrams.
def common_words_count(df, stars, ngram_range, max_count=40, fontsize=fontsize, figsize=(15, 8)):
# Filter the Dataset and create the count vectorizer
data_to_study = df.copy()[df.stars.isin(stars)]
count_vectorizer = CountVectorizer(ngram_range=ngram_range)
fig, axes = plt.subplots(1, 2, figsize=figsize)
fig.suptitle(f"Most frequent words for reviews with {stars} stars", fontsize=fontsize)
# Title
body_words_count = count_vectorizer.fit_transform(data_to_study.review_title)
count_per_word = body_words_count.toarray().sum(axis=0)
index_order = np.argsort(-count_per_word)
body_words_labels = np.array(count_vectorizer.get_feature_names())[index_order][0:max_count]
count_per_word = count_per_word[index_order][0:max_count]
g = sns.barplot(x=body_words_labels,y=count_per_word, ax=axes[0])
g.set_xlabel("n-grams", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
g.tick_params(labelsize=fontsize, labelrotation=90)
axes[0].set_title("Most common n-grams in the title")
# Body
body_words_count = count_vectorizer.fit_transform(data_to_study.review_body)
count_per_word = body_words_count.toarray().sum(axis=0)
index_order = np.argsort(-count_per_word)
body_words_labels = np.array(count_vectorizer.get_feature_names())[index_order][0:max_count]
count_per_word = count_per_word[index_order][0:max_count]
g = sns.barplot(x=body_words_labels,y=count_per_word, ax=axes[1])
g.set_xlabel("n-grams", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
g.tick_params(labelsize=fontsize, labelrotation=90)
axes[1].set_title("Most common n-grams in the body")
plt.show()
It will proceed to:
- Lemmatize the training data (both title and body of the review).
- Sample 10,000 reviews to:
- Create the Benchmark model to create a starting point.
- Re-conduct the study and training with more complete data.
- Optimize the hyperparameters.
- Once the hyperparameter optimization phase is completed, a model with all the training data will be trained with this result.
Note: The reason why this procedure is performed this way is due to the computational cost that it entails. Taking a representative sample can help to perform a greater number of iterations to find the best possible hyperparameters. It was decided to be 10,000 to keep a 2-1 relationship with the dev dataset (total of 5,000).
# Lemmatize
print("Body process:", end=" ")
start_time = time()
reviews_train_data.review_body = reviews_train_data.review_body.apply(lemmatize_text)
print_elapsed_minutes(start_time)
print("Title process:", end=" ")
start_time = time()
reviews_train_data.review_title = reviews_train_data.review_title.apply(lemmatize_text)
print_elapsed_minutes(start_time)
weights = np.ones(reviews_train_data.shape[0])
sampled_reviews_train_data = reviews_train_data.sample(n=base_sample, weights=weights, random_state=random_state)
sampled_reviews_train_data = sampled_reviews_train_data[["review_title", "review_body", "stars"]]
print(f"Size of the sample: {sampled_reviews_train_data.shape}")
g = sns.countplot(x="stars", data=sampled_reviews_train_data)
g.set_xlabel("Stars", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
plt.show()
Body process: 35.88 min Title process: 22.55 min Size of the sample: (10000, 3)
For positive reviews (grouping ratings of 4 and 5) and negative (ratings 1 and 2) it can be observed:
common_words_count(df=sampled_reviews_train_data, stars=[4,5], ngram_range=(2,3), max_count=20, fontsize=fontsize, figsize=(20, 5))
common_words_count(df=sampled_reviews_train_data, stars=[1,2], ngram_range=(2,3), max_count=20, fontsize=fontsize, figsize=(20, 5))
It can be noted that some n-grams such as no gustar and preciar no are shared between positive and negative reviews (in different proportions) but this is outside the scope of this study.
Benchmark Model¶
As a benchmark model we will use:
- The bag of words method where a count of the words of the bodies of the reviews will be made.
- An
SVCmodel. - One vs Rest method for multi-class problems.
The strategy to be followed will be described below:
- Only the text in the body of the review will be considered.
- An instance of
CountVectorizerwill be created to process and vectorize the information from bothtrainanddev. - The model
SVCandRandomForestClassifierwill be trained with the sample of the training data. - Once the training is finished, the
devdata will be used to obtain theaccuracyof the model.
# Create counter
benchmark_count_vectorizer = CountVectorizer(max_features = max_features, ngram_range=(1, 1))
# Transform
print("Vectorizing process:", end=" ")
%time matrix_body_train = benchmark_count_vectorizer.fit_transform(sampled_reviews_train_data.review_body)
X_train_bench = matrix_body_train.toarray()
y_train_bench = sampled_reviews_train_data.stars
Vectorizing process: Wall time: 176 ms
reviews_dev_data = pd.read_json("./data/dataset_es_dev.json", lines=True)
# Lemmatize
print("Body process:", end=" ")
%time reviews_dev_data.review_body = reviews_dev_data.review_body.apply(lemmatize_text)
print("Title process:", end=" ")
%time reviews_dev_data.review_title = reviews_dev_data.review_title.apply(lemmatize_text)
reviews_dev_data = reviews_dev_data[["review_title", "review_body", "stars"]]
Body process: Wall time: 51.9 s Title process: Wall time: 33.4 s
print("Vectorizing process:", end=" ")
%time matrix_body_dev = benchmark_count_vectorizer.transform(reviews_dev_data.review_body)
X_dev_bench = matrix_body_dev.toarray()
y_dev_bench = reviews_dev_data.stars
Vectorizing process: Wall time: 85 ms
Below are two functions that will be useful to visualize the results and train the models:
# Plot confusion matrix
def confusion_multi(ytest,y_pred):
names = ["1","2", "3", "4", "5"]
cm = confusion_matrix(ytest,y_pred)
f,ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot=True, linewidth=.5, linecolor="r", fmt=".0f", ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
# Train model
def train_benchmark_model(model, traindata, devdata):
X_train, y_train = traindata
X_dev, y_dev = devdata
# Training classifier
print("Training", end=" ")
start_time = time()
model.fit(X_train, y_train)
print_elapsed_minutes(start_time)
# Predict
print("Predicting", end=" ")
start_time = time()
predictions = model.predict(X_dev)
print_elapsed_minutes(start_time)
# Accuracy dev
accuracy = accuracy_score(y_dev, predictions)
print(f"Accuracy for dev set: {100 * accuracy:.2f}%")
confusion_multi(y_dev, predictions)
We execute the function to train and obtain the results of the model:
# # One vs Rest + SVC
train_benchmark_model(OneVsRestClassifier(SVC(random_state=random_state), n_jobs=-1), (X_train_bench, y_train_bench), (X_dev_bench, y_dev_bench))
Training 4.04 min Predicting 3.69 min Accuracy for dev set: 41.76%
# One vs Rest + Random Forest
train_benchmark_model(OneVsRestClassifier(RandomForestClassifier(random_state=random_state, n_jobs=-1), n_jobs=-1), (X_train_bench, y_train_bench), (X_dev_bench, y_dev_bench))
Training 1.02 min Predicting 0.02 min Accuracy for dev set: 39.30%
As benchmark it will be taken:
- Model:
SVCwith default values. - Accuracy obtained:
41%.
Creation of the model and optimization of hyperparameters.¶
Considering the hardware resources available at the time of this analysis, the following steps will be followed:
- The training data to be used will be the sample of 10,000 records previously obtained. They will be used for the optimization of hyperparameters.
- Both the
SVCandRandomForestClassifiermodels will be studied. - For vectorization, the classes
CountVectorizerandTfidfVectorizerwill be implemented.
Due to the large volume of data and to monitor the training time for each combination during the hyperparameter optimization phase, instead of using models like GridSearchCV we will use:
forcycles.- Hypermarameters will be optimized by training with
trainand validating withdev. - The actual
accuracyof the model will be estimated comparing with thetestdata.
Here are two methods:
confusion_binarythat will help us at the end of the analysis.train_modeladaptation oftrain_benchmark_modelto be able to contemplate the cases mentioned previously in this list.
def confusion_binary(ytest,y_pred):
names = ["0","1"]
cm = confusion_matrix(ytest,y_pred)
f,ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm, annot=True, linewidth=.5, linecolor="r", fmt=".0f", ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
def train_model(data, classifier, title_vectorizer, body_vectorizer, return_results=True, print_confusion_matrix=False, binary=False):
start_time = time()
train_data, dev_data = data
title_words_train = title_vectorizer.fit_transform(train_data.review_title)
body_words_train = body_vectorizer.fit_transform(train_data.review_body)
X_train = sparse.hstack((title_words_train, body_words_train))
y_train = train_data.stars
# model generation
classifier.fit(X_train, y_train)
title_words_dev = title_vectorizer.transform(dev_data.review_title)
body_words_dev = body_vectorizer.transform(dev_data.review_body)
X_dev = sparse.hstack((title_words_dev, body_words_dev))
y_dev = dev_data.stars
# Predict
predictions = classifier.predict(X_dev)
print_elapsed_minutes(start_time)
# Accuracy dev
accuracy = accuracy_score(y_dev, predictions)
print(f"Accuracy for dev set: {100 * accuracy:.2f}%\n")
if print_confusion_matrix:
confusion_binary(y_dev, predictions) if binary else confusion_multi(y_dev, predictions)
if return_results:
return accuracy
Starting the analysis for the SVC model:
ngram_ranges = [(1,1), (1,2)]
Cs=[1.0, 2.0]
kernels=["linear", "rbf"]
degrees=[2, 3]
vectorizer_strategies = ["count", "tf_idf"]
combinations = []
best_params_svc = {
"accuracy": 0
}
for ngram_range in ngram_ranges:
for C in Cs:
for kernel in kernels:
for degree in degrees:
for vectorizer_strategy in vectorizer_strategies:
combinations.append({
"ngram_range": ngram_range,
"C": C,
"kernel": kernel,
"degree": degree,
"vectorizer_strategy": vectorizer_strategy
})
total_combinations = len(combinations)
for loop_number, combination in enumerate(combinations):
print(f"Loop {loop_number + 1}/{total_combinations} ({combination}) =>", end=" ")
if combination["vectorizer_strategy"] == "tf_idf":
body_vectorizer = TfidfVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
title_vectorizer = TfidfVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
else:
body_vectorizer = CountVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
title_vectorizer = CountVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
classifier = OneVsRestClassifier(SVC(C=combination["C"], kernel=combination["kernel"], degree=combination["degree"], random_state=random_state), n_jobs=-1)
model_accuracy = train_model((sampled_reviews_train_data, reviews_dev_data), classifier, title_vectorizer, body_vectorizer)
if best_params_svc["accuracy"] < model_accuracy:
best_params_svc["accuracy"] = model_accuracy
best_params_svc["params"] = combination
best_params_svc["classifier"] = classifier
best_params_svc["body_vectorizer"] = body_vectorizer
best_params_svc["title_vectorizer"] = title_vectorizer
print("\nBest Params:")
best_params_svc
Loop 1/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.41 min
Accuracy for dev set: 43.38%
Loop 2/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.31 min
Accuracy for dev set: 44.62%
Loop 3/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.40 min
Accuracy for dev set: 43.38%
Loop 4/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.31 min
Accuracy for dev set: 44.62%
Loop 5/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.55 min
Accuracy for dev set: 44.34%
Loop 6/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.56 min
Accuracy for dev set: 45.66%
Loop 7/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.55 min
Accuracy for dev set: 44.34%
Loop 8/32 ({'ngram_range': (1, 1), 'C': 1.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.56 min
Accuracy for dev set: 45.66%
Loop 9/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.51 min
Accuracy for dev set: 42.78%
Loop 10/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.32 min
Accuracy for dev set: 43.72%
Loop 11/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.52 min
Accuracy for dev set: 42.78%
Loop 12/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.32 min
Accuracy for dev set: 43.72%
Loop 13/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.61 min
Accuracy for dev set: 44.26%
Loop 14/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.71 min
Accuracy for dev set: 45.38%
Loop 15/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.61 min
Accuracy for dev set: 44.26%
Loop 16/32 ({'ngram_range': (1, 1), 'C': 2.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.69 min
Accuracy for dev set: 45.38%
Loop 17/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.40 min
Accuracy for dev set: 42.06%
Loop 18/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.35 min
Accuracy for dev set: 44.30%
Loop 19/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.41 min
Accuracy for dev set: 42.06%
Loop 20/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.34 min
Accuracy for dev set: 44.30%
Loop 21/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.58 min
Accuracy for dev set: 44.72%
Loop 22/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.61 min
Accuracy for dev set: 45.86%
Loop 23/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.58 min
Accuracy for dev set: 44.72%
Loop 24/32 ({'ngram_range': (1, 2), 'C': 1.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.61 min
Accuracy for dev set: 45.86%
Loop 25/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.53 min
Accuracy for dev set: 41.06%
Loop 26/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'linear', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.35 min
Accuracy for dev set: 43.44%
Loop 27/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.53 min
Accuracy for dev set: 41.06%
Loop 28/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'linear', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.35 min
Accuracy for dev set: 43.44%
Loop 29/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'count'}) => 0.67 min
Accuracy for dev set: 44.20%
Loop 30/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'rbf', 'degree': 2, 'vectorizer_strategy': 'tf_idf'}) => 0.76 min
Accuracy for dev set: 45.54%
Loop 31/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'count'}) => 0.66 min
Accuracy for dev set: 44.20%
Loop 32/32 ({'ngram_range': (1, 2), 'C': 2.0, 'kernel': 'rbf', 'degree': 3, 'vectorizer_strategy': 'tf_idf'}) => 0.76 min
Accuracy for dev set: 45.54%
Best Params:
{'accuracy': 0.4586,
'params': {'ngram_range': (1, 2),
'C': 1.0,
'kernel': 'rbf',
'degree': 2,
'vectorizer_strategy': 'tf_idf'},
'classifier': OneVsRestClassifier(estimator=SVC(degree=2, random_state=1990), n_jobs=-1),
'body_vectorizer': TfidfVectorizer(max_features=2000, ngram_range=(1, 2)),
'title_vectorizer': TfidfVectorizer(max_features=2000, ngram_range=(1, 2))}
Continuing with RandomForestClassifier:
ngram_ranges = [(1,1), (1,2)]
n_estimators_array = [100, 200]
criterions = ["gini", "entropy"]
max_depths = [10, 50, 100]
vectorizer_strategies = ["count", "tf_idf"]
combinations = []
best_params_random_forest = {
"accuracy": 0
}
for n_estimators in n_estimators_array:
for criterion in criterions:
for max_depth in max_depths:
for ngram_range in ngram_ranges:
for vectorizer_strategy in vectorizer_strategies:
combinations.append({
"n_estimators": n_estimators,
"criterion": criterion,
"max_depth": max_depth,
"ngram_range": ngram_range,
"vectorizer_strategy": vectorizer_strategy
})
total_combinations = len(combinations)
for loop_number, combination in enumerate(combinations):
print(f"Loop {loop_number + 1}/{total_combinations} ({combination}) =>", end=" ")
if combination["vectorizer_strategy"] == "tf_idf":
body_vectorizer = TfidfVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
title_vectorizer = TfidfVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
else:
body_vectorizer = CountVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
title_vectorizer = CountVectorizer(max_features = max_features, ngram_range=combination["ngram_range"])
classifier = OneVsRestClassifier(RandomForestClassifier(max_depth=combination["max_depth"], criterion=combination["criterion"], n_estimators=combination["n_estimators"], n_jobs=-1, random_state=random_state), n_jobs=-1)
model_accuracy = train_model((sampled_reviews_train_data, reviews_dev_data), classifier, title_vectorizer, body_vectorizer)
if best_params_random_forest["accuracy"] < model_accuracy:
best_params_random_forest["accuracy"] = model_accuracy
best_params_random_forest["params"] = combination
best_params_random_forest["classifier"] = classifier
best_params_random_forest["body_vectorizer"] = body_vectorizer
best_params_random_forest["title_vectorizer"] = title_vectorizer
print("\nBest Params for Random Forest:")
best_params_random_forest
Loop 1/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.03 min
Accuracy for dev set: 42.32%
Loop 2/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.03 min
Accuracy for dev set: 42.76%
Loop 3/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.03 min
Accuracy for dev set: 42.62%
Loop 4/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.03 min
Accuracy for dev set: 42.08%
Loop 5/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.05 min
Accuracy for dev set: 44.58%
Loop 6/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.05 min
Accuracy for dev set: 44.16%
Loop 7/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.05 min
Accuracy for dev set: 44.24%
Loop 8/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.06 min
Accuracy for dev set: 44.22%
Loop 9/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.07 min
Accuracy for dev set: 44.10%
Loop 10/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.07 min
Accuracy for dev set: 43.40%
Loop 11/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.08 min
Accuracy for dev set: 44.30%
Loop 12/48 ({'n_estimators': 100, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.08 min
Accuracy for dev set: 43.94%
Loop 13/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.03 min
Accuracy for dev set: 42.56%
Loop 14/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.03 min
Accuracy for dev set: 43.24%
Loop 15/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.03 min
Accuracy for dev set: 42.86%
Loop 16/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.04 min
Accuracy for dev set: 42.84%
Loop 17/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.05 min
Accuracy for dev set: 44.64%
Loop 18/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.05 min
Accuracy for dev set: 44.14%
Loop 19/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.05 min
Accuracy for dev set: 44.24%
Loop 20/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.06 min
Accuracy for dev set: 44.00%
Loop 21/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.07 min
Accuracy for dev set: 43.96%
Loop 22/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.08 min
Accuracy for dev set: 44.34%
Loop 23/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.08 min
Accuracy for dev set: 44.22%
Loop 24/48 ({'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.09 min
Accuracy for dev set: 44.16%
Loop 25/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.04 min
Accuracy for dev set: 43.06%
Loop 26/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.04 min
Accuracy for dev set: 42.96%
Loop 27/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.04 min
Accuracy for dev set: 42.90%
Loop 28/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.05 min
Accuracy for dev set: 42.96%
Loop 29/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.08 min
Accuracy for dev set: 44.82%
Loop 30/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.08 min
Accuracy for dev set: 44.02%
Loop 31/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.09 min
Accuracy for dev set: 44.38%
Loop 32/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.09 min
Accuracy for dev set: 44.36%
Loop 33/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.14 min
Accuracy for dev set: 43.96%
Loop 34/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.14 min
Accuracy for dev set: 43.70%
Loop 35/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.15 min
Accuracy for dev set: 44.74%
Loop 36/48 ({'n_estimators': 200, 'criterion': 'gini', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.15 min
Accuracy for dev set: 43.82%
Loop 37/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.04 min
Accuracy for dev set: 42.98%
Loop 38/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.04 min
Accuracy for dev set: 43.46%
Loop 39/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.04 min
Accuracy for dev set: 43.02%
Loop 40/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 10, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.05 min
Accuracy for dev set: 43.32%
Loop 41/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.08 min
Accuracy for dev set: 44.98%
Loop 42/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.09 min
Accuracy for dev set: 44.50%
Loop 43/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.09 min
Accuracy for dev set: 44.64%
Loop 44/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 50, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.10 min
Accuracy for dev set: 44.32%
Loop 45/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'count'}) => 0.14 min
Accuracy for dev set: 44.14%
Loop 46/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 1), 'vectorizer_strategy': 'tf_idf'}) => 0.16 min
Accuracy for dev set: 44.08%
Loop 47/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'count'}) => 0.15 min
Accuracy for dev set: 44.74%
Loop 48/48 ({'n_estimators': 200, 'criterion': 'entropy', 'max_depth': 100, 'ngram_range': (1, 2), 'vectorizer_strategy': 'tf_idf'}) => 0.17 min
Accuracy for dev set: 44.30%
Best Params for Random Forest:
{'accuracy': 0.4498,
'params': {'n_estimators': 200,
'criterion': 'entropy',
'max_depth': 50,
'ngram_range': (1, 1),
'vectorizer_strategy': 'count'},
'classifier': OneVsRestClassifier(estimator=RandomForestClassifier(criterion='entropy',
max_depth=50,
n_estimators=200,
n_jobs=-1,
random_state=1990),
n_jobs=-1),
'body_vectorizer': CountVectorizer(max_features=2000),
'title_vectorizer': CountVectorizer(max_features=2000)}
It can be observed:
- The performance of the model
SVC(45.86%) was better thanRandomForestClassifier(44.98%). However, their difference does not exceed 1%. - The training and validation execution time with the
SVCmodel is approximately 14 times greater thanRandomForestClassifier.
Taking into account the aforementioned, the model to choose to train the model with the complete training data will be RandomForestClassifier. This is due to the speed-performance trade-off (in production, less computing time implies more money savings).
Training with the complete dataset using the chosen hyperparameters:
best_params_random_forest
{'accuracy': 0.4498,
'params': {'n_estimators': 200,
'criterion': 'entropy',
'max_depth': 50,
'ngram_range': (1, 1),
'vectorizer_strategy': 'count'},
'classifier': OneVsRestClassifier(estimator=RandomForestClassifier(criterion='entropy',
max_depth=50,
n_estimators=200,
n_jobs=-1,
random_state=1990),
n_jobs=-1),
'body_vectorizer': CountVectorizer(max_features=2000),
'title_vectorizer': CountVectorizer(max_features=2000)}
rf_best_params = best_params_random_forest["params"]
rf_body_vectorizer = CountVectorizer(max_features = max_features, ngram_range=rf_best_params["ngram_range"])
rf_title_vectorizer = CountVectorizer(max_features = max_features, ngram_range=rf_best_params["ngram_range"])
rf_classifier = OneVsRestClassifier(RandomForestClassifier(max_depth=rf_best_params["max_depth"], criterion=rf_best_params["criterion"], n_estimators=rf_best_params["n_estimators"], n_jobs=-1, random_state=random_state), n_jobs=-1)
print("Training time:", end=" ")
train_model((reviews_train_data, reviews_dev_data), rf_classifier, rf_title_vectorizer, rf_body_vectorizer, return_results=False, print_confusion_matrix=True)
Training time: 9.05 min Accuracy for dev set: 48.18%
Estimating the actual accuracy of the model with the test data set:
# Import
reviews_test_data = pd.read_json("./data/dataset_es_test.json", lines=True)
# Process
print("Body process:", end=" ")
%time reviews_test_data.review_body = reviews_test_data.review_body.apply(lemmatize_text)
print("Title process:", end=" ")
%time reviews_test_data.review_title = reviews_test_data.review_title.apply(lemmatize_text)
reviews_test_data = reviews_test_data[["review_title", "review_body", "stars"]]
title_words_test = rf_title_vectorizer.transform(reviews_test_data.review_title)
body_words_test = rf_body_vectorizer.transform(reviews_test_data.review_body)
X_test = sparse.hstack((title_words_test, body_words_test))
y_test = reviews_test_data.stars
# Predict
print("Predict:", end=" ")
start_time = time()
predictions = rf_classifier.predict(X_test)
print_elapsed_minutes(start_time)
# Evaluation
accuracy_test = accuracy_score(y_test, predictions)
print(f"Accuracy for test set: {100 * accuracy_test:.2f}%\n")
confusion_multi(y_test, predictions)
Body process: Wall time: 1min 8s Title process: Wall time: 39.3 s Predict: 0.02 min Accuracy for test set: 48.28%
In this case we obtain an accuracy of approximately 48% for the prediction model.
Using the properties of the algorithm class, it can be obtain the words that are most useful to make predictions. Below, it can be seen the most used words to predict the cases of 1 and 5 stars respectively:
all_body_words = [f"{word} (body)" for word in rf_body_vectorizer.get_feature_names()]
all_title_words = [f"{word} (title)" for word in rf_title_vectorizer.get_feature_names()]
all_words = [*all_title_words, *all_body_words]
def plot_feature_importance(all_words, importances, use_case):
indices = np.argsort(importances)[::-1]
indices = indices[:30]
selected_words = [all_words[i] for i in indices]
selected_importances = importances[indices]
plt.figure(figsize = (15,8))
g = sns.barplot(x=selected_words, y=selected_importances)
g.set_title(f"Most relevant words for predicting {use_case}", fontsize=fontsize)
g.set_xlabel("Words", fontsize=fontsize)
g.set_ylabel("Importance", fontsize=fontsize)
g.tick_params(axis="x", labelsize=fontsize, labelrotation=80)
plt.show()
importances = rf_classifier.estimators_[0].feature_importances_
plot_feature_importance(all_words, importances, "1 star")
importances = rf_classifier.estimators_[4].feature_importances_
plot_feature_importance(all_words, importances, "5 stars")
It can be seen that indeed the first graph focuses on negative words (1-star review) and the second on words that could be considered positive (5-star review). However, it should also be noted that both plots have some coincidences so the model confusion in the matrix can be explained.
Something interesting can be observed in the confusion matrix of the test set. The algorithm "confuses" to a greater extent, the grades 4 and 5 and the 1 and 2.
This may lead to think that it may make sense to change focus and instead of treating the problem as multi-class, treat it as binary (positive reviews vs negative reviews).
To do this, reviews with 3 stars will be eliminated and reviews with 4 and 5 stars will be grouped as positive and 1 and 2 as negative.
binary_train_data = reviews_train_data[reviews_train_data.stars.isin([1, 2, 4, 5])].copy()
binary_train_data.stars.replace([1,2,4,5], [1,1,0,0], inplace=True)
binary_dev_data = reviews_dev_data[reviews_dev_data.stars.isin([1, 2, 4, 5])].copy()
binary_dev_data.stars.replace([1,2,4,5], [1,1,0,0], inplace=True)
fig, axes = plt.subplots(1, 2, figsize=(15,5))
fig.suptitle(f"Binary classes count for train and dev datasets", fontsize=fontsize)
g = sns.countplot(x="stars", data=binary_train_data, ax=axes[0])
g.set_xlabel("Classes", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
axes[0].set_title("Train dataset")
g = sns.countplot(x="stars", data=binary_dev_data, ax=axes[1])
g.set_xlabel("Classes", fontsize=fontsize)
g.set_ylabel("Count", fontsize=fontsize)
axes[1].set_title("Dev dataset")
plt.show()
Starting the training:
binary_body_vectorizer = CountVectorizer(max_features = max_features, ngram_range=rf_best_params["ngram_range"])
binary_title_vectorizer = CountVectorizer(max_features = max_features, ngram_range=rf_best_params["ngram_range"])
binary_random_forest = RandomForestClassifier(max_depth=rf_best_params["max_depth"], criterion=rf_best_params["criterion"], n_estimators=rf_best_params["n_estimators"], n_jobs=-1, random_state=random_state)
print("Training time:", end=" ")
train_model((binary_train_data, binary_dev_data), binary_random_forest, binary_title_vectorizer, binary_body_vectorizer, return_results=False, print_confusion_matrix=True, binary=True)
Training time: 1.48 min Accuracy for dev set: 86.40%
For the test set:
all_body_words_binary = [f"{word} (body)" for word in binary_body_vectorizer.get_feature_names()]
all_title_words_binary = [f"{word} (title)" for word in binary_title_vectorizer.get_feature_names()]
all_words_binary = [*all_title_words_binary, *all_body_words_binary]
importances = binary_random_forest.estimators_[0].feature_importances_
plot_feature_importance(all_words, importances, "positive rates")
importances = binary_random_forest.estimators_[1].feature_importances_
plot_feature_importance(all_words, importances, "negative rates")
Results Analysis¶
A brief summary may be of help to compare the results obtained with what is expected:
- The problem: Predict the stars for a review regarding its content.
- Expected results: Obtain a model that can predict the number of stars that a user can assign to a purchase from their review. To do this, it is expected to associate words with a positive connotation (perfect, wonderful, delivered, etc.) with high marks like 4 and 5 and negative words (bad, lost, faulty, etc.) with low marks like 1 and 2.
The Random Forest model and the One vs Rest strategy provide 5 models, each of them dedicated to focusing on each rating. In the plots Most relevant words for predicting 1 star and Most relevant words for predicting 5 stars, we can see the predominance of words with negative connotation (return, cheat or fatal) and positive (perfect, great or excellent), respectively . However, considering that the comparison deals with the two extreme scores, the differences were expected to be much more marked. There are unexpected words like "perfect" in the negatives and "bad" in the positives ones. These coincidences between both models explain the accuracy obtained (48%).
One might also hope that the body of the review serves better as a guide for classification, which agrees with the results plotted. It must not be forgotten that in high positions of importance, some words in particular carry considerable weight if they are found in the title as 'perfect' or 'not'.
Another expected result is, if the perspective of the problem is changed, and it is thought of as a binary one (positive vs negative reviews), the level of successes would increase. The confusion matrix resulting from the multi-class problem hints that this may be true since the confusions are mostly concentrated in regions 4 and 5 (positive) and 1 and 2 (negative). Making adjustments to convert the problem to a binary one gives a significant increase in the accuracy to 86%, thus confirming our hypothesis.
Conclusions¶
- The best model obtained was
RandomForestClassifierwith the following hyperparameters:n_estimators: 200.criterion: 'entropy'.max_depth: 50.
- For vectorization, the
Bag of Wordsstrategy performed better than theTF-IDFwith Random Forest. In this case with monograms. - The estimated performance of the model is
48%of correctness (accuracy). - An article published on this dataset, indicates that using an algorithm called
BERTand a training of approximately10hrs, anaccuracyof58%can be obtained so the analysis presented in this notebook can be improved by making use of more advanced models. - The confusion matrix gives clues that indicate that the problem can be treated as binary: positive (4 and 5) and negative (1 and 2) reviews. By doing this, it can be noticed a significant improvement in the
accuracyfrom48%to86%. - The Random Forest model makes it much easier for us to interpret the results obtained. In the graphs of the most important words for making predictions, it can be seen that for negative cases, words that can be associated with negative feelings predominate. The same can be said for the case of positive reviews.
- An additional point to consider in the plots of the most important words for the prediction are the coincidences (although in different proportions) for the positive and negative reviews such as
devolucion,normalorperfectothat may be causing the confusion of the model. - Possible improvements that can be implemented on this analysis:
- The use of the multi-class strategy
One vs One. - Investigate the effect on model performance by more rigorously filtering words (example: removing the word
nofrom positive reviews). - Include in the model the categories of the reviews.
- Investigate and apply more advanced models (such as neural networks).
- The use of the multi-class strategy