The Kaggle Trip Advisor Reviews Dataset comprises
20491 Hotel Reviews
Rating from 1* (worst) to 5* (best rating)
The Dataset was modified to facilitate the task the Rating column was binned to 3 columns
- Negative - class 0 (Rating 1* & Rating 2*)
- Neutral - class 1 (Rating 3*)
- Positive - class 2 (Rating 4* & 5*)
This thus becomes a Sentiment classification task
However there is a Class imbalance
To compare the Feature engineering of azureml AutoML with hand-crafted engineering a preprocessing script was implemented:
For each hotel review the preprocessing script generates the following features:
- Ia. Spacy Transformer embedding en_trf_robertabase_lg
- Ib. Alternatively a character n-gram TFIDF embedding was used
- II. Review text gength based feature short_review, long_review
- III. Sentiment polarity dictionaries for adjectives/frequent words were mapped to the texts. Minimum, mean and maximum polarity wer determined per review text: min_adj, max_adj, mean_adj, min_freq_w, max_freq_w, mean_freq_w
- IV. LDA topic vectors were fitted for each text. Hyperparameter search for topic coherence yielded 30 topics as an optimal number of topics
This feature-engineered Dataset was used for Hyperparameter tuning with azureml Hyperdrive
Please note that part of the pre-processing involves sentiment polarity dictionaries which were created by:
William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. ArXiv preprint (arxiv:1606.02820). 2016.
Download the sentiment polarity dictionaries via:
####Save the unzipped files in the polarity directory to get the following structure:
automl_vs_hyperdrive/
│
└── data/
└── polarity_data
├── socialsent_hist_adj
│ └── adjectives
│ └── * many_tsv_files
│
└── socialsent_hist_freq
└── frequent_words
└── * many_tsv_files
Please note that the modeling is based on the Kaggle Trip Advisor Reviews Dataset see citation bellow
Alam, M. H., Ryu, W.-J., Lee, S., 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 339, 206–223.
Kaggle Trip Advisor Reviews ####Save the unzipped files in the polarity directory to get the following structure:
automl_vs_hyperdrive/
│
└── data/
└── datasets
└── socialsent_hist_adj
│ └── adjectives
│ └── * many_tsv_files
│
└── socialsent_hist_freq
└── frequent_words
└── * many_tsv_files
- I. Create a virtual environment
conda create --name automl_vs_hyperdrive python=3.7
- II. Activate your conda environment
activate automl_vs_hyperdrive
- III. Install requirements.txt
pip install -r requirements.txt
- IV. Execute setup.py
python setup.py develop
- V. Download en_trf_robertabase_lg spacy model
python -m spacy download en_trf_robertabase_lg
- VI. Download en_core_web_md spacy model
python -m spacy download en_core_web_md
In our workspace the Dataset is:
-
- manually loaded into the Notebook workspace of Azure Machine Learning studio
-
- uploaded into the azure default datastore
-
- Loaded as a Tabular Dataset
-
- Registered as a Dataset object
hotel_review_dataset = pd.read_csv(filepath_2_dataset)
hotel_review_dataset.to_csv("data/review_dataset.csv", index=False)
datastore = workspace.get_default_datastore()
datastore.upload(src_dir="data", target_path="data")
dataset_training = Dataset.Tabular.from_delimited_files(path = [(datastore, ("data/review_dataset.csv"))])
dataset_training = dataset_training.register(workspace=workspace, name="hotel-review-data", description="Hotel Review Dataset")
The following parameters were set for the AutoML Training Configuration:
-
experiment_timeout_minutes: was set to prevent the experiment from running for long timer periods with high cost
-
max_concurrent_iterations: was set to 4 since only 4 compute target nodes are available for paralle child runs
-
primary_metric: was set to AUC_weighted since this includes a balance between false positive and true positive rate
-
n_cross_validations: 5 crossvalidations were selected, since this results in a more robust mean/std estimation for each model
-
enable_early_stopping: to prevent unproductive runs which lead to no improvement and costs
-
compute_target: needs to be define to perform the AutoML computations
-
task: needs to be classification since the label column is defining separate classes
-
training_data: corresponds to the training set
-
label_column: corresponds to the target/label column defining the separate classes
-
debug_log: defined to enable detailed logging of automl errors
The best selected AutoML models were a Voting Ensemble and Random Forest with:
Voting Ensemble Accuracy: 0.7352 RandomForest Accuracy: 0.7352
- The auto-selected Features were pre-processed by a MinMaxScaler
- The comparison with the Hyperparameter tuning results with AutoML revealed that the feature engineering is a major advantage towards the automated preprocessing thus the results can be improved if the features are provided to AutoML
- Much better results might be obtainable by using a pre-trained neural model e.g. a pre-trained bert-base-uncased
AutoML RunDetails partI AutoML RunDetails partII AutoML RunDetails partIII
A big pitfall of using Accuracy as a metric to be optimized was that the best model learned to perfectly classify 100% of all examples of class 2 while it failed for 100% of the cases for classes 0 and 1 Confusion Matrix This can be seen by the confusion matrix In general the Dataset is highly class imbalanced and e.g. an F1 score or weighted AUC would have been a better metric Also stratified sampling, upsampling, downsampling might help with the class imbalance
In the previous azureml experiments with AutoML mostly LightGBM was the best performing model.
Also it is widely reported that Gradient Boosting Machines are winning in many Kaggle competition unless competing against a suitable pretrained neural net with sufficient finetuning
To limit the training time I decided against training a neural net and for the well established XGBoost library
The following hyperparameters were provided to the XGBoost Model
- '--max-depth': "How deep are individual trees growing during one round of boosting"
- '--min-child-weight': "Minimum sum of weight for all observations in a child. Controls overfitting"
- '--gamma': "Gamma corresponds to the minimum loss reduction required to make a split."
- '--subsample': "What fraction of samples are randomly sampled per tree.")
- '--colsample-bytree': "What fraction of feature columns are randomly sampled per tree."
- '--reg-alpha': "L1 regularization of the weights. Increasing the values more strongly prevents overfitting."
- '--eta': "Learning rate for XGBoost.")
- '--seed': "Random seed."
- '--num-iterations': "Number of fitting iterations"
The parameters for RandomSearch were selected according to my prior experience and the excellent XGBoost Tuning Blog Post
-
XGBoost is very sensitive to tree-centered parameters such as max_depth, min_child_weight, subsample, colsample_bytree, gamma
-
Also the regularization parameter alpha was tuned
For Hyperparameter tuning a Random Grid was defined:
- uniform distribution of the subsample & gamma parameters
- loguniform distribution of reg-alpha parameter
- discrete choice for max-depth, min-child-weight, colsample-bytree parameters
parameter_sampling_grid = RandomParameterSampling(
{
"--max-depth": choice(3,4,5,6),
"--min-child-weight": choice(1,2,3,4,5),
"--colsample-bytree": uniform(0.8, 1.0),
"--subsample": uniform(0.7, 1.0),
"--gamma": uniform(0, 0.4),
"--reg-alpha": loguniform(-5,-1)
}
)
An early termination BanditPolicy was used with an evaluation interval of 2 and a slack_factor of 0.1
from azureml.train.hyperdrive import BanditPolicy
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1)
- evaluation_interval 2 ==> every 2nd iteration it is checked, whether the termination criterion is met
- slack_factor 0.1 ==> if the evaluation result is 10% worse than the current optimum, the parameter search is aborted
The Accuracy score was defined as a primary criterion to be maximized during parameter search However besides the primary metric, also the weighted F1 score as logged
The hyperparameter tuning was performed via the train.py script
For some plots of the Hyperparameter Search RunDetails see:
Test set accuracy: 0.9189 Test set weighted F1 score: 0.9080
Hyperparameters of the best model:
- --max-depth 3
- --min-child-weight 2
- --gamma 0
- --subsample 0.9
- --colsample-bytree 0.8
- --reg-alpha 1E-05
- --eta 0.2
- --seed 42
- --num-iterations 20
- --colsample-bytree 0.9501132077820976
- --gamma 0.16945015198714986
- --max-depth 6
- --min-child-weight 2
- --reg-alpha 0.06308908942969567
- --subsample 0.7683197302311903
The best XGBoost model as selected by hyperparameter Randomsearch was deployed as a Webservice. Successfull deployment of the best XGBoost model
Application Insights Logs for best model
Example Request against best model endpoint
The scoring script score.py was defined in such a way, that:
Negative was returned for class:0
Neutral was returend for class:1
Positive was returned for class:2
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
inference_config = InferenceConfig(entry_script="score.py",
environment=myenv)
service_name = 'xgboost-review-classification'
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
service = Model.deploy(workspace=ws,
name=service_name,
models=[best_model],
inference_config=inference_config,
deployment_config=aci_config,
overwrite=True)
service.wait_for_deployment(show_output=True)
print("scoring URI: " + service.scoring_uri)
Application Insights was enabled for logging the status of the Deployed Webservice:
service.update(enable_app_insights=True)
An example request was done with the following code:
import requests
import json
headers = {'Content-Type':'application/json'}
test_sample = json.dumps({'data': [
positive_example,
neutral_example,
negative_example
]})
test_sample = bytes(test_sample, encoding = 'utf8')
service_url = "http://06df2eb2-6456-4d1d-ae18-0470e3d3e11b.southcentralus.azurecontainer.io/score"
response = requests.post(service_url, test_sample, headers=headers)
print("prediction:", response.text)
All in all the Hyperparameter Search for an XGBoost model was the most successful with the engineered features. However the class imbalance made was still not resolvable and the feature engineering required a lot of exploratory data analysis. Finetuning a pre-trained neural model such as bert-base-uncased, distilbert.... might lead to even better performance but will also require more training time and more costly GPU resources.
But the combination of polarity dictionaries, transformer embeddings and topic modeling was a very fruitful representation and yielded some reasonable results in particular the test set F1_score: 0.908
From an azureml perspective a future improvement definitely would be to create 2 azureml Pipelines which include the text preprocessing step followed either by AutoML or Hyperdrive and a final custom evaluation step. This would enable the identification of a new model if additional training data is available and e.g. a domain shift of the review data is observed over the time. An advantage would be that this Pipeline could be easily triggered via an endpoint if the Pipeline is deployed by Publishing.
Another improvement of the current approach would be to deal better with the class imbalance e.g. via stratified sampling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
Alternatively a random boostrapping upsampling of the underrepresented Negative and Neutral samples could be performed.
A. Furthermore, it has to be considered to bin the classes differently based on the original Rating column the Positive class into 2 classes e.g. 4 = Positive; 5 = Perfect
- class 0 - Negative (1*&2*)
- class 1 - Neutral (3*)
- class 2 - Positive (4*)
- class 3 - Perfect (5*)
B. It could even be considered to merge the ratings 1*, 2*, 3* into a single class and convert the problem into a binary sentiment classification problem:
- class 0 - Negative (1*,2*,3*)
- class 1 - Positive (4*, 5*)
Actually the distribution of the example Ratings in the TSNE embedding space for Topic embeddings, Roberta embeddings and TFIDF embeddings show that the current classes (Negative and Neutral) are strongly overlapping which would suggest this approach. See the next section in the Readme.md about Extensive Exploratory Data-Analysis and Feature engineering
For sure it would be a fair comparison to perform the AutoML with the engineered features and not just the raw review texts. Also neural methods should be enabled for AutoML training.
To further improve the performance of the sentiment classifier I would test a range of state of the art neural text classification network architectures which worked very well for some text classification tasks which I was conducting. Importantly it needs to be considered to perform Hyperdrive based parameter search these models on a GPU compute cluster. Also the training and evaluation code needs to be refactored to a train.py script to enable hyperdrive based parameter search. An important step for training/finetuning those neural text classification methods will be the implementation of DataLoaders which provide mini-batches to the models during the training and evaluation.
- Train a BERT base text classifier with a Jupyter Notebook
- Here is an excellent introduction to transformer models and a walkthrough for the Jupyter Notebook A bert-base-uncased transformer model from huggingface can be finetuned on the sequence sentiment classification task. However the num_labels argument has to be set to 3. I previously have successfully trained some high performance text classification models with a customized version of this notebook in the past.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels = 3, # The number of output labels--3 for 3 classes
output_attentions = False, # Whether the model returns attentions weights.
output_hidden_states = False, # Whether the model returns all hidden-states.
)
How to train a DistilBert classifier A TFDistilBertForSequenceClassification transformer model from the Huggingface transformers can be finetuned on the sentiment classification task
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP With flair a TextClassifier can be trained by embedding the hotel reviews with concatenated FlairEmbeddings ('news-forward' + 'news-backward') and WordEmbeddings('glove') with a single LSTM DocumentRNNEmbeddings layer.
(Convolutional Neural Networks for Sentence Classification)[https://arxiv.org/pdf/1408.5882.pdf] Implementation of the CNN with PyTorch in Github This widely used CNN text classification architecture as introduced by Yoon Kim 2015 needs to be trained on the Hotel Review Text classification task. As an input embedding matrix for the CNN I would rather use a sequence of the transformer embeddings as generated as input features in my current text classfier with the Roborta model. As in my current text classification approach I would truncate the texts to a maximum lenght as defined by the review length distribution. Also I would extend the Github Implementation for a concatenate the final flattened CNN representation topic model embeddings and text length features and use a softmax layer for the final classification.
The model should be exported in ONNX format to enable cross-platform compatibility: h and e.g. facilitate on edge deployment.
A lot of feature engineering was performed prior to the Hyperparameter Tuning. For this purpose also a Topic modeling and Random Forest Classification for identifying the feature imporatance of the engineered features was performed with the following Jupyter notebook:
A tiny fraction of the gained insights is presented here:
One interesting feature for the classification task comprises a topic model vector. With the topics being fitted on the documents via Latent-Dirichlet-Allocation (LDA) based on the lemmatized tokens and just maintaining tokens with the part-of-speech-tags (POS) ADV, ADV, NOUN and VERB. To decide on the number of topics a Grid Search was performed with the topic coherence score as a target metric.
Another feature comprises the minimum, maximum and mean frequent word and adjective polarity scores which are obtained by matching the Adjectives and frequent words from polarity dictionaries with the corresponding documents and aggregating the polarity scores. The plot indicates increasing mean adjective polarity scores for classes 1 and 2 compared to class 0.
To assess how well the different embeddings (document topic vectors, roberta document embeddings, TFIDF document embeddings) are separating the different classes, 2D plots of the dimension reduced TSNE embeddings were visualized. Those plots indicate that all of those embeddings show at least some subspace being occupied by the Negative and Neutral hotel reviews.
As a sanity check for how well suited the engineered features are for sentiment classification a simple Random Forest classifier was fitted to the training data and the confusion matrix and classification report were evaluated for this classification model with its default fitting parameters. Both the confusion matrix and the classification report indicate that the RF model is at least reasonably capable of predicting the labels for class 0 and 2 correctly while it totally fails to predict class 1. The weighted average F1 score 0.79 and the accuracy score 0.84 for this very crude classification attempt which is a good starting point.
Also the feature importance of the engineered features was evaluated. This indicates that in particular the polarity scores and Roberta embedding are mainly contributing to the classification.
All in all this pointed out that a traditional machine learning approach with those engineered features is worth trying
The content of this repository is licensed under a MIT License