Skip to content

chiemenz/AzureML-AutoML-and-Hyperdrive-Demo

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." The Dataset seems to be based on the Kaggle Bank Marketing Dataset https://www.kaggle.com/janiobachmann/bank-marketing-dataset

In total it contain data about bank customers (In total 21 columns) including the success of past marketing campaigns.

Numeric columns: balance, day, age, duration, campaign, pdays, emp.var.rate, cons.price.idx, euribor3m, nr.employed

Categorical columns: month, dayofweek, default, job, loan, marital status, education, housing, contact, previous, poutcome

with a label column: success of marketing campaign (yes/no)

The goal of the Dataset is to predict the success of future marketing campaigns based on the outcomes of past marketing campaings by taking into account the features of target customers.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." The best performing model was a LightGBM (gradient boosting machine) with MaxAbsScaler preprocessing found by AutoML. It has an overall accuracy of 0.91596 and was found by using AutoML on the cleaned Dataset.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. For the general ML-Pipeline see: I. As a first step, the Dataset is loaded

II. Data preprocessing/cleanup step:

  1. na columns are dropped
  2. marital, default, housing, loan and poutcome are one-hot encoded
  3. Also the contact, education columns are one-hot encoded by dummy variables
  4. The months are mapped to numbers 1 - 12 and the
  5. Weekdays are mapped to numbers 1- 7
  6. The target label column is one-hot encoded 1=success 0=failure of marketing campaign

III. The Dataset is split into a training and test set with sklear train_test_split (a test size of 20% is used) IV. A regularization constant and maximum number of iterations are defined as hyperparameters V. A logistic Regression Classifier is fitted (taking into account the hyperparameters VI. The accuracy of the Classifier is evaluated based on the test set

Under the Hyperparamer search conditions, an optimal classifier is found based on varying the hyperparameters of the model and evaluating the model performance with those specific hyperparameters. E.g. for the Logistic Regression Classfier, the l2 regularization strengh is varied and the maximum number of iterations.

For Hyperparameter search a Random Search is employed. As a result a random combination of the hyperparameters C and max_iter are sampled based on the provided constraints. E.g. a loguniform range from -3 to 2 is defined to find optimal parameters for the constant C (10^-3 - 10^2) And a choice set is defined for the maximum number of iterations choice(1,5,10,20,30,40,50,80,100,200,400,800,1000)

For each iteration of hyperparameter search a different combination is sampled e.g. C = 3*10^-1 and max_iter = 80 Those parameters are the set at step IV and setps IV - VI are run to find the accuracy of the Logistic Regression model with those hyperparameters. What are the benefits of the parameter sampler you chose? In contrast to RAndom Search, Grid Search is an exhaustive search method across the hyperparameter space, which sequentially tries every combination of hyperparameters. Due to the randomess it is efficiently exploring the whole of the search space more rapidly. By setting up some stopping criterion, the random search can stop exactly at the point in time once the Stopping Criterion is reached. By definition random search makes random steps across the search space as a whole and in contrast to Grid search which explores each parameter space region sequentially it can visit different parameter space regions more rapidly and will stop once some subregion satisfies the stopping criterion, while Grid-search is still searching in some local subspace far away from this random subspace. What are the benefits of the early stopping policy you chose? If the hyperparameter search already converged not more parameter combination which is much greater than the currently best configuration can be found. If there is a sequence of runs which are much worse than the optimal hyperparameter for multiple iterations, then the hyperparameter search is aborted. This avoids waisting resources since it is unlikely that a better parameter combination can be found.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. The AutoMl iterated through multiple pre-processing steps e.g. MinMaxScaling vs. SparseNormalizer of continuous numeric features to find the best feature engineering. Also a model architecture was selected. Mainly tree-based classifiers were investigated. E.g. the best model was a Light Gradient Boosting Machine LightGBM alternatively some RandomForest models and Ensemble Voting classifiers were evaluated for their classification performance.

Also for the different model architectures different model hyperparameters where analysed - as reported in the output logs. e.g. XGBoostClassifier with SparseNormalizer "param_kwargs": {"booster": "gbtree", "colsample_bytree": 1, "eta": 0.3, "gamma": 5, "grow_policy": "lossguide", "max_bin": 63, "max_depth": 10, "max_leaves": 0, "n_estimators": 25, "objective": "reg:logistic", "reg_alpha": 1.5625, "reg_lambda": 0.10416666666666667, "subsample": 0.7, "tree_method": "hist"} vs. XGboostClassifier with SparseNormalizer ""param_kwargs": {"booster": "gbtree", "colsample_bytree": 0.9, "eta": 0.3, "gamma": 0, "max_depth": 9, "max_leaves": 0, "n_estimators": 25, "objective": "reg:logistic", "reg_alpha": 0, "reg_lambda": 0.7291666666666667, "subsample": 0.9, "tree_method": "auto"}

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? The maximum Logistic Regression Classification performance with the hyperdrive was found for multiple combinations of the hyperparameters. The maximum Logistic Regression accuracy was found out to be 0.907283 e.g. for the combination of C = 0.70305 and max_iter = 50

{'Regularization Strength:': 0.07304859840588435, 'Max iterations:': 50, 'Accuracy': 0.9072837632776934} {'Regularization Strength:': 0.8783856468955435, 'Max iterations:': 30, 'Accuracy': 0.9072837632776934} {'Regularization Strength:': 0.07815523037726442, 'Max iterations:': 200, 'Accuracy': 0.9072837632776934} {'Regularization Strength:': 0.19954366367073706, 'Max iterations:': 1000, 'Accuracy': 0.9072837632776934} {'Regularization Strength:': 1.1174328674115732, 'Max iterations:': 1, 'Accuracy': 0.887556904400607} .... Thus, the logistic regression accuracy 0.907283 (optimized by hyperdrive) is lower compared to the best performing AutoML model LightGBM 0.91596.

The model architeture completely different. A decision tree ensemble method such as LightGBM is based on an ensemble of individually weak decision classifiers with every decision trees making a sequence of binary decisions of features to bin the datapoints into homogeneous sets of classes based on their features. In contrast, the logistic regression model is learning a weight for each feature to learn each features impact on the binary decision based on matrix multiplication followed by sigmoid gating.

The logistic regression model did not include any hyperparameter search of the pre-processing of continuous features in contrast to the AutoML model, this might at least partially explain the improved performance of the LightGBM vs. Logitic regression. Also Logistic Regression was regularized which migth decrease the performance at least partially, while Gradient boosting tends to overfit the data if not properly regularized.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model? The class imbalance should be circumvented e.g. by stratified sampling while performing the train test split or by upsampling the underrepresented class. For hyperparameter Tuning also the data pre-processing of continuous variables should be investigated. E.g. log-normalization vs. Min-max scaling ... Also missing value imputation might be help to include more datapoints if there are any missing values. A neural network model might yield some improvement as well ==> since it is a completely different architecture which was not yet tested

Proof of cluster clean up

If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section. Image of cluster marked for deletion

License

The content of this repository is licensed under a MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published