Recipe 1: Automated hyperparameters optimisation with Optuna

Tuning a Random Forest Classifier automatically with Optuna.

Nov 02, 2023

Ever wonder how to maximize your model's potential without the usual computational drain? Learn how Optuna provides the key to efficient hyperparameter tuning.

🍽 Introduction

In the world of machine learning, the recipe for a great model isn't just about data quality, or selecting the right algorithm—it's also about finding the perfect mix of hyperparameters. Think of hyperparameters as the spices in a dish. Too much or too little, and the outcome can be underwhelming or even unpalatable.

Traditionally, researchers and data scientists spent countless hours, or even days, manually tweaking these hyperparameters. Then, they turned to grid search—a method as exhaustive as it sounds—testing every possible combination. Random search came along as a more palatable alternative, offering a randomized approach. But both methods often fall short in efficiency and can be computationally expensive, especially for complex modeling pipelines.

How, then, can we season our models to perfection without the exhaustive taste-testing? Enter Optuna, the next-generation hyperparameter optimization tool.

💡 Spotlight on Optuna

Optuna emerges as an open source tool in the realm of hyperparameter optimization for machine learning. Embracing a "define-by-run" style, Optuna offers dynamic hyperparameter search, setting it apart from conventional static approaches. This unique design promotes flexibility, allowing the search space to be constructed dynamically. Moreover, it's optimized for efficiency; not only does it systematically prune unpromising trials, but it also seamlessly integrates with popular machine learning frameworks, making it a versatile tool for both lightweight experiments and intensive, distributed computations.

🔪 Optuna in Action: Optimizing a Scikit-Learn Model

Setting up Optuna: First, ensure you have Optuna installed. If not, you can easily install it using pip:
```
pip install optuna
```

Prepare the Dataset: For our example, we'll use the classic Iris dataset. Split it into training and testing sets for evaluation:

import optuna
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

Defining the Objective Function: The objective function should encompass the model, its hyperparameters, and the metric you wish to optimize. For simplicity, let's optimize a Random Forest Classifier's n_estimators and max_depth on the Iris dataset.

def objective(trial):
    # Optuna will explore integer values between 2 and 150 for 'n_estimators'
    n_estimators = trial.suggest_int("n_estimators", 2, 150)
    
    # For 'max_depth', Optuna will test integers between 1 and 32. 
    # Using 'log=True' makes the distribution logarithmic, making it 
    # more likely to explore values around the lower limit.
    max_depth = trial.suggest_int("max_depth", 1, 32, log=True)

    clf = RandomForestClassifier(
        max_depth=max_depth, 
        n_estimators=n_estimators
    )
    clf.fit(X_train, y_train)

    return accuracy_score(y_test, clf.predict(X_test))

Running the Optimization: Use Optuna to find the best hyperparameters that maximizes the accuracy score:
1. We set up an Optuna study to manage the optimization process.
2. The direction="maximize" tells Optuna that we want to maximize the output of our objective function (in this case, model accuracy).
3. The optimize method then performs the actual optimization by evaluating the objective function with different hyperparameters.
4. n_trials=100 means Optuna will try 100 different combinations of hyperparameters to find the best ones.
```
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
```
For a deeper dive into the optimization process, refer to this segment of the documentation.
Retrieving the best trial: After optimization, display the parameters of the best trial:
```
print(f"Best trial: {study.best_trial.params}")
# Output example: {'n_estimators': 93, 'max_depth': 2}
```
By the end of the study optimization, Optuna will have searched through a variety of n_estimators and max_depth combinations, identifying the ones that give the best accuracy score on the test set of the Iris dataset.

Curious about the other trials or the best trial’s performance?

study.trials_dataframe().sort_values("value",ascending=False).head()

This is a rudimentary example, and in a real-world scenario, the objective function can encompass more sophisticated pipelines, preprocessing steps, and additional hyperparameters.

🍲 Final Thoughts

Voilà! With Optuna's expertise, we've fine-tuned our Random Forest Classifier's hyperparameters. A well-tuned model can make all the difference in your data dish! However, remember that like in any cooking show, today's recipe has been tailored for a quick demonstration. For a robust model in real-world applications, it's essential to consider aspects like overfitting, which is a topic for another delicious recipe.

Bon Appétit! Oussama, your friendly Data Cook. 🍳

📖 References

Optuna Official Documentation: Optuna: A hyperparameter optimization framework
Image prompt: “In a modern kitchen setting, a chef, wearing an apron adorned with the 'Optuna' logo, stands focused on his culinary preparation. He glances at a nearby screen, which displays detailed graphs. Next to the screen, a range of spices are elegantly arranged.”

This journey of curating data science 'recipes' began with the inspiration drawn from Optuna. It's only fitting that it becomes Recipe #1 in this series.

Data Recipes

Discussion about this post