You are reading tutorials
pycaret-plot-model-error.png
ioannis-tolios-photo.jpg
Author: Ioannis Tolios
Data Scientist

Introduction to PyCaret - Build ML models faster w/ less code

Creating Regression Models with the PyCaret Library

Most machine learning practitioners start experimenting with the established scikit-learn library, but there's an easier and more approachable alternative, named PyCaret.

This library has many advantages compared to scikit-learn, especially for people with limited experience. This tutorial will provide an overview of the main features of PyCaret, as well as a case study focusing on regression. I suggest installing the latest version of Anaconda on Windows/macOS/Linux to follow this tutorial, but it is also compatible with Google Colab. You can either execute the code in a Jupyter notebook or use your preferred IDE.

Installing PyCaret

First of all, we have to install PyCaret by running the following command in an Anaconda terminal:

pip install pycaret

PyCaret should then be installed on your machine and any other required dependencies that might be missing.

The PyCaret Regression Module

Regression is a basic supervised machine learning task which estimates the relationship between a dependent variable $y$ (known as the target) and independent variables (known as features).

Regression can be used to predict continuous values such as the value of a house instead of classification, which is used for discrete values known as classes. The PyCaret regression module, which uses sklearn under the hood, lets you create and test regression models with a few lines of code. It includes a variety of algorithms, as well as the ability to plot and do hyperparameter tuning.

We are now going to examine a regression case study based on that module.

Loading a Dataset

The basis of every machine learning project is the acquisition or creation of an appropriate dataset. PyCaret includes a variety of example datasets for different kinds of machine learning tasks, and in this project we will use the medical insurance dataset.

This dataset originates from the book Machine Learning with R by Brett Lantz, and contains health insurance information. The target variable, $y$, represents the insurance charges for each person, and the features are properties, such as age, sex, and Body Mass Index (BMI).

Real-world data is rarely that simple, but working with toy datasets helps us understand the concepts and methodology before moving on to more complex cases.

Here is a description for each dataset variable:

  • age: age of the primary beneficiary
  • sex: insurance contractor gender - female, male
  • bmi: Body mass index, providing an understanding of body weights that are relatively high or low relative to height. An objective index of body weight ($kg / m ^ 2$) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Smoking
  • region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance

To get the data, we'll use the get_data function from pycaret:

from pycaret.datasets import get_data

data = get_data('insurance')
Out:
agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520
data.info()
Out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

The get_data function returns a pandas dataframe, so we can use the info() function from pandas to get some details about the dataset.

As we can see, there are 1338 records, and zero null values. Most real-world datasets have some null values and may require some feature engineering, but in this case we don't have to deal with that.

Exploratory Data Analysis (EDA)

After loading the dataset, we will normally need to examine and understand its basic properties. This is known as Exploratory Data Analysis, and can be accomplished with various tools and methods, such as plotting.

We start by plotting the histograms of the numerical variables.

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set_style('darkgrid')
colors = ['#851836', '#EDBD17', '#0E1428', '#407076', '#4C5B61']
sns.set_palette(sns.color_palette(colors))
numerical = ['bmi', 'age', 'charges']
data[numerical].hist(bins=20, layout=(1, 3), figsize=(9,3))

plt.tight_layout()
plt.show()
RESULT:
bmi-age-charges-plot.png

Here, we're using the built-in hist() function from pandas to plot a histogram for age, BMI, and charges. This helps us better understand the distribution of values for these numerical variables.

The BMI variable has a distribution close to normal, while the charges variable is right-skewed. Skewed distributions can be a problem for machine learning algorithms, so we will deal with that later.

Now, we'll get a bit creative by plotting the histogram of the target variable — i.e., the insurance charges — with stacked bars that represent different categories of the categorical variables. We accomplish this by using the histplot() function of the seaborn library:

categorical = ['sex', 'children', 'smoker', 'region']

fig, axs = plt.subplots(2, 2, figsize=(20,10))

for variable, ax in zip(categorical, axs.flatten()):
  sns.histplot(data, x='charges', hue=variable, multiple='stack', ax=ax)
RESULT:
categorical-stacked-bar-plots.png

Smokers have significantly higher charges, and we can see that men have higher medical costs more often than women.

Now that we've gleaned some useful insight from EDA, let's begin the PyCaret process for regression on this data.

Initializing a PyCaret Environment

The setup() function of PyCaret initializes the environment and prepares the machine learning modeling data and deployment. There are two necessary parameters, a dataset, and the target variable. After executing the function, each feature's type is inferred, and several pre-processing tasks are performed on the data.

from pycaret.regression import *

reg = setup(
    data=data, 
    target='charges', 
    train_size=0.8, 
    session_id=10,
    normalize=True, 
    transform_target=True
)
Out:
DescriptionValue
0session_id10
1Targetcharges
2Original Data(1338, 7)
3Missing ValuesFalse
4Numeric Features2
5Categorical Features4
6Ordinal FeaturesFalse
7High Cardinality FeaturesFalse
8High Cardinality MethodNone
9Transformed Train Set(1070, 14)
10Transformed Test Set(268, 14)
11Shuffle Train-TestTrue
12Stratify Train-TestFalse
13Fold GeneratorKFold
14Fold Number10
15CPU Jobs-1
16Use GPUFalse
17Log ExperimentFalse
18Experiment Namereg-default-name
19USIbd4e
20Imputation Typesimple
21Iterative Imputation IterationNone
22Numeric Imputermean
23Iterative Imputation Numeric ModelNone
24Categorical Imputerconstant
25Iterative Imputation Categorical ModelNone
26Unknown Categoricals Handlingleast_frequent
27NormalizeTrue
28Normalize Methodzscore
29TransformationFalse
30Transformation MethodNone
31PCAFalse
32PCA MethodNone
33PCA ComponentsNone
34Ignore Low VarianceFalse
35Combine Rare LevelsFalse
36Rare Level ThresholdNone
37Numeric BinningFalse
38Remove OutliersFalse
39Outliers ThresholdNone
40Remove MulticollinearityFalse
41Multicollinearity ThresholdNone
42ClusteringFalse
43Clustering IterationNone
44Polynomial FeaturesFalse
45Polynomial DegreeNone
46Trignometry FeaturesFalse
47Polynomial ThresholdNone
48Group FeaturesFalse
49Feature SelectionFalse
50Features Selection ThresholdNone
51Feature InteractionFalse
52Feature RatioFalse
53Interaction ThresholdNone
54Transform TargetTrue
55Transform Target Methodbox-cox

After running the setup() function on our data, the results display the pre-processing pipeline applied to the dataset. Some highlights of this pipeline are:

1. Inferred data types

We can see that four features have been correctly identified as categorical, and the rest as numerical. In case PyCaret fails to do that correctly, we can define them in the setup() function ourselves, using the categorical_features and numeric_features parameters.

2. Train/Test Split The dataset has been split into a train and test set, as it is standard practice in machine learning. The train set size has been set to 80% of the original dataset, meaning that 80% of the data will be used to train the machine learning model and the rest for testing its accuracy.

3. Normalization of Numerical Features Many regression algorithms that require the features to be normalized for them to work as expected. Normalized features have $μ = 0$ and $σ = 1$. The standard method to accomplish that is to replace each value with its associated z-score, which is defined as $z = \frac{x-μ}{σ}$ .

4. One-Hot Encoding of Categorical Features Some machine learning algorithms that accept categorical features and some that don't, so it is best to convert them to numerical features using one-hot encoding. One-hot encoding removes the categorical features and replaces them with additional binary variables, one for each category, minus one (to avoid the dummy variable trap).

5. Target Transformation As we've noticed in the EDA section, the target variable is right-skewed. This could cause problems as many regression algorithms expect the data to have a normal distribution to perform optimally. The setup() function includes the option to transform the target to have a distribution close to normal. Transformations can also be applied to the features if needed, but it was unnecessary in this case.

There are various other advanced parameters in the setup() function, so if you're curious, feel free to check out the relevant section of their docs that goes over each piece in detail.

Viewing the pre-processed data

The get_config('X') function returns the features dataset after the pre-processing pipeline has been applied to it:

get_config('X')
Out:
agebmisex_femalechildren_0children_1children_2children_3children_4children_5smoker_noregion_northeastregion_northwestregion_southeastregion_southwest
0-1.423959-0.4570491.01.00.00.00.00.00.00.00.00.00.01.0
1-1.4946650.4983360.00.01.00.00.00.00.01.00.00.01.00.0
2-0.7876080.3730130.00.00.00.01.00.00.01.00.00.01.00.0
3-0.434080-1.3025720.01.00.00.00.00.00.01.00.01.00.00.0
4-0.504786-0.2975470.01.00.00.00.00.00.01.00.01.00.00.0
.............................................
13330.7679170.0426160.00.00.00.01.00.00.01.00.01.00.00.0
1334-1.4946650.1972351.01.00.00.00.00.00.01.01.00.00.00.0
1335-1.4946650.9996271.01.00.00.00.00.00.01.00.00.01.00.0
1336-1.282548-0.7988391.01.00.00.00.00.00.01.00.00.00.01.0
13371.545679-0.2666231.01.00.00.00.00.00.00.00.01.00.00.0

1338 rows × 14 columns

We can see that the numerical features have been normalized with the z-score method, and the categorical features have been encoded with one-hot encoding. It is important to verify that the pre-processing has been completed successfully, as in some cases, our dataset might not be as clean as the one used in this example. In case the pre-processing pipeline fails, we may get incorrect and unexpected results from the machine learning models.

Comparing Different Models

There are numerous regression algorithms available, and it is not always obvious which one is optimal for our dataset. The only way to find the best model is to test a number of them and compare the results. Fortunately, PyCaret provides the compare_models() function, which compares a variety of different models easily:

best = compare_models(sort='RMSE')
Out:
ModelMAEMSERMSER2RMSLEMAPETT (Sec)
gbrGradient Boosting Regressor2049.581820421350.99264370.92230.86290.35600.16380.0160
rfRandom Forest Regressor2148.193720860760.21934463.41030.85850.38160.18570.0560
lightgbmLight Gradient Boosting Machine2320.881521164314.74584491.17250.85680.37710.19100.0550
catboostCatBoost Regressor2272.117621920375.92074552.21180.85250.36850.17370.7090
adaAdaBoost Regressor3028.982822202851.65934619.71460.84980.45830.39180.0110
etExtra Trees Regressor2290.215424332567.37424858.08210.83410.40750.20140.0490
xgboostExtreme Gradient Boosting2782.402235645139.90005681.23660.75930.41670.23620.0950
dtDecision Tree Regressor2883.427038905351.21376207.18440.72760.49460.31200.0050
ompOrthogonal Matching Pursuit5700.340459762727.64707668.12830.59190.68760.69010.0070
ridgeRidge Regression4081.942363909181.20007873.92120.56550.42600.26200.0050
brBayesian Ridge4088.183164170144.69097889.78250.56370.42590.26200.0050
larLeast Angle Regression4106.022564908348.62357935.26740.55870.42590.26190.0060
lrLinear Regression4106.035464908770.40007935.29420.55870.42590.26190.0080
huberHuber Regressor4245.033281231444.51108865.07060.44780.43560.20680.0080
knnK Neighbors Regressor4982.958281987651.94758946.70090.44520.54050.32900.0120
parPassive Aggressive Regressor6250.5322114585710.136810361.40840.24060.62380.55230.0060
enElastic Net8276.7225165075368.000012754.6845-0.11980.91280.96050.0050
llarLasso Least Angle Regression8385.7427166526887.187612811.0623-0.12970.92450.98950.0060
lassoLasso Regression8385.7422166526895.200012811.0627-0.12970.92450.98950.0080

After running the compare_models() function, the results are displayed. This table may seem intimidating, but it's actually fairly simple to understand. The first column contains each model's name, and the rest of the columns are various metrics.

You can focus on RMSE for now, which stands for Root Mean Squared Error. RMSE is a widely used metric for regression, and it is defined as the square root of the averaged squared difference between the actual value and the one predicted by the model:

$RMSE = \sqrt{ \frac{1}{N}\sum_{i=1}^{N} ( x_{i} - \hat{x_{i}} )^2 }$

The lower the RMSE value, the more accurate our model is. In this case, the best model is the Gradient Boosting Regressor model, with an RMSE value of 4368.4047.

Creating a model with PyCaret

The create_model() function lets you create a regression model based on the algorithm of your preference. In this case, we'll use Gradient Boosting Regressor since it had the best performance from compare_models() above.

The create_model() function uses k-fold cross-validation to evaluate the model accuracy. In this method, the dataset is first partitioned into $k$ subsamples, one subsample is retained for validation, and the rest is used to train the model. This process is repeated $k$ times, and each subsample is used only once as validation data.

model = create_model('gbr', cross_validation=True, fold=10)
Out:
MAEMSERMSER2RMSLEMAPE
01153.50213234575.32531798.49250.97040.24600.1493
12726.052631665967.03935627.25220.75580.47970.1887
22378.326429063760.25465391.08150.83090.32980.1569
32079.741121902085.81324679.96640.88060.44930.1497
41791.897717011091.07534124.45040.89760.33620.1576
51521.16869602958.30593098.86400.91570.24530.1530
61971.636515482810.41993934.82030.86520.32690.1721
72608.516531293027.61635594.01710.82490.42810.1597
82300.785425535340.40205053.25050.81180.42520.1699
91964.191119421893.67384407.02780.87600.29370.1813
Mean2049.581820421350.99264370.92230.86290.35600.1638
SD458.69978928114.54101147.34020.05710.08010.0129

After training the model, the cross-validation results are displayed. We set folds ($k$) to 10, so in this case, we have a ten-fold cross-validation. We can see the metrics for every fold, and the mean and standard deviation of all steps.

If you've used sklearn before, you'll notice that one line of code with PyCaret is equivalent to several lines with sklearn.

Tuning a Model

The tune_model() function tunes the hyperparameters of a given model and outputs the results. Hyperparameters are model settings that can be modified and can have either a positive or negative effect in their accuracy.

tune_model() uses the Random Grid Search method to tune and optimize the model by testing a random sample of the hyperparameters. We can define a grid with specific values for the hyperparameters by using the custom_grid parameter.

We can also define the number of iterations with the n_iter parameter. A random value from the defined grid of hyperparameters is selected for every iteration and tested using k-fold cross-validation.

params = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [5, 6, 7, 8],
    'subsample': [0.6, 0.7, 0.8],
    'n_estimators' : [100, 300, 400, 500]
}

tuned_model = tune_model(
    model, 
    optimize='RMSE',
    fold=10,
    custom_grid=params, 
    n_iter=20
)
Out:
MAEMSERMSER2RMSLEMAPE
01245.55494026621.41412006.64430.96310.24230.1569
12583.197230189293.05095494.47840.76720.48000.1834
22442.026630135598.88065489.59000.82470.34100.1730
31997.925221734060.85044661.98040.88160.45030.1487
41946.576516879740.10854108.49610.89840.34090.1819
51488.98349451504.04463074.32990.91700.26060.1656
62025.973515546849.04443942.94930.86460.32950.1775
72387.505828945218.90035380.07610.83810.42430.1528
82317.804126198464.51895118.44360.80690.44560.1837
91843.073817015176.79724124.94570.89140.28450.1729
Mean2027.862120012252.76104340.19340.86530.35990.1696
SD404.64658560903.93581083.96230.05450.08070.0123

As we can see from the cross-validation results, the hyperparameter tuning slightly increased the model's accuracy. The improvement is small, but experimenting with a higher iteration number or a grid with different hyperparameter values may lead to better results.

Plotting the Model Performance

PyCaret includes a plot_model() function that lets us visualize our model's accuracy and other properties. The function includes a variety of plots that help us evaluate and understand our model better. Compared to the underlying libraries used to generate these plots — sklearn, pandas, and matplotlib — using PyCaret is significantly quicker and simpler to work with.

First, we'll plot the error of the predictions on the test set:

plot_model(tuned_model, plot='error')
RESULT:
pycaret-plot-model-error.png

Second, we'll plot the importance of each feature:

plot_model(tuned_model, plot='feature')
RESULT:
pycaret-feature-importance-plot.png

In the EDA section above, we saw that being a smoker leads to significantly higher insurance charges, and now from the feature importance chart we see that being a smoker has the highest predictive value. Furthermore, we can also see that age and BMI seem to play an important role as well.

Making Predictions on New Data

Every real-world machine learning project's ultimate goal is to make predictions on new data, where the target variable is unknown. You can accomplish that by using the predict_model() function, which returns a pandas dataframe with predictions.

We are going to create a small synthetic dataset and test our model and see how it predicts insurance charges:

cols =  ['age', 'sex', 'bmi', 'children', 'smoker', 'region']

records = [
       [30, 'male', 20, 0, 'no', 'southeast'],
       [30, 'male', 20, 0, 'yes', 'southeast'],
       [30, 'male', 35, 0, 'yes', 'southeast'],
       [70, 'male', 35, 0, 'yes', 'southeast'],
       [30, 'female', 20, 0, 'no', 'southeast'],
       [30, 'female', 20, 0, 'yes', 'southeast'],
       [30, 'female', 35, 0, 'yes', 'southeast'],
       [70, 'female', 35, 0, 'yes', 'southeast'] 
]

new_data = pd.DataFrame(data=records, columns=cols)

predict_model(tuned_model, new_data)
Out:
agesexbmichildrensmokerregionLabel
030male200nosoutheast4043.350231
130male200yessoutheast17007.642015
230male350yessoutheast35749.960178
370male350yessoutheast45790.897563
430female200nosoutheast4503.047383
530female200yessoutheast17208.037478
630female350yessoutheast35853.324929
770female350yessoutheast45870.135872

We can see that young non-smokers with a low BMI are predicted to have the lowest charges by our model. On the other hand, those who are older, obese, and smoke are predicted to be charged ten times as much. Those results are in line with the EDA and the feature importance plot.

Interpreting the Model

The ability to interpret a machine learning model's results allows you to avoid relying on a "black box model," where you don't understand how it exactly works.

PyCaret includes the interpret_model() function that provides an interpretation plot for a given model. This function requires the SHAP (SHapley Additive exPlanations) library to work, so we'll have to install it first.

pip install shap

After installing the SHAP library, we can create an interpretation plot for our model. The Gradient Boosting Regressor isn't supported by the interpret_model() function, so we will create another model based on the XGBoost algorithm and interpret that model instead.

To interpret the model, we'll use the "reason" plot type:

xgb = create_model('xgboost', cross_validation=True, verbose=False)

interpret_model(xgb, plot='reason', observation=32)
pycaret-interpret-model-reason-plot.jpeg

Above the plot, you'll notice the "base value," which is defined as the mean predicted target, and f(x), which is the prediction for a selected observation. The red-colored features increased the predicted value, while the blue-colored features decreased it.

The size of each feature indicates the impact it has on the model. In this case, not being a smoker and having zero children had a positive effect, and as a result, decreased the predicted insurance charges below the mean value.

Conclusion

We have pre-processed our data, compared a variety of regression models, and tuned the model of our preference, all in a few lines of code. Using scikit-learn for regression is, of course, an option, but the time and effort required are significantly higher. PyCaret lets us create machine learning models quickly and easily, making it an ideal choice for beginners. Furthermore, PyCaret can also be used by experienced data scientists who want to reduce the time needed to complete machine learning projects.

There's many other machine learning tasks you can accomplish with PyCaret, so definitely check out their docs.


Meet the Authors

ioannis-tolios-photo.jpg

Giannis Tolios is passionate about data science, machine learning and other cutting-edge technologies. He is currently offering his services as a freelancer, and his goal is to work on projects that utilize AI to mitigate climate change, economic inequality, and help achieve the UN sustainable development goals. Giannis was excited to join LearnDataSci as an author, because he is always eager to share his knowledge and expertise with others!

Get updates in your inbox

Join over 7,500 data science learners.