You are reading glossary / Machine Learning Algorithm

Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

Dummy Variable Trap

LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.

You should already know:

Some Machine Learning – See our top picks for machine learning courses.

What is the Dummy Variable Trap?

The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models. In other words, the individual effect of the dummy variables on the prediction model can not be interpreted well because of multicollinearity.

Using the one-hot encoding method, a new dummy variable is created for each categorical variable to represent the presence (1) or absence (0) of the categorical variable. For example, if tree species is a categorical variable made up of the values pine or oak, then tree species can be represented as a dummy variable by converting each variable to a one-hot vector. This means that a separate column is obtained for each category, where the first column represents if the tree is pine and the second column represents if the tree is oak. Each column will contain a 0 or 1 if the tree in question is of the column's species. These two columns are multi-collinear since if a tree is pine, then we know it's not oak and vice versa.

Further explanation

To demonstrate the dummy variable trap, consider that we have a categorical variable of tree species and assume that we have seven trees:

$$\large x_{species} = [pine, oak, oak, pine, pine, pine, oak]$$

If the tree species variable is converted to dummy variables, the two vectors obtained:

$$\large x_{pine} = [1,0,0,1,1,1,0] \\[.5em] \quad \large x_{oak} = [0,1,1,0,0,0,1]$$

Because a 1 in the pine column would mean a 0 in the oak column, we can say $\large x_{pine} = 1 – x_{oak}$. This results in two multi-collinear dummy variables, so the dummy variable trap may occur in regression analysis.

To overcome the Dummy variable Trap, we drop one of the columns created when the categorical variables were converted to dummy variables by one-hot encoding. This can be done because the dummy variables include redundant information.

To see why this is the case, consider a multiple linear regression model for the given simple example as follows:

$$ \begin{equation} \large y = \beta_{0} + \beta_{1} {x_{pine}} + \beta_{2} {x_{oak}} + \epsilon \end{equation} $$

where $y$ is the response variable, $x_{pine}$ and $x_{oak}$ are the explanatory variables, $\beta_0$ is the intercept, $\beta_1$ and $\beta_2$ are the regression coefficients, and $\epsilon$ is the error term. Since these two dummy variables are multi-collinear — hence we know if a tree is pine, then it's not oak — we can substitute $x_{oak}$ by ($1 – x_{pine}$) in the multiple linear regression equation.

$$ \begin{equation} \large \begin{aligned} y &= \beta_{0} + \beta_{1} x_{pine} + ({1-x_{pine}}) \beta_{2} + \epsilon \\[.5em] &= (\beta_{0} + \beta_{2} ) + (\beta_{1} - \beta_{2}) x_{pine} + \epsilon \end{aligned} \end{equation} $$

As you can see, we were able to rewrite the regression equation using only $x_{pine}$, where the new coefficients to be predicted are $(\beta_{0} + \beta_{2})$ and $(\beta_{1} - \beta_{2})$. By dropping a dummy variable column, we can avoid this trap.

This example shows two categories, but this can be expanded to any number of categorical variables. In general, if we have $p$ number of categories, we will use $p-1$ dummy variables. Dropping one dummy variable to protect from the dummy variable trap.

Python Example

The get_dummies() function in the Pandas library can be used to create dummy variables.

import pandas as pd

    
        Learn Data Science with

Create a simple categorical variable:

c1 = ['pine', 'oak', 'oak', 'pine', 'pine' ]

    
        Learn Data Science with

Then convert the categorical variable to dummy variables:

pd.get_dummies(c1)

    
        Learn Data Science with

Out:

	oak	pine
0	0	1
1	1	0
2	1	0
3	0	1
4	0	1

You can see that pandas has one-hot encoded our two tree species into two columns.

To drop the first dummy variable, we can specify the drop_first parameter in the get_dummies function:

pd.get_dummies(c1, drop_first=True)

    
        Learn Data Science with

Out:

	pine
0	1
1	0
2	0
3	1
4	1

Now, we simply know whether a tree is pine or not pine.

For a more complex example, consider the following randomly created dataset:

# Defining a simple dataset
data ={
    'gender' : ['female', 'male', 'male', 'male', 'male', 'male','female', 'male', 'male', 'female'],
    'race' : ['white', 'hispanic', 'african', 'asian', 'asian', 'white', 'african', 'hispanic','white', 'african'],
    'age' : [12,15,22,21,29,42,17, 25,14,47],
    'income': [25,34,42,50,48,39,25,73,86,61]
}
df = pd.DataFrame (data, columns=data.keys())

print(df)

    
        Learn Data Science with

Out:

gender      race  age  income
0  female     white   12      25
1    male  hispanic   15      34
2    male   african   22      42
3    male     asian   21      50
4    male     asian   29      48
5    male     white   42      39
6  female   african   17      25
7    male  hispanic   25      73
8    male     white   14      86
9  female   african   47      61

    
        Learn Data Science with

pd.get_dummies(df, prefix=['gender', 'race'])

    
        Learn Data Science with

Out:

	age	income	gender_female	gender_male	race_african	race_asian	race_hispanic	race_white
0	12	25	1	0	0	0	0	1
1	15	34	0	1	0	0	1	0
2	22	42	0	1	1	0	0	0
3	21	50	0	1	0	1	0	0
4	29	48	0	1	0	1	0	0
5	42	39	0	1	0	0	0	1
6	17	25	1	0	1	0	0	0
7	25	73	0	1	0	0	1	0
8	14	86	0	1	0	0	0	1
9	47	61	1	0	1	0	0	0

The gender variable is converted to gender_female and gender_male, and the race variable is converted to the dummy variable race_african, race_asian, race_hispanic, and race_white. Again, to avoid the dummy variable trap, the last dummy variable is dropped by setting drop_first=True:

pd.get_dummies(df, prefix=['gender', 'race'], drop_first=True)

    
        Learn Data Science with

Out:

	age	income	gender_male	race_asian	race_hispanic	race_white
0	12	25	0	0	0	1
1	15	34	1	0	1	0
2	22	42	1	0	0	0
3	21	50	1	1	0	0
4	29	48	1	1	0	0
5	42	39	1	0	0	1
6	17	25	0	0	0	0
7	25	73	1	0	1	0
8	14	86	1	0	0	1
9	47	61	0	0	0	0

The result shows that the gender_female and race_african dummy variables are dropped. If we have more than two categories, the dropped variable can be thought of as the absence of all other options, represented by zeros in every column. In this example, race_african = 1 - (race_asian + race_hispanic + race_white). In other words, if race_asian, race_hispanic, and race_white are all zero, the race of this record is assumed to be the dropped variable race_african.

Course Recommendations

Further learning:

Deep Learning Specialization – Coursera

A series of five courses dedicated to the creation and tuning of deep neural networks, convolutional neural networks, and sequence models.

Start Learning for Free

Meet the Authors

Fatih Karabiber Ph.D. in Computer Engineering, Data Scientist

Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Back to blog index

Dummy Variable Trap

You should already know:

What is the Dummy Variable Trap?

Further explanation

Python Example

Course Recommendations

Further learning:

Deep Learning Specialization – Coursera

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Meet the Authors

	age	income	gender_female	gender_male	race_african	race_asian	race_hispanic	race_white
0	12	25	1	0	0	0	0	1
1	15	34	0	1	0	0	1	0
2	22	42	0	1	1	0	0	0
3	21	50	0	1	0	1	0	0
4	29	48	0	1	0	1	0	0
5	42	39	0	1	0	0	0	1
6	17	25	1	0	1	0	0	0
7	25	73	0	1	0	0	1	0
8	14	86	0	1	0	0	0	1
9	47	61	1	0	1	0	0	0

	age	income	gender_male	race_asian	race_hispanic	race_white
0	12	25	0	0	0	1
1	15	34	1	0	1	0
2	22	42	1	0	0	0
3	21	50	1	1	0	0
4	29	48	1	1	0	0
5	42	39	1	0	0	1
6	17	25	0	0	0	0
7	25	73	1	0	1	0
8	14	86	1	0	0	1
9	47	61	0	0	0	0

	age	income	gender_female	gender_male	race_african	race_asian	race_hispanic	race_white
0	12	25	1	0	0	0	0	1
1	15	34	0	1	0	0	1	0
2	22	42	0	1	1	0	0	0
3	21	50	0	1	0	1	0	0
4	29	48	0	1	0	1	0	0
5	42	39	0	1	0	0	0	1
6	17	25	1	0	1	0	0	0
7	25	73	0	1	0	0	1	0
8	14	86	0	1	0	0	0	1
9	47	61	1	0	1	0	0	0

	age	income	gender_male	race_asian	race_hispanic	race_white
0	12	25	0	0	0	1
1	15	34	1	0	1	0
2	22	42	1	0	0	0
3	21	50	1	1	0	0
4	29	48	1	1	0	0
5	42	39	1	0	0	1
6	17	25	0	0	0	0
7	25	73	1	0	1	0
8	14	86	1	0	0	1
9	47	61	0	0	0	0

Cookie Policy

Dummy Variable Trap

You should already know:

What is the Dummy Variable Trap?

Further explanation

Python Example

Course Recommendations

Further learning:

Deep Learning Specialization – Coursera

Get updates in your inbox

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Get updates in your inbox

Meet the Authors

Get updates in your inbox

	age	income	gender_female	gender_male	race_african	race_asian	race_hispanic	race_white
0	12	25	1	0	0	0	0	1
1	15	34	0	1	0	0	1	0
2	22	42	0	1	1	0	0	0
3	21	50	0	1	0	1	0	0
4	29	48	0	1	0	1	0	0
5	42	39	0	1	0	0	0	1
6	17	25	1	0	1	0	0	0
7	25	73	0	1	0	0	1	0
8	14	86	0	1	0	0	0	1
9	47	61	1	0	1	0	0	0

	age	income	gender_male	race_asian	race_hispanic	race_white
0	12	25	0	0	0	1
1	15	34	1	0	1	0
2	22	42	1	0	0	0
3	21	50	1	1	0	0
4	29	48	1	1	0	0
5	42	39	1	0	0	1
6	17	25	0	0	0	0
7	25	73	1	0	1	0
8	14	86	1	0	0	1
9	47	61	0	0	0	0