Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy.

By clicking "Accept" or further use of this website, you agree to allow cookies.

Accept
Learn Machine Learning by Doing Learn Now
You are reading glossary / Programming
Fatih-Karabiber-profile-photo.jpg
Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

Binary Variable

A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1

You should already know:

Basic Python — Learn Python and Data Science concepts interactively on Dataquest.

A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1 — where $0$ typically indicates that the attribute is absent, and $1$ indicates that it is present.

Some examples of binary variables, i.e. attributes, are:

  • Smoking is a binary variable with only two possible values: yes or no
  • A medical test has two possible outcomes: positive or negative
  • Gender is traditionally described as male or female
  • Health status can be defined as diseased or healthy
  • Company types may have two values: private or public
  • E-mails can be assigned into two categories: spam or not
  • Credit card transactions can be fraud or not

In some applications, it may be useful to construct a binary variable from other types of data. If you can turn a non-binary attribute into only two categories, you have a binary variable. For example, the numerical variable of age can be divided into two groups: 'less than 30' or 'equal or greater than 30'.

Datasets used in machine learning applications have more likely binary variables. Some applications such as medical diagnoses, spam analysis, facial recognition, and financial fraud detection have binary variables.

Binary Variables in Python

In Python, the boolean data type is the binary variable and defined as $True$ or $False$.

# Boolen data type
x = True
y = False
print(type(x), type(y))
Out:
<class 'bool'> <class 'bool'>

Additionally, the bool() function converts the value of an object to a boolean value. This function returns $True$ for all values except the following values:

  • Empty objects (list, tuple, string, dictionary)
  • Zero number (0, 0.0, 0j)
  • None value
print("Boolean value of an empty list is ", bool([]))
print("Boolean value of zero is ", bool(0))
print("Boolean value of number 10 is", bool(10))
print("Boolean value of an empty string is", bool(''))
print("Boolean value of a string is", bool('string'))
Out:
Boolean value of an empty list is  False
Boolean value of zero is  False
Boolean value of number 10 is True
Boolean value of an empty string is False
Boolean value of a string is True

In a dataset

From the statsmodels library, a real dataset named birthwt about 'Risk Factors Associated with Low Infant Birth Weight' will be imported to observe binary variables.

import statsmodels.api as sm
dataset1 = sm.datasets.get_rdataset(dataname='birthwt', package='MASS')
df1 = dataset1.data

df1.head()
Out:
lowagelwtracesmokeptlhtuiftvbwt
850191822000102523
860331553000032551
870201051100012557
880211081100122594
890181071100102600

From the help file, description of the dataset obtained by dataset1.__doc__ code is given below.

  • low : an indicator of whether the birth weight is less than 2.5kg
  • age : mother’s age in year
  • lwt : mother’s weight in pounds at last menstrual period
  • race : mother’s race (1 = white, 2 = black, white = other)
  • smoke : smoking status during pregnancy
  • ptl : number of previous premature labours
  • ht : history of hypertension
  • ui : presence of uterine irritability
  • ftv : number of physician visits during the first trimester
  • bwt : birth weight in grams

As can be easily learned from dataset description, low, smoke, and ui attributes are the binary variables. In Python, "value_counts()" function gives the counts of unique values in the variable.

# find counts of the variables
df1['smoke'].value_counts()
Out:
0    115
1     74
Name: smoke, dtype: int64

In the following example, a numerical variable, age, will be converted to a binary variable.

# convert a numerical variable to binary variable
df1['new_age'] = df1['age'] > 30
df1['new_age'].astype('bool')

print('Type of the new variable:\n', type(df1['new_age'].iloc[0]), '\n')
print('Value Counts of the new variable:\n', df1['new_age'].value_counts())
Out:
Type of the new variable:
 <class 'numpy.bool_'> 

Value Counts of the new variable:
 False    169
True      20
Name: new_age, dtype: int64

Meet the Authors

Fatih-Karabiber-profile-photo.jpg

Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Get updates in your inbox

Join over 7,500 data science learners.