A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1
You should already know:
Basic Python — Learn Python and Data Science concepts interactively on Dataquest.
A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1 — where $0$ typically indicates that the attribute is absent, and $1$ indicates that it is present.
Some examples of binary variables, i.e. attributes, are:
- Smoking is a binary variable with only two possible values: yes or no
- A medical test has two possible outcomes: positive or negative
- Gender is traditionally described as male or female
- Health status can be defined as diseased or healthy
- Company types may have two values: private or public
- E-mails can be assigned into two categories: spam or not
- Credit card transactions can be fraud or not
In some applications, it may be useful to construct a binary variable from other types of data. If you can turn a non-binary attribute into only two categories, you have a binary variable. For example, the numerical variable of age can be divided into two groups: 'less than 30' or 'equal or greater than 30'.
Datasets used in machine learning applications have more likely binary variables. Some applications such as medical diagnoses, spam analysis, facial recognition, and financial fraud detection have binary variables.
Binary Variables in Python
In Python, the boolean data type is the binary variable and defined as $True$ or $False$.
bool() function converts the value of an object to a boolean value. This function returns $True$ for all values except the following values:
- Empty objects (list, tuple, string, dictionary)
- Zero number (0, 0.0, 0j)
- None value
In a dataset
statsmodels library, a real dataset named
birthwt about 'Risk Factors Associated with Low Infant Birth Weight' will be imported to observe binary variables.
From the help file, description of the dataset obtained by
dataset1.__doc__ code is given below.
- low : an indicator of whether the birth weight is less than 2.5kg
- age : mother’s age in year
- lwt : mother’s weight in pounds at last menstrual period
- race : mother’s race (1 = white, 2 = black, white = other)
- smoke : smoking status during pregnancy
- ptl : number of previous premature labours
- ht : history of hypertension
- ui : presence of uterine irritability
- ftv : number of physician visits during the first trimester
- bwt : birth weight in grams
As can be easily learned from dataset description, low, smoke, and ui attributes are the binary variables. In Python, "
value_counts()" function gives the counts of unique values in the variable.
In the following example, a numerical variable, age, will be converted to a binary variable.