You are reading glossary
sklearn-binary-classifier-comparison.png
Fatih-Karabiber-profile-photo.jpg
Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

Binary Classification

What is Binary Classification?

Binary classification is a form of classification — the process of predicting categorical variables — where the output is restricted to two classes.

Binary classification is used in many different data science applications, such as:

Application01
Medical DiagnosisHealthyDiseased
Email AnalysisNot SpamSpam
Financial Data AnalysisNot FraudFraud
MarketingWon't BuyWill Buy

Quick example

For example, in medical diagnosis, a binary classifier for a specific disease could take in symptoms of a patient and predict whether the patient is healthy or has a disease. The possible outcomes of the diagnosis are positive and negative.

Evaluation of binary classifiers

If the model successfully predicts the patients as positive, this case is called True Positive (TP). If the model successfully predicts patients as negative, this is called True Negative (TN). The binary classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this error is called False Negative (FN). Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called False Positive(FP).

The binary classifier can be evaluated based on the following parameters.

  • True Positive (TP) : The patient is diseased and the model predicts as diseased
  • False Positive (FP): The patient is healthy but the model predicts as diseased
  • True Negative (TN) : The patient is healthy and the model predicts as healthy
  • False Negative (FN): The patient is diseased and the model predicts as healthy

After obtaining these values, Accuracy score of the binary classification is calculated as follows: $$ accuracy = \frac {TP + TN}{TP+FP+TN+FN} $$

A confusion matrix is created to represent the parameters for binary classification.

In machine learning, there are many methods used for binary classification. The most common are:

  • Support Vector Machines
  • Naive Bayes
  • Nearest Neighbor
  • Decision Trees
  • Logistic Regression
  • Neural Networks

A Python Example for Binary Classification

Here, we will use a sample data set to show demonstrate binary classification. We will use breast cancer data on the size of tumors to predict whether or not a tumor is malignant. For this example, we will use Logistic Regression, which is one of the many algorithms for performing binary classification. Both the data and the algorithm are available in the sklearn library.

First, we'll import and load the data:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

sns.set_style('dark')
import matplotlib as mpl
mpl.style.use(['https://gist.githubusercontent.com/BrendanMartin/01e71bb9550774e2ccff3af7574c0020/raw/6fa9681c7d0232d34c9271de9be150e584e606fe/lds_default.mplstyle'])
mpl.rcParams.update({"figure.figsize": (8,6), "axes.titlepad": 22.0})

We'll print the target variable, target names, and frequency of each unique value:

print('Target variables  : ', dataset['target_names'])

(unique, counts) = np.unique(dataset['target'], return_counts=True)

print('Unique values of the target variable', unique)
print('Counts of the target variable :', counts)

Out:
Target variables  :  ['malignant' 'benign']
Unique values of the target variable [0 1]
Counts of the target variable : [212 357]

Now, we can plot a bar chart to see the target variable:

sns.barplot(x=dataset['target_names'], y=counts)
plt.title('Target variable counts in dataset')
plt.show()

RESULT:
breast-cancer-target-variable-counts-plot.png

In this dataset, we have two classes: malignant denoted as $0$ and benign denoted as $1$, making this a binary classification problem.

To perform binary classification using Logistic Regression with sklearn, we need to accomplish the following steps.

Step 1: Define explonatory variables and target variable

X = dataset['data']
y = dataset['target']

Step 2:  Apply normalization operation for numerical stability

from sklearn.preprocessing import StandardScaler
standardizer = StandardScaler()
X = standardizer.fit_transform(X)

Step 3: Split the dataset into training and testing sets

75% of data is used for training, and 25% for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state=0)

Step 4: Fit a Logistic Regression Model to the train data

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

Out:
LogisticRegression()

Step 5: Make predictions on the testing data

predictions = model.predict(X_test)

Step 6: Calculate the accuracy score by comparing the actual values and predicted values.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)

TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)

print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

Out:
True Positive(TP)  =  88
False Positive(FP) =  3
True Negative(TN)  =  50
False Negative(FN) =  2
Accuracy of the binary classification = 0.965

Other Binary Classifiers in the Scikit-Learn Library

Here, we'll list some of the other classification algorithms defined in Scikit-learn library, which we will be evaluate and compare. You can read more about these algorithms in the sklearn docs here for details.

Well-known evaluation metrics for classification are also defined in scikit-learn library. Here, we'll focus on Accuracy, Precision, and Recall metrics for performance evaluation. If you'd like to read more about many of the other metric, see the docs here.

Initializing each binary classifier

Below, we can create an empty dictionary, initialize each model, then store it by name in the dictionary:

Perfomance evaluation of each binary classifier

Now that all models are initialized, we'll loop over each one, fit it, make predictions, calculate metrics, and store each result in a dictionary.

models = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()

# Support Vector Machines
from sklearn.svm import LinearSVC
models['Support Vector Machines'] = LinearSVC()

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()

from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy, precision, recall = {}, {}, {}

for key in models.keys():
    
    # Fit the classifier model
    models[key].fit(X_train, y_train)
    
    # Prediction 
    predictions = models[key].predict(X_test)
    
    # Calculate Accuracy, Precision and Recall Metrics
    accuracy[key] = accuracy_score(predictions, y_test)
    precision[key] = precision_score(predictions, y_test)
    recall[key] = recall_score(predictions, y_test)

With all metrics stored, we can use the pandas library to view the data as a table:

import pandas as pd

df_model = pd.DataFrame(index=models.keys(), columns=['Accuracy', 'Precision', 'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()

df_model

Out:
AccuracyPrecisionRecall
Logistic Regression0.9650350.9777780.967033
Support Vector Machines0.9440560.9444440.965909
Decision Trees0.8951050.8555560.974684
Random Forest0.9650350.9666670.977528
Naive Bayes0.9160840.9333330.933333
K-Nearest Neigbor0.9510490.9888890.936842

Finally, here's a quick bar chart to compare the classifiers performance:

ax  = df_model.plot.bar(rot=45)
ax.legend(ncol= len(models.keys()), bbox_to_anchor=(0, 1), loc='lower left', prop={'size': 14})
plt.tight_layout()

RESULT:
sklearn-binary-classifier-comparison.png

It's important to note that since the default parameters are used for the models, It is difficult to decide which classifier is the best one. Each algorithm should be analyzed carefully and the optimal parameters should be selected to have better performance.


Meet the Authors

Fatih-Karabiber-profile-photo.jpg

Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Editor:

Get updates in your inbox

Join over 7,500 data science learners.