You are reading solutions / Python
alfie-grace-headshot-square2.jpg
Author: Alfie Grace
Data Scientist

How to iterate over rows in Pandas: Most efficient options

There are many ways to iterate over rows of a DataFrame or Series in pandas, each with their own pros and cons.

Since pandas is built on top of NumPy, also consider reading through our NumPy tutorial to learn more about working with the underlying arrays.

To demonstrate each row-iteration method, we'll be utilizing the ubiquitous Iris flower dataset, an easy-to-access dataset containing features of different species of flowers.

We'll import the dataset using seaborn and limit it to the top three rows for simplicity:

import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris').head(3)

Most straightforward row iteration

The most straightforward method for iterating over rows is with the iterrows() method, like so:

for index, row in df.iterrows():
    print(row, '\n')
Out:
sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object 

sepal_length       4.9
sepal_width          3
petal_length       1.4
petal_width        0.2
species         setosa
Name: 1, dtype: object 

sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object

iterrows() returns a row index as well as the row itself. Additionally, to improve readability, if you don't care about the index value, you can throw it away with an underscore (_).

Despite its ease of use and intuitive nature, iterrows() is one of the slowest ways to iterate over rows. This article will also look at how you can substitute iterrows() for itertuples() or apply() to speed up iteration.

Additionally, we'll consider how you can use apply(), numpy functions, or map() for a vectorized approach to row operations.

Option 1 (worst): iterrows()

Using iterrows() in combination with a dataframe creates what is known as a generator. A generator is an iterable object, meaning we can loop through it.

Let's use iterrows() again, but without pulling out the index in the loop definition:

for row in df.iterrows():
    print(row, '\n')
Out:
(0, sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object) 

(1, sepal_length       4.9
sepal_width          3
petal_length       1.4
petal_width        0.2
species         setosa
Name: 1, dtype: object) 

(2, sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object)

You might have noticed that the result looks a little different from the output shown in the intro. Here, we've only assigned the output to the row variable, which now contains both the index and row in a tuple.

But let's try and extract the sepal length and width from each row:

for row in df.iterrows():
    print(f"Sepal Length - {row['sepal_length']} Sepal Width - {row['sepal_width']}")
Out:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-9addd52c9b21> in <module>
      1 for row in df.iterrows():
----> 2     print(f"Sepal Length - {row['sepal_length']} Sepal Width - {row['sepal_width']}")

TypeError: tuple indices must be integers or slices, not str

This error occurs because each item in our iterrows() generator is a tuple-type object with two values, the row index and the row content.

If you look at the output at the section beginning, the data for each row is inside parentheses, with a comma separating the index. By assigning the generator output to two variables, we can unpack the data inside the tuple into a more accessible format.

After unpacking the tuple, we can easily access specific values in each row. In this example, we'll throw away the index variable by using an underscore:

for _, row in df.iterrows():
    print(f"Sepal Length - {row['sepal_length']} Sepal Width - {row['sepal_width']}")
Out:
Sepal Length - 5.1 Sepal Width - 3.5
Sepal Length - 4.9 Sepal Width - 3.0
Sepal Length - 4.7 Sepal Width - 3.2

Row iteration example

A typical machine learning project for beginners is to use the Iris dataset to create a classification model that can predict the species of a flower based on its measurements.

To do this, we'll need to convert the values in the species column of the dataframe into a numerical format. This process is known as label encoding. Plenty of tools exist to automatically do this for us, such as sklearn's LabelEncoder, but let's use iterrows() to do the label encoding ourselves.

First, we need to know the unique species in the dataset:

df = sns.load_dataset('iris') # use the entire dataset

df['species'].unique()
Out:
array(['setosa', 'versicolor', 'virginica'], dtype=object)

We'll manually assign a number for each species:

species_labels = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

After that, we can iterate through the DataFrame and update each row:

for index, row in df.iterrows():
    # get the correct number for the species
    label = species_labels[row['species']]
    
    # update the row in the dataframe
    df['species'].at[index] = label 

    
# check that the species were converted correctly
df['species'].unique()
Out:
array([0, 1, 2], dtype=object)

The output shows that we successfully converted the values. The problem with this approach is that using iterrows requires Python to loop through each row in the dataframe. Many datasets have thousands of rows, so we should generally avoid Python looping to reduce runtimes.

Before eliminating loops, let's consider a slightly better option than iterrows().

Option 2 (okay): itertuples()

An excellent alternative to iterrows is itertuples, which functions very similarly to iterrows, with the main difference being that itertuples returns named tuples. With a named tuple, you can access specific values as if they were an attribute. Thus, in the context of pandas, we can access the values of a row for a particular column without needing to unpack the tuple first.

The example below shows how we could use itertuples to label the Iris species:

df = sns.load_dataset('iris') # reset the dataframe
species_labels = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

for row in df.itertuples():
    label = species_labels[row.species]
    df['species'].at[row.Index] = label # update the row in the dataframe

df['species'].unique() # check that the species were converted correctly
Out:
array([0, 1, 2], dtype=object)

As the example demonstrates, the setup for itertuples is very similar to iterrows. The only significant differences are that itertuples doesn't require unpacking into index and row and that the syntax row['column_name'] doesn't work with itertuples. It's also worth mentioning that itertuples is much faster.

As mentioned previously, we should generally avoid looping in pandas. However, there are a few situations where looping may be required, for example, if one of your dataframe columns contained URLs you wanted to visit one at a time with a web scraper. In situations where looping is needed, itertuples is a much better choice than iterrows.

Option 3 (best for most applications): apply()

By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows, which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.

df = sns.load_dataset('iris') # reset the dataframe
species_labels = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

df['species'] = df.apply(lambda row: species_labels[row['species']], axis=1)

df.species.unique() # check that the species were converted correctly
Out:
array([0, 1, 2])

In this situation, axis=1 is used to specify that we'd like to iterate across the rows of the dataframe. By changing this to axis=0, we could apply a function to each column of a dataframe instead. For example, see below for how we could use apply to get the max value for each column in our dataframe:

df = sns.load_dataset('iris') # reset the dataframe
species_labels = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

df.apply(lambda column: max(column), axis=0)
Out:
sepal_length          7.9
sepal_width           4.4
petal_length          6.9
petal_width           2.5
species         virginica
dtype: object

It's worth noting that using axis=0 is much faster, as it applies functionality to every row of a column at once instead of iterating through rows one at a time. A solution like this which acts on multiple array values simultaneously is known as a vectorized solution. Check out the next section to see how we could use a vectorized solution for our label encoding problem.

Option 4 (best for some applications): map()

One option we've got for implementing a vectorized solution is using map. We can use map on a series (such as a dataframe column). As mentioned previously, a vectorized solution is one we can apply to multiple array values simultaneously. By providing map a dictionary as an argument, map will treat column values as dictionary keys, then transform these to their corresponding values in the dictionary. By passing map our species labels dictionary that we've been using, we can solve our label encoding problem:

df = sns.load_dataset('iris') # reset the dataframe
species_labels = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

df['species'] = df['species'].map(species_labels)

df['species'].unique() # check that the species were converted correctly
Out:
array([0, 1, 2])

Note that your dictionary needs to include keys for all possible values in the column you're transforming when using map with a dictionary. Otherwise, Python will replace any dataframe values that don't have a key in the dictionary with a missing (nan) value.

Speed testing different options

Let's look at a larger dataset to get a good feel for how a vectorized approach is faster.

seaborn has another sample dataset called gammas, which contains medical imaging data. More specifically, the data represents the blood oxygen levels for different areas of the brain.

The ROI column refers to the region of interest the row data represents and has the unique values IPS, AG and V1. The ROI values need to be labeled, so let's take a solution using iterrows, apply and map while recording the times to get a better idea of speed differences.

region_labels = {'IPS': 0, 'AG': 1, 'V1': 2}

def iterrows_test(region_labels):
    df = sns.load_dataset('gammas')
    for index, row in df.iterrows():
        label = region_labels[row['ROI']]
        df['ROI'].at[index] = label

def itertuples_test(region_labels):
    df = sns.load_dataset('gammas')
    for row in df.itertuples():
        label = region_labels[row.ROI]
        df['ROI'].at[row.Index] = label
        
def apply_test_axis0(region_labels):
    def get_label(x):
        return region_labels[x]
    df = sns.load_dataset('gammas')
    df['ROI'] = df['ROI'].apply(get_label)

def apply_test_axis1(region_labels):
    df = sns.load_dataset('gammas')
    df['ROI'] = df.apply(lambda row: region_labels[row['ROI']], axis=1)
    
def map_test(region_labels):
    df = sns.load_dataset('gammas')
    df['ROI'] = df['ROI'].map(region_labels)


# Calculate timings
iterrows_time = %timeit -o -q iterrows_test(region_labels)
itertuples_time = %timeit -o -q itertuples_test(region_labels)
apply_time1 = %timeit -o -q apply_test_axis0(region_labels)
apply_time2 = %timeit -o -q apply_test_axis1(region_labels)
map_time = %timeit -o -q map_test(region_labels)

# Create data table
data = [
    ['iterrows', iterrows_time.best], 
    ['itertuples', itertuples_time.best],
    ['apply axis=0', apply_time1.best], 
    ['apply axis=1', apply_time2.best], 
    ['map', map_time.best]
]

df = pd.DataFrame(data, columns=['type', 'milliseconds'])
df.milliseconds = round(df.milliseconds * 1e3, 2) 
df.sort_values('milliseconds', inplace=True)

df
Out:
typemilliseconds
4map7.67
2apply axis=09.35
3apply axis=177.74
1itertuples82.49
0iterrows585.06

The results show that apply massively outperforms iterrows. As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does.

While slower than apply, itertuples is quicker than iterrows, so if looping is required, try implementing itertuples instead.

Using map as a vectorized solution gives even faster results. The gammas dataset is relatively small (only 6000 rows), but the performance gap between map and apply will increase as dataframes get larger. As a result, vectorized solutions are much more scalable, so you should get used to using these.

Summary

Using iterrows or itertuples to manipulate dataframe rows is an acceptable approach when you're just starting with dataframes. For a much quicker solution, apply is usually pretty easy to implement in place of iterrows, meaning that it's good when you need a quick fix. However, real-life datasets can get very large in terms of their scale, meaning that vectorized solutions (such as map) are usually the way to go. As mentioned previously, vectorized approaches scale much better with huge datasets. Good scalability means that you won't start to experience long runtimes with big datasets, so vectorizing dataframe processing is essential when working in the data industry.

Take the internet's best data science courses Learn More

Meet the Authors

alfie-grace-headshot-square2.jpg

Alfie graduated with a Master's degree in Mechanical Engineering from University College London. He's currently working as a top-rated data scientist on Upwork. Find him on LinkedIn.

Brendan Martin
Editor: Brendan Martin
Founder of LearnDataSci

Get updates in your inbox

Join over 7,500 data science learners.