How to iterate over rows in Pandas: Most efficient options
There are many ways to iterate over rows of a DataFrame or Series in pandas, each with their own pros and cons.
Since pandas is built on top of NumPy, also consider reading through our NumPy tutorial to learn more about working with the underlying arrays.
To demonstrate each row-iteration method, we'll be utilizing the ubiquitous Iris flower dataset, an easy-to-access dataset containing features of different species of flowers.
We'll import the dataset using seaborn and limit it to the top three rows for simplicity:
Most straightforward row iteration
The most straightforward method for iterating over rows is with the
iterrows() method, like so:
iterrows() returns a row
index as well as the
row itself. Additionally, to improve readability, if you don't care about the
index value, you can throw it away with an underscore (
Despite its ease of use and intuitive nature,
iterrows() is one of the slowest ways to iterate over rows. This article will also look at how you can substitute
apply() to speed up iteration.
Additionally, we'll consider how you can use
apply(), numpy functions, or
map() for a vectorized approach to row operations.
Option 1 (worst):
iterrows() in combination with a dataframe creates what is known as a generator. A generator is an iterable object, meaning we can loop through it.
iterrows() again, but without pulling out the index in the loop definition:
You might have noticed that the result looks a little different from the output shown in the intro. Here, we've only assigned the output to the
row variable, which now contains both the index and row in a tuple.
But let's try and extract the sepal length and width from each row:
This error occurs because each item in our
iterrows() generator is a tuple-type object with two values, the row index and the row content.
If you look at the output at the section beginning, the data for each row is inside parentheses, with a comma separating the index. By assigning the generator output to two variables, we can unpack the data inside the tuple into a more accessible format.
After unpacking the tuple, we can easily access specific values in each row. In this example, we'll throw away the index variable by using an underscore:
Row iteration example
A typical machine learning project for beginners is to use the Iris dataset to create a classification model that can predict the species of a flower based on its measurements.
To do this, we'll need to convert the values in the species column of the dataframe into a numerical format. This process is known as label encoding. Plenty of tools exist to automatically do this for us, such as sklearn's
LabelEncoder, but let's use
iterrows() to do the label encoding ourselves.
First, we need to know the unique species in the dataset:
We'll manually assign a number for each species:
After that, we can iterate through the DataFrame and update each row:
The output shows that we successfully converted the values. The problem with this approach is that using
iterrows requires Python to loop through each row in the dataframe. Many datasets have thousands of rows, so we should generally avoid Python looping to reduce runtimes.
Before eliminating loops, let's consider a slightly better option than
Option 2 (okay):
An excellent alternative to
itertuples, which functions very similarly to
iterrows, with the main difference being that
itertuples returns named tuples. With a named tuple, you can access specific values as if they were an attribute. Thus, in the context of pandas, we can access the values of a row for a particular column without needing to unpack the tuple first.
The example below shows how we could use
itertuples to label the Iris species:
As the example demonstrates, the setup for
itertuples is very similar to
iterrows. The only significant differences are that
itertuples doesn't require unpacking into
row and that the syntax
row['column_name'] doesn't work with
itertuples. It's also worth mentioning that
itertuples is much faster.
As mentioned previously, we should generally avoid looping in pandas. However, there are a few situations where looping may be required, for example, if one of your dataframe columns contained URLs you wanted to visit one at a time with a web scraper. In situations where looping is needed,
itertuples is a much better choice than
Option 3 (best for most applications):
apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but
apply has been optimized better than
iterrows, which results in faster runtimes. See below for an example of how we could use
apply for labeling the species in each row.
In this situation,
axis=1 is used to specify that we'd like to iterate across the rows of the dataframe. By changing this to
axis=0, we could apply a function to each column of a dataframe instead. For example, see below for how we could use
apply to get the max value for each column in our dataframe:
It's worth noting that using
axis=0 is much faster, as it applies functionality to every row of a column at once instead of iterating through rows one at a time. A solution like this which acts on multiple array values simultaneously is known as a vectorized solution. Check out the next section to see how we could use a vectorized solution for our label encoding problem.
Option 4 (best for some applications):
One option we've got for implementing a vectorized solution is using
map. We can use
map on a series (such as a dataframe column). As mentioned previously, a vectorized solution is one we can apply to multiple array values simultaneously. By providing
map a dictionary as an argument,
map will treat column values as dictionary keys, then transform these to their corresponding values in the dictionary. By passing
species labels dictionary that we've been using, we can solve our label encoding problem:
Note that your dictionary needs to include keys for all possible values in the column you're transforming when using
map with a dictionary. Otherwise, Python will replace any dataframe values that don't have a key in the dictionary with a missing (
Speed testing different options
Let's look at a larger dataset to get a good feel for how a vectorized approach is faster.
seaborn has another sample dataset called gammas, which contains medical imaging data. More specifically, the data represents the blood oxygen levels for different areas of the brain.
ROI column refers to the region of interest the row data represents and has the unique values
ROI values need to be labeled, so let's take a solution using
map while recording the times to get a better idea of speed differences.
The results show that
apply massively outperforms
iterrows. As mentioned previously, this is because
apply is optimized for looping through dataframe rows much quicker than
While slower than
itertuples is quicker than
iterrows, so if looping is required, try implementing
map as a vectorized solution gives even faster results. The gammas dataset is relatively small (only 6000 rows), but the performance gap between
apply will increase as dataframes get larger. As a result, vectorized solutions are much more scalable, so you should get used to using these.
itertuples to manipulate dataframe rows is an acceptable approach when you're just starting with dataframes. For a much quicker solution,
apply is usually pretty easy to implement in place of
iterrows, meaning that it's good when you need a quick fix. However, real-life datasets can get very large in terms of their scale, meaning that vectorized solutions (such as
map) are usually the way to go. As mentioned previously, vectorized approaches scale much better with huge datasets. Good scalability means that you won't start to experience long runtimes with big datasets, so vectorizing dataframe processing is essential when working in the data industry.