Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy.

By clicking "Accept" or further use of this website, you agree to allow cookies.

Accept
Learn Machine Learning by Doing Learn Now
You are reading solutions
Cansın-Guler-profile-photo.jpg
Author: Cansin Guler
Software Engineer

Pandas df.explode(): Unnesting Series and DataFrame Columns

Pandas explode(): Unnesting Series and DataFrame Columns

Pandas' explode() flattens nested Series objects and DataFrame columns by unfurling the list-like values and spreading their content to multiple rows.

Let's have a quick look. Take the DataFrame below:

import pandas as pd

data = [['Kelly Osborn', ['Calculus', 'Programming']], ['Jade Reed', ['Biology', 'Chemistry', 'Physics']]]
df = pd.DataFrame(data, columns= ['Student', 'Subject'])

df
Out:
StudentSubject
0Kelly Osborn[Calculus, Programming]
1Jade Reed[Biology, Chemistry, Physics]

We can call explode() to unpack the values under Subject, like so:

df.explode('Subject')
Out:
StudentSubject
0Kelly OsbornCalculus
0Kelly OsbornProgramming
1Jade ReedBiology
1Jade ReedChemistry
1Jade ReedPhysics

How does explode() work?

Having multiple values bunched up in one cell (in a _list-like_ form) can create a challenge for analysis. explode() (adopted by Pandas in version 0.25.0) tackles this particular problem.

explode() has two parameters:

  • column - specifies the column(s) to be exploded. It is either a column name or a bracketed list of column names.
  • ignore_index - decides whether the original indexes are to be reset. It is False by default.

Let's work on an example. Take the DataFrame below:

midterm_data = {'name': ['Jack', 'David'], 'math101': [[35, 67], []], 'comp101': [[50, 73], [20, 40]]}
midterms = pd.DataFrame(midterm_data)

midterms
Out:
namemath101comp101
0Jack[35, 67][50, 73]
1David[][20, 40]

Let's separate the comp101 column as a Series of its own:

midterms['comp101']
Out:
0    [50, 73]
1    [20, 40]
Name: comp101, dtype: object

We can apply explode() to it directly without any parameters:

midterms['comp101'].explode()
Out:
0    50
0    73
1    20
1    40
Name: comp101, dtype: object

Notice that even though we now have numbers populating the cells, the dtype stays object. This is the case for every column explode() produces.

Now, let's run explode() on the DataFrame itself. In this case, we have to specify the column to be exploded:

midterms.explode('math101')
Out:
namemath101comp101
0Jack35[50, 73]
0Jack67[50, 73]
1DavidNaN[20, 40]

As you can see, the first row repeated itself for each value of math101, and the empty list of the second row got replaced with NaN.

In this DataFrame, the row labels do not hold any particular information. We could reset them by passing True to ignore_index. Like so:

midterms.explode('comp101', True)
Out:
namemath101comp101
0Jack[35, 67]50
1Jack[35, 67]73
2David[]20
3David[]40

The ignore_index=True creates numeric indices from 0 to n.

Exploding List-Looking Strings

The previous section defined the explode() behavior as 'unpacking list-like values'. explode(), more precisely, works on Python lists, tuples and sets, Pandas Series, and Numpy n-dimensional arrays. It does not affect Strings, and this is known to cause problems.

Take the DataFrame below:

movie_data = {'movie': ['Memento', 'Casablanca'], 'genre':["['Thriller', 'Mistery']", "['Drama', 'Romance', 'War']"]}
movies = pd.DataFrame(movie_data)

movies
Out:
moviegenre
0Memento['Thriller', 'Mistery']
1Casablanca['Drama', 'Romance', 'War']

Let's try to explode movies' genre column:

movies.explode('genre')
Out:
moviegenre
0Memento['Thriller', 'Mistery']
1Casablanca['Drama', 'Romance', 'War']

explode() fails us here without any apparent error.

This happens -often when working on imported data. Here, it is easy to pinpoint the problem since we intentionally filled the genre column with list-looking Strings.

We must convert these String values to lists before running explode(). Since the Strings under genre have the exact form of a Python list, we can use ast.literal_eval like so:

from ast import literal_eval

movies['genre'] = movies['genre'].apply(literal_eval)

movies
Out:
moviegenre
0Memento[Thriller, Mistery]
1Casablanca[Drama, Romance, War]

And now, we should be able to explode the genre column:

movies.explode('genre')
Out:
moviegenre
0MementoThriller
0MementoMistery
1CasablancaDrama
1CasablancaRomance
1CasablancaWar

literal_eval only applies, however, to Strings that have the _literal form_ of a Python list, set, or tuple.

Let's take a DataFrame where the data is slightly different in structure:

book_data = {'book': ['Little Women', 'Jane Eyre'], 'tags':["classics,historical,young adult", "classics,romance, gothic"]}
books = pd.DataFrame(book_data)

books
Out:
booktags
0Little Womenclassics,historical,young adult
1Jane Eyreclassics,romance, gothic

Here, we can use str.split to transform the values in the tags column.

Like so:

books = books.assign(tags=books.tags.str.split(","))

books
Out:
booktags
0Little Women[classics, historical, young adult]
1Jane Eyre[classics, romance, gothic]

And now we explode it:

books.explode("tags")
Out:
booktags
0Little Womenclassics
0Little Womenhistorical
0Little Womenyoung adult
1Jane Eyreclassics
1Jane Eyreromance
1Jane Eyregothic

Exploding Multiple Columns at Once

We can explode more than one column at a time, given their list-like values on each row match in length.

Let's create a new midterms DataFrame:

midterm_data = {'name': ['Nia', 'Millie'], 'calc': [[80, 88], [45, 50]], 'bio': [[80, 43], [78, 50]], 'chem': [[], [50, 67]]}
midterms = pd.DataFrame(midterm_data)

midterms
Out:
namecalcbiochem
0Nia[80, 88][80, 43][]
1Millie[45, 50][78, 50][50, 67]

Since calc and bio entries in each row match in length, we can explode these columns together:

midterms.explode(['calc', 'bio'])
Out:
namecalcbiochem
0Nia8080[]
0Nia8843[]
1Millie4578[50, 67]
1Millie5050[50, 67]

However, we cannot explode the chem column along with the others since its first value is an empty list, while others in the same row hold two elements each.

Let's try it, though:

midterms.explode(['calc', 'bio', 'chem'])
Out:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 midterms.explode(['calc', 'bio', 'chem'])
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py:8351, in DataFrame.explode(self, column, ignore_index)
   8349     for c in columns[1:]:
   8350         if not all(counts0 == self[c].apply(mylen)):
-> 8351             raise ValueError("columns must have matching element counts")
   8352     result = DataFrame({c: df[c].explode() for c in columns})
   8353 result = df.drop(columns, axis=1).join(result)
ValueError: columns must have matching element counts

Note that the names must be enclosed in brackets when exploding multiple columns. Otherwise, Pandas throws no error yet only changes the first column whose name was given:

midterms.explode('calc', 'bio')

Summary

explode() unnests the multi-value cells in a given Series or DataFrame column, transforming wide-format data into long-format.


Meet the Authors

Cansın-Guler-profile-photo.jpg

Software engineer, technical writer and trainer.

Brendan Martin
Editor: Brendan Martin
Founder of LearnDataSci

Get updates in your inbox

Join over 7,500 data science learners.