Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy.

By clicking "Accept" or further use of this website, you agree to allow cookies.

Accept
Learn Machine Learning by Doing Learn Now
You are reading solutions / SQL
alfie-grace-headshot-square2.jpg
Author: Alfie Grace
Data Scientist

SQL String Contains in Postgres, MySQL, SQLite, BigQuery

For this article, we used a table containing Netflix titles. You can find it here on Kaggle.

For your reference, here's what the first five rows of the data looks like:

show_idtypetitledirectorcastcountrydate_addedrelease_yearratingdurationlisted_indescription
0s1MovieDick Johnson Is DeadKirsten JohnsonNaNUnited StatesSeptember 25, 20212020PG-1390 minDocumentariesAs her father nears the end of his life, filmm...
1s2TV ShowBlood & WaterNaNAma Qamata, Khosi Ngema, Gail Mabalane, Thaban...South AfricaSeptember 24, 20212021TV-MA2 SeasonsInternational TV Shows, TV Dramas, TV MysteriesAfter crossing paths at a party, a Cape Town t...
2s3TV ShowGanglandsJulien LeclercqSami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...NaNSeptember 24, 20212021TV-MA1 SeasonCrime TV Shows, International TV Shows, TV Act...To protect his family from a powerful drug lor...
3s4TV ShowJailbirds New OrleansNaNNaNNaNSeptember 24, 20212021TV-MA1 SeasonDocuseries, Reality TVFeuds, flirtations and toilet talk go down amo...
4s5TV ShowKota FactoryNaNMayur More, Jitendra Kumar, Ranjan Raj, Alam K...IndiaSeptember 24, 20212021TV-MA2 SeasonsInternational TV Shows, Romantic TV Shows, TV ...In a city of coaching centers known to train I...

A Quick and Easy Way

An easy way to check with SQL if a string contains a particular phrase is by using the LIKE clause in conjunction with the '%' wildcard, which acts as a way to match characters around the target word. For example, we can do this as follows:

SELECT title
FROM netflix_data
WHERE LOWER(title) LIKE '%desert%'  -- LOWER used to avoid case-sensitivity
Out:
title
0Madness in the Desert
1Six Windows in the Desert
2Who Would You Take to a Deserted Island?
3Jalpari: The Desert Mermaid
4Queen of the Desert
5The Desert Bride

Notice that the '%' wildcard matched "desert" in many positions. Other than LIKE, this article covers various SQL options developers can use to check if a string contains a substring.

A few simple use cases for wanting to find a substring include:

  • Identifying different variants of a product e.g., blue T-shirt small, blue T-shirt medium
  • Searching user messages/posts for a particular phrase or word
  • Finding and replacing substrings when cleaning data

We'll start by covering the LIKE operator in more detail, then look at Regular Expressions (Regex) and the INSTR function. Our examples will target the PostgreSQL dialect, but we've included a list of SQL dialects before each feature to show if it's supported and if the approach needs to be varied.

Option 1: The LIKE clause

Main SQL dialects supported

SQL DialectVariation
PostgreSQLWHERE value LIKE pattern
MYSQLWHERE value LIKE pattern
SQLiteWHERE value LIKE pattern
BigQueryWHERE value LIKE pattern
T-SQLWHERE value LIKE pattern

The LIKE clause compares two string values. It's usually combined with the '%' and '_' wildcards to search for substrings. We'll return to the '_' wildcard later, but '%' can be used loosely in the following manners:

Wildcard usageOutcome
'%substring%'Match values containing 'substring'
'substring%'Match values that start with 'substring'
'%substring'Match values that end with 'substring'

Match values containing a substring

SELECT title
FROM netflix_data
WHERE LOWER(title) LIKE '%drama%'
Out:
title
0Mama Drama
1Abla Fahita: Drama Queen
2Total Drama
3Dramaworld

Match values that start with a substring

SELECT title
FROM netflix_data
WHERE LOWER(title) LIKE 'drama%'
Out:
title
0Dramaworld

Match values that end with a substring

SELECT title
FROM netflix_data
WHERE LOWER(title) LIKE '%drama'
Out:
title
0Mama Drama
1Total Drama

Adding in the '_' operator

It's less common, but you can leverage the '_' wildcard to enhance your LIKE usage. '_' represents a single character, which allows you to make your LIKE comparisons more specific. For example, you can use '_' in the following fashions:

Wildcard usageOutcome
'_substring%'Match values where 'substring' starts from the second character
'substring_%'Match values that start with 'substring' and have at least one additional character
'substring__%'Match values that start with 'substring' and have at least two additional characters

As mentioned, this is rare, but it does allow you to make more bespoke substring searches. Following the above examples, see below for a demonstration of how we could add in the '_' wildcard to find all values that contain a substring but don't end with it:

SELECT title
FROM netflix_data
WHERE LOWER(title) LIKE '%drama_%'
Out:
title
0Abla Fahita: Drama Queen
1Dramaworld

Remember that you don't need to restrict the usage of '%' and '_' to what's shown here. These wildcards can be stringed together in many ways to help pinpoint much more specific patterns.

Option 2: Regex matching

Main SQL dialects supported

SQL DialectVariation
PostgreSQLREGEXP_MATCH(column, pattern)
MYSQLcolumn REGEXP pattern
SQLitecolumn REGEXP pattern
BigQueryREGEXP_CONTAINS(column, pattern)

While slightly more complicated, regex is often the go-to approach for substring searching. Regex is used to search for patterns ('%substring%' could be defined as a simple pattern), but it allows for heavy customization - this can be ideal when working with messy text data.

While regex can be complex, using it to search for substrings is relatively simple:

SELECT title
FROM netflix_data
WHERE REGEXP_MATCHES(title, '(?i)batman') -- (?i) used to avoid case-sensitivity
Out:
title
0Batman: The Killing Joke
1Lego DC Comics: Batman Be-Leaguered

Regex 101 is a great place to experiment with regex. You can apply your regex patterns to test strings to determine what's getting matched. See below for an example of how Regex 101 was used to test the usage of the \\D character class, which corresponds to anything that isn't a digit.

Option 3: The INSTR function

Main SQL dialects supported

SQL DialectVariation
MySQLINSTR(column, pattern)
SQLiteINSTR(column, pattern)
BigQueryCONTAINS_SUBSTR(column, pattern)
SELECT title
FROM netflix_data
WHERE INSTR(LOWER(title), 'fried')
Out:
title
0Fresh, Fried & Crispy
1Judah Friedlander: America Is the Greatest Cou...

Performance comparison

Using Python combined with duckdb, we can run some speed tests on the approaches described above to compare their performances.

import duckdb

def like_clause_speed_test():
    sql = """
    SELECT title
    FROM netflix_data
    WHERE LOWER(title) LIKE '%desert%'
    """

    duckdb.query(sql)

def regex_speed_test():
    sql = """
    SELECT title
    FROM netflix_data
    WHERE REGEX_MATCHES(LOWER(title), 'desert')
    """

    duckdb.query(sql)
        
def instr_speed_test():
    sql = """
    SELECT title
    FROM netflix_data
    WHERE INSTR(LOWER(title), 'desert')
    """

    duckdb.query(sql)

# Calculate timings
like_clause_time = %timeit -o -q like_clause_speed_test()
regex_time = %timeit -o -q regex_speed_test
instr_time = %timeit -o -q instr_speed_test

# Create data table
performance_data = [
    ['LIKE clause', like_clause_time.best], 
    ['Regex', regex_time.best],
    ['INSTR', instr_time.best]
]

performance_df = pd.DataFrame(performance_data, columns=['type', 'milliseconds'])
performance_df.milliseconds = round(performance_df.milliseconds * 1e3, 6) 
performance_df.sort_values('milliseconds', inplace=True)

performance_df
Out:
typemilliseconds
1Regex0.000015
2INSTR0.000015
0LIKE clause5.511349

These results show that regex is the fastest, outpacing the LIKE clause by a reasonable amount. INSTR is also very quick, suggesting it could be utilizing a regex search under the hood.

Remember that the data table used in this article is relatively small-scale; performance could vary based on table size and the type of database used.

Summary

In SQL, the simplest way to see if a column contains a substring is by using the LIKE clause in combination with wildcards. Alternatively, most SQL dialects include functionality for regex matching.

While more complex, we'd recommend using Regex, which offers more customization.

Regex skills can be massively helpful when working with text data - it does much more than just substring searching, but this is a good starting point if you want to start leveraging regex. As a bonus, it usually comes with a performance boost!


Meet the Authors

alfie-grace-headshot-square2.jpg

Alfie graduated with a Master's degree in Mechanical Engineering from University College London. He's currently working as Data Scientist at Square Enix. Find him on LinkedIn.

Brendan Martin
Editor: Brendan
Founder of LearnDataSci

Get updates in your inbox

Join over 7,500 data science learners.