SQL String Contains in Postgres, MySQL, SQLite, BigQuery
For this article, we used a table containing Netflix titles. You can find it here on Kaggle.
For your reference, here's what the first five rows of the data looks like:
|0||s1||Movie||Dick Johnson Is Dead||Kirsten Johnson||NaN||United States||September 25, 2021||2020||PG-13||90 min||Documentaries||As her father nears the end of his life, filmm...|
|1||s2||TV Show||Blood & Water||NaN||Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...||South Africa||September 24, 2021||2021||TV-MA||2 Seasons||International TV Shows, TV Dramas, TV Mysteries||After crossing paths at a party, a Cape Town t...|
|2||s3||TV Show||Ganglands||Julien Leclercq||Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...||NaN||September 24, 2021||2021||TV-MA||1 Season||Crime TV Shows, International TV Shows, TV Act...||To protect his family from a powerful drug lor...|
|3||s4||TV Show||Jailbirds New Orleans||NaN||NaN||NaN||September 24, 2021||2021||TV-MA||1 Season||Docuseries, Reality TV||Feuds, flirtations and toilet talk go down amo...|
|4||s5||TV Show||Kota Factory||NaN||Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...||India||September 24, 2021||2021||TV-MA||2 Seasons||International TV Shows, Romantic TV Shows, TV ...||In a city of coaching centers known to train I...|
A Quick and Easy Way
An easy way to check with SQL if a string contains a particular phrase is by using the
LIKE clause in conjunction with the
'%' wildcard, which acts as a way to match characters around the target word. For example, we can do this as follows:
|0||Madness in the Desert|
|1||Six Windows in the Desert|
|2||Who Would You Take to a Deserted Island?|
|3||Jalpari: The Desert Mermaid|
|4||Queen of the Desert|
|5||The Desert Bride|
Notice that the
'%' wildcard matched "desert" in many positions. Other than
LIKE, this article covers various SQL options developers can use to check if a string contains a substring.
A few simple use cases for wanting to find a substring include:
- Identifying different variants of a product e.g., blue T-shirt small, blue T-shirt medium
- Searching user messages/posts for a particular phrase or word
- Finding and replacing substrings when cleaning data
We'll start by covering the
LIKE operator in more detail, then look at Regular Expressions (Regex) and the
INSTR function. Our examples will target the PostgreSQL dialect, but we've included a list of SQL dialects before each feature to show if it's supported and if the approach needs to be varied.
Option 1: The
Main SQL dialects supported
|PostgreSQL||WHERE value LIKE pattern|
|MYSQL||WHERE value LIKE pattern|
|SQLite||WHERE value LIKE pattern|
|BigQuery||WHERE value LIKE pattern|
|T-SQL||WHERE value LIKE pattern|
LIKE clause compares two string values. It's usually combined with the
'_' wildcards to search for substrings. We'll return to the
'_' wildcard later, but
'%' can be used loosely in the following manners:
|'%substring%'||Match values containing 'substring'|
|'substring%'||Match values that start with 'substring'|
|'%substring'||Match values that end with 'substring'|
Match values containing a substring
|1||Abla Fahita: Drama Queen|
Match values that start with a substring
Match values that end with a substring
Adding in the
It's less common, but you can leverage the
'_' wildcard to enhance your
'_' represents a single character, which allows you to make your
LIKE comparisons more specific. For example, you can use
'_' in the following fashions:
|'_substring%'||Match values where 'substring' starts from the second character|
|'substring_%'||Match values that start with 'substring' and have at least one additional character|
|'substring__%'||Match values that start with 'substring' and have at least two additional characters|
As mentioned, this is rare, but it does allow you to make more bespoke substring searches. Following the above examples, see below for a demonstration of how we could add in the
'_' wildcard to find all values that contain a substring but don't end with it:
|0||Abla Fahita: Drama Queen|
Remember that you don't need to restrict the usage of
'_' to what's shown here. These wildcards can be stringed together in many ways to help pinpoint much more specific patterns.
Option 2: Regex matching
Main SQL dialects supported
|MYSQL||column REGEXP pattern|
|SQLite||column REGEXP pattern|
While slightly more complicated, regex is often the go-to approach for substring searching. Regex is used to search for patterns (
'%substring%' could be defined as a simple pattern), but it allows for heavy customization - this can be ideal when working with messy text data.
While regex can be complex, using it to search for substrings is relatively simple:
|0||Batman: The Killing Joke|
|1||Lego DC Comics: Batman Be-Leaguered|
Regex 101 is a great place to experiment with regex. You can apply your regex patterns to test strings to determine what's getting matched. See below for an example of how Regex 101 was used to test the usage of the
\\D character class, which corresponds to anything that isn't a digit.
Option 3: The
Main SQL dialects supported
|0||Fresh, Fried & Crispy|
|1||Judah Friedlander: America Is the Greatest Cou...|
Using Python combined with
duckdb, we can run some speed tests on the approaches described above to compare their performances.
These results show that regex is the fastest, outpacing the
LIKE clause by a reasonable amount. INSTR is also very quick, suggesting it could be utilizing a regex search under the hood.
Remember that the data table used in this article is relatively small-scale; performance could vary based on table size and the type of database used.
In SQL, the simplest way to see if a column contains a substring is by using the
LIKE clause in combination with wildcards. Alternatively, most SQL dialects include functionality for regex matching.
While more complex, we'd recommend using Regex, which offers more customization.
Regex skills can be massively helpful when working with text data - it does much more than just substring searching, but this is a good starting point if you want to start leveraging regex. As a bonus, it usually comes with a performance boost!