How to split strings on multiple delimiters with Python?


Problem

You need to split a string into fields, but the delimiters aren’t consistent throughout the string.

Solution

There are multiple ways you can split a string or strings of multiple delimiters in python. The most and easy approach is to use the split() method, however, it is meant to handle simple cases.

re.split() is more flexible than the normal `split()` method in handling complex string scenarios.

With re.split() you can specify multiple patterns for the separator. As shown in the solution, the separator is either ahyphen(-), or whitespace( ), or comma(,) followed values. Regular expressions documentation can be found here.

Whenever that pattern is found, the entire match becomes the delimiter between the fields that are on either side of thematch.

Extract only the text between the delimiters (no delimiters).

Example

import re
tennis_greats = 'Roger-federer, Rafael nadal, Novak Djokovic,Andy murray'
""""
#-----------------------------------------------------------------------------
# Scenario 1 - Output the players
# Input - String with multiple delimiters ( - , white space)
# Code - Specify the delimters in []
#-----------------------------------------------------------------------------
"""
players = re.split(r'[-,\s]\s*',tennis_greats)

output

print(f" The output is - {players}")

The output is -

['Roger', 'federer', 'Rafael', 'nadal', 'Novak', 'Djokovic', 'Andy', 'murray']

Extract the text between the delimiters along with delimiters

Example

import re
tennis_greats = 'Roger-federer, Rafael nadal, Novak Djokovic,Andy murray'
""""
#-----------------------------------------------------------------------------
# Scenario 2 - Output the players and the delimiters
# Input - String with multiple delimiters ( - , white space)
# Code - Specify the delimters between pipe (|)
#-----------------------------------------------------------------------------
"""
players = re.split(r'(-|,|\s)\s*',tennis_greats)

output

print(f" The output is -{players}")

The output is -

['Roger', '-', 'federer', ',', 'Rafael', ' ', 'nadal', ',', 'Novak', ' ', 'Djokovic', ',', 'Andy', ' ', 'murray']

Updated on: 17-Nov-2020

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements