Extracting a substring using Linux bash


Overview

Extracting a substring from a string is a basic and common operation of text processing in Linux.

We're looking at different ways to extract substrings from strings using the Linux command line here.

Extracting an Index-Based Substring

Let’s first take a quick glance at how to extract index-based substring using four different methods.

  • Using the cut command

  • Using the awk command

  • Using Bash’s substring expansion

  • Using the expr command

Next, we’ll see them in action.

Using The cut Command

We can extract characters starting at position N through position M from the input string using "cut" command.

To use the cut command to fix our issue, we must add 1 to the starting index and subtract 1 from the ending index. Therefore, the new intervals will be 4-8 and 9-13 respectively.

Now, we’ll see if the cut command solves the problem.

$ cut -c 5-9 <<< '0123Linux9'
Linux

We've found the expected substring, "Linux" — no longer an issue.

We passed the input string to our cut function via a here-string, then echoed out the result.

Using The awk Command

If we want to solve some text processing problems in Linux, we don't need to remember any specific tools. We just need to use awk.

The substr() functions takes three arguments. Let's examine each one of them in detail.

  • s − The input string

  • i − The start index of the substring (awk uses the 1-based index system)

  • n − The length of the substring. If it’s omitted, awk will return from index i until the last character in the input string as the substring

Let’s now see whether awk‘s substring() function can provide us with the desired output.

$ awk '{print substr($0, 5, 5)}' <<< '0123Linux9'
Linux

We start at position 0 (the first character) and count up to position 4 (the last character). Then we add one to account for the fact that we're counting from 1 instead of 0.

Using Bash’s Substring Expansion

We've seen how cut and awks can easily extract substring-like strings.

Instead of using sed, which doesn't support substring expansion, use bash, which does.

Today, bash is the default command line interpreter for most modern Linux distributions. In other words, if we want to use the command line, we don't need to install anything else.

$ STR="0123Linux9"
$ echo ${STR:4:5}
Linux

Using The expr Command

The expr (expression) is a core utility in the GNU Core Utilities package. It means that it’s available for all Linux systems.

Further, expr has a subcommand called substr which allows us to extract substring from an expression.

expr substr <input_string> <start_index> <length>

You may want to mention that the expr function works using the 1-based indexing system.

Let’s say we want to extract the first two words from each line of text. We could use the substring function with

$ expr substr "0123Linux9"5 5
Linux

The output above indicates that the expr solution worked.

Extracting a Pattern-Based Substring

Now we're going to explore patterns-substrings, in addition to the indexed substrings that we've already learned.

We’ll discuss two ways to solve our problem: one approach, which we’ll

  • Using the cut command

  • Using the awk command

We’ll take another approach to solving this problem by looking at a different type of string matching problem.

Using The cut Command

The "field" commands are useful tools for working with field-related data.

Let’s take a quick look at our problem. We have an input value which is separated by commas. And we want to get the third item from that list.

We can use awk to split the line into fields using commas (,-) as separators, and then print out the third field (-f3).

$ cut -d , -f 3 <<< "Eric,Male,28,USA"
28

We achieved our desired results and fixed the issue.

Using the awk Command

Awks are also good at handling field-based input. A compact awkish one-liner can solve this problem.

$ awk -F',' '{print $3}' <<< "Eric,Male,28,USA"
28

Furthermore, since awk's field separator (FS), which allows for regular expressions, we can build more generic solutions using awk.

For this reason, the “C” option isn't a good choice for solving this problem. It would only support one character as the field delimiters.

It’s still easy to use awk.

$ awk -F', ' '{print $3}' <<< "Eric, Male, 28, USA"
28

You can use an awk command to work in both situations. This could be a handy trick in the real word.

$ awk -F', ?' '{print $3}' <<< "Eric, Male, 28, USA"
28
$ awk -F', ?' '{print $3}' <<< "Eric,Male,28,USA"
28

A Different Pattern-Based Substring Case

We've already dealt with the "Eric's birthday" issue. Now let's look at another one.

Although in theory, the pattern-matching substring should be present in a CSV file, this may not always be the case. As a demonstration, let's look at an example.

Awk is an excellent tool for solving this kind of challenge. However, it doesn't always use the cut command.

Let’s now look at how we solve this problem using awk. We store the input string into a variable called $STR so that our commands become easier to read.

$ STR="whatever dataBEGIN:Interesting dataEND:something else"
$ awk -F'BEGIN:|END:' '{print $2}' <<< "$STR"
Interesting data
$ awk '{ sub(/.*BEGIN:/, ""); sub(/END:.*/, ""); print }' <<< "$STR"
Interesting data

The first awk statement sets the beginning (or end) of each line as the delimiter, and then takes the second column.

After executing these two substitutions, our final output will be the desired one. We just need to display it.

Conclusion

Text processing is a key component of Linux. Depending on the needs, specific substrings can be determined through pattern- or index-related parameters.

Through examples, we have looked at how to extract substrings from both types of strings.

Updated on: 03-Jan-2023

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements