Get the Contents of a Web Page in a Shell Variable on Linux


Introduction

One of the most useful and powerful features of the Linux command line is the ability to manipulate text. This can be especially useful when working with web pages, as web page content can often be saved as plain text and then manipulated with command-line tools. In this article, we will explore how to insert the content of a web page into a shell variable in Linux.

What is a Shell variable?

A Shell variable is a value stored in memory and can be used by the shell (command-line interface) and other programs. Shell variables are usually defined in the form NAME=value, where “NAME” is the name of the variable and “value” is the value stored in the variable.

Shell variables can be used to store a wide variety of information, including the output of command-line tools, the contents of text files, and even the contents of web pages.

Using curl to get the content of a web page

One of the easiest ways to put the content of a web page into a shell variable is to use the “curl” command. Curl is a command line tool used to transfer data to or from a server. It supports a wide range of protocols, including HTTP, HTTPS, FTP and much more.

To get the content of a web page into a shell variable using curl, we can use the following command −

$ webcontent=$(curl -s https://www.example.com)

This command will store the content of the web page at https://www.example.com in the shell variable "webcontent". The "-s" flag instructs curl to run in silent mode, which means it will not print any output to the terminal.

Using Grep to extract specific lines from web page

Once we have the web page content in a shell variable, we can use command line tools like grep to extract specific lines of text from the web page. Grep is a powerful command line tool used to search for patterns in text.

For example, suppose we want to extract all links from the web page. We can use the following command to do this −

$ links=$(echo "$webcontent" | grep -o 'href="[^"]*"')

This command will use grep to find all occurrences of the pattern 'href="[^"]*"' in the web page content, which matches all links on the page. The "-o" flag tells grep to print only the corresponding part of the text, which in this case is the link itself. The output of this command will be a list of all links on the web page, one link per line.

Using Awk to extract specific fields from web page

Another useful command-line tool for extracting specific information from text is “awk”. Awk is a programming language designed for text processing and is often used to extract specific fields from text files.

For example, suppose we want to extract the title of the web page. The web page title is usually stored in the "title" element of the HTML source code, which looks like this −

<title>Example Web Page</title>

To extract the web page title using awk, we can use the following command −

$ title=$(echo "$webcontent" | awk '// {print $0}' | sed 's/<[^>]*>//g')

This command will search the template.

Using Cut to extract specific fields from the web page

Another useful command-line tool for extracting specific fields from text is “cut”. Cut is a command-line tool used to extract specific fields from a file or command output.

For example, suppose you want to extract the first and last name from a list of names in the following format: "first last". We can use the following command to do this −

$ names="John Smith Jane Doe"
$ first_names=$(echo "$names" | cut -d' ' -f1)
$ last_names=$(echo "$names" | cut -d' ' -f2)
$ echo "$first_names"
# Output: John Jane
$ echo "$last_names"
# Output: Smith Doe

This command uses the "-d" flag to specify the delimiter (in this case a space) and the "-f" flag to specify the field number we want to extract. The output of the command is a list of first names and a list of last names, separated by the delimiter.

Conclusion

In this article, we've explored how to get the content of a webpage into a shell variable on Linux and how to use command-line tools like curl, grep, awk, and cut to extract specific information from the webpage. These tools are powerful and they can save you a lot of time and effort when working with web pages on the command line

Updated on: 25-Jan-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements