How to read the contents of a webpage into a string in java?

Java Object Oriented Programming Programming

We can read the contents of a web page in several ways using Java. Here, we are going to discuss three of them. Those are as follows:

Using the java.net.URL class.
Using the HttpClient library.
Using the org.jsoup library.

Using the URL class

The URL class of the java.net package represents a Uniform Resource Locator which is used to point a resource (file or, directory or a reference) in the world wide web.

The openStream() method of this class opens a connection to the URL represented by the current object and returns an InputStream object using which you can read data from the URL

Following are the steps to read the contents of a web page using the URL class:

Instantiate the java.net.URL class by passing the URL of the desired web page as a parameter to its constructor.
Invoke the openStream() method and retrieve the InputStream object.
Instantiate the Scanner class by passing the above retrieved InputStream object as a parameter.

Example

Following is the Java program to read the contents of a web page using the URL class:

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class ReadingWebPage {
   public static void main(String args[]) throws IOException {
      //Instantiating the URL class
      URL url = new URL("http://www.something.com/");
      //Retrieving the contents of the specified page
      Scanner sc = new Scanner(url.openStream());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

Output

When the above code is executed, it produces the following output:

<html><head><title>Something.</title></head><body>Something.</body></html>
Contents of the web page:  Something.Something

Using the HttpClient library

HttpClient is a transfer library, it resides on the client side, sends and receives HTTP messages. It provides up to date, feature-rich and, efficient implementation which meets the recent HTTP standards.

The GET request (of the HTTP protocol) is used to retrieve information from the given server using a given URI. Requests using GET should only retrieve data and should have no other effect on the data.

The HttpClient API provides a class named HttpGet which represents the get request method. To execute the GET request and retrieve the contents of a web page

Following are the steps to read the contents of a web page using the HttpClient library:

To use the HttpClient library, you need to add the following Maven dependency in your pom.xml file:

<dependency>
   <groupId>org.apache.httpcomponents</groupId>
   <artifactId>httpclient</artifactId>
   <version>4.5.13</version>
</dependency>

Or, you can download the jar file from the Apache HttpClient website and add it to your project.

The createDefault() method of the HttpClients class returns a CloseableHttpClient object, which is the base implementation of the HttpClient interface. Using this method, create an HttpClient object.
Create an HTTP GET request by instantiating the HttpGet class. The constructor of this class accepts a String value representing the URI of the web page to which you need to send the request.
Execute the HttpGet request by invoking the execute() method.
Retrieve an InputStream object representing the content of the website from the response as an entity.

Example

Following is the Java program to read the contents of a web page using the HttpClient library:

import java.util.Scanner;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class HttpClientExample {
   public static void main(String args[]) throws Exception{
      //Creating a HttpClient object
      CloseableHttpClient httpclient = HttpClients.createDefault();
      //Creating a HttpGet object
      HttpGet httpget = new HttpGet("http://www.something.com/");
      //Executing the Get request
      HttpResponse httpresponse = httpclient.execute(httpget);
      Scanner sc = new Scanner(httpresponse.getEntity().getContent());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

Output

When the above code is executed, it produces the following output:

<html><head><title>Something.</title></head><body>Something.</body></html>
Contents of the web page:  Something.Something

Using the Jsoup library

Jsoup is a Java-based library to work with HTML-based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jQuery-like methods. It implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.

To retrieve the contents of a web page using the Jsoup library:

To use the Jsoup library, you need to add the following Maven dependency in your pom.xml file:

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.14.3</version>
</dependency>

Or, you can download the jar file from the Jsoup website and add it to your project.

The connect() method of the Jsoup class accepts a URL of a web page and connects to the specified web page and returns the connection object. Connect to the desired web page using the connect() method.
The get() method of the Connection interface sends/executes the GET request and returns the HTML document as an object of the Document class. Send GET request to the page by invoking the get() method.
Retrieve the contents of the obtained document into a String.

Example

Following is the Java program to read the contents of a web page using the Jsoup library:

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExample {
   public static void main(String args[]) throws IOException {
      String page = "http://www.something.com/";
      //Connecting to the web page
      Connection conn = Jsoup.connect(page);
      //executing the get request
      Document doc = conn.get();
      //Retrieving the contents (body) of the web page
      String result = doc.body().text();
      System.out.println(result);
   }
}

Output

When the above code is executed, it produces the following output:

Something.
Something

Maruthi Krishna

Updated on: 2025-09-01T14:11:04+05:30

13K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started