How to read the contents of a webpage into a string in java?


You can read the contents of a web page in several ways using Java. Here, we are going to discuss three of them.

Using the openStream() method

The URL class of the java.net package represents a Uniform Resource Locator which is used to point a resource (file or, directory or a reference) in the world wide web.

The openStream() method of this class opens a connection to the URL represented by the current object and returns an InputStream object using which you can read data from the URL.

Therefore, to read data from web page (using the URL class) −

  • Instantiate the java.net.URL class by passing the URL of the desired web page as a parameter to its constructor.

  • Invoke the openStream() method and retrieve the InputStream object.

  • Instantiate the Scanner class by passing the above retrieved InputStream object as a parameter.

Example

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class ReadingWebPage {
   public static void main(String args[]) throws IOException {
      //Instantiating the URL class
      URL url = new URL("http://www.something.com/");
      //Retrieving the contents of the specified page
      Scanner sc = new Scanner(url.openStream());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

Output

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

Using HttpClient

Http client is a transfer library, it resides on the client side, sends and receives HTTP messages. It provides up to date, feature-rich and, efficient implementation which meets the recent HTTP standards.

The GET request (of Http protocol) is used to retrieve information from the given server using a given URI. Requests using GET should only retrieve data and should have no other effect on the data.

The HttpClient API provides a class named HttpGet which represents the get request method. To execute the GET request and retrieve the contents of a web page −

  • The createDefault() method of the HttpClients class returns a CloseableHttpClient object, which is the base implementation of the HttpClient interface. Using this method, create an HttpClient object.

  • Create a HTTP GET request by instantiating the HttpGet class. The constructor of this class accepts a String value representing the URI of the web page to which you need to send the request.

  • Execute the HttpGet request by invoking the execute() method.

  • Retrieve an InputStream object representing the content of the web site from the response as −

httpresponse.getEntity().getContent()

Example

import java.util.Scanner;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class HttpClientExample {
   public static void main(String args[]) throws Exception{
      //Creating a HttpClient object
      CloseableHttpClient httpclient = HttpClients.createDefault();
      //Creating a HttpGet object
      HttpGet httpget = new HttpGet("http://www.something.com/");
      //Executing the Get request
      HttpResponse httpresponse = httpclient.execute(httpget);
      Scanner sc = new Scanner(httpresponse.getEntity().getContent());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

Output

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

Using the Jsoup library

Jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

To retrieve the contents of a web page using the Jsoup library −

  • The connect() method of the Jsoup class accepts an URL of a web page and connects to the specified web page and returns the connection object. Connect to the desired web page using the connect() method.

  • The get() method of the Connection interface sends/executes the GET request and returns the HTML document as an object of the Document class. Send GET request to the page by invoking the get() method.

  • Retrieve the contents of the obtained document into a String as −

String result = doc.body().text();

Example

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExample {
   public static void main(String args[]) throws IOException {
      String page = "http://www.something.com/";
      //Connecting to the web page
      Connection conn = Jsoup.connect(page);
      //executing the get request
      Document doc = conn.get();
      //Retrieving the contents (body) of the web page
      String result = doc.body().text();
      System.out.println(result);
   }
}

Output

It works!
raja
Published on 10-Oct-2019 11:03:43
Advertisements