jsoup - Quick Guide



jsoup - Overview

Introduction

jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

jsoup libary implements the WHATWG HTML5 specification, and parses an HTML content to the same DOM as per the modern browsers.

Functionalities of jsoup

jsoup library provides following functionalities.

  • Multiple Read Support − It reads and parses HTML using URL, file, or string.

  • CSS Selectors − It can find and extract data, using DOM traversal or CSS selectors.

  • DOM Manipulation − It can manipulate the HTML elements, attributes, and text.

  • Prevent XSS attacks − It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.

  • Tidy − It outputs tidy HTML.

  • Handles invalid data − jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.

jsoup - Environment Setup

This chapter will guide you on how to prepare a development environment to start your work with jsoup. It will also teach you how to set up JDK on your machine before you set up jsoup −

Setup Java Development Kit (JDK)

You can download the latest version of SDK from Oracle's Java site − Java SE Downloads. You will find instructions for installing JDK in downloaded files, follow the given instructions to install and configure the setup. Finally set PATH and JAVA_HOME environment variables to refer to the directory that contains java and javac, typically java_install_dir/bin and java_install_dir respectively.

If you are running Windows and have installed the JDK in C:\jdk-24, you would have to put the following line in your C:\autoexec.bat file.

set PATH=C:\jdk-24;%PATH% 
set JAVA_HOME=C:\jdk-24

Alternatively, on Windows NT/2000/XP, you will have to right-click on My Computer, select Properties → Advanced → Environment Variables. Then, you will have to update the PATH value and click the OK button.

On Unix (Solaris, Linux, etc.), if the SDK is installed in /usr/local/jdk-24 and you use the C shell, you will have to put the following into your .cshrc file.

setenv PATH /usr/local/jdk-24/bin:$PATH 
setenv JAVA_HOME /usr/local/jdk-24

Alternatively, if you use an Integrated Development Environment (IDE) like Borland JBuilder, Eclipse, IntelliJ IDEA, or Sun ONE Studio, you will have to compile and run a simple program to confirm that the IDE knows where you have installed Java. Otherwise, you will have to carry out a proper setup as given in the document of the IDE.

Popular Java Editors

To write your Java programs, you need a text editor. There are many sophisticated IDEs available in the market. But for now, you can consider one of the following −

  • Notepad − On Windows machine, you can use any simple text editor like Notepad (Recommended for this tutorial), TextPad.

  • Netbeans − It is a Java IDE that is open-source and free, which can be downloaded from www.netbeans.org/index.html.

  • Eclipse − It is also a Java IDE developed by the eclipse open-source community and can be downloaded from www.eclipse.org.

jsoup Environment

Download the latest version of jsoup jar files.

At the time of writing this tutorial, we have copied them into C:\>jsoup folder.

OS Archive name
Windows jsoup-1.21.2.jar
Linux jsoup-1.21.2.jar
Mac jsoup-1.21.2.jar

Set CLASSPATH Variable

Set the CLASSPATH environment variable to point to the jsoup jar location. Assuming, you have stored jsoup and related jars in jsoup folder on various Operating Systems as follows.

OS Output
Windows Set the environment variable CLASSPATH to %CLASSPATH%;C:\jsoup\jsoup-1.21.2.jar;.;
Linux export CLASSPATH=$CLASSPATH:jsoup/jsoup-1.21.2.jar:.
Mac export CLASSPATH=$CLASSPATH:jsoup/jsoup-1.21.2.jar:.

jsoup - Parsing String

Overview

Jsoup.parse(String) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.

Following example will showcase parsing an HTML String into a Document object.

Syntax

Document document = Jsoup.parse(html);

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML String.

Get the tags using Document object

String title = document.title();
Elements paragraphs = document.getElementsByTag("p");

Read tag values

for (Element paragraph : paragraphs) {
   System.out.println(paragraph.text());
}

Example - Parsing an HTML String to get Title of HTML

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body><p>Sample Content</p></body></html>";
      Document document = Jsoup.parse(html);
      System.out.println(document.title());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Title

Example - Parsing an HTML String to get Body of HTML

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body><p>Sample Content</p></body></html>";
      Document document = Jsoup.parse(html);
      Elements paragraphs = document.getElementsByTag("p");
      for (Element paragraph : paragraphs) {
            System.out.println(paragraph.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Content

jsoup - Parsing HTML Fragment/Body

Overview

Jsoup.parseBodyFragment(String) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html body fragment.

Following example will showcase parsing an HTML fragment into a Document object.

Syntax

Document document = Jsoup.parseBodyFragment(html);

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML Fragment String.

Get the body using document object

Element body = document.body();

Here body represents element children of the document's body element and is equivalent to document.getElementsByTag("body").

Read tag values

Elements paragraphs = body.getElementsByTag("p");

for (Element paragraph : paragraphs) {
   System.out.println(paragraph.text());
}

Example - Parsing an HTML Fragment String to read paragraphs

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<div><p>Sample Content</p></div>";
      Document document = Jsoup.parseBodyFragment(html);
      Element body = document.body();
      Elements paragraphs = body.getElementsByTag("p");
      for (Element paragraph : paragraphs) {
         System.out.println(paragraph.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Content

Example - Parsing an HTML Fragment String to read Div tags

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<div>Sample Content</div>";
      Document document = Jsoup.parseBodyFragment(html);
      Element body = document.body();
      Elements divs = body.getElementsByTag("div");
      for (Element div : divs) {
         System.out.println(div.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Content

jsoup - Loading URL

Overview

Jsoup.connect(url) method makes a connection to the url and Jsoup.get() method return the html of the requested url.

Syntax

String url = "http://www.google.com";
Document document = Jsoup.connect(url).get();

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • url − url of the html page to load.

Get the data using document object

Element body = document.body();

Here body represents element children of the document's body element and is equivalent to document.getElementsByTag("body").

Read tag values

Elements divs = body.getElementsByTag("div");
for (Element div : divs) {
   System.out.println(div.text());
}

Example - Connecting and loading HTML title

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String url = "http://www.google.com";
      Document document = Jsoup.connect(url).get();
      System.out.println("Title: " + document.title());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Title: Google

Example - Connecting and loading HTML Body

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String url = "http://www.google.com";
      Document document = Jsoup.connect(url).get();
      Element body = document.body();
      Elements divs = body.getElementsByTag("div");
      for (Element div : divs) {
         System.out.println(div.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

AboutStore Gmail Images Sign in AI Mode 
...

jsoup - Loading File

Overview

Jsoup.parse(file, string) method can be used to load a file from file system with required encoding string passed.

Syntax

Document document = Jsoup.parse(inputFile, "UTF-8");
System.out.println(document.title());

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • inputFile − File object representing the file on file system.

Get the data using document object

Element body = document.body();

Here body represents element children of the document's body element and is equivalent to document.getElementsByTag("body").

Read tag values

Elements paragraphs = body.getElementsByTag("p");
for (Element paragraph : paragraphs) {
   System.out.println(paragraph.text());
}

Following is the html file we've used in this example −

<html>
   <head>
      <title>Sample Title</title>
   </head>
   <body>
      <p>Sample Content</p>
   </body>
</html>

Example - Parsing a local html file and read Title of HTML

JsoupTester.java

package com.tutorialspoint;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
      File input = new File("test.htm");
      Document document = Jsoup.parse(input, "UTF-8");
      System.out.println(document.title());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Title

Example - Parsing a local html file and read Body of HTML

JsoupTester.java

package com.tutorialspoint;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
      File input = new File("test.htm");
      Document document = Jsoup.parse(input, "UTF-8");
      Element body = document.body();
      Elements paragraphs = body.getElementsByTag("p");
      for (Element paragraph : paragraphs) {
         System.out.println(paragraph.text());
      } 
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Content

jsoup - Using DOM Methods

Overview

Jsoup.parse(file, string) parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.

Syntax

Document document = Jsoup.parse(html);
Element sampleDiv = document.getElementById("sampleDiv");
Elements links = sampleDiv.getElementsByTag("a");

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML string

  • sampleDiv − Element object represent the html node element identified by id "sampleDiv".

  • links − Elements object represents the multiple node elements identified by tag "a".

Get an element by ID

Element sampleDiv = document.getElementById("sampleDiv");

Get elements by Tag

Elements links = sampleDiv.getElementsByTag("a");

Example - Parsing an html string and read Paragraphs

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);
      System.out.println(document.title());
      Elements paragraphs = document.getElementsByTag("p");
      for (Element paragraph : paragraphs) {
         System.out.println(paragraph.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Sample Title
Sample Content

Example - Parsing an html String and read a particular Div

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);
     
      Element sampleDiv = document.getElementById("sampleDiv");
      System.out.println("Data: " + sampleDiv.text());
     
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Data: Google

Example - Parsing an html String and read links

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);
      Element sampleDiv = document.getElementById("sampleDiv");
      Elements links = sampleDiv.getElementsByTag("a");

      for (Element link : links) {
         System.out.println("Href: " + link.attr("href"));
         System.out.println("Text: " + link.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Href: www.google.com
Text: Google

jsoup - Using Selector Syntax

Overview

The document.select(expression) method parses the given CSS selector expression to select a html dom element.

Syntax

Document document = Jsoup.parse(html);
// select images with src ending .png
Elements pngs = document.select("img[src$=.png]");

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML string

Select PNG Images

Elements pngs = document.select("img[src$=.png]");

Select div by class header

Element headerDiv = document.select("div.header").first();

Example - Selecting Hyperlinks

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //a with href
      Elements links = document.select("a[href]");

      for (Element link : links) {
         System.out.println("Href: " + link.attr("href"));
         System.out.println("Text: " + link.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Href: www.google.com
Text: Google

Example - Selecting PNG Images

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // img with src ending .png
      Elements pngs = document.select("img[src$=.png]");

      for (Element png : pngs) {
         System.out.println("Name: " + png.attr("name"));
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Name: google

Example - Selecting elements by CSS class

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // div with class=header
      Element headerDiv = document.select("div.header").first();
      System.out.println("Id: " + headerDiv.id());
   
      // direct a after h3
      Elements sampleLinks = document.select("h3 > a"); 

      for (Element link : sampleLinks) {
         System.out.println("Text: " + link.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Id: imageDiv
Text: Sample

jsoup - Extracting Attributes

Overview

Element object represent a dom elment and provides various method to get the attribute of a dom element.

Syntax

Document document = Jsoup.parse(html);
Element link = document.select("a").first();
System.out.println("Href: " + link.attr("href"));

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML string

  • link − Element object represent the html node element representing anchor tag.

  • link.attr() − attr(attribute) method retrives the element attribute.

Example - Selecting Hyperlinks

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //a with href
      Elements links = document.select("a[href]");

      for (Element link : links) {
         System.out.println("Href: " + link.attr("href"));
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Href: www.google.com

Example - Selecting PNG Images

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // img with src ending .png
      Elements pngs = document.select("img[src$=.png]");

      for (Element png : pngs) {
         System.out.println("Src: " + png.attr("src"));
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Src: google.png

Example - Selecting custom attributes

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img key='google1' name='google' src='google.png' />"
         + "<img key='yahoo1' name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // select images
      Elements images = document.select("img");

      for (Element image : images) {
         System.out.println("Key: " + image.attr("key"));
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Key: google1
Key: yahoo1

jsoup - Extracting HTML

Overview

Element object represent a dom elment and provides various method to get the inner html as well as outer html (complete html) of a dom element.

Syntax

Document document = Jsoup.parse(html);
Element link = document.select("a").first();         

System.out.println("Outer HTML: " + link.outerHtml());
System.out.println("Inner HTML: " + link.html());

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML string.

  • link − Element object represent the html node element representing anchor tag.

  • link.outerHtml() − outerHtml() method retrives the element complete html.

  • link.html() − html() method retrives the element inner html.

Example - Selecting Complete html of a Tag

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //a with href
      Elements links = document.select("a[href]");

      for (Element link : links) {
         System.out.println(link.outerHtml());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

<a href="www.google.com">Google</a>

Example - Getting Inner HTML of a tag

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //a with href
      Elements links = document.select("a[href]");

      for (Element link : links) {
         System.out.println(link.html());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Google

Example - Selecting HTML of a DIV

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img key='google1' name='google' src='google.png' />"
         + "<img key='yahoo1' name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // select divs
      Elements divs = document.select("div");

      for (Element div : divs) {
         System.out.println("Outer HTML: " + div.outerHtml());
		 System.out.println("Inner HTML: " + div.html());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML: <div id="sampleDiv">
 <a href="www.google.com">Google</a>
 <h3><a>Sample</a></h3>
 <h3></h3>
</div>
Inner HTML: <a href="www.google.com">Google</a>
<h3><a>Sample</a></h3>
<h3></h3>
Outer HTML: <div id="imageDiv" class="header">
 <img key="google1" name="google" src="google.png"><img key="yahoo1" name="yahoo" src="yahoo.jpg">
</div>
Inner HTML: <img key="google1" name="google" src="google.png"><img key="yahoo1" name="yahoo" src="yahoo.jpg">

jsoup - Extracting Text

Overview

Element object represent a dom elment and provides various method to get the text of a dom element.

Syntax

Document document = Jsoup.parse(html);
Element link = document.select("a").first();
System.out.println("Text: " + link.text());

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML string

  • link − Element object represent the html node element representing anchor tag.

  • link.text() − text() method retrives the element text.

Example - Selecting Hyperlinks title

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //a with href
      Elements links = document.select("a[href]");

      for (Element link : links) {
         System.out.println("Title: " + link.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Title: Google

Example - Getting Text of Divs

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // select div
      Elements divs = document.select("div");

      for (Element div : divs) {
         System.out.println(div.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Google Sample

Example - Selecting text of paragraph

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTester {
   public static void main(String[] args) {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a href='www.google.com'>Google</a>"
         + "<h3><a>Sample</a><h3>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img key='google1' name='google' src='google.png' />"
         + "<img key='yahoo1' name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // select paragraph
      Elements paragraphs = document.select("p");

      for (Element paragraph : paragraphs) {
         System.out.println("Text: " + paragraph.text());
      }
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Text: Sample Content

jsoup - Working with URLs

Overview

Element object represent a dom elment and provides methods to get relative as well as absolute URLs present in the html page.

Syntax

String url = "http://www.google.com/";
Document document = Jsoup.connect(url).get();
Element link = document.select("a").first();         

System.out.println("Relative Link: " + link.attr("href"));
System.out.println("Absolute Link: " + link.attr("abs:href"));
System.out.println("Absolute Link: " + link.absUrl("href"));

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to connect to a url and get the html content.

  • link − Element object represent the html node element representing anchor tag.

  • link.attr("href") − provides the value of href present in anchor tag. It may be relative or absolute.

  • link.attr("abs:href") − provides the absolute url after resolving against the document's base URI.

  • link.absUrl("href") − provides the absolute url after resolving against the document's base URI.

Example - Selecting Attributes of a URL after Connecting

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String url = "http://www.google.com/";
      Document document = Jsoup.connect(url).get();

      Element link = document.select("a").first();
      System.out.println("Relative Link: " + link.attr("href"));
      System.out.println("Absolute Link: " + link.attr("abs:href"));
      System.out.println("Href: " + link.absUrl("href"));
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Relative Link: https://about.google/?fg=1&utm_source=google-IN&utm_medium=referral&utm_campaign=hp-header
Absolute Link: https://about.google/?fg=1&utm_source=google-IN&utm_medium=referral&utm_campaign=hp-header
Href: https://about.google/?fg=1&utm_source=google-IN&utm_medium=referral&utm_campaign=hp-header

Example - Getting Exception while Connecting

During connection, we can get exception as well. For example, hitting tutorialspoint.com using http instead of https results in exception.

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String url = "http://www.tutorialspoint.com/";
      Document document = Jsoup.connect(url).get();

      Element link = document.select("a").first();
      System.out.println("Relative Link: " + link.attr("href"));
      System.out.println("Absolute Link: " + link.attr("abs:href"));
      System.out.println("Href: " + link.absUrl("href"));
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=[http://www.tutorialspoint.com/]
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:913)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:866)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:365)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:350)
	at com.tutorialspoint.JsoupTester.main(JsoupTester.java:13)

jsoup - Setting Attributes

Overview

Element object represent a dom elment and provides various method to set the attribute of a dom element.

Syntax

Document document = Jsoup.parse(html);
Element link = document.select("a").first();         
link.attr("href","www.yahoo.com");     
link.addClass("header"); 
link.removeClass("header");

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML String.

  • link − Element object represent the html node element representing anchor tag.

  • link.attr() − attr(attribute,value) method set the element attribute the corresponding value.

  • link.addClass() − addClass(class) method add the class under class attribute.

  • link.removeClass() − removeClass(class) method remove the class under class attribute.

Example - Selecting Attributes of a Link

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         + "<div class='comments'><a href='www.sample1.com'>Sample1</a>"
         + "<a href='www.sample2.com'>Sample2</a>"
         + "<a href='www.sample3.com'>Sample3</a><div>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //Example: set attribute
      Element link = document.getElementById("googleA");
      System.out.println("Outer HTML Before Modification :"  + link.outerHtml());
      link.attr("href","www.yahoo.com");      
      System.out.println("Outer HTML After Modification :"  + link.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :<a id="googleA" href="www.google.com">Google</a>
Outer HTML After Modification :<a id="googleA" href="www.yahoo.com">Google</a>

Example - Adding and Removing CSS Class of an Element

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         + "<div class='comments'><a href='www.sample1.com'>Sample1</a>"
         + "<a href='www.sample2.com'>Sample2</a>"
         + "<a href='www.sample3.com'>Sample3</a><div>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      // Example: add class
      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :"  + div.outerHtml());
      div.addClass("header");      
      System.out.println("Outer HTML After Modification :"  + div.outerHtml());
      System.out.println("---");
      
      // Example: remove class
      Element div1 = document.getElementById("imageDiv");
      System.out.println("Outer HTML Before Modification :"  + div1.outerHtml());
      div1.removeClass("header");      
      System.out.println("Outer HTML After Modification :"  + div1.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
Outer HTML After Modification :<div id="sampleDiv" class="header">
 <a id="googleA" href="www.google.com">Google</a>
</div>
---
Outer HTML Before Modification :<div id="imageDiv" class="header">
 <img name="google" src="google.png"><img name="yahoo" src="yahoo.jpg">
</div>
Outer HTML After Modification :<div id="imageDiv">
 <img name="google" src="google.png"><img name="yahoo" src="yahoo.jpg">
</div>

Example - Multiple Updates

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Elements;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<p>Sample Content</p>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         + "<div class='comments'><a href='www.sample1.com'>Sample1</a>"
         + "<a href='www.sample2.com'>Sample2</a>"
         + "<a href='www.sample3.com'>Sample3</a><div>"
         +"</div>"
         + "<div id='imageDiv' class='header'><img name='google' src='google.png' />"
         + "<img name='yahoo' src='yahoo.jpg' />"
         +"</div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      //Example: bulk update
      Elements links = document.select("div.comments a");
      System.out.println("Outer HTML Before Modification :"  + links.outerHtml());
      links.attr("rel", "nofollow");
      System.out.println("Outer HTML Before Modification :"  + links.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :<a href="www.sample1.com">Sample1</a>
<a href="www.sample2.com">Sample2</a>
<a href="www.sample3.com">Sample3</a>
Outer HTML Before Modification :<a href="www.sample1.com" rel="nofollow">Sample1</a>
<a href="www.sample2.com" rel="nofollow">Sample2</a>
<a href="www.sample3.com" rel="nofollow">Sample3</a>

jsoup - Setting HTML

Overview

Element object represent a dom elment and provides various method to set, prepend or append html to a dom element.

Syntax

Document document = Jsoup.parse(html);
Element div = document.getElementById("sampleDiv");     
div.html("<p>This is a sample content.</p>");   
div.prepend("<p>Initial Text</p>");
div.append("<p>End Text</p>");   

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML String.

  • div − Element object represent the html node element representing anchor tag.

  • div.html() − html(content) method replaces the element's outer html with the corresponding value.

  • div.prepend() − prepend(content) method adds the content before the outer html.

  • div.append() − append(content) method adds the content after the outer html.

Example - Change HTML Content

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);

      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
      div.html("<p>This is a sample content.</p>");
      System.out.println("Outer HTML After Modification :\n"  + div.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
Outer HTML After Modification :
<div id="sampleDiv">
 <p>This is a sample content.</p>
</div>

Example - Prepending HTML

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);
      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
      div.prepend("<p>Initial Text</p>");
      System.out.println("After Prepend :\n"  + div.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
After Prepend :
<div id="sampleDiv">
 <p>Initial Text</p>
 <a id="googleA" href="www.google.com">Google</a>
</div>

Example - Appending HTML

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"
         +"</body></html>";
      Document document = Jsoup.parse(html);
      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
      
      div.append("<p>End Text</p>");
      System.out.println("After Append :\n"  + div.outerHtml());  
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
After Append :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
 <p>End Text</p>
</div>

jsoup - Setting Text Content

Overview

Element object represent a dom elment and provides various method to set, prepend or append text to a dom element.

Syntax

Document document = Jsoup.parse(html);
Element div = document.getElementById("sampleDiv");
div.text("This is a sample content.");   
div.prepend("Initial Text.");
div.append("End Text.");  

Where

  • document − document object represents the HTML DOM.

  • Jsoup − main class to parse the given HTML String.

  • html − HTML String.

  • div − Element object represent the html node element representing anchor tag.

  • div.text() − text(content) method replaces the element's content with the corresponding value.

  • div.prepend() − prepend(content) method adds the content before the outer html.

  • div.append() − append(content) method adds the content after the outer html.

Example - Modifying Content

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"       
         +"</body></html>";
      Document document = Jsoup.parse(html);

      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
      div.text("This is a sample content.");
      System.out.println("Outer HTML After Modification :\n"  + div.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
Outer HTML After Modification :
<div id="sampleDiv">This is a sample content.</div>

Example - Prepending Text Content

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {
   
      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"       
         +"</body></html>";
      Document document = Jsoup.parse(html);

      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
	  
	  div.prepend("Initial Text.");
      System.out.println("After Prepend :\n"  + div.outerHtml());
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>
After Prepend :
<div id="sampleDiv">
 Initial Text.<a id="googleA" href="www.google.com">Google</a>
</div>

Example - Appending Text Content

JsoupTester.java

package com.tutorialspoint;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTester {
   public static void main(String[] args) throws IOException {

      String html = "<html><head><title>Sample Title</title></head>"
         + "<body>"
         + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>"       
         +"</body></html>";
      Document document = Jsoup.parse(html);

      Element div = document.getElementById("sampleDiv");
      System.out.println("Outer HTML Before Modification :\n"  + div.outerHtml());
      div.append("End Text.");
      System.out.println("After Append :\n"  + div.outerHtml());     
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Outer HTML Before Modification :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>End Text.
</div>
After Append :
<div id="sampleDiv">
 <a id="googleA" href="www.google.com">Google</a>
</div>

jsoup - Sanitizing HTML

Overview

Jsoup.clean() method sanitizes an html using Whitelist configurations. It helps in prevention of XSS attacks or cross-site scripting attack.

Syntax

String safeHtml =  Jsoup.clean(html, Safelist.basic());   

Where

  • Jsoup − main class to parse the given HTML String.

  • html − Initial HTML String.

  • safeHtml − Cleaned HTML.

  • Safelist − Object to provide default configurations to safeguard html.

  • clean() − cleans the html using Whitelist.

Example - Santize an HTML Content

JsoupTester.java

package com.tutorialspoint;

import org.jsoup.Jsoup;
import org.jsoup.safety.Safelist;

public class JsoupTester {
   public static void main(String[] args) {
      String html = "<p><a href='http://example.com/'"
         +" onclick='checkData()'>Link</a></p>";

      System.out.println("Initial HTML: " + html);
      String safeHtml =  Jsoup.clean(html, Safelist.basic());
      System.out.println("Cleaned HTML: " +safeHtml);
   }
}

Verify the result

Compile and run the JsoupTester to verify the result −

Initial HTML: <p><a href='http://example.com/' onclick='checkData()'>Link</a></p>
Cleaned HTML: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
Advertisements