XPath selector on HTML with Java

Created: 2020-09-30 | 6 min read

I wanted to explore how easy is to extract elements from HTML files or directly from website URLs programatically using Java. I found a third party library that gave me the desired results without much difficulties.

It’s XSoup.

Github Repo: https://github.com/code4craft/xsoup
Maven Repo: https://mvnrepository.com/artifact/us.codecraft/xsoup

With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.

In the example below, it extracts the posts of this website using the URL (https://ckinan.com) , or reading the html of that exact url but downloaded into the repository.

#The code

Repo: https://github.com/ckinan/learning/tree/main/java/parse-html

We got:

One single file App.java to place our logic
One single dependency in our pom.xml file: XSoup

#App.java

I tried to explain the code with the docstrings and comments there to point out what the lines and methods are supposed to do.

  1package com.ckinan;
  2
  3import org.jsoup.Jsoup;
  4import org.jsoup.nodes.Document;
  5import org.jsoup.nodes.Element;
  6import us.codecraft.xsoup.Xsoup;
  7
  8import java.io.*;
  9import java.nio.file.Files;
 10import java.nio.file.Paths;
 11import java.util.ArrayList;
 12import java.util.List;
 13import java.util.stream.Collectors;
 14
 15/**
 16 * Code snippet to find elements within a html (from file and url) using XPath.
 17 *
 18 * It uses a third party library called XSoup. Links:
 19 * - mvn repository: https://mvnrepository.com/artifact/us.codecraft/xsoup
 20 * - github repository: https://github.com/code4craft/xsoup
 21 *
 22 * This example reads a list of blog posts using a fixed XPath expression and prints the results in the console.
 23 * */
 24public class App {
 25
 26    public static final String XPATH_BLOG_POST = "//div[contains(@class, 'leading-snug')]";
 27
 28    /**
 29     * The method main of the program. It starts and orchestrates the process.
 30     * It gets instances of Documents from specific sources and sends them to read specific elements
 31     * by xpath expression.
 32     *
 33     * It retrieve documents from:
 34     * - Files (by specifying the full path)
 35     * - URL (of the website to be parsed)
 36     *
 37     * @param  args
 38     *         Arguments of our application. Won't be used at all
 39
 40     * @throws  java.io.IOException
 41     */
 42    public static void main(String[] args) throws IOException {
 43        // Calling the static methods that will get an instance of the HTML document and then evaluate
 44        // the XPath expression to get a list of Blog Posts. First with a HTML file, then do the same with the website
 45        // URL directly.
 46        Document doc1 = App.getDocumentFromFile("src/main/resources/blog.html");
 47        List<String> postsFromFile = App.readDocument(doc1, App.XPATH_BLOG_POST);
 48        System.out.println(postsFromFile);
 49
 50        Document doc2 = App.getDocumentFromUrl("https://ckinan.com/blog");
 51        List<String> postsFromUrl = App.readDocument(doc2, App.XPATH_BLOG_POST);
 52        System.out.println(postsFromUrl);
 53    }
 54
 55    /**
 56     * Read the content file of the given full path using "collect" from Stream Java 8+.
 57     *
 58     * Refs:
 59     * - https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html#get-java.lang.String-java.lang.String...-
 60     * - https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#lines-java.nio.file.Path-
 61     * - https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.stream.Collector-
 62     *
 63     * Then it uses Jsoup to parse the content file into a Document instance.
 64     *
 65     * @param  fullPath
 66     *         The full path of the file to be read and parsed using Jsoup.
 67
 68     * @throws  java.io.IOException
 69     *
 70     * @return  The parsed document which contains all the elements of the HTML file
 71     */
 72    public static Document getDocumentFromFile(String fullPath) throws IOException {
 73        System.out.println("Reading document from file: " + fullPath);
 74        // Read the content file using Java 8+ Streams.
 75        String html = Files.lines(Paths.get(fullPath)).collect(Collectors.joining(System.lineSeparator()));
 76        return Jsoup.parse(html);
 77    }
 78
 79    /**
 80     * Read the content of the given website url using Jsoup
 81     *
 82     * @param  url
 83     *         The URL of the website HTML to be read and parsed using Jsoup.
 84     *
 85     * @throws  java.io.IOException
 86     *
 87     * @return  The parsed document which contains all the elements of the HTML file
 88     */
 89    public static Document getDocumentFromUrl(String url) throws IOException {
 90        System.out.println("Reading document from url: " + url);
 91        return Jsoup.connect(url).get();
 92    }
 93
 94    /**
 95     * Evaluates the given XPath expression on the given Document which is limited to assume that it's a list of
 96     * Elements right now. Returns results of the XPath evaluation.
 97     *
 98     * @param  doc
 99     *         The URL of the website HTML to be read and parsed using Jsoup.
100     *
101     * @param  xpath
102     *         The URL of the website HTML to be read and parsed using Jsoup.
103     *
104     * @throws  java.io.IOException
105     *
106     * @return  The list of string values resulted from the XPath expression.
107     */
108    public static List<String> readDocument(Document doc, String xpath) {
109        List<String> result = new ArrayList<>();
110
111        // It first extract the Element instances of the html content using the given Document and the XPath expression
112        // Note: Xsoup uses Jsoup as the HTML parser. Xsoup "evaluates" a Document which is an instance variable created
113        //       by Jsoup
114        List<Element> elements =
115                Xsoup.compile(xpath).evaluate(doc).getElements();
116
117        for(Element e: elements) {
118            // Each Element object has multiple properties available to be extracted, for example the HTML tag name.
119            // For this code snipped, we are interested into extract the text of the element.
120            // Example: <p>Hello world</p> -> e.text() would return "Hello world"
121            result.add(e.text());
122        }
123
124        return result;
125    }
126
127}

#pom.xml (dependences section only)

1<dependencies>
2  <!-- https://mvnrepository.com/artifact/us.codecraft/xsoup -->
3  <dependency>
4    <groupId>us.codecraft</groupId>
5    <artifactId>xsoup</artifactId>
6    <version>0.3.1</version>
7  </dependency>
8</dependencies>

#Testing

Run the App.java class, and it should get this printed in the Console:

Reading document from file: src/main/resources/blog.html
[Sat 22:Remote Debug Spring Boot App in a Docker Container using IntelliJ IDEA, Thu 13:Cheat Sheet - Git Commands, Mon 20:Spring Security - Credentials from JSON request, Fri 03:Spring Security - Filter Chain, Sun 28:Spring Security - Form Login & Logout, Wed 17:KnockoutJS - Register Component, Mon 01:Spring Security - Getting Started, Sun 17:Tailwind CSS - First Steps, Fri 24:Project Github Pull Request Viewer, Fri 13:Github API - OAuth tokens for apps, Sat 07:Netlify Functions, Fri 07:Github API - First Steps, Sun 05:Auth0 - Simple Example with Vanilla JS, Tue 08:Intro]
Reading document from url: https://ckinan.com/blog
[Sat 22:Remote Debug Spring Boot App in a Docker Container using IntelliJ IDEA, Thu 13:Cheat Sheet - Git Commands, Mon 20:Spring Security - Credentials from JSON request, Fri 03:Spring Security - Filter Chain, Sun 28:Spring Security - Form Login & Logout, Wed 17:KnockoutJS - Register Component, Mon 01:Spring Security - Getting Started, Sun 17:Tailwind CSS - First Steps, Fri 24:Project Github Pull Request Viewer, Fri 13:Github API - OAuth tokens for apps, Sat 07:Netlify Functions, Fri 07:Github API - First Steps, Sun 05:Auth0 - Simple Example with Vanilla JS, Tue 08:Intro]

#Tooling

Something uselful to “test” the XPath expression to see if it’s a valid one or if will work, it is this chrome extension: XPath Helper. For the above example I introduced the XPath expression //div[contains(@class, 'leading-snug')] and ensure the results made sense.

#Final thoughts

I needed an XPath selector, and Xsoup worked well and without a lot of boilerplate, just like what I was looking for. So thanks Xsoup!

#Links:

XSoup: https://github.com/code4craft/xsoup
Jsoup: https://jsoup.org/
Repo of the code snippet: https://github.com/ckinan/learning/tree/main/java/parse-html
This tutorial helped me to understand the basics of XPath: https://www.w3schools.com/xml/xpath_syntax.asp