XPath selector on HTML with Java
Created: 2020-09-30 | 6 min read
I wanted to explore how easy is to extract elements from HTML files or directly from website URLs programatically using Java. I found a third party library that gave me the desired results without much difficulties.
It’s XSoup.
- Github Repo: https://github.com/code4craft/xsoup
- Maven Repo: https://mvnrepository.com/artifact/us.codecraft/xsoup
With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.
In the example below, it extracts the posts of this website using the URL (https://ckinan.com) , or reading the html of that exact url but downloaded into the repository.
#The code
Repo: https://github.com/ckinan/learning/tree/main/java/parse-html
We got:
- One single file
App.javato place our logic - One single dependency in our
pom.xmlfile: XSoup
#App.java
I tried to explain the code with the docstrings and comments there to point out what the lines and methods are supposed to do.
1package com.ckinan;
2
3import org.jsoup.Jsoup;
4import org.jsoup.nodes.Document;
5import org.jsoup.nodes.Element;
6import us.codecraft.xsoup.Xsoup;
7
8import java.io.*;
9import java.nio.file.Files;
10import java.nio.file.Paths;
11import java.util.ArrayList;
12import java.util.List;
13import java.util.stream.Collectors;
14
15/**
16 * Code snippet to find elements within a html (from file and url) using XPath.
17 *
18 * It uses a third party library called XSoup. Links:
19 * - mvn repository: https://mvnrepository.com/artifact/us.codecraft/xsoup
20 * - github repository: https://github.com/code4craft/xsoup
21 *
22 * This example reads a list of blog posts using a fixed XPath expression and prints the results in the console.
23 * */
24public class App {
25
26 public static final String XPATH_BLOG_POST = "//div[contains(@class, 'leading-snug')]";
27
28 /**
29 * The method main of the program. It starts and orchestrates the process.
30 * It gets instances of Documents from specific sources and sends them to read specific elements
31 * by xpath expression.
32 *
33 * It retrieve documents from:
34 * - Files (by specifying the full path)
35 * - URL (of the website to be parsed)
36 *
37 * @param args
38 * Arguments of our application. Won't be used at all
39
40 * @throws java.io.IOException
41 */
42 public static void main(String[] args) throws IOException {
43 // Calling the static methods that will get an instance of the HTML document and then evaluate
44 // the XPath expression to get a list of Blog Posts. First with a HTML file, then do the same with the website
45 // URL directly.
46 Document doc1 = App.getDocumentFromFile("src/main/resources/blog.html");
47 List<String> postsFromFile = App.readDocument(doc1, App.XPATH_BLOG_POST);
48 System.out.println(postsFromFile);
49
50 Document doc2 = App.getDocumentFromUrl("https://ckinan.com/blog");
51 List<String> postsFromUrl = App.readDocument(doc2, App.XPATH_BLOG_POST);
52 System.out.println(postsFromUrl);
53 }
54
55 /**
56 * Read the content file of the given full path using "collect" from Stream Java 8+.
57 *
58 * Refs:
59 * - https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html#get-java.lang.String-java.lang.String...-
60 * - https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#lines-java.nio.file.Path-
61 * - https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.stream.Collector-
62 *
63 * Then it uses Jsoup to parse the content file into a Document instance.
64 *
65 * @param fullPath
66 * The full path of the file to be read and parsed using Jsoup.
67
68 * @throws java.io.IOException
69 *
70 * @return The parsed document which contains all the elements of the HTML file
71 */
72 public static Document getDocumentFromFile(String fullPath) throws IOException {
73 System.out.println("Reading document from file: " + fullPath);
74 // Read the content file using Java 8+ Streams.
75 String html = Files.lines(Paths.get(fullPath)).collect(Collectors.joining(System.lineSeparator()));
76 return Jsoup.parse(html);
77 }
78
79 /**
80 * Read the content of the given website url using Jsoup
81 *
82 * @param url
83 * The URL of the website HTML to be read and parsed using Jsoup.
84 *
85 * @throws java.io.IOException
86 *
87 * @return The parsed document which contains all the elements of the HTML file
88 */
89 public static Document getDocumentFromUrl(String url) throws IOException {
90 System.out.println("Reading document from url: " + url);
91 return Jsoup.connect(url).get();
92 }
93
94 /**
95 * Evaluates the given XPath expression on the given Document which is limited to assume that it's a list of
96 * Elements right now. Returns results of the XPath evaluation.
97 *
98 * @param doc
99 * The URL of the website HTML to be read and parsed using Jsoup.
100 *
101 * @param xpath
102 * The URL of the website HTML to be read and parsed using Jsoup.
103 *
104 * @throws java.io.IOException
105 *
106 * @return The list of string values resulted from the XPath expression.
107 */
108 public static List<String> readDocument(Document doc, String xpath) {
109 List<String> result = new ArrayList<>();
110
111 // It first extract the Element instances of the html content using the given Document and the XPath expression
112 // Note: Xsoup uses Jsoup as the HTML parser. Xsoup "evaluates" a Document which is an instance variable created
113 // by Jsoup
114 List<Element> elements =
115 Xsoup.compile(xpath).evaluate(doc).getElements();
116
117 for(Element e: elements) {
118 // Each Element object has multiple properties available to be extracted, for example the HTML tag name.
119 // For this code snipped, we are interested into extract the text of the element.
120 // Example: <p>Hello world</p> -> e.text() would return "Hello world"
121 result.add(e.text());
122 }
123
124 return result;
125 }
126
127}
#pom.xml (dependences section only)
1<dependencies>
2 <!-- https://mvnrepository.com/artifact/us.codecraft/xsoup -->
3 <dependency>
4 <groupId>us.codecraft</groupId>
5 <artifactId>xsoup</artifactId>
6 <version>0.3.1</version>
7 </dependency>
8</dependencies>
#Testing
Run the App.java class, and it should get this printed in the Console:
Reading document from file: src/main/resources/blog.html
[Sat 22:Remote Debug Spring Boot App in a Docker Container using IntelliJ IDEA, Thu 13:Cheat Sheet - Git Commands, Mon 20:Spring Security - Credentials from JSON request, Fri 03:Spring Security - Filter Chain, Sun 28:Spring Security - Form Login & Logout, Wed 17:KnockoutJS - Register Component, Mon 01:Spring Security - Getting Started, Sun 17:Tailwind CSS - First Steps, Fri 24:Project Github Pull Request Viewer, Fri 13:Github API - OAuth tokens for apps, Sat 07:Netlify Functions, Fri 07:Github API - First Steps, Sun 05:Auth0 - Simple Example with Vanilla JS, Tue 08:Intro]
Reading document from url: https://ckinan.com/blog
[Sat 22:Remote Debug Spring Boot App in a Docker Container using IntelliJ IDEA, Thu 13:Cheat Sheet - Git Commands, Mon 20:Spring Security - Credentials from JSON request, Fri 03:Spring Security - Filter Chain, Sun 28:Spring Security - Form Login & Logout, Wed 17:KnockoutJS - Register Component, Mon 01:Spring Security - Getting Started, Sun 17:Tailwind CSS - First Steps, Fri 24:Project Github Pull Request Viewer, Fri 13:Github API - OAuth tokens for apps, Sat 07:Netlify Functions, Fri 07:Github API - First Steps, Sun 05:Auth0 - Simple Example with Vanilla JS, Tue 08:Intro]
#Tooling
Something uselful to “test” the XPath expression to see if it’s a valid one or if will work, it is this chrome extension: XPath Helper. For the above example I introduced the XPath expression //div[contains(@class, 'leading-snug')] and ensure the results made sense.

#Final thoughts
I needed an XPath selector, and Xsoup worked well and without a lot of boilerplate, just like what I was looking for. So thanks Xsoup!
#Links:
- XSoup: https://github.com/code4craft/xsoup
- Jsoup: https://jsoup.org/
- Repo of the code snippet: https://github.com/ckinan/learning/tree/main/java/parse-html
- This tutorial helped me to understand the basics of XPath: https://www.w3schools.com/xml/xpath_syntax.asp