Introduction to Web Scraping With Java

Kevin Sahin - Dec 30 '19 - - Dev Community

Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…

Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.

In this post, we are going to see basic techniques in order to fetch and parse data in Java.

Prerequisites

  • Basic Java understanding
  • Basic XPath

Tools

You will need Java 8 with HtmlUnit

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.19</version>
</dependency>

Enter fullscreen mode Exit fullscreen mode

If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page.

Let's scrape CraigList

For our first example, we are going to fetch items from Craigslist since they don't seem to offer an API, to collect names, prices, and images, and export it to JSON.

First, let's take a look at what happens when you search an item on Craigslist. Open Chrome Dev tools and click on the Network tab :

The search URL is :

https://newyork.craigslist.org/search/moa?is_paid=all&search_distance_type=mi&query=iphone+6s
Enter fullscreen mode Exit fullscreen mode

You can also use

https://newyork.craigslist.org/search/sss?sort=rel&query=iphone+6s  
Enter fullscreen mode Exit fullscreen mode

Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled ...)

We are going to disable Javascript since it's not required for our example, and disabling Javascript makes the page load faster :

String searchQuery = "Iphone 6s" ;

WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
  String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
  HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
  e.printStackTrace();
}
}
Enter fullscreen mode Exit fullscreen mode

The HtmlPage object will contain the HTML code, you can access it with asXml() method.

Now we are going to fetch titles, images, and prices. We need to inspect the DOM structure for an item :

With HtmlUnit you have several options to select an html tag :

  • getHtmlElementById(String id)
  • getFirstByXPath(String Xpath)
  • getByXPath(String XPath) which returns a List
  • many others, rtfm !

Since there isn't any ID we could use, we have to make an Xpath expression to select the tags we want.

XPath is a query language to select XML nodes( HTML in our case).

First, we are going to select all the <p> tags that have a class result-info

Then we will iterate through this list, and for each item select the name, price, and URL, and then print it.

List<HtmlElement> items = (List<HtmlElement>) page.getByXPath("//li[@class='result-row']") ;
if(items.isEmpty()){
  System.out.println("No items found !");
}else{
for(HtmlElement item : items){
  HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

  HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

  String itemName = itemAnchor.asText()
  String itemUrl =  itemAnchor.getHrefAttribute()

  // It is possible that an item doesn't have any price
  String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText() ;

  System.out.println( String.format("Name : %s Url : %s Price : %s", itemName, itemPrice, itemUrl));
  }
}
Enter fullscreen mode Exit fullscreen mode

Then instead of just printing the results, we are going to put it in JSON, using Jackson library, to map items in JSON format.

We need a POJO (plain old java object) to represent Items

Item.java

public class Item {
    private String title ; 
    private BigDecimal price ;
    private String url ;
//getters and setters
}
Enter fullscreen mode Exit fullscreen mode

Then add this to your pom.xml :

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.7.0</version>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Now, all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file ...), and adapt the previous code a little bit :

for(HtmlElement htmlItem : items){
   HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

   HtmlElement spanPrice = ((HtmlElement) 
   htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

   // It is possible that an item doesn't have any 
   //price, we set the price to 0.0 in this case
   String itemPrice = spanPrice == null ? "0.0" : 
   spanPrice.asText() ;

   Item item = new Item();

   item.setTitle(itemAnchor.asText());
   item.setUrl( baseUrl + 
   itemAnchor.getHrefAttribute());

   item.setPrice(new 
   BigDecimal(itemPrice.replace("$", "")));

   ObjectMapper mapper = new ObjectMapper();
   String jsonString = 
   mapper.writeValueAsString(item) ;

   System.out.println(jsonString);
}
Enter fullscreen mode Exit fullscreen mode

Go further

This example is not perfect, there are many things that can be improved :

  • Multi-city search
  • Handling pagination
  • Multi-criteria search

You can find the code in this Github repo

This was my first blog post I hope you enjoyed it, feel free to give me any feedback in the comments.

Further reading

I recently wrote a blog post about a Web Scraping without getting blocked to explain the different techniques in order how to hide your scrapers, check it out!

. . . . . . . . . . . . .