49 Days of Ruby: Day 35 - Web Scraping

Ben Greenberg - Apr 30 '21 - - Dev Community

Welcome to day 35 of the 49 Days of Ruby! 🎉

Now that we know a bit about HTTP and making HTTP requests in Ruby, today we'll discuss how to use that knowledge to scrape the web!

Web scraping is where you write some code that fetches a resource off the web and gives you some content from that website. It is an alternative to using APIs (more about that tomorrow) and is often used when there is no API available.

tl;dr Today's resources come from this excellent blog post by Sylwia, a DEV community member, and friend:

Making the HTTP Request

If you recall from yesterday, we made HTTP requests using the net/http library. Today, we will use open-uri, which is also part of the standard Ruby core utilities:

require "open-uri"

html = open("https://en.wikipedia.org/wiki/Douglas_Adams")
Enter fullscreen mode Exit fullscreen mode

The above example looks a lot like our fetching of the blog post yesterday, except even more condensed. The variable html now holds the HTML content of the Wikipedia page for Douglas Adams.

Our next step is to parse that HTML.

Parsing the HTML

A popular gem to use to help us in parsing HTML is Nokogiri. The gem is very powerful, and because of that, its complexity can grow by multitudes as you build out more intricate applications.

In our case, we will try to pare down our usage of it:

response = Nokogiri::HTML(html)
Enter fullscreen mode Exit fullscreen mode

The response variable now contains an object of Nokogiri::HTML::Document. This is the HTML that is structured like a hash with lots of nested resources.

We now have our HTML in a structure that we can scrape some data from.

Scrape Away

For our example, we'll get just the main body text for Douglas Adams.

We do that by finding some kind of identifier on the Wikipedia page that we can utilize. HTML is the language that one creates websites in. Another language, which we are not discussing but need to mention, is CSS. CSS is the language that one styles websites in. Each part of the page has some kind of CSS tags that we can use to identify the part we want to scrape.

In the case of the Wikipedia page, the text is inside a p tag. We can use the Nokogiri #css method providing the p tag as an argument to get just the text:

text = html.css("p").text
Enter fullscreen mode Exit fullscreen mode

Now, if you inspect text you will see it contains the entire description for Douglas Adams from Wikipedia. You've successfully scraped a site!

If you want to read more about this, I highly recommend Sylwia's post. She goes into a lot more detail than our format provides. Continue to share your learnings with the community using the hashtag #49daysofruby!

Come back tomorrow for the next installment of 49 Days of Ruby! You can join the conversation on Twitter with the hashtag #49daysofruby.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .