Web scraping can often seem like a daunting task, especially when dealing with complex websites with many different elements. However, using Ruby and the Nokogiri gem, we can simplify this task and scrape websites more effectively.
In this post, I will illustrate how to scrape Wikipedia to fetch specific elements from a webpage. Please remember that web scraping should always be done responsibly, in compliance with the website's terms of service.
Setting Up:
Firstly, you'll need to install the nokogiri
gem. You can do this with the following command: gem install nokogiri
.
The Code:
require 'nokogiri'
require 'open-uri'
def scrape_wikipedia(url)
document = Nokogiri::HTML(URI.open(url))
page_title = document.css('h1').first.text
puts "Page Title: #{page_title}"
infobox_vevent = document.css('.infobox.vevent')
infobox_title = infobox_vevent.css('.infobox-title.summary').text
puts "Infobox Title: #{infobox_title}"
tbody_tr_elements = infobox_vevent.css('tbody tr')
third_tr_element = tbody_tr_elements[2]
fourth_tr_element = tbody_tr_elements[3]
if third_tr_element && fourth_tr_element
third_label = third_tr_element.css('.infobox-label').text
third_data = third_tr_element.css('.infobox-data').text
puts "#{third_label}: #{third_data}"
fourth_label = fourth_tr_element.css('.infobox-label').text
fourth_data = fourth_tr_element.css('.infobox-data').text
puts "#{fourth_label}: #{fourth_data}"
end
end
scrape_wikipedia('https://en.wikipedia.org/wiki/Ruby_(programming_language)')
Code Breakdown:
The script above uses Ruby's open-uri
module and nokogiri
gem to scrape data from Wikipedia. The scrape_wikipedia
function receives a URL and fetches the HTML content from that URL. Nokogiri then parses this HTML content into a format we can work with in Ruby.
We use CSS selectors to target specific elements on the page. document.css('h1').first.text
fetches the text of the first h1 element, which is usually the page's title.
The 'infobox.vevent
' and 'infobox-title summary
' classes are used to fetch the infobox on Wikipedia pages, which generally holds summary information about the page's topic.
We also fetch specific rows within the infobox's tbody (the 3rd and 4th row) and extract the labels and data.
Conclusion:
And there you have it! A simple way to extract specific information from a Wikipedia page using Ruby and Nokogiri
.
Remember, while web scraping can be a powerful tool, it's essential to use it responsibly to respect the website's terms and resources.
Happy scraping!