Web scraping is a skill that can come in handy in a number of situations, mainly when you need to get a particular set of data from a website. I believe this is used most often in engineering and sciences for retrieving data such as statistics or articles with specific keywords. For this tutorial I will be teaching you how to scrape a website for the latter - articles with specific keywords.
Before we begin, I want to introduce web scraping and some of its limitations. Web scraping is also known as web harvesting or web data extraction and is a method of automatically extracting data from websites over the internet. The method of parsing I will be teaching you today is HTML parsing, which means our web scraper will be looking at the HTML content of a page and extracting the information that matches the class we want to retrieve information from (if this doesn't make sense, don't worry. I'll go into more detail later!) This method of web scraping is limited by the fact that not all web sites store all of their information in html - much of what we see today is dynamic and built after the page has been loaded. In order to see that information a more sophisticated web crawler is required, typically with its own web loader, which is beyond the scope of this tutorial.
I chose to build a web scraper in C# because the majority of tutorials built their web scrapers in Python. Although that is likely the ideal language for the job, I wanted to prove to myself that it can be done in C#. I also hope to help others learn to build their own web scrapers by providing one of only a few C# web scraping tutorials (as of the time of writing).
Building a Web Scraper
The website we will be scraping is Ocean Networks Canada, a website dedicated to providing information about the ocean and our planet. People using this project to scrape the internet for articles and data will find that this website provides a similar model to many other websites they will encounter.
-
Launch Visual Studio and create a new C# .NET Windows Forms Application.
-
Design a basic Form with a Button to start the scraper and a Rich Textbox for printing the results.
-
Open your NuGet Package Manager by right-clicking your project name in the Solution Explorer and selecting "Manage NuGet Packages". Search for "AngleSharp" and click Install.
-
Add an array of query terms (these should be the words you want your articles to have in the title) and create a method where we will set up our document to scrape. Your code should look like the following:
private string Title { get; set; } private string Url { get; set; } private string siteUrl = "https://www.oceannetworks.ca/news/stories"; public string[] QueryTerms { get; } = {"Ocean", "Nature", "Pollution"}; internal async void ScrapeWebsite() { CancellationTokenSource cancellationToken = new CancellationTokenSource(); HttpClient httpClient = new HttpClient(); HttpResponseMessage request = await httpClient.GetAsync(siteUrl); cancellationToken.Token.ThrowIfCancellationRequested(); Stream response = await request.Content.ReadAsStreamAsync(); cancellationToken.Token.ThrowIfCancellationRequested(); HtmlParser parser = new HtmlParser(); IHtmlDocument document = parser.ParseDocument(response); }
CancellationTokenSource provides a token if a cancellation is requested by a task or thread.
HttpClient provides a base class for sending HTTP requests and receiving HTTP responses from a URI-identified resource
HttpResponseMessage represents an HTTP response message and includes the status code and data.
HtmlParser and IHtmlDocument are AngleSharp Classes that allow you to build and parse documents from website HTML content. -
Create another new method to get and display the results from your AngleSharp document. Here we will parse the document and retrieve any articles that match our QueryTerms. This can be tricky, as no two websites use the same HTML naming conventions - it can take some trial and error to get the "articleLink" LINQ query correct:
private void GetScrapeResults(IHtmlDocument document) { IEnumerable<IElement> articleLink; foreach (var term in QueryTerms) { articleLink = document.All.Where(x => x.ClassName == "views-field views-field-nothing" && (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))); } if (articleLink.Any()) { // Print Results: See Next Step } }
If you aren't sure what happened here, I'll explain in more detail: We are looping through each of our QueryTerms (Ocean, Nature, and Pollution) and parsing through our document to find all instances where the ClassName is "views-field views-field-nothing" and where the ParentElement.InnerHtml contains the term we're currently querying.
If you're unfamiliar with how to see the HTML of a webpage, you can find it by navigating to your desired URL, right clicking anywhere on the page, and choosing "View Page Source". Some pages have a small amount of HTML, others have tens of thousands of lines. You will need to sift through all of this to find where the article headers are stored, then determine the class that holds them. A trick I use is searching for part of one of the article headers, then moving up a few lines.
-
Now, if our query terms were lucrative, we should have a list of several sets of HTML inside of which are our article titles and URLs. Create a new method to print your results to the Rich Textbox.
public void PrintResults(string term, IEnumerable<IElement> articleLink) { // Clean Up Results: See Next Step resultsTextbox.Text = $"{Title} - {Url}{Environment.NewLine}"; }
-
If we were to print our results as-is, they would come in looking like HTML markup with all the tags, angle braces, and other non-human friendly items. We need to insert a method that will clean up our results before we print them to the form and, like step 5, the markup will vary widely by website.
private void CleanUpResults(IElement result) { string htmlResult = result.InnerHtml.ReplaceFirst(" <span class=\"field-content\"><div><a href=\"", "https://www.oceannetworks.ca"); htmlResult = htmlResult.ReplaceFirst("\">", "*"); htmlResult = htmlResult.ReplaceFirst("</a></div>\n<div class=\"article-title-top\">", "-"); htmlResult = htmlResult.ReplaceFirst("</div>\n<hr></span> ", ""); // Split Results: See Next Step }
So what happened here? Well, I examined the InnerHtml of the result object that was coming in to see what extra stuff needed to be removed from what I actually wanted to display - a Title and a URL. Working from left to right, I simply replaced each chunk of html stuff with an empty string or "nothing", then for the chunk between the URL and the title I replaced with a "*" as a placeholder to split the strings on later. Each of these ReplaceFirst() uses will be different on each website, and it may not even work flawlessly on every article on a particular site. You can continue to add new replacements, or just ignore them if they are uncommon enough.
-
I'm sure you noticed from the previous step that there's one last method to add before we can print a clean result to our textbox. Now that we've cleaned up our result string, we can use our "*" placeholder to split it into two strings - a Title and a URL.
private void SplitResults(string htmlResult) { string[] splitResults = htmlResult.Split('*'); Url = splitResults[0]; Title = splitResults[1]; }
-
Finally we have a clean, human-friendly result! If all went well and the articles haven't drastically changed since the time of writing, running your code should provide the following set of results (and more... there was a lot!) that have been scraped by your application from Ocean Networks:
I hope this tutorial has given you some insight into the world of web scraping. If there's enough interest, I can continue this series and teach you how to set up your application to do a fresh scrape at specific time intervals and send you a newsletter-style email with a day's or week's worth of results.
If you'd like to catch up with me on social media, come find me over on Twitter or LinkedIn and say hello!