Crawling websites with elixir and crawly

Codegram - Aug 13 '20 - - Dev Community

In this post, I'd like to introduce crawly which is an elixir library for crawling websites.

My idea for this post is to be a quick introduction to how we can use crawly. For this little example we're going to extract the latest posts titles on our website and write it to a file, so let's do it.

First of all, we need elixir installed, in case you don't have it installed you can check this guide.

Once we have elixir installed let's create our elixir application with built-in supervisor

mix new crawler --sup
Enter fullscreen mode Exit fullscreen mode

In order to add the crawly dependencies to our project, we are going to change the deps function in the file mix.exs and it should look like this

defp deps do
    [
        {:crawly, "~> 0.10.0"},
        {:floki, "~> 0.26.0"} # used to parse html
    ]
end
Enter fullscreen mode Exit fullscreen mode

We need to install the dependencies that we just added running the command below

mix deps.get
Enter fullscreen mode Exit fullscreen mode

Let's create a spider file lib/crawler/blog_spider.ex that is going to make a request to our blog, query the HTML response to get the post titles, and then returns a ParsedItem which contains items and requests. We are not going to leave requests as an empty list to keep it simple.

defmodule BlogSpider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "https://www.codegram.com"

  @impl Crawly.Spider
  def init(), do: [start_urls: ["https://www.codegram.com/blog"]] # urls that are going to be parsed

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = Floki.parse_document(response.body)

    items =
      document
      |> Floki.find("h5.card-content__title") # query h5 elements with class card-content__title
      |> Enum.map(&Floki.text/1)
      |> Enum.map(fn title -> %{title: title} end)

    %Crawly.ParsedItem{items: items, requests: []}
  end
end
Enter fullscreen mode Exit fullscreen mode

Now that we have our spider created it would be nice to save what we're extracting into some file. To do this we can use a pipeline provided by crawly called Crawly.Pipelines.WriteToFile. For that, we need a config folder and a config.exs file:

mkdir config # creates the config directory
touch config/config.exs # creates an empty file called config.exs inside the config folder
mkdir -p priv/output # creates output folder inside priv where we are going to store our files
Enter fullscreen mode Exit fullscreen mode

Now let's create the configuration to save the response from our spider into a file.

use Mix.Config

config :crawly,
  pipelines: [
    Crawly.Pipelines.JSONEncoder, # encode each item into json
    {Crawly.Pipelines.WriteToFile, folder: "priv/output/", extension: "jl"} # stores the items into a file inside the folder specified
  ]
Enter fullscreen mode Exit fullscreen mode

Now that we are good to go, we can open the elixir repl

iex -S mix
Enter fullscreen mode Exit fullscreen mode

And then we can execute our spider

Crawly.Engine.start_spider(BlogSpider)
Enter fullscreen mode Exit fullscreen mode

The spider is going to be executed by a supervisor and then we should see a new file inside priv/output folder. In my case the lasts posts showing in the first page are

{"title":"\"High tech, high touch\": A communication toolkit for virtual team"}
{"title":"My learning experience in a fully remote company as a Junior Developer"}
{"title":"Finding similar documents with transformers"}
{"title":"UX… What?"}
{"title":"Slice Machine from Prismic"}
{"title":"Stop (ab)using z-index"}
{"title":"Angular for Junior Backend Devs"}
{"title":"Jumping into the world of UX 🦄"}
{"title":"Gettin' jiggy wit' Git - Part 1"}
Enter fullscreen mode Exit fullscreen mode

This is just a simple example of what is possible to do using crawly. I hope you enjoyed this introduction and remember to be responsible when extracting data from websites.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .