I thought it might be fun to show how I spent an hour or so yesterday throwing together a simple web site that is hosted on GitHub Pages and is updated daily using GitHub Actions.
An itch to scratch
So many web sites start out with an itch to scratch, and this one is no different. In this case, it was about wanting to stay more informed.
Each day, the BBC News web site publishes a story that shows the front pages of all of the British newspapers. Although a lot of the British press isn't particularly trustworthy, I still think it's good to get an overview of what they're saying about the day's news. So finding and reading this story on the BBC site is one of my morning rituals.
But they don't make it easy to find the archive of those stories. So it's hard to read anything than the current day's front pages - and even that story tends to vanish from the BBC site by lunchtime. I decided I'd like a page that contains an archive to these stories.
Scraping the site
The BBC don't publish an API for their web site, so we need to resort to screen-scraping. That, of course makes the process inherently fragile but it seems to be the best we can do at this stage.
It wasn't hard to create a program that pulls what I want from the web site using Web::Query (my tool of choice for scraping web sites).
Of course, having scraped the data, we need to store it somewhere. I decided to store it in a JSON document and worry about displaying it later.
So here's the code I wrote:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Time::Piece;
use Web::Query;
use JSON;
my $site = 'https://bbc.co.uk';
my $start = "$site/news/";
my $file = 'docs/papers.json';
my $js_p = JSON->new->pretty->canonical;
my $data;
if (-e $file) {
open my $fh, '<', $file or die "$!\n";
my $json = do { local $/; <$fh> };
$data = $js_p->decode($json);
}
my $start_len = @$data;
my $q = wq($start);
$q->find('a')
->each(sub {
my ($i, $elem) = @_;
return unless $elem->text =~ /^The Papers:/;
push @$data, {
date => localtime->strftime('%Y-%m-%d'),
text => ($elem->text =~ s/^The Papers:\s*//r),
link => $site . $elem->attr('href'),
};
});
if (@$data == $start_len) {
warn "No new article found\n";
} else {
open my $fh, '>', $file or die "$!\n";
print $fh $js_p->encode($data);
}
Nothing too complex there. We look for all of the <a>
tags in the page and ignore the ones that don't contain text starting with "The Papers:". We then extract the information we want (the text, the link and the date) and store that all in the JSON document.
At that point, I could run the code to create the JSON file. I then created the GitHub repository and turned on GitHub Pages for the repo. Once that was all working, I could browse to https://davorg.github.io/bbc_papers/papers.json to see the JSON.
Doing it every day
We need to run this code every day. That's simple enough with GitHub Actions. We simply add a workflow definition file to the repo. The workflow looks like this:
name: Overnight processing
on:
schedule:
- cron: '0 9 * * *'
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Install cpanm
run: sudo apt update && sudo apt install cpanminus
- name: Install dependencies
run: sudo cpanm -n --installdeps .
- name: Add data
run: ./get_link
- name: Commit new page
run: |
GIT_STATUS=$(git status --porcelain)
echo $GIT_STATUS
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
git add docs/
if [ "$GIT_STATUS" != "" ]; then git commit -m "Overnight job"; fi
if [ "$GIT_STATUS" != "" ]; then git push; fi
Again, this is all pretty much standard stuff. We run the workflow on a Ubuntu container. It's a three-phase process:
- Install everything we need
- Run the program to get the new data
- Commit the new data file to the repo
Note that the workflow is triggered in two ways:
- It runs at 09:00 every morning
- You can run it manually from the Actions tab in the repo (that's what the
workflow_dispatch
line does)
Displaying our wares
Having got a daily build of the data, we need to create a web page to display it. I'm not a web designer, so this is going to be necessarily basic. I threw together a simple page using Bootstrap.
The next step was to grab the papers.json
document, parse it and then display it on the page. Now, I can wrangle Javascript pretty successfully most of the time. but I wanted to get this working as quickly as possible, so I asked ChatGPT for some help. It only took a few iterations for it to give me this code:
document.addEventListener("DOMContentLoaded", function(){
fetch("papers.json")
.then(response => response.json())
.then(data => {
const tableBody = document.querySelector("table tbody");
data.forEach(item => {
const row = document.createElement("tr");
const date = document.createElement("td");
date.textContent = item.date;
row.appendChild(date);
const textLink = document.createElement("td");
const a = document.createElement("a");
a.href = item.link;
a.textContent = item.text;
textLink.appendChild(a);
row.appendChild(textLink);
tableBody.appendChild(row);
});
});
});
It didn't work first time. But that's because I'm an idiot and didn't tell ChatGPT the name of my JSON document or how to correctly identify the <table>
element in the HTML. But once I'd corrected my errors it all worked perfectly.
Summing up
So that's how I spent yesterday's lunch break. I can now see an archive of the BBC's stories about each day'd front pages by just going to:
Of course, just after I'd written it, I (once again) had a look to see if someone else had created something similar and found the BBC's page listing all of the stories. Ah well, I had fun putting my version together.
All of this code is available on GitHub.