Challenges, Solutions and more Challenges and more Solutions

Jonas Brømsø - Oct 23 '22 - - Dev Community

I am maintaining two Perl distributions, which are using C-bindings.

I do not do much day to day maintenance, but on occassion there is a PR, which needs to be processed or a report from cpan-testers indicating a failing test, requiring further investigation. The thing it that the surroundings of these distributions change constantly. Circumstances involving change got me involved with the maintenance of the two distributions in the first place, I was a mere user and the platforms I was using these components on, where being updated continuously and we simply needed to keep up.

Apparently this never stops and when I could see that I over time got failing tests due to the toolchains used around these kept evolving. The toolchain issue was often related to clang, I was visiting the documentation for clang on several occassions.

By adjusting the command line parameters, I could keep the tests passing.

Here are some examples on the mentioned docmentation:

In order to see if a given option was available in a version or when it was introduced (or deprecated), I surfed across the pages with multiple tabs open, cross-checking command line options etc..

And example: -Wunreachable-code-fallthrough

Present in version 14 (https://releases.llvm.org/14.0.0/tools/clang/docs/DiagnosticsReference.html#wunreachable-code-fallthrough) but not in earlier versions.

Where: -Wall is present in versions all versions, below some examples:

  • 14 (https://releases.llvm.org/14.0.0/tools/clang/docs/DiagnosticsReference.html#wall)
  • 13 (https://releases.llvm.org/13.0.0/tools/clang/docs/DiagnosticsReference.html#wall)
  • 4 (https://releases.llvm.org/4.0.0/tools/clang/docs/DiagnosticsReference.html#wall)

I had a challenge, I needed to keep an overview of the compiler diagnostic parameters. At the same time I could see that a pattern emerged as the URL structure was uniform and the pages had the some structure. So I decided to make a matrix of all of the diagnostic command line flags. The latter was clearly a benefit, so if you are creating similar documentation, please keep this in mind as a use-case.

Anyway this lets me introduce: "clang diagnostic flags matrix generator", a Perl application that would iterate over a set of available web pages (one for each version of clang), extract/scrape the information and insert it into a data structure, from which I can print a matrix expressed as a Markdown table. Do note not all versions of clang are represented, but relevant versions and data is available from version 4 and above. Please see the source code for details.

And it worked, one can discuss the readability due to the size of the matrix, but I had a challenge and I came up with a solution. The matrix was inserted into my TIL collection under the clang category, it is also available in the clang diagnostic flags matrix generator repository.

A new problem occurred however. The matrix would not render correctly on GitHub, it would stop at some point, in the beginning I thought this was a transient error, but it did seem to persist. I did not observe the issue when using GitHub pages or the Markdown preview in Visual Studio Code, so the problem had to be with GitHub. So I reported it as a bug to GitHub and a got an answer, brief and to the point. My Markdown exceeded the limit of file size for rendering on GitHub.

Text files over 512 KB are always displayed as plain text. Code is not syntax highlighted, and prose files are not converted to HTML (such as Markdown, AsciiDoc, etc.).

REF: GitHub Documentation

I did a check on my file and it exceeded the 512 KB with a size exceeding 1MB.

New challenge, how do I decrease the size of the generated Markdown table.

I started out by eliminating much of the use of spaces and emojis, the latter I exchanged for ASCII characters. The size decreased, but then version 15.0.0 of clang came along and the size increased. But it was easy to spot the culprit as all of the command line flags would link to their respective documentation per version, meaning that the URL carried a log of redundant information, which was reapeated a lot.

After thinking a little I came to the conclusion that had shorten the URL, boiling down all the redundant information like a compression algorithm. I did an experiment, where I just rewrote the URL to a short fake domain name. And immediately I could see an effect and I decided to implement support for redirecting via a short URL to the longer URL.

I did some basic checks, since I could isolate the Markdown matrix/table output from the generator. The data is based on a matrix covering versions from 4 to 14.

  • 947393 KB with emojis and original (long) URLs
  • 926691 KB emojis exchanged for ASCII
  • 418901 KB no emojis and URLs shortened

As mentioned version 15 of clang was introduced around the same time I was looking into this, so it gave me the opportunity to calculate as approximate size cost of a new version.

  • 462850 KB no emojis and URLs shortened including version 15.

So the cost of version 15 is:

462850 - 418901 = 43.949 KB

Meaning in a few versions the maximum of 512 KB will be exceeded again at some point, but I will look at that challenge when it becomes a problem.

Well the solution required a way to shorten the URL. I ended up with a sort of proxy which redirects from my short URL to the original. Actually reversing the change made by the clang diagnostic flags matrix generator.

Next up was understanding what the common parts was and what the variables were. Looking at the URLs mentioned above, one will spot:

  • version number
  • fragment

I could even abbreviate the version number, since it only documented major versions, since command line options was not added a removed via minor or bug releases (semantic versioning for the win).

The service should need to support:

  • version as a 1 digit number
  • fragment, the complete fragment

And I came up with the following scheme: <domain>/<version>/<fragment>

An example:

  • https://releases.llvm.org/5.0.0/tools/clang/docs/DiagnosticsReference.html#rsanitize-address

Would be abbreviated to:

  • https://<domain>/5/rsanitize-address

Which could be expressed as:

  • [X](https://<domain>/5/rsanitize-address)

I created a basic service implemented in go: pxy-redirect. In addition I needed a short domain name and ended up registering: pxy.fi. The complete solution running at https://pxy.fi, which replaced the original domain name: https://releases.llvm.org/, doming the redirection by:

  • expanding the version number
  • and transporting the last part of the URL as a fragment

To recap:

  1. The URL: https://releases.llvm.org/5.0.0/tools/clang/docs/DiagnosticsReference.html#rsanitize-address is extracted as part of manual parsing
  2. The Markdown is generated with a shorter representation: [X](https://pxy.fi/5/rsanitize-address)
  3. When the link is clicked, the service rewrites from: https://pxy.fi/5/rsanitize-address to https://releases.llvm.org/5.0.0/tools/clang/docs/DiagnosticsReference.html#rsanitize-address

The the service is deployed with DigitalOcean anb is up and running and I am watching it's logs to spot any weird things.

To begin with my code was very aimed at the proxy part, being very transparent, so I decided to introduce
an index.html for the root, just to introduce the service just in case somebody hit that particular URL, then I could guide them.

Introducing index.html, then resulted in requests for favicon.ico and I recently added support for ´robots.txt`, since I could see this was requested.

I can see somebody is requesting /login, which is very sweet, but that is not a valid URL and it results in an error. JFYI there is no need to crawl the site since it is a very basic and transparent redirecting proxy and all of the code is open source and is available on GitHub

  • https://github.com/jonasbn/pxy-redirect

I examined the option to implement this as serverless functions with DigitalOcean, but that will require some more research. If I need to do some more redirection I can added an extra part to the URL so I can separate into namespaces, but I do not currently have this requirement, so it is not implemented.

The implementation has been really fun and I can highlight some of the key points:

  • Mojo::UserAgent, which is an awesome tool for HTTP client work
  • GitHub limitations
  • I am still in the process of learning Go, so it was fun with an experiment, which was not just another tutorial
  • URL fragments and their nature
  • Deploying on DigitalOcean and I want to dig into DigitalOceans functions, because I believe this to be a good use-case for serverless functions over a server solution

Ideas and suggestions for improvements are most welcome. I am thinking about doing some follow up posts on the different components mentioned to walk through the implementation highlighting different aspects, I believe this could also be a good way to spot points of interest for improvements.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .