🗜️Using Cantor-Pairing as a String Compression

creuser - Jan 22 -

This compression method may have already been invented, but I'll share it nonetheless.

Cantor-Pairing, an algorithm combining two numbers into one, proves effective for string compression. Through experimentation in JavaScript, I discovered a solution.

During compression, characters are grouped into pairs (or singles):

hello => he, ll, o
world! => wo, rl, d!

These pairs convert to paired numbers using corresponding character Unicode. The resulting string includes non-Latin characters like Chinese, hieroglyphics, Arabic, emojis, etc.

function pair(a, b) {
  return 0.5 * (a + b) * (a + b + 1) + b;
}

For decompression, characters' Unicode reverses via the inverse Cantor-Pairing algorithm, returning the original string.

function unpair(n) {
  var w = Math.floor((Math.sqrt(8 * n + 1) - 1) / 2);
  var t = (w ** 2 + w) / 2;
  return [w - (n - t), n - t];
}

For further information about this algorithm, here are the pros, cons, and considerations:

Pros:

Fast processing.
Effective reduction of string size by half.

Cons:

Limited universality due to non-standard characters.

Considerations:

Avoid compressing an already compressed string to prevent incorrect Unicode.
Exercise caution with short strings, as they may lead to corrupted output.

If you are interested, check out the gist.