🗜️Using Cantor-Pairing as a String Compression

creuser - Jan 22 - - Dev Community

This compression method may have already been invented, but I'll share it nonetheless.

Cantor-Pairing, an algorithm combining two numbers into one, proves effective for string compression. Through experimentation in JavaScript, I discovered a solution.

During compression, characters are grouped into pairs (or singles):

  • hello => he, ll, o
  • world! => wo, rl, d!

These pairs convert to paired numbers using corresponding character Unicode. The resulting string includes non-Latin characters like Chinese, hieroglyphics, Arabic, emojis, etc.

function pair(a, b) {
  return 0.5 * (a + b) * (a + b + 1) + b;
}
Enter fullscreen mode Exit fullscreen mode

For decompression, characters' Unicode reverses via the inverse Cantor-Pairing algorithm, returning the original string.

function unpair(n) {
  var w = Math.floor((Math.sqrt(8 * n + 1) - 1) / 2);
  var t = (w ** 2 + w) / 2;
  return [w - (n - t), n - t];
}
Enter fullscreen mode Exit fullscreen mode

For further information about this algorithm, here are the pros, cons, and considerations:

Pros:

  1. Fast processing.
  2. Effective reduction of string size by half.

Cons:

  • Limited universality due to non-standard characters.

Considerations:

  1. Avoid compressing an already compressed string to prevent incorrect Unicode.
  2. Exercise caution with short strings, as they may lead to corrupted output.

If you are interested, check out the gist.

. . . . .