This compression method may have already been invented, but I'll share it nonetheless.
Cantor-Pairing, an algorithm combining two numbers into one, proves effective for string compression. Through experimentation in JavaScript, I discovered a solution.
During compression, characters are grouped into pairs (or singles):
-
hello
=>he, ll, o
-
world!
=>wo, rl, d!
These pairs convert to paired numbers using corresponding character Unicode. The resulting string includes non-Latin characters like Chinese, hieroglyphics, Arabic, emojis, etc.
function pair(a, b) {
return 0.5 * (a + b) * (a + b + 1) + b;
}
For decompression, characters' Unicode reverses via the inverse Cantor-Pairing algorithm, returning the original string.
function unpair(n) {
var w = Math.floor((Math.sqrt(8 * n + 1) - 1) / 2);
var t = (w ** 2 + w) / 2;
return [w - (n - t), n - t];
}
For further information about this algorithm, here are the pros, cons, and considerations:
Pros:
- Fast processing.
- Effective reduction of string size by half.
Cons:
- Limited universality due to non-standard characters.
Considerations:
- Avoid compressing an already compressed string to prevent incorrect Unicode.
- Exercise caution with short strings, as they may lead to corrupted output.
If you are interested, check out the gist.