A Primer on Data Compression

Andrew (he/him) - Jun 15 '19 - - Dev Community

Likely States

Let's talk about statistical mechanics for a second. Suppose we have a system of 4 particles, labeled a through d, each of which can be in one of two states. We can call them the 0 state and the 1 state. A particle has one "unit" of energy when it's in the 1 state, and no energy when it's in the 0 state.

1-energy:   x x 
0-energy: x     x
particle: a b c d
Enter fullscreen mode Exit fullscreen mode

If we only measure the overall energy of the system, we could find anywhere from 0 units of energy (when all particles are in the 0 state) to 4 units of energy (when all particles are in the 1 state).

1-energy:              1-energy: x x x x
0-energy: x x x x  =>  0-energy:        
particle: a b c d      particle: a b c d
--> total energy: 0    --> total energy: 4
Enter fullscreen mode Exit fullscreen mode

Suppose we measure the energy of the system and we find 2 units of energy. This tells us that exactly two particles are in the 1 state, but it doesn't tell us which two particles are in that state. The overall energy of the system is known as a macrostate, while the energy of each individual particle (a=0, b=1, c=0...) is known as the microstate. Individual microstates can map to the same macrostate, for instance...

abcd = 0101

1-energy:   x   x
0-energy: x   x  
particle: a b c d
--> total energy: 2
Enter fullscreen mode Exit fullscreen mode

and

abcd = 1100

1-energy: x x
0-energy:     x x
particle: a b c d
--> total energy: 2
Enter fullscreen mode Exit fullscreen mode

...both have energy 2 (the same macrostate), but which particle is in the 0 or 1 state differs between the two microstates. In a system like this, with only two energy states, particle a can be in the 0 state or the 1 state (2 possible states), and (independently) particle b can be in the 0 state or the 1 state (2 possible states), and so on. These 2s all multiply together and give us 2^N possible microstates for a 2-state, N-particle system, where the particles are distinguishable (as they are in our case -- they're labeled).

If each particle has a 50/50 chance of being in the 0 state or being in the 1 state, then a 2-energy system is the most probable macrostate. With a 50/50 0/1 probability, each microstate is equally likely and the most-probable, 2-energy macrostate simply has the most microstates which map to it:

abcd -> total energy
0000 -> 0
0001 -> 1
0010 -> 1          x
0011 -> 2          x
0100 -> 1        x x x
0101 -> 2        x x x
0110 -> 2        x x x
0111 -> 3      x x x x x
1000 -> 1  =>  0 1 2 3 4
1001 -> 2
1010 -> 2       energy
1011 -> 3      histogram
1100 -> 2
1101 -> 3    2-energy state
1110 -> 3   occurs 6/16 times
1111 -> 4
Enter fullscreen mode Exit fullscreen mode

While the 3-energy macrostate means that a given particle has a 75% chance of being in the 1 position, the 2-energy state means there's an equal probability that any particle in the system will be in the 0 state or the 1 state, on average.

So What?

Statistical mechanics lays the foundation for thermodynamics, which concerns itself with concepts like heat, pressure, and entropy -- the latter of which is directly related to theories of information. It seems strange that the laws of physics could have anything to do with information, which seems like a uniquely human concept, but entropy ties these two fields together.

In our four-particle example above, we can calculate the entropy of the system by simply applying the formula:

S = k * ln(H)
Enter fullscreen mode Exit fullscreen mode

...where H is the number of microstates and k is the Boltzmann constant. ln() is the natural (base-e) logarithm. We assume that the environment has sufficient energy to give to our particles such that any microstate is possible, and equally likely. The entropy of the four-particle, two-state system is then:

S = k * ln(16)
  =~ 2.77*k
Enter fullscreen mode Exit fullscreen mode

This doesn't mean much until we compare it to the entropy of other systems. Entropy is, in a sense, related to information content. In physics, entropy is also related to how disordered a system is -- more disorder means more entropy. Let's look at another example. Suppose we cool the system down, removing all of the heat. In that case, there's probably not enough energy to bring any of the particles into the 1 state, and all are stuck at 0. There is only one allowed microstate when the energy of the system is zero, and the entropy becomes:

S = k * ln(1)
  = 0
Enter fullscreen mode Exit fullscreen mode

The entropy is zero. The system is perfectly ordered; all particles are in the 0 state and there is no chance that that will change.

Bits of Entropy, Bits of Information

Let's move to bigger collections of particles. Let's say we have the entire alphabet except (and you'll see why later) y and z -- 24 particles labeled a through x -- and each of them can have 0 energy or 1 energy, again with a 50/50 probability to be in either state. Intuitively, which state seems less likely to occur by chance...

case (A):
100101011010100101011010
abcdefghijklmnopqrstuvwx
Enter fullscreen mode Exit fullscreen mode

or

case (B):
111111111111000000000000
abcdefghijklmnopqrstuvwx
Enter fullscreen mode Exit fullscreen mode

...? If you think that the case (B) looks really unlikely, you're not alone. Most people intuit that the second case is somehow "less likely" than the first case -- even though we said each particle has a 50/50 chance of being in either state! From a purely probabilistic point of view, each of these states is equally likely. (It's also why you shouldn't play the lottery if you think the numbers 1-2-3-4-5 are unlikely to win. They're just as probable as any other 5-digit combination.)

This mis-intuition occurs because we've ordered the particles. If I'd given them to you as...

case (C):
101010101010100110101011
itjukvlwmxanbopcdqerfsgh
Enter fullscreen mode Exit fullscreen mode

...does case (C) that look more or less likely than case (B)? (They're actually exactly equivalent.)

But if we insist on distinguishable particles, with some ordering, then we can "process" the particles by that ordering. And then there is a serious difference between case (A) and case (B). It has to do with how "surprised" we are when we see a value different from ones we've seen in the past.

Let's process case (A) from left-to-right (that is, alphabetically). We can begin by assuming a 50/50 probability for 0 or 1 states. Then, as we get new information, we'll adjust our expectations. The formula for the Shannon Entropy of a two-state system is:

H = -(p * log2(p)) - (q * log2(q))
Enter fullscreen mode Exit fullscreen mode

...where p is the probability of the 0 state and q is the probability of the 1 state (or vice versa). If we begin by assuming that each state is equally likely, then the first bit of information (particle a) gives us:

H = -(0.5 * log2(0.5)) - (0.5 * log2(0.5))
  = -(0.5 * -1) - (0.5 * -1)
  = -(-0.5) - (-0.5)
  = 1
Enter fullscreen mode Exit fullscreen mode

Exactly 1 bit of information! That's not too surprising, is it? The next bit is a bit trickier, though. We now have one data point (particle a), and we're trying to extrapolate from that. Simple frequentist probability now tells us that since 100% of the (one) bits we've seen so far have been 1 (a=1), we expect all future particles to be in the 1 state:

H = -(p * log2(p)) - (q * log2(q))
  = -(0 * (-infinity)) - 0
  = problem here
Enter fullscreen mode Exit fullscreen mode

Really what we should do is use a Bayesian estimate of the probability. Very few things in life are 100% guaranteed to happen, so the formula above falls apart for all-or-nothing probabilities. Instead of 0% / 100% probabilities, let's let those probabilities get arbitrarily close to 0 and 1, but not actually reach them. Then what happens?

limit of H as e -> 0
H = -(e * log2(e)) - ((1-e) * log2(1-e))
  = 0 - (1 * log2(1))
  = 0 - 0 = 0
Enter fullscreen mode Exit fullscreen mode

If all the particles / bits we've seen in the past have had the same value, then we expect the next one will also have that same value. We don't need to look at it to "know" its value. The next particle gives us 0 bits of new information.

This changes, though, when we see a new value. After we've seen a and b of case (A), we now (by the frequentist approach) think that there's a 50/50 probability of either state, and c gives us 1 full bit of new information. Particle d is where things really get interesting, though. At this point, we've seen a=1, b=0, and c=0. So we assume that there's a 1/3 probability of seeing a 1 in d, and a 2/3 probability of a 0:

H = -(1/3 * log2(1/3)) - (2/3 * log2(2/3))
  = -(1/3 * -1.585) - (2/3 * -0.585)
  = -(-0.528) - (-0.390)
  = 0.918
Enter fullscreen mode Exit fullscreen mode

Particle d gives us slightly less than 1 bit of information! This is because the "surprise" factor isn't so all-or-nothing, as it was when we assumed a 100% probability of seeing a 1 after a, but it also won't be a total surprise, as it was for c, when we had a 50/50 shot of getting either a 0 or a 1. We've seen both 0s and 1s before, but we've seen slightly more 0s. So we expect a 0, but there's a possibility that it could be a 1 as well.

We can work our way down each bit of case (A) and calculate the amount (in bits) of "entropic" information contained within each particle / at each position:

case (A):
a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)
b:0 -> 0.000 bits (1/1 prior probability)
c:0 -> 1.000 bits (1/2 prior)
d:1 -> 0.918 bits (1/3)
e:0 -> 1.000 bits (2/4)
f:1 -> 0.970 bits (2/5)
g:0 -> 1.000 bits (3/6)
h:1 -> 0.985 bits (3/7)
i:1 -> 1.000 bits (4/8)
j:0 -> 0.991 bits (5/9)
k:1 -> 1.000 bits (5/10)
l:0 -> 0.994 bits (6/11)
m:1 -> 1.000 bits (6/12)
n:0 -> 0.996 bits (7/13)
o:0 -> 1.000 bits (7/14)
p:1 -> 0.997 bits (7/15)
q:0 -> 1.000 bits (8/16)
r:1 -> 0.997 bits (8/17)
s:0 -> 1.000 bits (9/18)
t:1 -> 0.998 bits (9/19)
u:1 -> 1.000 bits (10/20)
v:0 -> 0.998 bits (11/21)
w:1 -> 1.000 bits (11/22)
x:0 -> 0.999 bits (12/23)
Enter fullscreen mode Exit fullscreen mode

Each particle gives, on average, about 0.96 bits of information. Close to -- but not exactly -- 1. If we wanted to compress this information, there's not much we could do on a bit-by-bit basis. However, you might notice patterns in the data:

case (A):
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
Enter fullscreen mode Exit fullscreen mode

This stream of bits is clearly compressible in that, if we break it up sequentially into 2-bit blocks, we never see the patterns 00 or 11. This means we could encode this stream as:

case (A):
 B  A  A  A  B  B  B  A  A  A  B  B
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx

where A = 01, B = 10
Enter fullscreen mode Exit fullscreen mode

We would, of course, have to know this encoding in order to decode the data, but it would save space in storing the information. Interestingly, this representation actually contains less entropic information on a bit-per-bit basis than the uncompressed version (average 0.89):

case (A):
ab:B -> 1.000 bits (1/2 assumed)
cd:A -> 0.000 bits (1/1 prior)
ef:A -> 1.000 bits (1/2)
gh:A -> 0.918296 bits (1/3)
ij:B -> 0.811278 bits (1/4)
kl:B -> 0.970951 bits (2/5)
mn:B -> 1.000 bits (3/6)
op:A -> 0.9985228 bits (4/7)
qr:A -> 1.000 bits (4/8)
st:A -> 0.991076 bits (4/9)
uv:B -> 0.970951 bits (4/10)
wx:B -> 0.99403 bits (5/11)
Enter fullscreen mode Exit fullscreen mode

The reason for this is those long strings of 0s and 1s in the encoding. This series can actually be compressed even further, as we could next allow the substitutions C=BAA and D=ABB to get:

case (A):
       C        D        C        D
 B  A  A  A  B  B  B  A  A  A  B  B
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
Enter fullscreen mode Exit fullscreen mode

This is, in a nutshell, how information compression works for digital files. You try to find common patterns in the data, and replace those patterns with shorter representations, which can later be decoded to restore the original file.

This works well for the stream of bytes above, because at each step, there are only two (or fewer) possible "super-patterns". These could be encoded into a single byte. At each step, the data would be compressed by 50%. If we had a single 00 or 11 string in the original bit pattern, though, the compression wouldn't work:

case (D):
      ??        D        C        D
 B  A  ?  A  B  B  B  A  A  A  B  B
10 01 11 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
Enter fullscreen mode Exit fullscreen mode

This single 11 pattern means we need a new pattern at the first level of compression, which would require an extra bit, defeating the purpose of compression in the first place. This means we can't compress the data in 2-bit packages at all. (Though it's possible that some other coding mechanism could do the trick.)

I said earlier that there was an obvious difference between case (A) and case (B), and we can see it clearly if we calculate the amount of entropy contained in each particle for case (B):

case (A):
a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)
b:1 -> 0.000 bits (1/1 prior probability)
c:1 -> 0.000 bits (2/2 prior)
d:1 -> 0.000 bits (3/3)
e:1 -> 0.000 bits (4/4)
f:1 -> 0.000 bits (5/5)
g:1 -> 0.000 bits (6/6)
h:1 -> 0.000 bits (7/7)
i:1 -> 0.000 bits (8/8)
j:1 -> 0.000 bits (9/9)
k:1 -> 0.000 bits (10/10)
l:1 -> 0.000 bits (11/11)
m:0 -> 0.000 bits (12/12)
n:0 -> 0.391244 bits (12/13)
o:0 -> 0.591673 bits (12/14)
p:0 -> 0.721928 bits (12/15)
q:0 -> 0.811278 bits (12/16)
r:0 -> 0.873981 bits (12/17)
s:0 -> 0.918296 bits (12/18)
t:0 -> 0.949452 bits (12/19)
u:0 -> 0.970951 bits (12/20)
v:0 -> 0.985228 bits (12/21)
w:0 -> 0.994030 bits (12/22)
x:0 -> 0.998636 bits (12/23)
Enter fullscreen mode Exit fullscreen mode

...only 0.43 bits of information, on average, per particle. Long streams of 0s and 1s are also ripe for compression.

Summary

So what have we learned? Information content, at a physical level, is related to the disorder inherent in a system. The more disordered a system, the more information it contains. A file consisting of a million zeroes is perfectly ordered, but it contains little information.

Data compression relies on the fact that, within files, there are sometimes repeated strings of bits or characters which can be encoded so as to minimize the size of the encoded file. "Information density" (in the physical / entropic sense) contained within bits of digital files is maximized when there is no discernable pattern or regular, repeated runs of bits / characters. Although two files may contain the same number of 1s and 0s, requiring the same amount of storage space on disk, the order in which these 1s and 0s appear in the file is crucial. Maximum information density presents itself as a file full of completely random 1s and 0s.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .