Case Study: Data Compression

Paul Ngugi - Jul 23 - - Dev Community

Huffman coding compresses data by using fewer bits to encode characters that occur more frequently. The codes for the characters are constructed based on the occurrence of the characters in the text using a binary tree, called the Huffman coding tree. Compressing data is a common task. There are many utilities available for compressing files. This section introduces Huffman coding, invented by David Huffman in 1952.

In ASCII, every character is encoded in 8 bits. If a text consists of 100 characters, it will take 800 bits to represent the text. The idea of Huffman coding is to use a fewer bits to encode frequently used characters in the text and more bits to encode less frequently used characters to reduce the overall size of the file. In Huffman coding, the characters’ codes are constructed based on the characters’ occurrence in the text using a binary tree, called the Huffman coding tree. Suppose the text is Mississippi. Its Huffman tree can be shown as in Figure below (a). The left and right edges of a node are assigned a value 0 and 1, respectively. Each character is a leaf in the tree. The code for the character consists of the edge values in the path from the root to the leaf, as shown in Figure below (b). Since i and s appear more than M and p in the text, they are assigned shorter codes. Based on the coding scheme in Figure below,

Image description

The coding tree is also used for decoding a sequence of bits into characters. To do so, start with the first bit in the sequence and determine whether to go to the left or right branch of the tree’s root based on the bit value. Consider the next bit and continue to go down to the left or right branch based on the bit value. When you reach a leaf, you have found a character. The next bit in the stream is the first bit of the next character. For example, the stream 011001 is decoded to sip, with 01 matching s, 1 matching i, and 001 matching p.

To construct a Huffman coding tree, use an algorithm as follows:

  1. Begin with a forest of trees. Each tree contains a node for a character. The weight of the node is the frequency of the character in the text.
  2. Repeat the following action to combine trees until there is only one tree: Choose two trees with the smallest weight and create a new node as their parent. The weight of the new tree is the sum of the weight of the subtrees.
  3. For each interior node, assign its left edge a value 0 and right edge a value 1. All leaf nodes represent characters in the text.

Here is an example of building a coding tree for the text Mississippi. The frequency table for the characters is shown in Figure below (b). Initially the forest contains single-node trees, as shown in Figure below (a). The trees are repeatedly combined to form large trees until only one tree is left, as shown in Figures below (b–d).

Image description

It is worth noting that no code is a prefix of another code. This property ensures that the streams can be decoded unambiguously.

The algorithm used here is an example of a greedy algorithm. A greedy algorithm is often used in solving optimization problems. The algorithm makes the choice that is optimal locally in the hope that this choice will lead to a globally optimal solution. In this case, the algorithm
always chooses two trees with the smallest weight and creates a new node as their parent. This intuitive optimal local solution indeed leads to a final optimal solution for constructing a Huffman tree. As another example, consider changing money into the fewest possible coins. A greedy algorithm would take the largest possible coin first. For example, for 98¢, you would use three quarters to make 75¢, additional two dimes to make 95¢, and additional three pennies to make the 98¢. The greedy algorithm finds an optimal solution for this problem. However, a greedy algorithm is not always going to find the optimal result.

The code below gives a program that prompts the user to enter a string, displays the frequency table of the characters in the string, and displays the Huffman code for each character.

package demo;
import java.util.Scanner;

public class HuffmanCode {

    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        System.out.print("Enter text: ");
        String text = input.nextLine();

        int[] counts = getCharacterFrequency(text); // Count frequency

        System.out.printf("%-15s%-15s%-15s%-15s\n", "ASCII Code", "Character", "Frequency", "Code");

        Tree tree = getHuffmanTree(counts); // Create a Huffman tree
        String[] codes = getCode(tree.root); // Get codes

        for (int i = 0; i < codes.length; i++)
            if (counts[i] != 0) // (char)i is not in text if counts[i] is 0
                System.out.printf("%-15d%-15s%-15d%-15s\n", i, (char)i + "", counts[i], codes[i]);
    }

    /** Get Huffman codes for the characters
    * This method is called once after a Huffman tree is built
    */
    public static String[] getCode(Tree.Node root) {
        if (root == null) return null;
        String[] codes = new String[2 * 128];
        assignCode(root, codes);
        return codes;
    }

    /* Recursively get codes to the leaf node */
    private static void assignCode(Tree.Node root, String[] codes) {
        if (root.left != null) {
            root.left.code = root.code + "0";
            assignCode(root.left, codes);

            root.right.code = root.code + "1";
            assignCode(root.right, codes);
        }
        else {
            codes[(int)root.element] = root.code;
        }
    }

    /** Get a Huffman tree from the codes */
    public static Tree getHuffmanTree(int[] counts) {
        // Create a heap to hold trees
        Heap<Tree> heap = new Heap<>(); // Defined in Listing 23.9
        for (int i = 0; i < counts.length; i++) {
            if (counts[i] > 0)
                heap.add(new Tree(counts[i], (char)i)); // A leaf node tree
        }

        while (heap.getSize() > 1) {
            Tree t1 = heap.remove(); // Remove the smallest-weight tree
            Tree t2 = heap.remove(); // Remove the next smallest
            heap.add(new Tree(t1, t2)); // Combine two trees
        }

        return heap.remove(); // The final tree
    }

    /** Get the frequency of the characters */
    public static int[] getCharacterFrequency(String text) {
        int[] counts = new int[256]; // 256 ASCII characters

        for (int i = 0; i < text.length(); i++)
            counts[(int)text.charAt(i)]++; // Count the characters in text

        return counts;
    }

    /** Define a Huffman coding tree */
    public static class Tree implements Comparable<Tree> {
        Node root; // The root of the tree

        /** Create a tree with two subtrees */
        public Tree(Tree t1, Tree t2) {
            root = new Node();
            root.left = t1.root;
            root.right = t2.root;
            root.weight = t1.root.weight + t2.root.weight;
        }

        /** Create a tree containing a leaf node */
        public Tree(int weight, char element) {
            root = new Node(weight, element);
        }

        @Override /** Compare trees based on their weights */
        public int compareTo(Tree t) {
            if (root.weight < t.root.weight) // Purposely reverse the order
                return 1;
            else if (root.weight == t.root.weight)
                return 0;
            else
                return -1;
        }

        public class Node {
            char element; // Stores the character for a leaf node
            int weight; // weight of the subtree rooted at this node
            Node left; // Reference to the left subtree
            Node right; // Reference to the right subtree
            String code = ""; // The code of this node from the root

            /** Create an empty node */
            public Node() {
            }

            /** Create a node with the specified weight and character */
            public Node(int weight, char element) {
                this.weight = weight;
                this.element = element;
            }
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

Image description

The program prompts the user to enter a text string (lines 7–9) and counts the frequency of the characters in the text (line 11). The getCharacterFrequency method (lines 66–73) creates an array counts to count the occurrences of each of the 256 ASCII characters in the text. If a character appears in the text, its corresponding count is increased by 1 (line 70).

The program obtains a Huffman coding tree based on counts (line 15). The tree consists of linked nodes. The Node class is defined in lines 102–118. Each node consists of properties element (storing character), weight (storing weight of the subtree under this node), left (linking to the left subtree), right (linking to the right subtree), and code (storing the Huffman code for the character). The Tree class (lines 76–119) contains the root property. From the root, you can access all the nodes in the tree. The Tree class implements Comparable. The trees are comparable based on their weights. The compare order is purposely reversed (lines 93–100) so that the smallest-weight tree is removed first from the heap of trees.

The getHuffmanTree method returns a Huffman coding tree. Initially, the single-node trees are created and added to the heap (lines 50–54). In each iteration of the while loop (lines 56–60), two smallest-weight trees are removed from the heap and are combined to form a big tree, and then the new tree is added to the heap. This process continues until the heap contains just one tree, which is our final Huffman tree for the text.

The assignCode method assigns the code for each node in the tree (lines 34–45). The getCode method gets the code for each character in the leaf node (lines 26–31). The element codes[i] contains the code for character (char)i, where i is from 0 to 255. Note that codes[i] is null if (char)i is not in the text.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .