String compression using Huffman Coding Algorithm

Understand the Huffman Coding Algorithm and implement String Compression using this algorithm.

The Huffman algorithm is a widely used technique for data compression, which aims to reduce the size of data files without losing any information. It was developed by David A. Huffman in 1952 while he was a graduate student at MIT. The algorithm is specifically designed to efficiently represent characters in a way that minimizes the overall file size.

Here is the step by step explaination of the Huffman Coding Algorithm for string compression.

Step 1: Calculate Character Frequencies

The first step in the Huffman algorithm is to analyze the input data and calculate the frequency of occurrence for each character in the data. This can be done by scanning the input data and counting the occurrences of each character.

Step 2: Build the Priority Queue

After calculating the character frequencies, we create a priority queue (min heap) based on these frequencies. Each node in the priority queue represents a single character along with its frequency.

Step 3: Create Huffman Trees

Next, we create Huffman trees from the nodes in the priority queue. In each iteration, we take the two nodes with the lowest frequencies from the priority queue and merge them into a new internal node. The merged node's frequency is the sum of the frequencies of the two nodes being merged. The two nodes become the left and right children of the new internal node. We repeat this process until only one node remains in the priority queue.

Step 4: Generate Huffman Codes

Once we have the final Huffman tree, we assign binary codes to each character in the tree. The Huffman code is generated by traversing the tree from the root to each leaf node. Each left traversal corresponds to a '0' bit, and each right traversal corresponds to a '1' bit. The path from the root to each leaf node gives the unique binary code for that character.

Step 5: Encode the Data

Now that we have the Huffman codes for each character, we can encode the input data using these codes. Each character in the input data is replaced with its corresponding Huffman code. The encoded data is a sequence of '0' and '1' bits representing the original data.

Step 6: Store the Huffman Tree (Optional)

In some applications, it may be necessary to store the Huffman tree along with the encoded data to enable proper decoding. The tree can be represented in a compact form, as it can be uniquely reconstructed during decoding using the prefix property of Huffman codes.

Step 7: Decode the Data

To decode the encoded data back to its original form, we use the stored Huffman tree (if available). Starting from the root of the tree, we traverse the tree using the encoded bits. Whenever we encounter a '0', we move to the left child, and whenever we encounter a '1', we move to the right child. When we reach a leaf node, we output the corresponding character and continue traversing from the root until all encoded bits are processed.

Step 8: Verify Data Integrity

After decoding the data, we should verify its integrity by comparing the decoded data with the original input data. If the Huffman algorithm is implemented correctly, the decoded data should match the original data, ensuring a lossless compression process.

Following is the example code written in C for String Compression using Huffman Coding Algorithm.

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

// Structure for the Huffman tree node
struct Node {
    char data;
    unsigned frequency;
    struct Node *left, *right;
};

// Structure for the priority queue node
struct MinHeapNode {
    struct Node* node;
    struct MinHeapNode* next;
};

// Structure for the priority queue
struct MinHeap {
    struct MinHeapNode* head;
};

// Function to create a new node for the Huffman tree
struct Node* createNode(char data, unsigned frequency) {
    struct Node* newNode = (struct Node*)malloc(sizeof(struct Node));
    newNode->data = data;
    newNode->frequency = frequency;
    newNode->left = newNode->right = NULL;
    return newNode;
}

// Function to create a new priority queue node
struct MinHeapNode* createMinHeapNode(struct Node* node) {
    struct MinHeapNode* newNode = (struct MinHeapNode*)malloc(sizeof(struct MinHeapNode));
    newNode->node = node;
    newNode->next = NULL;
    return newNode;
}

// Function to create an empty priority queue
struct MinHeap* createMinHeap() {
    struct MinHeap* minHeap = (struct MinHeap*)malloc(sizeof(struct MinHeap));
    minHeap->head = NULL;
    return minHeap;
}

// Function to swap two MinHeapNode pointers
void swapMinHeapNodes(struct MinHeapNode** a, struct MinHeapNode** b) {
    struct MinHeapNode* temp = *a;
    *a = *b;
    *b = temp;
}

// Function to maintain the heap property of the priority queue
void minHeapify(struct MinHeap* minHeap) {
    struct MinHeapNode* current = minHeap->head;
    while (current->next != NULL) {
        struct MinHeapNode* smallest = current;
        struct MinHeapNode* next = current->next;
        while (next != NULL) {
            if (next->node->frequency < smallest->node->frequency) {
                smallest = next;
            }
            next = next->next;
        }
        if (smallest != current) {
            swapMinHeapNodes(&current, &smallest);
        }
        current = current->next;
    }
}

// Function to check if the priority queue has only one node
bool isSizeOne(struct MinHeap* minHeap) {
    return minHeap->head->next == NULL;
}

// Function to extract the minimum node from the priority queue
struct MinHeapNode* extractMin(struct MinHeap* minHeap) {
    struct MinHeapNode* temp = minHeap->head;
    minHeap->head = minHeap->head->next;
    return temp;
}

// Function to insert a new node into the priority queue
void insertMinHeap(struct MinHeap* minHeap, struct Node* node) {
    struct MinHeapNode* newNode = createMinHeapNode(node);
    newNode->next = minHeap->head;
    minHeap->head = newNode;
    minHeapify(minHeap);
}

// Function to build the Huffman tree from character frequencies
struct Node* buildHuffmanTree(char data[], unsigned frequency[], int size) {
    struct MinHeap* minHeap = createMinHeap();
    for (int i = 0; i < size; ++i) {
        insertMinHeap(minHeap, createNode(data[i], frequency[i]));
    }
    while (!isSizeOne(minHeap)) {
        struct Node* left = extractMin(minHeap)->node;
        struct Node* right = extractMin(minHeap)->node;
        struct Node* newNode = createNode('$', left->frequency + right->frequency);
        newNode->left = left;
        newNode->right = right;
        insertMinHeap(minHeap, newNode);
    }
    return extractMin(minHeap)->node;
}

// Function to compress the input string using Huffman codes
void compressString(char* inputString) {
    int frequency[256] = {0};
    char data[256];
    int data_size = 0;

    // Calculate character frequencies
    for (int i = 0; inputString[i] != '\0'; ++i) {
        frequency[(unsigned char)inputString[i]]++;
    }

    // Create data array from characters with non-zero frequency
    for (int i = 0; i < 256; ++i) {
        if (frequency[i] > 0) {
            data[data_size++] = (char)i;
        }
    }

    // Build Huffman tree and perform compression
    struct Node* root = buildHuffmanTree(data, frequency, data_size);
    int arr[100], top = 0;

    printf("Huffman Codes:\n");
    printHuffmanCodes(root, arr, top);

    printf("Compressed string: ");
    for (int i = 0; inputString[i] != '\0'; ++i) {
        // Search for the character in the data array
        for (int j = 0; j < data_size; ++j) {
            if (inputString[i] == data[j]) {
                for (int k = 0; k < top; ++k) {
                    printf("%d", arr[k]);
                }
            }
        }
    }
    printf("\n");
}

int main() {
    char inputString[] = "abacda";
    compressString(inputString);

    return 0;
}

In conclusion, the Huffman algorithm is a powerful data compression technique that efficiently reduces data size while preserving data integrity. By generating optimal binary codes based on character frequencies, the Huffman algorithm plays a significant role in various applications, from file compression to data transmission.