Data Compression Algorithms

by: Jeffrey N. Ladino

This page is a research project in the field of data compression algorithms. It is intended to be an informative overview for a beginner in the field of computer science.

Table of Contents for this Project

About This Page

How To Use This Page

I have tried to organize this page in an easy to use format. The main viewing area of this page is the top frame which fills about 75% of your browser window. All of the information presented in this page is contained in this top frame. This frame consists of a single file with "within the file" links. These links take you directly to the part of the file which you click on. These "within the file" links are available in two places. One is the Table of Contents at the beginning of the file.

The other source for "within the file" links is the bottom frame. It is a permanent source of in the page links for easy access to all parts of the project. Clicking on one of these links will change the display presented in the top frame of the page. The bottom two frames remain unchanged.

The frame located above the bottom frame is the Symbol Key. It is separated from the main page so that the meaning of a symbol contained in the top frame can be found without losing your place. All Symbols which occur in the main page are linked to the symbol frame. Clicking on the link to a symbol will make that symbol appear topmost in the symbol-key frame leaving the top frame unchanged.

This page is designed to be read from beginning to end by simply using the scrollbar on the right side of the top frame. One can also skip through the page reading only the parts he or she wishes. Good luck, I hope that you find it informative.

Simplifying Conditions for this Page

The focus of this project is lossless data compression. Data compression is explained with a focus on lossless text compression. Lossy compression is only briefly touched upon.

All of the encoding discussed in this page will be from the english alphabet, A, into the binary alphabet, B. Because all of the encoding is into the binary alphabet, every time log is written it will refer to log to the base 2.

The compression discussed in this page is memoryless. This means that in any given message, M, the probability of an element, P:e, is independent of the element which precedes it. This must be stated explicitly because many messages are not memoryless. The probabilities of the elements in these memoried messages are dependent on the elements which precede them.

Data Compression Overview

Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and data communication. Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an uncompressed file. Smaller files are also desirable for data communication, because the smaller a file the faster it can be transferred. A compressed file appears to increase the speed of data transfer over an uncompressed file.

Data Compression Defined

All data is encoded. This means that the data is originally a combination of elements, e, from some alphabet, A. This combination of elements is a message, M. This message from the alphabet, A, is encoded into the binary alphabet, B. The string of bits, binary digits (0's and 1's), is the encoded data. So essentially encoding is just tranferring a message, M, from the alphabet A into the alphabet B. Here is an example:

The message is: a b c d
The encoded message is: 00 01 10 11

The above example just translates the elements a,b,c,d from the english alphabet, A, into the binary alphabet, B. These elements can also be decoded from the binary alphabet back into the original message which was in the english alphabet.

Types of Data Compression

There are two main types of data compression: lossy and lossless.

Lossy data compression is named for what it does. After one applies lossy data compression to a message, the message can never be recovered exactly as it was before it was compressed. When the compressed message is decoded it does not give back the original message. Data has been lost.

Because lossy compression can not be decoded to yield the exact original message, it is not a good method of compression for critical data, such as textual data. It is most useful for Digitally Sampled Analog Data (DSAD). DSAD consists mostly of sound, video, graphics, or picture files. Algorithms for lossy compression of DSAD vary, but many use a threshold level truncation. This means that a level is chosen past which all data is truncated. In a sound file, for example, the very high and low frequencies, which the human ear can not hear, may be truncated from the file. Some examples of lossy data compression algorithms are JPEG, MPEG, and Indeo.

Lossless data compression is also named for what it does. In a lossless data compression file the original message can be exactly decoded. Lossless data compression works by finding repeated patterns in a message and encoding those patterns in an efficient manner. For this reason, lossless data compression is also referred to as redundancy reduction. Becuase redundancy reduction is dependent on patterns in the message, it does not work well on random messages. Lossless data compression is ideal for text. Most of the algorithms for lossless compression are based on the LZ compression method developed by Lempel and Ziv.

One type of text encoding which is very effective for files with long strings of repeating bits is RLE. RLE stands for Run Length Encoding. RLE uses a sliding dictionary method of the LZ algorithm. The sliding dictionary method utilizes pointers within the compressed file that point to previously represented strings of bits within the file. Here is an example of a message which could be effectively encoded with RLE:

The rain in Spain falls mainly on the plain.
The string "ain" could be represented only once and could be pointed to by all later calls to that string.

Huffman coding works by analyzing the frequency, F, of elements, e, in a message, M. The elements with the highest frequency, F:e, get assigned the shortest encoding (with the fewest bits). Elements with lower frequencies get assigned longer encodings (with more bits).

Entropy Discussed

The average information content of a message is called its enropy. The information content is related to uncertainty. The less likely a message is to occur the larger its information content. This makes sense if we think of an example: if a person knows what message is about to be sent to him, how much new information has he learned by receiving that message? None. This is all that the above statement is saying.

Entropy, E, is information content. The entropy of a source is inversely proportional to its probability of occurance:

E = -log P
We use the the log function because we are converting all sources into the binary alphabet, B.

The same rule applies to an element, e, in a message, M. Its entropy can be defined as:

E:e = -log (P:e)
P:e is the probability of an element in a message. It is equal to that element's frequency, F:e, divided by the frequency of the entire message, F:M:
P:e = F:e / F:M

The average information content or average entropy for a message, E:M, can now be defined. We know the entropy for each element in the message is E:e. We will index the elements, e, in a message, M, by assigning them integer positions, i, according to their place in the message. For example: the first element, e, in message M will be assigned 1, the second will be assigned 2...the i-th will be assigned i. Now we can define the entropy for an entire message, E:M, where there are i elements:

E:M = (P:1*E:1)+(P:2*E:2)+(P:3*E:3)+...+(P:i*E:i)

Entropy is an important concept to data compression. The entropy of an element (E:e) is the minimum number of bits needed to encode that element. The entropy of an entire message (E:M) is the minimum number of bits needed to encode the entire message with a lossless compression. The entropy of a message can be used to determine if data compression is worth attempting. It can also be used to evaluate the effectiveness of a compression. The number of bits in a compressed code can be compared to the entropy for that message (E:M) revealing how close to optimal compression one's compressed code is.

Huffman Code Example

Huffman Code assigns shorter encodings to elements with a high frequency, F:e. It differs from block encoding in that it is able to assign codes of different bit lengths to different elements. Elements with the highest frequency, F:e, get assigned the shortest bit length code. The key to decompressing huffman code is a huffman tree.

A huffman tree is a special binary tree called a trie. (pronounced try) A binary trie is a binary tree in which a 0 represents a left branch and a 1 represents a right branch. The numbers on the nodes of the binary trie represent the total frequency, F, of the tree below. The leaves of the trie represent the elements, e, to be encoded. The elements are assigned the encoding which corresponds to their place in the binary trie. Below is an example.

NOTE: Spaces are included in the message and in the encodings for readability purposes only. The space is not considered as a character in the message and therefore is not encoded as an element in the message.

Message to be Encoded

dad ade fade bead ace dead cab bad fad cafe face

Block Encoding

011 000 011 000 011 100 101 000 011 101 001 100 000 011 000 010 100 011 100 000 011 010 000 001 001 000 011 101 000 011 010 000 101 100 101 000 010 100

The block encoding above is a fixed length encoding. If a message contains i elements, block encoding requires log i bits to encode each element, e.

Spaces have been inserted between the strings of bits which represent each character in both the Block Encoding and the Huffman Encoding.

Huffman Encoding

01 10 01 10 01 111 110 10 01 111 000 111 10 01 10 001 111 01 111 10 01 001 10 000 000 10 01 110 10 01 001 10 110 111 110 10 001 111

The Huffman code is decoded using a Huffman tree. The tree for this code is illustrated below.

The huffman encoding of the message is 94 bits long. The block encoding is 114 bits long. The huffman encoding saves 20 bits. The compression ratio is 1.21 to 1. The compression rate is 17.5%

Anotated Source List

  1. Drozdek and Simon, Data Structures in C, PWS publishing Co., Boston, 1995.

    (This book is pretty hard to understand. It does give some information on data compression, but it is not very clear.)

  2. Lynch, Thomas J., Data Compression: Techniques and Applications, Lifetime Learning Publications, Belmont, CA, 1985

    (This book is written at a high level. It has good diagrams and problems but is difficult for a beginner to grasp. One of the diagrams from the book (page 6) was used in this project.)

  3. Rosen, Kenneth, Discrete Mathematics and its Applications, McGraw-Hill Inc., New York, 1995.

    (This book was actually my text book for my Discrete Math course this quarter. I'm surprised that it was helpful. It was easy to read and I found it informative in the area of trees and tree structure.)

  4. Storer, James A., Data Compression: Methods and Theory, Computer Science Press, Rockville, MD, 1988

    (This was another book written at a graduate level. I was able to sift through it and find some information which I understood and presented in this project.)

  5. Weiss, Mark Allen, Algorithms, Data Structures and Problem Solving With C++, Addison Wesley Publishing Co. Inc., New York, 1996

    (This is another useful course book which helped me with this project. It was the book I was required to buy for my algorithms and data structures class. Chapter 12 is devoted mostly to data compression and it contains a helpful example of huffman coding.)

End List and Links

I'm from theCollege of Computer Science at Northeastern University.

View my personal homepage on the web.

If you would like to share some input regarding this project, feel free to email me.

Copyright Notice

This page was created by Jeffrey N. Ladino and is his sole property. Jeffrey N. Ladino reserves the rights to all information contained within this page. Any copying of this page, in whole or in part, without the expressed written consent of Jeffrey N. Ladino is a violation of copyright law. Copyright Jeffrey N. Ladino 1996.