Information Theory

Information theory is the quantitative study of information, pioneered by Claude Shannon in 1948 with the publication of his paper, "A Mathematical Theory of Communication". In the study of communication, it is very useful to be able to have a concrete, mathematical measurement of information. In its earliest form, information theory was created to investigate properties of data communication, although since its creation, information theory has found applications in many other disciplines unrelated to data communication.



How to Quantify Information
There are two ways of looking at a quantitative measurement of information--from a heuristic non-engineering viewpoint, and from a more mathematical engineering viewpoint.

Non-Engineering Way
When looking at information heuristically, one can view the usefulness of information in terms of the surprise content of the message being conveyed. For example, consider the two following messages:

1. Taxes will be due on April 15 this year.

2. This year, no taxes will be collected.

The first message should not come as a surprise to anyone (who pays taxes at least), and thus conveys very little information. The second message however would come as a very large surprise to most people and would convey a great amount of information. It can be seen from this example that news of an event with a very high probability (event 1) conveys little information and news of an event with low probability (event 2) conveys lots of information. Mathematically speaking, if P is the probability of an event occurring and I is the "value" of information contained in the message of the event occurring, then:
 * $$P \rightarrow 1 \Rightarrow I \rightarrow 0$$
 * $$P \rightarrow 1 \Rightarrow I \rightarrow 0$$


 * $$P \rightarrow 0 \Rightarrow I \rightarrow \infty$$




 * $$ I \sim log \left(\frac {1}{P}\right) $$

Engineering Way
Looking at the problem of quantifying information from an engineering standpoint, it is more useful to look at the time it takes to transmit the message. An engineer should be able to transmit information as fast and effectively as possible. Therefore, the information in a message would be proportional to the minimum time that it takes for the engineer to transmit the message, i.e. a message with higher probability can be transmitted faster than a message with low probability. Similar to the heuristic way of information quantification, this definition leads to the following relationship:


 * $$ I = log_{r} \left(\frac {1}{P}\right) $$

where r refers to an "r-ary" digit used to convey the information. For example, if a message was to be sent using binary, the above relationship could be written as:


 * $$ I = log_{2} \left(\frac {1}{P}\right) $$

Morse code Example
A good illustration of quantifying information the "engineering way" is the transmission of messages (in English for simplicity) in Morse code. Morse code uses a series of dots (short pulses) and dashes (long pulses) in different combinations to represent the characters in the English alphabet as well as the numerals in the decimal system. Because Morse code is a binary system being used to represent 26 characters and 10 numerals, multiple bits are required to send all but two of the characters (the numerals are all represented with 5 bits). As can be seen from the chart on the left, the two characters represented by a single bit are E and T. The reason for this is that E and T are two of the most frequently used letters in the English alphabet (i.e. with higher probability of occurring and with lower information content) and therefore they are assigned the coding scheme that takes the least amount of time to transmit. On the other hand, a less frequently used letter such as Z or Q is assigned 4 bits, which takes a longer amount of time to transmit.

Entropy
An important concept under information theory is the concept of entropy. Simply stated, entropy (as it applies to information theory) is the average amount of information (usually denoted in bits) required to transmit (or store) a message. A fair coin would have an entropy of 1 bit, i.e. the number of possible outcomes (heads or tails) could be represented by a 1 or a 0. Similarly,        a fair 8-sided die would have an entropy of 3 bits, i.e. the 8 unique outcomes of the die could be each be represented by a unique combination of 3 bits.

It is important to note that in both examples above, the coin and the die are fair, and thus each outcome is equally probable. It can be shown that the maximum amount of entropy is found when all outcomes of an event (i.e. all "messages" generated by a source) have equal probability.

Mathematically, the entropy, H(m) of a message source m can be defined as

$$ H\left(m\right)=\sum_{i=1}^n \, P_i \, I_i $$

which, substituting the "engineering" relationship between information and probability and using bits to encode the message, can be re-written as

$$ H\left(m\right)=\sum_{i=1}^n \, P_i \, log_{2} \left(\frac {1}{P_i}\right) $$

Compact Codes
The idea of entropy can be applied to the coding of messages and can be used to find the most efficient way to encode a set of messages. According to the source encoding theory, when trying to encode a source with an entropy of H(m), a minimum of H(m) binary bits will be needed to encode a message, or code word, from the source. The most efficient source code that can be created (given a set of messages to be encoded) is called the Huffman Code.

How to Create Huffman Code
Huffman code can be determined by using a regression pattern. First, one must know the full set of messages that a source will be able to transmit and the probability of each message. In this example, there are five messages, each with a separate probability of occurring, and the coding scheme used is binary. To start, arrange the messages in order of probability from most likely to least likely (as shown in the figure below). Then, combine the bottom two messages, add their probabilities, and insert the "new" probability into the list (and re-arrange the list from most likely to least likely). In this example, m4 and m5 are combined into a message with a probability of 0.14. This procedure is repeated to form columns R2 and R3 in the table. Note that the combined probability of m3 and m4 from R1 (0.31) is higher than the probability of m2 in R1 (0.23), and so in R2, the probability of 0.31 is arranged before 0.23 so that the list is still in descending order.



After the above procedure has been repeated enough times, the original list of messages will be reduced to two messages. Assign "m1" from column R3 with a 0 and "m2" from column R3 with a 1. Then, go to the code column for R2. The message with a probability of 0.46 is still coded as a 1, but the message with a probability of 0.54 (in column R3) has regressed back into two messages, one with probability of 0.31 and one with probability of 0.23. In order to code these messages, take the 0 code from the 0.54 probability message as the first bit and add on another 0 for the message with probability 0.31 to form "00" and add on a 1 for the message with probability 0.23 to form a "01". In R1, the 0.31 probability message from R2 is broken down again and the code is assigned starting with the 00 from R2 and adding a 0 or 1 after it to form a 3 digit binary code. In this way, each message is recursively assigned a code until all messages have a code.