Information Theory Demystified

updated 12/04

Scott Uminsky

Over the years many contributions have been made to the field of communication theory. For example, in 1928 R.V.L. Hartley published a paper titled “Transmission of Information” and in 1946 Dennis Gabor published a paper titled “Theory of Communication”. In 1948, Claude Shannon published the now famous paper titled “A mathematical Theory of Communication”. This paper was first published in two parts in the July and October editions of the Bell System Technical Journal. Since then communication theory or information theory as it is sometimes called has become an accepted field of research. Many books and articles have been published on the subject since Shannon’s original paper most notably those by Leon Brillouin. The following discussion has been partly derived from the second edition of the book “Science and Information Theory” by Leon Brillouin [1] as well as from the book “An Introduction to Information Theory” by John R. Price [2].

Before we proceed it’s important to note that we must not confuse the ordinary use of the word “information” with the word “information” in “information theory”. The latter is called “Shannon information” which is narrowly defined. Sometimes authors use the ordinary word for “information” as if the word “information” in “information theory” means the same thing. This all too often occurs within the same article and has become a problem in certain bodies of literature. The same is true for the word “entropy”. That is, authors will sometimes use the word entropy without saying whether they are talking about thermodynamic entropy or “Shannon entropy” and visa versa. The following discussion is an attempt to clarify these as well as other problems in this area.

Communication Theory

In communication theory there are three basic elements the transmitter, the channel and the receiver. For example, a person who writes a letter is the transmitter (or message source), the postal system is the channel, and of course the person who receives the letter is the receiver. The same can be said for cell phone text messaging, email and a whole host of other examples.

With respect to communication \ information theory, each transmitter is generally viewed as a source of a finite number of messages. The content of these messages doesn’t matter. For example, if a message source is only capable of sending two messages, it doesn’t matter (as far as “information theory” is concerned) whether one of the messages contains random letters and the other contains a grammatically correct sentence. In fact, the random sequence could still have an agreed upon meaning between the users nevertheless, the meaning of that message is beyond the scope of “information theory”.

The question then becomes ‘what is the scope of information theory?’. Information theory is only concerned with which message is sent out of a group of messages. That is, the more messages that a message source is capable of transmitting, the more uncertain the receiver can be about which particular message will be sent. Thus, the more uncertain the receiver is about which message is sent, the more (Shannon) information there is. Note that the length of a message can be one symbol long or several symbols in long with no impact on the amount of Shannon information.

Messages: Equally Likely

Example 1

If a message source can send only one possible message, the chance that the receiver will receive that particular message is one out of one or 100%. Therefore, the receiver is 100% certain about which message will be received. Another way of saying this is that the receiver has 0% uncertainty about which message will be received.

This can be calculated using the equation I = - log₂ P_o

I = - log₂ ¹/₁ = 0 bits

Thus, we have 0 bits of “Shannon information” which corresponds to 0 bits of uncertainty.

Note that in communication theory, this uncertainty is taken as a measure of the amount of “information” transmitted by the source and that this amount of information is called “the entropy” or for the sake of clarity “Shannon entropy”. Thus, in communication theory the words “information”, “entropy” and “uncertainty” are synonymous and mathematically defined.

As an aside, many writers have legitimately compared “thermodynamic entropy“ with “Shannon entropy”. The basis for this comparison comes from the statistical nature of thermodynamic entropy and the statistical nature of Shannon entropy i.e. in terms of randomness. Nevertheless, it’s still important to identify which one we are talking about or if we are using these words synonymously.

Example 2

Let’s say that our message source can send only two possible messages and that it’s equally likely that either one or the other message will be sent.

In this case, the chances that the receiver will receive one particular message out of the two, is one out of two or 50%. Thus, the receiver is less certain (or more uncertain compared to example 1) about which message will be sent.

This can also be calculated using the equation I = - log₂ P_o

I = - log₂ ½ = 1 bit

Thus, we have 1 bit of Shannon information that corresponds to 1 bit of uncertainty or 1 bit of Shannon entropy.

Example 3

Let’s say that this time our message source can send only ten possible messages and that it’s equally likely that any one out of the ten will be sent.

In this case, the chances that the receiver will receive one particular message out of the ten, is one out of ten or 10%. Therefore, the receiver is even more uncertain (compared to example 2) about which message will be sent.

This can also be calculated using the equation I = - log₂ P_o

I = - log₂ ¹/₁₀ = 3.3 bits

Thus, we have 3.3 bits of Shannon information corresponding to 3.3 bits of uncertainty or 3.3 bits of Shannon entropy.

Summary

From these three examples, we conclude that the number of bits of information (or entropy) increases as the uncertainty of which message will be sent increases.

Note that the (Shannon) information \ uncertainty \ entropy is the same whether these messages are random sequences or grammatically correct sentences.

Messages: Not Equally Likely

Example 1

Suppose that our message source can send only ten possible messages with nine of the ten messages identical and one remaining message different from the nine.

In this case, the chances are nine out of ten or 90% that the receiver will receive one of the nine identical messages and the chances are one out of ten or 10% that the receiver will receive the one different message. Thus, there is a high degree of certainty (not much uncertainty) that the message received will be one of the nine identical messages.

This can be calculated using the equation I = - ( P_olog₂ P_o + P₁log₂ P₁)

I = - ( ¹/₁₀ log₂ ¹/₁₀ + ⁹/₁₀ log₂ ⁹/₁₀) = - [ (¹/₁₀)(-3.3) + (⁸/₁₀)(- .12) ] = 0.20 bits

Thus, we have .20 bits of Shannon information corresponding to .20 bits of uncertainty or .20 bits of Shannon entropy.

Example 2

Suppose that our message source can again send only ten possible messages with six of the ten messages identical and the remaining four of the ten identical but different from the six.

In this case, the chances are six out of ten or 60% that the receiver will receive one of the six identical messages and the chances are four out of ten or 40% that the receiver will receive one of the four identical messages. Thus, the receiver is more uncertain (compared to example 1) that the message received will be one of the six identical messages or one of the four identical messages.

This can be calculated using the equation I = - ( P_olog₂ P_o + P₁log₂ P₁)

I = - ( ⁴/₁₀ log₂ ⁴/₁₀ + ⁶/₁₀ log₂ ⁶/₁₀) = - [ (⁴/₁₀)(-1.3) + (⁶/₁₀)(- .736) ] = 0.962 bits

Thus, we have 0.962 bits of Shannon information corresponding to 0.962 bits of uncertainty or 0.962 bits of Shannon entropy.

Summary

From these two examples, we also conclude that the number of bits of information (or entropy) increases as the uncertainty of which message will be sent increases. Thus, the greater the number of bits, the greater the uncertainty.

Note again that the (Shannon) information \ uncertainty \ entropy is the same whether any of these messages are random sequences or grammatically correct sentences. Note again that a message can be one symbol long or several symbols in long.

Messages: Equally likely vs Not Equally likely

Notice that with the “equally likely” messages the Shannon information is greater (i.e. 3.3 bits) than the “not equally likely” messages (i.e. 0.20 bits and 0.962 bits). This is because the more constraints that we impose, the more certainty (less uncertainty) we have about which message will be sent. Thus, the amount of Shannon information decreases the more we constrained or limit the system.

Messages: Random vs. Recognizable

Letters: Equally Likely

It’s well known that a complete alphabet contains the usual 26 letters plus a ‘blank’ for space between words. This gives a total of 27 symbols. In the following examples we will consider the 27 symbols to be 27 possible messages one symbol long.

Example 1:

Suppose that we wanted to produce a random sequence of letters and spaces with equal probabilities. We could do this by taking a set of cards and marking each one with a letter and with a space until all 27 symbols were used. After this, we would take our marked cards and put them into a single drum, mix up the letters, draw a card, record the symbol then return it to the drum and then start the process over again.

Let’s say that we only want to calculate the number of bits of Shannon information for one letter.

In this case, we would simply draw one card out of the drum. Since the chances are one out of twenty-seven that any one particular letter (or space) will be drawn, we would once again use the equation I = - log₂ P_oas follows

I = - log₂ ¹/₂₇ = 4.76 bits per letter

Thus, we have 4.76 bits of Shannon information corresponding to 4.76 bits of uncertainty or 4.76 bits of Shannon entropy per letter.

Example 2:

Suppose that we want to repeat example 1 only this time calculate the number of bits of Shannon information for three letters.

In this case, let’s continue to draw letters out of the drum randomly until we have three letters and let’s say that those letters are “CAT” corresponding to three messages that are one symbol long. Since each letter has 4.76 bits of Shannon information we would simply do the following

I = 3 x ( - log₂ ¹/₂₇ ) = 14.28 bits

For the sake of argument let’s say that the three letters that we obtained were “XFO” instead of “CAT”. In this case the calculation is the same i.e.

I = 3 x ( - log₂ ¹/₂₇ ) = 14.28 bits

Summary

In this case the number of bits of Shannon information is the same whether the three letters spell out a recognizable word or not.

Letters: Not Equally Likely

It is also well known that in any given sentence some letters occur more frequently than others. It is also known that the space between words occurs a certain percent of the time. For example, a space occurs roughly 20% of the time, the letters “E” occurs 10.5% of the time, the letter “T” occurs 7.2% of the time and so on for all the letters of the alphabet.

Example 1

Let’s say that we want to calculate the number of bits of Shannon information for one letter. Once again we mark a set of cards but in this case, we have to account for the different percentages as well as for 10^th’s and in some cases 100^th’s of a percent. To do this we will have to put a much larger number than 27 in the drum. We also note that the larger the total number of cards we put in, the better we are able to account for the 10^th’s and 100^th’s of a percent.

In this case, we will have to put 9,679 cards in the drum in order to account for the 100^th’s of a percent. The breakdown is as follows: 2000 cards for the ‘space’ because it occurs roughly 20% of the time, 1050 cards for the letter ‘E’, 720 cards for the letter ‘T’ and so on for all the letters of the alphabet until we have 9,679 cards in the drum.

To calculate this the equation I = - ( P_olog₂ P_o + P₁log₂ P₁) is extended so that we can account for all of the percentages of occurrence for all 27 symbols. To do this we write the equation as follows adding each individual expression from the highest probability to the lowest for all 27 symbols:

I = - ( P_spacelog₂ P_space + P_Elog₂ P_E+ P_Tlog₂ P_T+ ………… + P_Zlog₂ P_Z)

= - (²⁰⁰⁰/₉₆₇₉log₂ ²⁰⁰⁰/₉₆₇₉ + ¹⁰⁵⁰ /₉₆₇₉log₂ ¹⁰⁵⁰/₉₆₇₉+ ⁷²⁰ /₉₆₇₉log₂ ⁷²⁰/₉₆₇₉+ …….. _.…. + ¹⁰ /₉₆₇₉log₂ ¹⁰/₉₆₇₉) » 3.9 bits per letter

Thus, we have approximately 3.9 bits of Shannon information corresponding to 3.9 bits of uncertainty or 3.9 bits of Shannon entropy per letter. Note however that this is an average value.

Example 2

Now let’s say that we want to calculate the number of bits of Shannon information for three different letters.

In this case, we draw letters out of the same drum randomly until we have three letters and let’s say that those letters are “XFO”. Let’s also say that these letters can be drawn out in any order i.e. “OFX”, “FXO” etc.. Each letter however has a unique probability \ percentage associated with it, so we must use those values in our calculation instead of the average value of 3.9 bits per letter. That is, since “X” occurs roughly 0.2% of the time we will have 20 out of 9,769 cards marked with a “X” and so on for the letters “F” and “O”.

I = - ( P_Xlog₂ P_X + P_Flog₂ P_F+ P_Olog₂ P_O)

= - (²⁰/₉₆₇₉log₂ ²⁰/₉₆₇₉ + ²²⁵ /₉₆₇₉log₂ ²²⁵/₉₆₇₉+ ⁶⁵⁴ /₉₆₇₉log₂ ⁶⁵⁴/₉₆₇₉)

= 2.23 bits

Example 3

Now let’s say that we are again going to draw three letters out of the drum but this time the three letters must be “CAT” and in that order. This particular scenario is much more constrained compared to the others that we’ve looked at so far.

Ordinarily it would be highly unlikely that we would draw the letter “C” out of the drum on the first try, then “A” on the second try and then “T” on the third try. In fact to do this properly, we will need to draw a card, read it and not put it back in the drum. This way, we increase the chances of finding the desired letter on each subsequent try because the total number of cards is decreasing.

Because we have 9,769 cards we would typically have to draw tens if not hundreds of cards for this to work. Nevertheless, it‘s still “possible” to draw C, A and T on the first, second and third try whether we return the cards or not. For this example we will not return the cards to the drum even though the impact on the actual calculation is negligible.

Based upon these considerations, we assume the best possible set of favorable outcomes and take the following result to represent the best case scenario i.e. maximum number of bits.

Again each letter also has a unique percentage associated with it, so we must use those values in our calculation instead of the average value of 3.9 bits per letter. That is, since “C” occurs roughly 2.3% of the time we will have 230 out of 9,769 cards marked with a “C” and so on for the letters “A” and “T”.

To calculate this we do the following with the total number of cards decreasing as stated above:

I = - ( P_Clog₂ P_C + P_Alog₂ P_A+ P_Tlog₂ P_T)

= - (²³⁰/₉₆₇₉log₂ ²³⁰/₉₆₇₉ + ⁶³⁰ /₉₆₇₈log₂ ⁶³⁰/₉₆₇₈+ ⁷²⁰ /₉₆₇₇log₂ ⁷²⁰/₉₆₇₇)

= .67 bits maximum

Note that the typical number of bits for this calculation will be far less than the “best possible” or maximum value given above. In fact, the amount of uncertainty will decrease and tend to zero (bits) as the number of attempts to draw a desired letter increases.

Summary

From the examples above, we observe that the average number of bits is the same for both the recognizable word “CAT” and random string “XFO” i.e. 3 x 3.9 bits or 14.28 bits. However, when we compare the amount of Shannon information for the randomly selected letters “XFO” with the Shannon information for the specifically selected letters “CAT”, we see that the recognizable word CAT has fewer bits than the random XFO. This is consistent with the notion that as the amount of Shannon information increases, the amount of uncertainty increases. That is, more bits equals more uncertainty which equals less constraint or more randomness.

Words: Not Equally Likely

It can also be shown that the same is true for words in a sentence. That is, we can put words in a drum according to their percentage of occurrence, draw them out, arrange them in a sentence and calculate the amount of Shannon information. We can do this as we did above, once for a random sentence and once for a specifically selected recognizable sentence and still find that the recognizable sentence has much less Shannon information than the random sentence. Note that if we use the value for the average number of bits per word, these two sentences will also have the same amount of Shannon information.

Concluding remarks

We have shown that the ordinary use of the word “information” is much different from the word “information” in “information theory”. We have also explained that thermodynamic “entropy” and the “entropy” of information theory are the same in terms of increasing randomness. Nevertheless, these words also have different meanings when narrowed to specifics. With respect to information theory, it doesn’t matter whether the content of a message is random or recognizable because the average number of bits is the same. Moreover, the word’s “information”, “entropy” and “uncertainty” are synonymous and correspond to increasing randomness or uncertainty as the number of bits increases i.e. in the non-average cases. Alternatively, the more we constrain a system the more certainty there is about the outcome. This increase in certainty always corresponds to a decrease in Shannon information. Moreover, this decrease is said to represent the advance information (in the ordinary use of the word) that we have about the system. Hopefully this paper has clarified some of these issues.

S.U. 11/04

Bibliography

[1] Leon Brillouin, “Science and Information Theory”, Second Edition, Academic Press,
copyright 1962.

[2] John R. Pierce, “An Introduction to Information Theory”, Second Revised Edition,
Dover Publications, copyright 1980.

Special thanks to Ryan Huxley and Casey Luskin for suggesting ways to improve this article.