## Information Theory Demystified## updated 12/04Scott UminskyOver the years many contributions have been made to the field of communication theory. For example, in 1928 R.V.L. Hartley published a paper titled “Transmission of Information” and in 1946 Dennis Gabor published a paper titled “Theory of Communication”. In 1948, Claude Shannon published the now famous paper titled “A mathematical Theory of Communication”. This paper was first published in two parts in the July and October editions of the Bell System Technical Journal. Since
then communication theory or information theory as it is sometimes called has
become an accepted field of research. Many books and articles have been
published on the subject since Shannon’s original paper most notably those by
Leon Brillouin. The following discussion has been partly derived from the
second edition of the book “Science and Information Theory” by Leon Brillouin
[1] as well as from the book “An Introduction to Information Theory” by John R.
Price [2].
Before we proceed it’s important to note that we must not confuse the ordinary use of the word “information” with the word “information” in “information theory”. The latter is called “Shannon information” which is narrowly defined. Sometimes authors use the ordinary word for “information” as if the word “information” in “information theory” means the same thing. This all too often occurs within the same article and has become a problem in certain bodies of literature. The same is true for the word “entropy”. That is, authors will sometimes use the word entropy without saying whether they are talking about thermodynamic entropy or “Shannon entropy” and visa versa. The following discussion is an attempt to clarify these as well as other problems in this area. Communication Theory
In communication theory there are three basic elements the transmitter, the channel and the receiver. For example, a person who writes a letter is the transmitter (or message source), the postal system is the channel, and of course the person who receives the letter is the receiver. The same can be said for cell phone text messaging, email and a whole host of other examples. With respect to communication \ information theory, each transmitter is generally viewed as a source of a finite number of messages. The content of these messages doesn’t matter. For example, if a message source is only capable of sending two messages, it doesn’t matter (as far as “information theory” is concerned) whether one of the messages contains random letters and the other contains a grammatically correct sentence. In fact, the random sequence could still have an agreed upon meaning between the users nevertheless, the meaning of that message is beyond the scope of “information theory”. The question then becomes ‘what is the scope of information theory?’. Information theory is only concerned with message is sent out of a group of messages.
That is, the more messages that a message source is capable of transmitting,
the more uncertain the receiver can be about whichwhich particular
message will be sent. Thus, the more uncertain the receiver is about which
message is sent, the more (Shannon) information there is.
Note that the length of a
message can be one symbol long or several symbols in long with no impact on the
amount of Shannon information.
Messages: Equally Likely
Example 1
If a message source can send only possible message, the chance that the receiver
will receive that particular
message is one out of one or 100%. Therefore, the receiver is 100% certain
about which message will be received. Another way of saying this is that the
receiver has 0% uncertainty about which message will be received.
oneThis can be calculated using the equation I = - _{ }log_{2} P_{o}
I = - log _{2} ^{1}/_{1}
= 0 bitsThus, we have 0 bits of “Shannon information” which corresponds to 0 bits of uncertainty. Note that in communication theory, this uncertainty is taken as a measure of the amount of “information” transmitted by the source and that this amount of information is called “the entropy” or for the sake of clarity “Shannon entropy”. Thus, in communication theory the words “information”, “entropy” and “uncertainty” are synonymous and mathematically defined. As an aside, many writers have legitimately compared “thermodynamic entropy“ with “Shannon entropy”. The basis for this comparison comes from the statistical nature of thermodynamic entropy and the statistical nature of Shannon entropy i.e. in terms of randomness. Nevertheless, it’s still important to identify which one we are talking about or if we are using these words synonymously. Example 2
Let’s say that our message source can send only possible messages and that it’s equally
likely that either one or the other message will be sent.
twoIn this case, the chances that the receiver will receive one particular message out of the two, is one out of two or 50%. Thus, the receiver is less certain (or
compared to example 1) about which message will be sent.
more uncertainThis can also be calculated using the equation I = - log _{2} P_{o}
I = - log _{2} ˝
= 1 bitThus, we have 1 bit of Shannon information that corresponds to 1 bit of uncertainty or 1 bit of Shannon entropy. Example 3
Let’s say that this time our message source can send only possible messages and that it’s
equally likely that any one out of the ten will be sent.
tenIn this case, the chances that the receiver will receive one particular message out of the ten, is one out of ten or 10%. Therefore, the receiver is even more uncertain (compared to example 2) about which message will be sent. This can also be calculated using the equation I = - _{ }log_{2} P_{o}
I = - log _{2}
^{1}/_{10} = 3.3 bitsThus, we have 3.3 bits of Shannon information corresponding to 3.3 bits of uncertainty or 3.3 bits of Shannon entropy. SummaryFrom these three examples, we conclude that the number of bits of information (or entropy) increases as the uncertainty of which message will be sent increases. Note that the (Shannon) information \ uncertainty \ entropy is the same whether these messages are random sequences or grammatically correct sentences. Messages: Not Equally LikelyExample 1Suppose that our message source can send only possible messages with nine of the ten
messages identical and one remaining message different from the nine.
tenIn this case, the chances are nine out of ten or 90% that the receiver will receive one of the nine identical messages and the chances are one out of ten or 10% that the receiver will receive the one different message. Thus, there is a high degree of certainty (not much uncertainty) that the message received will be one of the nine identical messages. This can be calculated using the equation I = - ( P _{o }log_{2} P_{o} + P_{1
}log_{2} P_{1 })
I = - ( ^{1}/_{10}
log_{2} ^{1}/_{10} + ^{9}/_{10 } log_{2}
^{9}/_{10 }) = - [ (^{1}/_{10})(-3.3) + (^{8}/_{10})(-
.12) ] = 0.20 bitsThus, we have .20 bits of Shannon information corresponding to .20 bits of uncertainty or .20 bits of Shannon entropy. Example 2Suppose that our message source can again send only possible messages with six of the
ten messages identical and the remaining four of the ten identical but
different from the six.
tenIn this case, the chances are six out of ten or 60% that the receiver will receive one of the six identical messages and the chances are four out of ten or 40% that the receiver will receive one of the four identical messages. Thus, the receiver is more
uncertain (compared to example 1) that the message received will be one of
the six identical messages or one of the four identical messages.
This can be calculated using the equation I = - ( P _{o }log_{2} P_{o} + P_{1
}log_{2} P_{1 })
I = - ( ^{4}/_{10}
log_{2} ^{4}/_{10} + ^{6}/_{10 } log_{2}
^{6}/_{10 }) = - [ (^{4}/_{10})(-1.3) + (^{6}/_{10})(-
.736) ] = 0.962 bitsThus, we have 0.962 bits of Shannon information corresponding to 0.962 bits of uncertainty or 0.962 bits of Shannon entropy. SummaryFrom these two examples, we also conclude that the number of bits of information (or entropy) increases as the uncertainty of which message will be sent increases. Thus, the greater the number of bits, the greater the uncertainty. Note again that the (Shannon) information \ uncertainty \ entropy is the same whether any of these messages are random sequences or grammatically correct sentences. Note again that a message can be one symbol long or several symbols in long. Messages: Equally likely vs Not Equally likely
Notice that with the “equally likely” messages the Shannon information is greater (i.e. 3.3 bits) than the “not equally likely” messages (i.e. 0.20 bits and 0.962 bits). This is because the more constraints that we impose, the more certainty (less uncertainty) we have about which message will be sent. Thus, the amount of Shannon information decreases the more we constrained or limit the system. Messages: Random vs. Recognizable
Letters: Equally Likely
It’s well known that a complete alphabet contains the usual 26 letters plus a ‘blank’ for space between words. This gives a total of 27 symbols. In the following examples we will consider the 27 symbols to be 27 possible messages one symbol long. Example 1:
Suppose that we wanted to produce a random sequence of letters and spaces with equal probabilities. We could do this by taking a set of cards and marking each one with a letter and with a space until all 27 symbols were used. After this, we would take our marked cards and put them into a single drum, mix up the letters, draw a card, record the symbol then return it to the drum and then start the process over again. Let’s say that we only want to calculate the number of bits of Shannon information for
letter.
oneIn this case, we would simply draw one card out of the drum. Since the chances are one out of twenty-seven that any one particular letter (or space) will be drawn, we would once again use the equation I = - _{ }log_{2}
P_{o }as follows_{ }
I = - log _{2}
^{1}/_{27} = 4.76 bits per letterThus, we have 4.76 bits of Shannon information corresponding to 4.76 bits of uncertainty or 4.76 bits of Shannon entropy per letter. Example 2:
Suppose that we want to repeat example 1 only this time calculate the number of bits of Shannon information for letters.
threeIn this case, let’s continue to draw letters out of the drum randomly until we have three letters and
let’s say that those letters are “CAT” corresponding to three messages that are
one symbol long. Since each letter has 4.76 bits of Shannon information we
would simply do the following
I = 3 x ( - log _{2} ^{1}/_{27} ) = 14.28 bitsFor the sake of argument let’s say that the three letters that we obtained were “XFO” instead of “CAT”. In this case the calculation is the same i.e. I = 3 x ( - log _{2} ^{1}/_{27} ) = 14.28 bits
Summary
In this case the number of bits of Shannon information is the same whether
the three letters spell out a recognizable word or not.
Letters: Not Equally Likely
It is also well known that in any given sentence some letters occur more frequently than others. It is also known that the space between words occurs a certain percent of the time. For example, a space occurs roughly 20% of the time, the letters “E” occurs 10.5% of the time, the letter “T” occurs 7.2% of the time and so on for all the letters of the alphabet. Example 1
Let’s say that we want to calculate the number of bits of Shannon information for
letter. Once again we mark a set of cards but in this case, we have to account
for the different percentages as well as for 10one^{th}’s and in some
cases 100^{th}’s of a percent. To do this we will have to put a much
larger number than 27 in the drum. We also note that the larger the total
number of cards we put in, the better we are able to account for the 10^{th’}s
and 100^{th}’s of a percent.
In this case, we will have to put 9,679 cards in the drum in order to account for the 100 ^{th}’s of a percent. The breakdown is as follows: 2000 cards for the ‘space’ because it occurs roughly
20% of the time, 1050 cards for the letter ‘E’,
720 cards for the letter ‘T’ and so on for all
the letters of the alphabet until we have 9,679
cards in the drum.
To calculate this the equation I = - ( P _{o }log_{2} P_{o} + P_{1 }log_{2}
P_{1 }) is extended so that we can account for all of the percentages
of occurrence for all 27 symbols. To do this we write the equation as follows
adding each individual expression from the highest probability to the lowest
for all 27 symbols:
I = - ( P _{space }log_{2} P_{space} + P_{E
}log_{2} P_{E }+ P_{T }log_{2} P_{T }+ ………… _{ }+ P_{Z
}log_{2} P_{Z})
= - ( ^{2000}/_{9679 }log_{2} ^{
2000}/_{
9679 } + ^{1050}
/_{9679 }log_{2}
^{1050}/_{
9679 }+ ^{720}
/_{9679 }log_{2}
^{720}/_{
9679 }+ …….. _{
}
_{ .…. }
+ ^{10}
/_{9679 }log_{2}
^{10}/_{
9679})
» 3.9 bits
per letterThus, we have approximately 3.9
bits of Shannon information corresponding to 3.9 bits of uncertainty or 3.9
bits of Shannon entropy per letter. Note however that this is an average value.
Example 2
Now let’s say that we want to calculate the number of bits of Shannon information for
different letters.
threeIn this case, we draw letters out of the same drum randomly until we have three letters and let’s say
that those letters are “XFO”. Let’s also say that these letters can be drawn
out in any order i.e. “OFX”, “FXO” etc.. Each letter however has a unique
probability \ percentage associated with it, so we must use those values in our
calculation instead of the average value of 3.9 bits per letter. That is, since
“X” occurs roughly 0.2% of the time we will
have 20 out of 9,769
cards marked with a “X” and so on for the letters “F” and “O”.
I = - ( P _{
X }log_{2} P_{X} + P_{F }log_{2}
P_{F }+ P_{
O }log_{2} P_{O}_{
})
= - ( ^{
20}/_{
9679 }log_{2} ^{
20}/_{
9679 } + ^{225}
/_{9679 }log_{2}
^{225}/_{
9679 }+ ^{654}
/_{9679 }log_{2}
^{654}/_{
9679 })
= 2.23 bitsExample 3
Now let’s say that we are again going to draw letters out of the drum but this time
the three letters must be “CAT” and in that order. This particular scenario is
much more constrained compared to the others that we’ve looked at so far.
threeOrdinarily it would be highly unlikely that we would draw the letter “C” out of the drum on the first try, then “A” on the second try and then “T” on the third try. In fact to do this properly, we will need to draw a card, read it and not put it back in
the drum. This way, we increase the chances of finding the desired letter on
each subsequent try because the total number of cards is decreasing.
Because we have 9,769 cards we would typically have to draw tens if not hundreds of cards for this to work. Nevertheless, it‘s still “possible” to draw C, A and T on the first, second and third try whether we return the cards or not. For this example we will not return the cards to the drum even though the impact on the actual calculation is negligible. Based upon these considerations, we assume the best possible set of favorable outcomes and take the following result to represent the best case scenario i.e. maximum number of bits. Again each letter also has a unique percentage associated with it, so we must use those values in our calculation instead of the average value of 3.9 bits per letter. That is, since “C” occurs roughly 2.3% of the time we will have 230 out of 9,769 cards marked with a “C” and so on for the letters “A” and “T”. To calculate this we do the following with the total number of cards decreasing as stated above: I = - ( P _{
C }log_{2} P_{
C} + P_{A }log_{2}
P_{A }+ P_{
T }log_{2} P_{T })
= - ( ^{
230}/_{
9679 }log_{2} ^{
230}/_{
9679 } + ^{630}
/_{9678 }log_{2}
^{630}/_{
9678 }+ ^{720}
/_{9677 }log_{2}
^{720}/_{
9677 })
= .67 bits maximumNote that the typical number of bits for this calculation will be far less than the “best possible” or maximum value given above. In fact, the amount of uncertainty will decrease and tend to zero (bits) as the number of attempts to draw a desired letter increases. SummaryFrom the examples above, we observe that the average number of bits is the same for both the recognizable word “CAT” and random string “XFO” i.e. 3 x 3.9 bits or 14.28 bits. However, when we compare the amount of Shannon information for the randomly selected letters “XFO” with the Shannon information for the specifically selected letters “CAT”, we see that the recognizable word CAT has fewer bits than the random XFO. This is consistent with the notion that as the amount of Shannon information increases, the amount of uncertainty increases. That is, more bits equals more uncertainty which equals less constraint or more randomness. Words: Not Equally Likely
It can also be shown that the same is true for words in a sentence. That is, we can put words in a drum according to their percentage of occurrence, draw them out, arrange them in a sentence and calculate the amount of Shannon information. We can do this as we did above, once for a random sentence and once for a specifically selected recognizable sentence and still find that the recognizable sentence has much less Shannon information than the random sentence. Note that if we use the value for the average number of bits per word, these two sentences will
also have the same amount of Shannon information.
Concluding remarks
We have shown that the ordinary use of the word “information” is much different from the word “information” in “information theory”. We have also explained that thermodynamic “entropy” and the “entropy” of information theory are the same in terms of increasing randomness. Nevertheless, these words also have different meanings when narrowed to specifics. With respect to information theory, it doesn’t matter whether the content of a message is random or recognizable because the average number of bits is the same. Moreover, the word’s “information”, “entropy” and “uncertainty” are synonymous and correspond to increasing randomness or uncertainty as the number of bits increases i.e. in the non-average cases. Alternatively, the more we constrain a system the more certainty there is about the outcome. This increase in certainty always corresponds to a decrease in Shannon information. Moreover, this decrease is said to represent the advance information (in the ordinary use of the word) that we have about the system. Hopefully this paper has clarified some of these issues. S.U. 11/04 Bibliography
[1] Leon Brillouin, “Science and Information Theory”, Second Edition, Academic Press, copyright 1962. [2] John R. Pierce, “An Introduction to Information Theory”, Second Revised Edition, Dover Publications, copyright 1980. Special thanks to Ryan Huxley and Casey Luskin for suggesting ways to improve this article. | Please also see the article Information Theory: Common Misconceptions. |