Claude Shannon, mathematician and computer scientist, born on April 30 1916, and died on February 24 2001, was the one who create the mathematical foundations which laid down the general rules of modern information theory. In his fundamental paper of 1948, A Mathematical Theory of Communication, a measure of the uncertainty associated with a random memoryless source, called entropy, is proposed. Here we are interested in the use of the entropy concept to analyze texts at the level of its words variety.
We define the entropy of a text T, with l
words and n different ones, by the formula
where pi, i = 1,¼,n, is
the frequency of each i-word in the text T, that is, pi
is the number of times that the i-word happens to occur in the
given text. If we consider that a text of length l (a text with l
words) is as much richer as much larger is the number n of
different words and, among the texts with the same number l of words and the same number n of
different words, is richer the one where the words have less
variation in frequency, one can easily conclude that the entropy
is indeed a very useful measure to compare the richness of two or
more texts. To compare texts with different number of words l, we introduce a kind of ``relative
entropy'' Erel, defined as the quotient between the
entropy ET of the text and the maximum entropy Emax,
and multiplying by 100 if one wants a percentage:
|
The maximum entropy Emax is just the entropy of a text with the same number l of words and in which each word occurs exactly once (i.e., n : = l, pi : = 1):
Given a text T, write a program that computes the total number l of words in T, the entropy ET of the text, and its relative entropy Erel. In order to determine the required numbers, your program must be case insensitive (for example, words like ``House'', ``house'' or ``HOUSE'' must be considered to be the same); and ignore the punctuation marks , . : ; ! ? " ( ) as well as spaces and newlines ('\n').
The input contains a text T, necessarily with more than one word (l > 1). You can assume that the maximum length of the words is 20 characters long and the text does not have more than 100 000 words.
In the output write three lines: the first with the total number l of words in T; the second with the text entropy ET rounded to one decimal digit; and the last one with the relative entropy Erel, in percentage, and rounded to be an integer.
Midnight, not a sound from the pavement Has the moon lost her memory? She is smiling alone In the lamplight, the withered leaves collect at my feet And the wind begins to moan Memory, all alone in the moonlight I can dream of the old days Life was beautiful then I remember the time I knew what happiness was Let the memory live again
64 1.6 89
1Barbra
Streisand, Memory (first two verses).
MIUP'2002: 2nd. Portuguese National ACM Programming Contest