In my physics degree, we were asked to calculate the entropy of a chess board. Students smarter than me snorted at this silly exercise. Yet, entropy is just a statistical measure. You cannot measure it directly (there are no entropy-ometers) but it exists everywhere and can be used in odd places like machine learning (eg maximum entropy classifiers which "from all the models that fit our training data, selects the one which has the largest entropy").

Alternatively, you might want to find the configuration with the smallest entropy. An example is here where the quality of a clustering algorithm (k-means in this case) is assessed by looking at the entropy of the detected clusters. "As an external criteria, entropy uses external information — class labels in this case. Indeed, entropy measures the purity of the clusters with respect to the given class labels. Thus, if every cluster consists of objects with only a single class label, the entropy is 0. However, as the class labels of objects in a cluster become more varied, the entropy value increases."

For instance, say you are trying to find the parameters for an equation such that it best fits the data. "At the very least, you need to provide ... a score for each candidate parameter it tries. This score assignment is commonly called a cost function. The higher the cost, the worse the model parameter will be... The cost function is derived from the principle of maximum entropy." [1]

__What is Entropy__

I found this description of heads (H) and tails (T) from tossing a coin enlightening:

"If the frequency [f] of H is 0.5 and f(T) is 0.5, the entropy E, in bits per toss, is

-0.5 log

_{2}0.5

for heads, and a similar value for tails. The values add up (in this case) to 1.0. The intuitive meaning of 1.0 (the Shannon entropy) is that a single coin toss conveys 1.0 bit of information.

Contrast this with the situation that prevails when using a "weighted" or unfair penny that lands heads-up 70% of the time. We know intuitively that tossing such a coin will produce less information because we can predict the outcome (heads), to a degree.

**Something that's predictable is uninformative**. Shannon's equation gives

-0.7 log

_{2}(0.7) = 0.3602

for heads and

-0.3 log

_{2}(0.3) = 0.5211

for tails, for an entropy of 0.8813 bits per toss. In this case we can say that a toss is 11.87% [1.0 - 0.8813] redundant."

Here's another example:

X = 'a' with probability 0.5

'b' with probability 0.25

'c' with probability 0.125

'd' with probability 0.125

The entropy of this configuration is:

H(X) = -0.5 log(0.5) - 0.25 log(0.25) - 0.125 log(0.125) - 0.125 log(0.125) = 1.75 bits

What does this actually mean? Well, if the average number of questions asked ("Is X 'a'? If not, is it 'b'? ...") then "the resulting expected number of binary questions required is 1.75" [2].

__Derivation of entropy__

Basically, we want entropy to be extensive. That is "parameters that scale with the system. In other words U(aS,aV,aN)=aU(S,V,N)".

So, if S

_{X}is the entropy of system X, then the combined entropy of two systems, A and B, would be:

S

_{C}= S

_{A}+ S

_{B}

Second, we want it to be largest when all the states are equally probably. Let's call the function f then the average value is:

S = <f> = Σ

_{i}p

_{i}f(p

_{i}) Equation 1

Now, given two sub-systems, A and B, the system they make up C will have entropy:

S

_{C}= Σ

_{i}Σ

_{j}p

_{i}p

_{j}f(p

_{i})f(p

_{j}) Equation 2

that is, the we are summing probabilities over the states where A is in state i and B is in state j.

For equation 2 to conform to the form of equation 1, let's introduce the variable p

_{ij}=p

_{i}p

_{j}.Then:

S

_{C}= Σ

_{i}Σ

_{j}p

_{ij}f(p

_{ij})

For this to be true, f = C ln p since ln(ab) = ln(a) + ln(b).

This is the argument found here.

[1] Machine Learning with Tensor Flow.

[2] Elements of Information Theory.