Agile Java Man: Figaro

Saturday, November 19, 2016

Probabilistic Programming - Part 2

In part 1, I looked at the first chapter of Bayesian Methods for Hackers. In this post, I'll conduct the same analysis on the same data but instead of using Python I'll be using Scala and Figaro.

To recap: we have somebody's text usage and we want to know when (if at all) there was a change in usage. All we have is the number of texts sent each day over a 74 day period.

We suppose that the distribution of texts per day is a Poisson distribution as we assume the text messages are independent of each other, it's a discrete distribution and it's got the nice property that the one constant it depends on (λ) is also its mean.

Unfortunately, we don't know what this parameter is so we need to model that stochastically too. This λ parameter is not discrete but continuous, so we choose the exponential distribution to model as it too has nice properties. It too depends on a constant parameter (α) and it has the nice property that its mean is α^-1.

Because of these nice properties the distributions have, we are helped in our initial guess at what their values are. In this case, we can take the mean number of texts (let's call it m) and then we know λ will (roughly) be that value. However, λ is not a fixed value, it's a stochastic variable modeled by an exponential function. That's OK though since we know that the mean of an exponential distribution is α^-1then α=m^-1.

Finally, we are assuming that on some day indicated by a value tau, we switch from one Poisson distribution to another so we need two of everything. Regarding the s, "because they’re parameters of a parameter, they’re sometimes called hyper-parameters." [1]

OK, enough of the maths, here's the code:

val data: List[Int] = // load the data from a file

val mean = data.sum.toDouble / data.length.toDouble
val alpha = 1.0 / mean

val lambda1: Element[Double] = Exponential(alpha)
val lambda2: Element[Double] = Exponential(alpha)

val poisson1 = Poisson(lambda1)
val poisson2 = Poisson(lambda2)

We don't know the initial value of tau so we guess it's initially likely to be any value in the 74 day period. We express that in Figaro thus:

val tau: AtomicUniform[Int] = Uniform(0 to n_count_data: _*)

And then we build our model to describe the situation outlined in the first few paragraphs thus:

def pDay(day: Int): Element[Int] = If (tau.map(value => value > day), poisson1, poisson2)

val modelData: Seq[Element[Int]] = (1 to n_count_data).map(d => pDay(d))

Finally, we tell the model what actual data is to help it know what it's shooting for:

modelData.zip(data).foreach { case(test, real) =>
test.observe(real) // WARNING! This does not play nicely with MH (see below)
}

Then we hand it off to an algorithm that solves the equation for us, Metropolis-Hastings.

val mh = MetropolisHastings(numSamples = 40000,
ProposalScheme(tau, lambda1, lambda2),
burnIn = 10000,
tau)
mh.start()

Oops. It gets stuck. Even if I reduce the number of samples from 40 000 to 10. Hmm. Well, it prompted me to ask Avi Pfeffer, the creator of Figaro, what the problem was.

"You have 70+ variables representing different days with 70+ possible values for each variable. When you observe all of these, you get an extremely unlikely observation. I think MH is struggling with finding any state that causes the observations to be satisfied."

That is, the chances of getting the actual data given the model we have built is staggeringly small. Annoyingly, I already knew this as I had read Dr Pfeffer's book where he built a probabilistic model of movie actors winning awards. Naturally, only one person can win the Oscar for best actor, so the model gets stuck as MH cannot explore any other actor winning the Oscar as it's a two step process to get there: before we select a new actor the original actor is no longer the winner (zero actors win the award - that's impossible) or we choose another actor who wins the award but before we say the original is not the winner (two actors win the award - also impossible).

There are two solutions. "The first is to use proposal schemes that are different from the default scheme. These custom schemes can, for example, propose to change two elements at once. The second idea is to avoid hard conditions that can never be violated." [1]

I chose the latter and wrote this:

def fitnessFnPowered(real: Int, power: Double): (Int) => Double = x => 1.0 / pow((abs(real - x) + 1), power)

That is, we calculate a value in the range [0,1] where 1 indicates a direct hit and anything else is less than this depending on how much off target it was. How quickly wrong proposals deviate from 1.0 is determined by the value of power.

modelData.zip(data).foreach { case(test, real) =>
test.addConstraint(fitnessFnPowered(real, ???))
}

And now my algorithm runs to completion. Well, it does if I give it a power rather than ???. What is a good value? By trial and error, I got:

And the probability became 1.0 (ie, certain) when the power value was set to 5.0. Although the distributions differ to the BM4H book, both frameworks agree on the most probable value for the day the usage change: day 45.

[1] Practical Probabilistic Programming, Avi Pfeffer.

Monday, July 18, 2016

The theory that dare not speak its name

From XKCD (https://xkcd.com/1132/)

The author of the excellent XKCD cartoons misrepresents the arguments of Frequentists and Bayesians for the purpose of jocularity but found he'd "kicked a hornet's nest" by doing so.

Bayes Theorem has been controversial for hundreds of years. In fact, books have been written on it (see here for a recent and popular one). One reason is that it depends on subjectivity. But Allen B Downey, author of Think Bayes (freely downloadable from here), thinks this is a good thing. "Science is always based on modeling decisions, and modeling decisions are always subjective. Bayesian methods make these decisions explicit, and that’s a feature, not a bug. But all hope for objectivity is not lost. Even if you and I start with different priors, if we see enough data (and agree on how to interpret it) our beliefs will converge. And if we don’t have enough data to converge, we’ll be able to quantify how much uncertainty remains." [1]

Bayesian Networks

Taking the examples from Probabilistic Graphical Models available at Coursera, imagine this scenario: a student wants a letter of recommendation which depends on their grades, intelligence, SAT scores and course difficulty like so:

The course uses SamIam but I'm going to use Figaro as it's written in Scala. Representing these probability dependencies then looks like this:

class ProbabilisticGraphicalModels {

implicit val universe = Universe.createNew()

val Easy = "Easy"
val Hard = "Hard"
val Dumb = "Dumb"
val Smart = "Smart"
val A = "A"
val B = "B"
val C = "C"
val GoodSat = "GoodSat"
val BadSat = "BadSat"
val Letter = "Letter"
val NoLetter = "NoLetter"

def chancesOfDifficultIs(d: Double): Chain[Boolean, String]
= Chain(Flip(d), (b: Boolean) => if (b) Constant(Hard) else Constant(Easy))

def chancesOfSmartIs(d: Double): Chain[Boolean, String]
= Chain(Flip(d), (b: Boolean) => if (b) Constant(Smart) else Constant(Dumb))

def gradeDistributionWhen(intelligence: Chain[Boolean, String] = defaultIntelligence,
difficulty: Chain[Boolean, String] = defaultDifficulty): CPD2[String, String, String]
= CPD(intelligence, difficulty,
(Dumb, Easy) -> Select(0.3 -> A, 0.4 -> B, 0.3 -> C),
(Dumb, Hard) -> Select(0.05 -> A, 0.25 -> B, 0.7 -> C),
(Smart, Easy) -> Select(0.9 -> A, 0.08 -> B, 0.02 -> C),
(Smart, Hard) -> Select(0.5 -> A, 0.3 -> B, 0.2 -> C)
)

def satDist(intelligence: Chain[Boolean, String] = defaultIntelligence): CPD1[String, String]
= CPD(intelligence,
Dumb -> Select(0.95 -> BadSat, 0.05 -> GoodSat),
Smart -> Select(0.2 -> BadSat, 0.8 -> GoodSat)
)

def letterDist(gradeDist: CPD2[String, String, String] = defaultGradeDist): CPD1[String, String]
= CPD(gradeDist,
A -> Select(0.1 -> NoLetter, 0.9 -> Letter),
B -> Select(0.4 -> NoLetter, 0.6 -> Letter),
C -> Select(0.99 -> NoLetter, 0.01 -> Letter)
)
.
.

And we can query it like so:

.
.
def probabilityOf[T](target: Element[T], fn: (T) => Boolean): Double = {
val ve = VariableElimination(target)
ve.start()
ve.probability(target, fn)
}
.
.

Finally, let's add some sugar so I can use it in ScalaTests:

.
.
val defaultDifficulty = chancesOfDifficultIs(0.6)
val easier = chancesOfDifficultIs(0.5)
val defaultIntelligence = chancesOfSmartIs(0.7)
val defaultGradeDist = gradeDistributionWhen(intelligence = defaultIntelligence, difficulty = defaultDifficulty)
val defaultLetterDist = letterDist(defaultGradeDist)
val defaultSatDist = satDist(defaultIntelligence)

def being(x: String): (String) => Boolean = _ == x

def whenGettingAnABecomesHarder(): Unit = defaultGradeDist.addConstraint(x => if (x == A) 0.1 else 0.9)

def whenTheCourseIsMoreLikelyToBeHard(): Unit = defaultDifficulty.addConstraint(x => if (x == Hard) 0.99 else 0.01)

def whenLetterBecomesLessLikely(): Unit = defaultLetterDist.addConstraint(x => if (x == Letter) 0.1 else 0.9)

def whenTheSatIsKnownToBeGood(): Unit = defaultSatDist.observe(GoodSat)

}

Flow of Probabilistic Influence

If we wanted to know the chances of receiving a letter of recommendation, we'd just have to run probabilityOf(letterDist, being(Letter)) (which equals about 0.603656). The interesting thing is what happens if we observe some facts - does this change the outcome?

The answer is: it depends.

If we observe the SAT score for an individual, then their probability of receiving the letter changes. For example, executing satDist.observe(GoodSat) means that the chancesOf method now returns a probability of about 0.712 (and similarly a BadSat reduces the probability to about 0.457).

The general idea is given in this screen grab:

From Daphne Koller's Probabilistic Graphical Models

The X-> Y and X <- Y are easy. "In general, probabilistic influence is symmetrical. That is, if X can influence Y, Y can influence X." [2]

Probabilities can flow through nodes in this Bayesian network so X -> W -> Y and X <- W <- Y are not terribly surprising if we know nothing about the intermediate step (the column on the left hand side). Conversely, if we do know something about it, that's all we need and it doesn't matter what the first step does.

The V-Structure (X <- W -> Y) is more interesting. Here, we can use ScalaTest to demonstrate what's going on. The first case is when we have no observed evidence. Here, Koller tells us

"If I tell you that a student took a class and the class is difficult, does that tell you anything about the student's intelligence? And the answer is 'no'."

And sure enough, Figaro agrees:

"probabilities" should {

"flow X -> W <- Y ('V-structure')" in new ProbabilisticGraphicalModels {

val smartBefore = probabilityOf(defaultIntelligence, being(Smart))

whenTheCourseIsMoreLikelyToBeHard()

val smartAfter = probabilityOf(defaultIntelligence, being(Smart))

smartBefore shouldEqual smartAfter

}

with the caveat that this is true if and only if

"... W and all of its descendants are not observed."

if it is observed, then the following is true:

"flow X -> W <- Y ('V-structure')" in new ProbabilisticGraphicalModels {
// Can difficulty influence intelligence via letter?
val smartBefore = probabilityOf(defaultIntelligence, being(Smart))
whenLetterBecomesLessLikely() // "this too activates the V-structure"
val smartAfter = probabilityOf(defaultIntelligence, being(Smart))
smartBefore should be > smartAfter
}

As mentioned, if we do know something about the intermediate step, the first step can't influence the outcome. Take these examples:

"I know the student got an A in the class, now I'm telling you that the class is really hard.does that change the probability of the distribution of the letter? No because ... the letter only depends on the grade."

"Given evidence about Z" should {

"make no difference in (X -> W -> Y)" in new ProbabilisticGraphicalModels {
defaultGradeDist.observe(A)
val letterChanceBefore = probabilityOf(defaultLetterDist, being(Letter))
whenTheCourseIsMoreLikelyToBeHard()
val letterChanceAfter = probabilityOf(defaultLetterDist, being(Letter))
letterChanceBefore shouldEqual letterChanceAfter
}

"If I tell you that the student is intelligent, then there is no way the SAT can influence the probability influence in grade."

"make no difference in X <- W -> Y" in new ProbabilisticGraphicalModels {

defaultIntelligence.observe(Smart)

val chancesOfABefore = probabilityOf(defaultGradeDist, being(A))

whenTheSatIsKnownToBeGood()

val chancesOfAAfter = probabilityOf(defaultGradeDist, being(A))

chancesOfAAfter shouldEqual chancesOfABefore

}

More notes to come.

[1] Prof Allen B Downey's blog
[2] Daphne Koller, Coursera.

Agile Java Man

Saturday, November 19, 2016

Probabilistic Programming - Part 2

Monday, July 18, 2016

The theory that dare not speak its name

Blog Archive

About Me