7/14/10

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

 
 

Sent to you by Jeff via Google Reader:

 
 

via LingPipe Blog by lingpipe on 7/13/10

I've uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don't derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child's play. Maybe they are if you're a physicist or theoretical statistician. But that was a whole lot of choppin' with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn't possible to integrate out the multinomials in naive Bayes (actually, they told me you'd be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result's interesting. I didn't wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don't get me started on the misappropriation of the term "Bayesian" by any system that applies Bayes's rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you're curious.

Thanks to Wikipedia

I couldn't have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia's Derivation is Wrong!

At least I'm pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia's calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we've already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what's going on is in the text above which says it drops terms that don't depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia's calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don't eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don't need to keep (as large) a symbol table in your head.

I haven't changed the Wikipedia page for two reasons: (1) I'd like some confirmation that I haven't gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I'd introduce typos or spend all day entering it.

LaTeX Rocks

I also don't think I could've done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?



 
 

Things you can do from here:

 
 

4/12/10

Approaches of tackling the problem of synonymy in IR

Approaches of tackling the problem of synonymy in IR

1. Structrue-based retrieval model
2. synonymy query expansion

1/19/10

Marginalizing Latent Variables in EM

 
 

Sent to you by Jeffye via Google Reader:

 
 

via LingPipe Blog by lingpipe on 1/19/10


After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I "turned the crank" (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads r, gene or splice variant sources g, and mRNA expression \theta, setting

p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta).

Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

What's a Parameter?

So that's what I did, calculating

g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta).

The problem is that there's an implicit "except the latent parameters" inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources g, taking

\theta^* = \arg \max_{\theta} p(r|\theta).

That requires marginalizing out the latent mapping of reads r_n to gene or splice variant sources g_n,

p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

On a log scale, that's

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

The Example

I'll repeat the example here for convenience:

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we'll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

  Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)

{} \ \ \ \ =  70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta)  + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)

= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})

= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2).

The reduction on the third line is because the probability of a read in the second exon, p(2|k), is only non-zero for isoform k=1, and similarly for a read in the third exon, where p(3|k) is nly non-zero for the second gene or splice variant, k=2. The final reduction follows because \theta_1 + \theta_2 = 1.

Not so Biased After All

After marginalizing out the read mappings g, the MLE for expression is right where we'd expect it, at the solution of

\frac{d}{d\theta} \log p(r|\theta) = 0,

which is

\theta^*=20/70.

It turns out EM isn't so biased for these discrete mixtures, at least when you turn the crank the right way.


 
 

Things you can do from here: