18 October 2010

LDS General Conference: The Software Guesses the Speaker

Mormon Tabernacle Choir general conference - A...Image via WikipediaLet's say that you stripped the author and photo from the online transcript of any given LDS General Conference talk. Let's also say that you didn't see or hear that talk delivered. Could you tell who gave the talk just by reading it?

My computer can. Here's how:

It turns out that modern text processing software is getting pretty good at this kind of stuff. You've always known that your inbox can tell whether or not a random email is spam with a fair amount of accuracy. Have you used cutting-edge software like Zemanta, though, that classifies and categorizes your blog post as you are typing it? Or how about companies that have software that preprocesses incoming customer feedback emails to decide whether you are happy with the product or not?

Even the most basic approaches can be very precise in some domains; this example being LDS General Conference talks.

The first thing that the computer needs is a set of training text. This text is like giving the computer the test questions and the answer key, which the computer can use to try to learn the subject matter. While "teaching to the test" might not be the best for our student population, it works great for computers that can make appropriate inferences from the smallest details.

The way I got my training text was to use a web crawler that would go to www.lds.org, download general conference talks, scrape off all the HTML, etc, and save the raw, unformatted text of each talk into a separate file. Each file's name had the name of the speaker in it, which would be where the software would look to check its answer.

The second thing that the computer needs is to know what features in the text are important to you. The most basic approach is to ask it to note individual words in the document. For example, one feature might be "the article has the word 'commandments' in it". Another might be "the article has the word 'scriptures' in it". There are a lot more ways to look at a document than just the words. How about "this article uses the passive tense" or "this article has long sentences" or "this article references '2 Corinthians'". This simply depends on your level of effort to teach the computer what to look for.

It's not quite as involved as that, though, at the most basic level. For the program that I wrote, it simply gets all the words in each document in the training set and treats each unique word as a feature. So, it builds the feature set automatically, and we get "this article has 'humble'" as well as "this article has 'seemed'" as features.

Third, the computer needs to know what it is trying to guess at. It needs a list of possible answers.

In our case, the answers are "Eyring," "Monson," "Uchtdorf," etc.

Fourth, the computer needs to search and tally those features in the training text.

The approach that I used is called Naive-Bayesian. The idea is simple. For each feature, give each document in the training text (for which it does have the answer key) a score in one of four categories:

1. This document has word X AND it is a talk by speaker Y
2. This document doesn't have word X AND it is a talk by speaker Y
3. This document has word X AND it is not a talk by speaker Y
4. This document doesn't have word X AND it is not a talk by speaker Y

Now, with all of that tallied, we can give it text that it hasn't seen before. Given enough data to work with, even this simple approach can be very accurate.

In fact, with my corpus of the last ten years of Eyring, Hinckley, and Monson talks, my computer is 88% accurate!

Here is some nifty data that it found:




















FeatureComparisonOdds
contains(prophets) = TrueEyring : Hinckl =15.1 : 1.0
contains(wants) = TrueEyring : Hinckl =12.4 : 1.0
contains(evidence) = TrueEyring : Monson =11.6 : 1.0
contains(promised) = TrueEyring : Hinckl =11.1 : 1.0
contains(seemed) = TrueEyring : Hinckl =9.8 : 1.0
contains(qualify) = TrueEyring : Hinckl =9.8 : 1.0
contains(commandments) = TrueEyring : Hinckl =9.8 : 1.0
contains(start) = TrueEyring : Monson = 9.1 : 1.0
contains(commandments) = FalseHinckl : Eyring =8.6 : 1.0
contains(answers) = TrueEyring : Hinckl =8.5 : 1.0
contains(gifts) = TrueEyring : Hinckl =8.5 : 1.0
contains(lifetime) = TrueEyring : Hinckl =8.5 : 1.0
contains(chose) = TrueEyring : Hinckl =8.5 : 1.0
contains(simple) = TrueEyring : Hinckl =8.3 : 1.0
contains(memory) = TrueEyring : Monson =7.9 : 1.0
contains(whatsoever) = TrueEyring : Monson =7.9 : 1.0
contains(resurrected) = TrueEyring : Monson =7.9 : 1.0


This table shows what the computer found to be the most helpful features, in our case 'words', in determining who gave the talk. In the first column is the feature. The second column is the two speakers, A:B, it is comparing and the third column is the odds of it being speaker A over speaker B, given that the feature is satisfied.

(Note to self: Do not show this to brother-in-law lest he decide that he can use these odds to place bets on the April general conference address.)

What do you see that's interesting to you? I think that it's really interesting to see "basic" words mixed in with religious terms. I think that it's also interesting to see most of the odds involve President Eyring. While I haven't taken it so far, yet, I would guess that this simply means that Eyring's vocabulary is more easily distinguished from the other two. That said, on my first go-round, I just used President Monson and President Hinckley's talks, and I was still at about 85% accuracy.



Enhanced by Zemanta