Übung zur Vorlesung Maschinelles Lernen
Übung vom 8.6.2000
Assistentin: Julia Vogel
Using a Hidden Markov Model
Description: In this exercise, we use a hidden Markov model (HMM)
as a model of word generation from part-of-speech sequences. We will:
-
Train an HMM on a sample of English-like text
-
Inspect the resulting model
-
Generate sentences at random from the model
-
Create test sentences and find the most likely hidden state sequence
Credits: The HMM
package we are using in this exercise was written by Tapas
Kanungo, and this exercise, the accompanying scripts, etc. were written
by by Philip Resnik.
Getting the code
-
Download the code and the examples from the Machine Learning webpage: http://www.vision.ethz.ch/ml/uebung_ml.html
-
Extract the code from the file
% tar xvf hmm.tar
-
Now go into directory example0:
% cd example0
-
Run llink to make the HMM programs available for running in this
directory
% ./llink
Creating the Training Data
-
Look at file example0.train, which contains a very small corpus
generated using a very simple English-like grammar.
% more example0.train
Alternatively, open the file in an editor. What parts of speech would you
say are represented in this file? Write down a list of parts of speech
and the words you think belong to each one. Notice that the same word can
appear in multiple parts of speech.
Turn example0.train into training input for the HMM code, by running
the
create_key.pl program. This reads in the training file and converts
each unique word into a unique number, since the HMM program uses numbers
rather than symbols. For example, "the tall man saw the short man" might
be translated into the sequence "1 2 3 4 1 5 3".
% ./create_key.pl example0.key < example0.train > example0.seq
Feel free to look at file example0.key, which contains the mapping
from words to their corresponding numbers, and at file example0.seq,
which now contains the training input represented as numbers rather than
words. The "T=" value at the top of example0.seq tells you how many symbols
there are in the training sequence. (In this case the number should be
590.)
Training the HMM
(Problem 3: estimate model parameters from given training sequence using
the Baum-Welch method)
-
To train the model we will use the esthmm program. This program
needs to know the number of symbols in the alphabet of the HMM (that is,
symbols that can be emitted); you can obtain this by typing (wc = word
count)
% wc -l example0.key
In this case the number of symbols is 13. The number of states is something
you can choose. Recall that for a HMM-based part-of-speech model like this
one, each state corresponds to a part of speech (i.e. a syntactic category
like noun, verb, etc.), so the number of states to choose corresponds to
the number of parts of speech you believe are represented in the corpus.
In this case, let's use 6 states. We run the esthmm program as follows:
% ./esthmm -v -N 6 -M 13 example0.seq > example0.hmm
This creates file example0.hmm, which contains the trained model.
If you'd like, look at file example0.hmm -- it's not the easiest thing
in the world to read, but you can see how the model is represented there.
At the top are specified the number of states and the number of symbols
(M=13 symbols, N=6 states). Then you have the complete A-matrix,
i.e. the 6-by-6 transition probability matrix. Next you have the 6-by-13
emission probability matrix, B. Finally you have the pi-vector,
giving initial probabilities for the 6 states.
Inspecting the Model
-
To view the model you've just created a bit more readably, keeping only
the most useful information, run the
viewhmm.pl program. This shows
the state-to-state transition probabilities when they are above a certain
threshold probability (0.01 in this case), since very low transition probabilities
probably don't tell you much about the structure of the model. Similarly,
for the emission probabilities, it shows only the symbols emitted with
a sufficiently high probability (also >0.01). (And it shows them as words,
not numbers, for easier readability.) Run viewhmm.pl as follows:
% ./viewhmm.pl example0.hmm example0.key 0.01 0.01
Chances are it scrolls by too fast, you you may want to do this instead:
% ./viewhmm.pl example0.hmm example0.key 0.01 0.01 | more
Or, you might want to save the output to a file:
% ./viewhmm.pl example0.hmm example0.key 0.01 0.01 > example0.view
Based on what you see when you run viewhmm.pl, draw, on paper, a
transition diagram for this HMM. That is, write each state as a node labeled
by the state number, and write probabilities on the arcs from state to
state.
-
Now, based on what you see when you run
viewhmm.pl, label each state
in your transition diagram with a part of speech. How good a match is there
between your intuitions, earlier, and the way the model has automatically
decided which are the high-probability symbols for each state? If there
are mismatches, are they linguistically interesting?
Generating Sentences at Random from the Model
("Problem" 4: generator of observations)
-
Using your transition diagram and the output of
viewhmm.pl, start
at the the most likely start state, and write down the most likely symbol
to be emitted there. (Break ties at random.) Then follow the most likely
arc to the next state, and write down the most likely symbol to be emitted
there.
Continue in this fashion until you emit a punctuation mark, or until you
get bored.
-
Now we'll have the computer do this same process. Since it doesn't ever
get bored, we'll have to tell it how many symbols to emit before it stops
-- say, 100. To do this, run the
genseq program:
% ./genseq -T 100 example0.hmm
Notice that the output isn't very readable, since the program generates
symbols as numbers. We can take that output, though, and run it through
the ints2words.pl program to replace the numbers with the corresponding
words:
% ./genseq -T 100 example0.hmm | ints2words.pl example0.key
Finding the Hidden State Sequence for a Test Sentence
(Problem 2: uncover most likely state sequence to a given observation using
the Viterbi algorithm)
-
Let's create a sentence to use as input to the model. First, create a file
example0.test.words
containing the word sequence that you got when you ran through the model
state by state by hand, above. You can do this using an editor, or you
can do it by typing
% cat > example0.test.words
then typing the sentence in, hitting return, and then typing control-D.
Don't forget to make sure everything is lowercase, and make sure the punctuation
mark is a separate word, not attached to the last word of the sentence.
Turn the file you've just created into the right format for the HMM programs,
by running words2seq.pl program as follows:
% ./words2seq.pl example0.key < example0.test.words > example0.test
If you're interested, take a look at file example0.test to see what the
input sequence looks like.
Find the sequence of hidden states most likely to have generated the symbol
sequence in example0.test, by using the
testvit program:
% ./testvit example0.hmm example0.test
The program reports the probability of the symbol sequence, given the most
likely sequence of states, and it also reports the optimal state sequence.
Take that optimal state sequence, and replace each state number with the
part-of-speech label you assigned to that state.
-
Congratulations! You've just done some HMM-based part-of-speech tagging.
Does the sequence of parts of speech correspond to what you expected?
Time for Fun
Now that you've gone through this exercise, here are some suggestions for
further exploration:
-
Try getting the state sequence (part-of-speech tags) for some more sentences
-- you can look at example0.key to see what words you're allowed
to use. To speed things up, you can skip the step of creating example0.test.words
by executing:
% cat | words2seq.pl example0.key > example0.testA
Typing your sentence in, hitting return, and then typing control-D to end
and create file example0.testA. (You can use suffixes testA, testB,
etc. for new sentences.) As before, don't forget to make sure everything
is lowercase, and make sure punctuation are separated from other words
by spaces. Once you've created example0.testA, run the
testvit
program on that file, as described above.
Go back to "Training the HMM", but this time increase the number of states
to 7 or 8 or 9. Go through the rest of the exercise of labeling the resulting
states with part-of-speech tags. What does the model do when it has more
states to play with, for this training set? What do you think are are some
possible consequences of having more states, in terms of the model's ability
to tag accurately, and in terms of the linguistic facts the model captures?
-
Go back to "Training the HMM", but this time decrease the number of states
to, say, 3 or 4. Same questions as in the previous paragraph: what are
the consequences?
-
Let's look at some more interesting data, using a more interesting English-like
language. Go into the
example1 directory:
% cd ../example1
Now do the following:
-
Run the ./llink program as described earlier. This time, though,
you
won't do the "Creating the Training Data" or "Training the HMM"
steps, because the training would take too long. (Could be a few hours,
depending on the machine.) Instead, go straight to "Inspecting the Model".
Notice that this is a bigger HMM with a richer language: there are 36 symbols
in the vocabulary, and the HMM has 12 states.
-
Do some of the same steps as we did for example0, particularly assigning
a part-of-speech label by hand to each state, and generating some text
at random using the genseq.pl program.
-
Look at the file example1.train, and see if you can come up with a context-free
grammar that generates this language, or something close to it. (If you
want to cheat, look at file gen.lisp in your hmm directory.)
-
Looking at the random text you generated using the genseq.pl program,
do you see some sentences that could not have been generated by the context-free
grammar? Why do those sentences get generated by the HMM but not the CFG?
Information concerning the software
This software contains code for understanding the basics of hidden
Markov models (HMM). The notation used is very similar to that used by
Rabiner and Juang in:
-
Rabiner, L. R. and B. H. Juang, "Fundamentals of Speech Recognition," Prentice
Hall, 1993.
-
Rabiner, L. R., "A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition, Prov. of IEEE, vol. 77, no. 2, pp. 257-286, 1989.
-
Rabiner, L. R., and B. H. Juang, "An Introduction to Hidden Markov Models,"
IEEE ASSP Magazine, vol. 3, no. 1, pp. 4-16, Jan. 1986.
Executables:
-
esthmm: Estimates the HMM from a given symbol sequence using BaumWelch.
-
genseq: Generates a symbol sequence using the specified
model.
-
testvit: Generates the most likely state sequence
for a given symbol sequence, given the HMM, using Viterbi.
Note:
The model test.hmm and the sequence test.seq solve exercise 1.5 of the
HMM exercise. Just execute the command:
% ./testvit test.hmm test.seq
The output is the solution of the exercise
HMM file format:
M = <number of symbols>
N = <number of states>
A:
a11 a12 ... a1N
a21 a22 ... a2N
. . . .
. . . .
. . . .
aN1 aN2 ... aNN
B:
b11 b12 ... b1M
b21 b22 ... b2M
. . . .
. . . .
. . . .
bN1 bN2 ... bNM
pi:
pi1 pi2 ... piN
Sample HMM file:
M= 2
N= 3
A:
0.333 0.333 0.333
0.333 0.333 0.333
0.333 0.333 0.333
B:
0.5 0.5
0.75 0.25
0.25 0.75
pi:
0.333 0.333 0.333
Sequence file format:
T=<seqence lenght>
o1 o2 o3 . . . oT
Sample sequence file:
T= 10
1 1 1 1 2 1 2 2 2 2