Melanie Tosik

25 February 2018

Viterbi part-of-speech (POS) tagger

For my natural language processing (NLP) class, we were asked to implement and train a part-of-speech (POS) tagger, as described in the 3rd edition of “Speech and Language Processing” (Jurafsky and Martin). I actually distinctly remember a very similar assignment during my undergraduate NLP class, where I felt so overwhelmed with the data structures and implementation details of the Viterbi algorithm that I ended up skipping the assignment entirely (which was allowed, one time). I’ve always been meaning to revisit this problem since, so I was really excited about finally getting a chance to do so.

In a nutshell, a POS tagger attempts to read in some text in a given language and assign a POS tag to each word (or token), such as noun, verb, adjective, etc. (although in practice these are usually divided into more fine-grained categories). This is a hard problem because many (if not most) English words are ambiguous in their POS. For example, the word “experience” frequently occurs as both a noun and a verb, “abstract” can be a noun and an adjective, and so on.

In general, there are two parts to building a (supervised) POS tagger: training and decoding. For this project, training was done using a hidden Markov model (HMM), which essentially involves estimating transition and emission probabilities for the hidden states (POS tags) that correspond to each observation (token) in the annotated training data. To decode new observations, we can then use the Viterbi algorithm, which computes the most likely sequence of POS tags given the current input sequence/sentence.

Both the HMM and the Viterbi algorithm are relatively straightforward to implement. The major challenge is how to deal with previously unseen observations for which no probabilities have been observed in the training data. Additive smoothing is one standard way to deal with unknown words, but there are many more advanced techniques out there that might yield even better results (such as the Good-Turing method, Katz’s backoff, or smoothing by linear interpolation).

View project on GitHub ☺︎

Til next time,