Language Models: Spellchecking and Autocorrection

Sanket Patil
12 min readAug 10, 2017

--

“Anyone who can only think of one way to spell a word obviously lacks imagination.” — Mark Twain

How many times have you sent a message to a friend that contained odd spelling mistakes, only to follow it up with, “Oops! Typo… I meant…”? And how many times have you given your phone’s autocorrect a free rein such that it sends an even stranger message, and then you go, “Oh fork you, autocorrect!”? I am sure you have also been a victim of Freudian Autocorrect multiple times. (That’s when you decided for the umpteenth time: “No more drunk texting!”. Oh well.) In fact, autocorrect errors are so pervasive, there are multiple websites that aggregate all sorts of these, ranging from mildly amusing to profoundly embarrassing.

While a source of consternation at times, autocorrect tools do work very well most of the times, and we end up using them everywhere: email, search queries, messenger, word processor, even when when writing code. So, how does spellchecking and autocorrection work? Now that we have been equipped with a reasonably sound understanding of language models, let’s try and apply it to this very common problem.

Spelling Errors

Let’s define the job of a spell checker and an autocorrector. A word needs to be checked for spelling correctness and corrected if necessary, many a time in the context of the surrounding words. A spellchecker points to spelling errors and possibly suggests alternatives. An autocorrector usually goes a step further and automatically picks the most likely word. In case of the correct word already having been typed, the same is retained. So, in practice, an autocorrect is a bit more aggressive than a spellchecker, but this is more of an implementation detail — tools allow you to configure the behaviour. There is not much difference between the two in theory. So, the discussion in the rest of the blog post applies to both.

There are different types of spelling errors. Let’s try to classify them a bit formally. Note that the below is for the purpose of illustration, and is not an rigorous linguistic approach to classifying spelling errors.

  1. Non-word Errors: These are the most common type of errors. You either miss a few keystrokes or let your fingers hurtle a bit longer. E.g., typing langage when you meant language; or hurryu when you meant hurry
  2. Real Word Errors: If you have fat fingers, sometimes instead of creating a non-word, you end up creating a real word, but one you didn’t intend. E.g, typing buckled when you meant bucked. Or your fingers are a tad wonky, and you type in three when you meant there.
  3. Cognitive Errors: The previous two types of errors result not from ignorance of a word or its correct spelling. Cognitive errors can occur due to those factors. The words piece and peace are homophones (sound the same). So you are not sure which one is which. Sometimes your damn sure about your spellings despite a few grammar nazis claim you’re not.
  4. Short forms/Slang/Lingo: These are possibly not even spelling errors. May be u r just being kewl. Or you are trying hard to fit in everything within a text message or a tweet and must commit a spelling sin. We mention them here for the sake of completeness.
  5. Intentional Typos: Well, because you are clever. You type in teh and pwned and zomg carefully and frown if they get autocorrected. It could be a marketing trick, one that probably even backfired.

In the rest of the post, we will cover an approach that largely focuses on the first three types. Of course, given the nature of the approach, the other two get covered too.

Unigram Frequencies to Maximize Likelihood

Given that a word w was typed, what was the intended word c? Supposing w = “the”, then almost certainly, c = “the”. However, supposing, w = “thew”, then what might c be? It could be one of the following:

There are many possibilities. Perhaps you actually meant to type “thew” and there was no spelling mistake. How do we know?

Since by now we know that the language model must have something to do with this, let’s consult it. Recall that a language model learns frequencies of n-grams from large text corpora of a language, like English. Once trained, it can be used to evaluate the validity of an n-gram from that language, or to probabilistically generate new n-grams (word sequences or sentences) from that language.

Given an input unigram (n-gram with n = 1, a single word) w, we generate a set of candidate unigrams,

(we will show you how to generate this set shortly, and then choose a

that maximizes the probability using the language model.

This essentially means, pick the most frequent word among the set of candidates. The language model will tell us:

Perhaps this is right. After all, “the” is the most common word in English, and someone mistyping it does not seem unreasonable. On the other hand, perhaps the intended word was “thaw” and we assumed it was “the” because our language model said so.

Suppose your friend sends you the following message, “Hey, I’m getting lates.” Are you going to automatically assume that she’s tardy as always, or do you anticipate a hot cup of coffee coming your way?

The language model will always pick the most frequent word given a set of candidates. So, given

it will pick late. Your friend does not stand a chance. This does not seem right.

The Noisy Channel Model or The Error Model

When we consider the unigram probabilities alone for spelling corrections, there is a subtle but profound mistake that we are making. There are two factors that affect the choice of the right candidate.

  1. How probable is in the absence of any other information (given by its unigram frequency)
  2. How likely is it that someone would type w when they intended c (because of a fat finger, a mistake, confusion, etc.)

Previously, we essentially conflated the two factors. We need to consider them separately to do better spell correction.

Let’s think about this a bit more. There are patterns to spelling errors, and reasons for them, some of which we listed earlier. Some of these reasons are: you are typing with your thumbs and they slipped onto an adjoining character; you accidentally longpressed a key, thus producing a unintended special character; you misheard a word; or you got confused whether it’s e or i that goes first.

Source of original image: http://www.yorku.ca/mack/mhci2013g.html

For instance, in the standard keypad layout above, you can see that your finger flirting with w when you are on e is much more likely than it jumping off to x. In other words:

The noisy channel model or the error model defines these probabilities — the likelihood that a particular (type) or error given a certain intended word. Errors could be due multiple factors including the input device, and ignorance of the language. These combined together form the noisy channel.

The idea of noisy channel comes from communication theory (remember Shannon from an earlier post?). An message s was sent through a channel that is noisy. The noise arises due to the form factor of the device, or it could be the “noise” in the user’s head. The channel messes up with the message in certain ways (some of which we can estimate and quantify). The message that gets received on the other end is s’. We don’t know the intended message s. Given that we have observed s’, our job is to find out the message s.

Building the Error Model

In other words, given that we have observed the word w, we need to compute the intended word c.

That is the spelling correction problem. Mathematically, it’s the following:

Finding the right candidate boils down to maximizing the above joint probability: the unigram probability, P(c), (given by the language model), multiplied by the probability that w is a spelling error of c, (given by the error model).

The former is straightforward — we already have the language model. How do we build the error model? That is where this whole business becomes tricky. Building a language model is easy because there’s a lot of data available. Building an error model is harder. We don’t know what spelling errors are generally made and at what frequency. Fortunately, there are a few datasets available, like the Birkbeck spelling error corpus built by Roger Mitton. Some examples below. Along with spelling mistakes, other contextual information is captured: example sentence where this mistake occurred; speakers of which native language are more likely to make this mistake, etc. Apparently, the below mistakes are likely made by a Tamil speaker.

  • aero arrow they throw a *
  • ansion ancient an * method of hunting
  • dear deer after the *
  • flu flew the birds * up
  • noice noise making any *
  • stright straight bring it * to the hunters

Now that we have a dataset of common spelling errors, we can start building the error model. Just look up combinations like

and so on, keep track of their frequencies, and we are done. Right? Well, not so fast. This approach won’t work because regardless of how good the spelling error corpus is, it’s not going to be able to cover all cases. Moreover, there are always new kinds of errors that get introduced.

Instead of going for

combinations, let’s look at the patterns. Ignore common letters, and go for the exact mistake. In the first two examples, above, we notice that there’s an e vs a confusion, in the third one e vs i confusion. So, instead of

if we consider,

we will be able to arrive at a more usable error model.

Once we have our error model, spellchecking/correction is a matter of computing the conditional probability we arrived at previously to choose the best candidate. But where did we get these candidates from to begin with?

Candidate Model

A misspelled word looks similar to the intended word most of the time, except they have some errors that occur when some characters are missed, superfluous characters added, and so on. We model the different types of spelling errors people make using a distance metric called Damerau-Levenshtein Distance, and generate candidates based on that.

Damerau-Levenshtein distance measures the distance between two strings (a word and its misspelling, for instance) in terms of the minimum number of basic edit operations required to transform one string to the other. These edit operations are:

  1. Substitution: Substitute a single character by another (e.g, pwned →owned)
  2. Deletion: Delete a single character (e.g., thew →the)
  3. Addition: Add a single character (e.g., langage →language)
  4. Transposition: Transpose or exchange a single pair of adjacent characters (e.g., recieve → receive)

If you get the drift, you can see that we can generate the set of candidates for spelling correction by creating words by transforming a word using the above operations. So, the candidate set when, w = “thew”, might be:

This can become quite a large set. If the length of the word is n, there are n deletions, 26 X n, substitutions, 26 X (n + 1) insertions, and n-1 transpositions possible. That is 54n + 25 candidates!

So, in order to resolve a small word like thew we will have to deal with 241 candidates. Of course, this is just candidates at Damerau-Levenshtein distance of 1 (i.e. only 1 edit is allowed). The amount of computations increases with number of edits allowed.

We make certain assumption about the edit distance. People typically tend to make a lot of spelling mistakes, but each spelling mistake itself does not deviate too much from the actual word. The chances of someone missing 4 characters of a word are slim, for instance. Therefore, we make allowance for at most 2 mistakes — that is generate candidates that are at most 2 edit distance away. Experiments have shown that this is a reasonable assumption as most errors get covered by this.

The number of candidates is still very high. But if you go back and look at the candidate set, C, above, you will notice that there are many candidates there are not valid words. So, we prune this set to include only the words that are present in our language model. As you can imagine, this reduces the size of the candidates set by a huge margin.

Additional Optimizations and Improvements

Equipped with the language model, the error model, and the candidate model, it’s straightforward to write a spelling corrector that works fairly well. Peter Norvig has written an excellent article on writing a simple spell correct program in Python.

You can of course make some more improvements to improve the performance of your spell corrector. The below is not an exhaustive list, but only indicative.

  • In the candidate model, allow whitespace as well hyphen. E.g., inlaw might be in law or in-law
  • Include words with similar spellings and pronunciations as candidates. This might need us to use a phonetic model of language. E.g., tuf is probably tough
  • Consider the input device when building the error model. E.g., tömàto is a result of someone longpressing characters on a phone’s keypad.
  • Instead of using a unigram language model, use a bigram or a trigram language model. The greater the n, the more the context, and hence potentially better spelling error resolution. E.g., “your” has a better chance of getting corrected to “you are” in the context “lol your stupid”.

Epilogue

While autocorrect tools do have a mind of their own, spellchecking and autocorrection is a well addressed problem for English and other European languages. However, spell correction has a long way to go for other languages, especially Indian languages.

One of the major challenges in building error models for languages other than English include lack of datasets like the Birbeck corpus. There are also syntax related challenges. Indian languages are phonetic, and it is not clear what kinds of spelling error patterns exist. This is an area that needs to be studied a lot more.

Logs collected from multilingual applications, such as Reverie’s Swalekh input tool, multilingual search, and localization APIs might give us some insight into error patterns. Also, user generated content (UGC) from social media and forums is another useful source for building the error model.

Another challenge unique to India — and one that we deal with at Reverie — Indians are polyglots by and large. When we speak or when we write, we switch between multiple languages instinctively. That would involve language detection even before we get into spell correction. Also, our English is extremely pliable. Many a time the English we encounter is a heady concoction of English and one or more Indian languages. I’ll leave you with a few examples of actual search queries from our logs.

  • micromax लो बजट मोबाइल फोन अंडर थाउजेंड
  • saibaba marble styachyu
  • p1000 ಕಿಡ್ಸ್ ಅಜುಕೆಶನಲ್ ಟ್ಯಾಬ್‌ಲೆಟ್…
  • कंडोम जेली
  • सिलाई मचिन
  • kandam के पैक
  • mobile kabar

Damn interesting, don’t you think? And colorful too. How do you go from mobile kabar to mobile phone cover, or kandam के पैक to pack of condoms? These and other such hard NLP problems keep us going. Until next time then.

References

  1. Dan Jurafsky’s detailed set of slides, Spelling Correction and the Noisy Channel [pdf]
  2. Peter Norvig’s excellent essay with detailed code, How to Write a Spelling Corrector
  3. Peter Norvig’s book chapter on language models, Natural Language Corpus Data [pdf]
  4. Birkbeck Spelling Error Corpus, a great source for building error models.bi
  5. The wikipedia article on different string distance measures is instructive.

Originally published at reverieinc.com.

--

--