Last of the Careless Men: Spelling Corrector Thoughts

Last night while I couldn't sleep I was pondering the spelling corrector code. One thing I thought was an advantage over the Python version is that my Perl 6 version used <alpha> to match words, rather than [a-z]. My understanding (though I'm having a hard time finding it in the spec at the moment) is that <alpha> is Unicode-aware, which seems like a big win in terms of power.

Unfortunately, my code, following Norvig's version, uses just a to z when it wants to substitute or add an arbitrary letter. I have no idea how to do a Unicode equivalent of that -- and even if I did, the combinatorial problem which already cripples Rakudo's ability to handle the two-edit case would just get horribly worse. How to handle that mismatch, I wondered?

That's when it occurred to me: instead of constructing a candidate list of psuedo-words that might match real words in our dictionary, construct a psuedo-word regular expression and grep it against the list of words in the dictionary. I'm guessing it will be slower, but it will be every bit as Unicode-aware as <alpha> is, and I'm pretty sure it can be done with only a few more lines of code.

Okay, now that I've thought of it, it's time to actually try to write it...

Last of the Careless Men

Monday, October 26, 2009

Spelling Corrector Thoughts

No comments:

Post a Comment

Followers

Blog Archive