Monday, October 26, 2009

Spelling Corrector Thoughts

Last night while I couldn't sleep I was pondering the spelling corrector code. One thing I thought was an advantage over the Python version is that my Perl 6 version used <alpha> to match words, rather than [a-z]. My understanding (though I'm having a hard time finding it in the spec at the moment) is that <alpha> is Unicode-aware, which seems like a big win in terms of power.

Unfortunately, my code, following Norvig's version, uses just a to z when it wants to substitute or add an arbitrary letter. I have no idea how to do a Unicode equivalent of that -- and even if I did, the combinatorial problem which already cripples Rakudo's ability to handle the two-edit case would just get horribly worse. How to handle that mismatch, I wondered?

That's when it occurred to me: instead of constructing a candidate list of psuedo-words that might match real words in our dictionary, construct a psuedo-word regular expression and grep it against the list of words in the dictionary. I'm guessing it will be slower, but it will be every bit as Unicode-aware as <alpha> is, and I'm pretty sure it can be done with only a few more lines of code.

Okay, now that I've thought of it, it's time to actually try to write it...

No comments:

Post a Comment