Sunday, November 1, 2009

Spelling Corrector, Broken Unicode Regex Edition

I'm throwing this out there in hopes that wiser Perl 6 heads can help me sort out what to do with it. As I recounted two posts ago, it occurred to me that the spelling corrector could properly support Unicode if instead of the list of letter combinations you might have meant, it generated a list of regexes for them. This drastically cuts down on the combinatorial explosion from calling the edit routine twice and makes the script handle Unicode properly. On the downside, it is likely to be a good bit slower.

This script implements it. Unfortunately, Rakudo does not yet support variable interpolation in strings, so I can't test the script. Also I'm suspicious I have mucked up the combination of any and grep, but it's hard to be sure without testing the script.


Anyway, for those of you keeping score at home, Norvig's original Python script does this task in 21 lines of code. This quasi-correct Perl 6 version adds full Unicode support with just an additional 3 lines of code, for 24 total. And that's counting 4 lines which are just }, and the semi-optional use v6; line as well. Assuming fixing the issues don't require additional lines, this looks like a clear win for Perl 6. And I'm quite sure this code can be made a good bit better and clearer...

No comments:

Post a Comment