Skip to content Skip to sidebar Skip to footer

Russian Input For Word Count

Ok, so this is what I have (special thx to Tushar Gupta, for fixing the code) HTML STS

Solution 1:

The \b notation is defined in terms of “word boundaries”, but with “word” meaning a sequence of ASCII letters, so it cannot be used for Russian texts. A simple approach is to count sequences of Cyrillic letters, and the range from U+0400 to U+0481 covers the Cyrillic letters used in Russian.

var matches = this.value.match(/\b/g);
wordCounts[this.id] = matches ? matches.length / 2 : 0;

by the lines

var matches = this.value.match(/[\u0400-\u0481]+/g);
wordCounts[this.id] = matches ? matches.length : 0;

You should perhaps treat a hyphen as corresponding to a letter (and therefore add \- inside the brackets), so that a hyphenated compound would be counted as one word, but this is debatable (is e.g. “жили-были” two words or one?)

Solution 2:

The problem is in your regex - \b doesn't match UTF-8 word boundaries.

Try changing this:

var matches = this.value.match(/\b/g);

To this:

var matches = this.value.match(/[^\s\.\!\?]+/g);

and see if that gives a result for Cyrillic input. If it works then you no longer need to divide by 2 to get the word count.

Post a Comment for "Russian Input For Word Count"