Friday, 15 January 2010

What characters are allowed in twitter hashtags? -



What characters are allowed in twitter hashtags? -

in developing ios app containing twitter client, must allow user generated hashtags (which may created elsewhere within app, not in tweet body).

i ensure such hashtags valid twitter, error check entered value invalid characters. bear in mind users may non-english speaking countries.

i aware of usual limitations, such not origin hashtag number, , no special punctuation characters, wondering if there known list of additional characters technically allowed within hashtags (i.e. international characters).

karl, you've rightly pointed out, word in language can valid twitter hashtag (as long meets number of basic criteria). such asking list of valid international word characters. i'm sure has compiled such list somewhere, using not efficient approach reaching appears initial goal: ensuring given hashtag valid twitter.

i believe, looking regular look can match word characters within unicode range. such look not dependant on locale , match characters in modern typography can appear part of word.

you didn't specify language writing app in, can't help language specific implementation. however, basic approach follows:

check if of bracket expressions or character classes back upwards unicode character ranges in language. if yes, utilize them.

check if there regex modifier can enable unicode character range back upwards language.

most modern languages implement regular expressions in similar way , lot of them borrow heavily perl, hope next 2 illustration set on right track:

perl:

use posix bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) give greater command on characters want match, compared character classes (eg: \w).

use /u modifier enable unicode back upwards when pattern matching. under modifier, ascii platform becomes unicode platform; , hence, example, \w match of more 100,000 word characters in unicode.

see perl documentation more info:

http://perldoc.perl.org/perlre.html#character-set-modifiers http://perldoc.perl.org/perlrecharclass.html#posix-character-classes

ruby:

use posix bracket expressions encompass non-ascii characters. instance, /\d/ matches ascii decimal digits (0-9); whereas /[[:digit:]]/ matches character in unicode nd category.

see ruby documentation more info:

http://www.ruby-doc.org/core-2.1.1/regexp.html#class-regexp-label-character+classes

examples:

given list of hashtags, next regex match hashtags start word character (inc. international word characters) , followed word character, number or underscore:

m/^#[[:alpha:]][[:alnum:]_]+$/u # perl /^#[[:alpha:]][[:alnum:]_]+$/ # ruby

twitter hashtag

No comments:

Post a Comment