Question :
I have some strings that have a mix of English and none English letters. For example:
w='_1991_اف_جي2'
How can I recognize these types of string using Regex or any other fast method in Python?
I prefer not to compare letters of the string one by one with a list of letters, but to do this in one shot and quickly.
Answer #1:
You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.
Note the comment # -*- coding: ....
. It should be there at the top of the python file (otherwise you would receive some error about encoding)
# -*- coding: utf-8 -*-
def isEnglish(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
assert not isEnglish('slabiky, ale liší se podle významu')
assert isEnglish('English')
assert not isEnglish('?? ???????? ?? ?????? ??')
assert not isEnglish('how about this one : ? asf?')
assert isEnglish('?fd4))45s&')
Answer #2:
If you work with strings (not unicode objects), you can clean it with translation and check with isalnum()
, which is better than to throw Exceptions:
import string
def isEnglish(s):
return s.translate(None, string.punctuation).isalnum()
print isEnglish('slabiky, ale liší se podle významu')
print isEnglish('English')
print isEnglish('?? ???????? ?? ?????? ??')
print isEnglish('how about this one : ? asf?')
print isEnglish('?fd4))45s&')
print isEnglish('????? ?? ???????')
> False
> True
> False
> False
> True
> False
Also you can filter non-ascii characters from string with this function:
ascii = set(string.printable)
def remove_non_ascii(s):
return filter(lambda x: x in ascii, s)
remove_non_ascii('slabiky, ale liší se podle významu')
> slabiky, ale li se podle vznamu
Answer #3:
IMHO it is the simpliest solution:
def isEnglish(s):
return s.isascii()
print(isEnglish("Test"))
print(isEnglish("_1991_??_??2"))
Output:
True
False
Answer #4:
import re
english_check = re.compile(r'[a-z]')
if english_check.match(w):
print "english",w
else:
print "other:",w
Answer #5:
w.isidentifier()
You can easily see the method in docs:
Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.
Answer #6:
I believe this one would have a minimal runtime since it stops once it finds a character which is not a latin letter. It also uses a generator for better memory usage.
import string
def has_only_latin_letters(name):
char_set = string.ascii_letters
return all((True if x in char_set else False for x in name))
>>> has_only_latin_letters('_1991_??_??2')
False
>>> has_only_latin_letters('bla bla')
True
>>> has_only_latin_letters('blä blä')
False
>>> has_only_latin_letters('????????')
False
>>> has_only_latin_letters('also a string with numbers and punctuation 1, 2, 4')
True
You can also use a different set of characters:
>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.punctuation
'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'
>>> string.digits
'0123456789'
>>> string.digits + string.lowercase
'0123456789abcdefghijklmnopqrstuvwxyz'
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
'()*+,-./:;<=>?@[\]^_`{|}~ tnrx0bx0c'
To add latin accented letters, you can refer to this post.