Consider the following list:
a_list = ['? ? me así, bla es se ? ds ???']
How can I extract in a new list all the emojis inside
new_lis = ['? ? ? ? ? ?']
I tried to use regex, but I do not have all the possible emojis encodings.
You can use the
emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in
import emoji def extract_emojis(s): return ''.join(c for c in s if c in emoji.UNICODE_EMOJI)
I think it’s important to point out that the previous answers won’t work with emojis like ??????? , because it consists of 4 emojis, and using
... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like ??.
My solution includes the
regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like ???????
import emoji import regex def split_count(text): emoji_list =  data = regex.findall(r'X', text) for word in data: if any(char in emoji.UNICODE_EMOJI for char in word): emoji_list.append(word) return emoji_list
Testing (with more emojis with skin color):
line = ["? ? me así, se ? ds ??? hello ???? emoji hello ??????? how are ? you today????"] counter = split_count(line) print(' '.join(emoji for emoji in counter))
? ? ? ? ? ? ???? ??????? ? ?? ??
If you want to include flags, like ?? the Unicode range would be from ? to ?, so add:
flags = regex.findall(u'[U0001F1E6-U0001F1FF]', text)
to the function above, and
return emoji_list + flags.
See this post for more information about the flags.
If you don’t want to use an external library, as a pythonic way you can simply use regular expressions and
re.findall() with a proper regex to find the emojies:
In : import re In : re.findall(r'[^ws,]', a_list) Out: ['?', '?', '?', '?', '?', '?']
The regular expression
r'[^ws,]' is a negated character class that matches any character that is not a word character, whitespace or comma.
As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.
Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies (
^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:
The top rated answer does not always work. For example flag emojis will not be found. Consider the string:
s = u'Hello U0001f1f7U0001f1fa hello'
What would work better is
import emoji emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys()) r = re.compile('|'.join(re.escape(p) for p in emojis_list)) print(' '.join(r.findall(s)))
Another way to do it using emoji is to use
emoji.demojize and convert them into text representations of emojis.
Ex: ? will be converted to
Then find all
:.*: patterns, and use
emoji.emojize on those.
# -*- coding: utf-8 -*- import emoji import re text = """ Of course, too many emoji characters ? like ?, #@^!*&#@^# ? helps ? people read ?aa?aaa?a #douchebag """ text = emoji.demojize(text) text = re.findall(r'(:[^:]*:)', text) list_emoji = [emoji.emojize(x) for x in text] print(list_emoji)
This might be a redundant way but it’s an example of how
emoji.demojize can be used.
The solution to get exactly what tumbleweed ask, is a mix between the top rated answer and user594836’s answer. This is the code that works for me in Python 3.6.
import emoji import re test_list=['? ? me así,bla es,se ? ds ???'] ## Create the function to extract the emojis def extract_emojis(a_list): emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys()) r = re.compile('|'.join(re.escape(p) for p in emojis_list)) aux=[' '.join(r.findall(s)) for s in a_list] return(aux) ## Execute the function extract_emojis(test_list) ## the output ['? ? ? ? ? ?']
Step 1: Make sure that your text it’s decoded on utf-8
Step 2: Locate all emoji from your text, you must separate the text character by character
[str for str in decode]
Step 3: Saves all emoji in a list
[c for c in allchars if c in emoji.UNICODE_EMOJI] full example bellow:
import emoji text = "? ? me así, bla es se ? ds ???" decode = text.decode('utf-8') allchars = [str for str in decode] list = [c for c in allchars if c in emoji.UNICODE_EMOJI] print list [u'U0001f914', u'U0001f648', u'U0001f60c', u'U0001f495', u'U0001f46d', u'U0001f459']
if you want to remove from text
str for str in decode.split() if not any(i in str for i in list)] clean_text = ' '.join(filtred) print clean_text me así, bla es se dsfiltred = [
from emoji import * EMOJI_SET = set() # populate EMOJI_DICT def pop_emoji_dict(): for emoji in UNICODE_EMOJI: EMOJI_SET.add(emoji) # check if emoji def is_emoji(s): for letter in s: if letter in EMOJI_SET: return True return False
This is a better solution when working with large datasets since you dont have to loop through all emojis each time. Found this to give me better results 🙂