I have 3 API’s that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list
valuelist. One of the steps is to remove the punctuation from them. I normally use
string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:
wordlist = [s.translate(None, string.punctuation)for s in valuelist] TypeError: translate() takes exactly one argument (2 given)
Is there a way around this? Either by encoding the unicode or a replacement for
The translate method work differently on Unicode objects than on byte-string objects:
>>> help(unicode.translate) S.translate(table) -> unicode Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.
So your example would become:
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation) word_list = [s.translate(remove_punctuation_map) for s in value_list]
Note however that
string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.
I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.
>>> import re >>> s1="this.is a.string, with; (punctuation)." >>> s1 'this.is a.string, with; (punctuation).' >>> re.sub("[.t,:;().]", "", s1, 0, 0) 'thisis astring with punctuation' >>>
In this version you can relatively make one’s letters to other
def trans(to_translate): tabin = u'??????' tabout = u'??????' tabin = [ord(char) for char in tabin] translate_table = dict(zip(tabin, tabout)) return to_translate.translate(translate_table)
re module allows to use a function as a replacement argument, which should take a
Match object and return a suitable replacement. We may use this function to build a custom character translation function:
import re def mk_replacer(oldchars, newchars): """A function to build a replacement function""" mapping = dict(zip(oldchars, newchars)) def replacer(match): """A replacement function to pass to re.sub()""" return mapping.get(match.group(0), "") return replacer
An example. Match all lower-case letters (
[a-z]), translate ‘h’ and ‘i’ to ‘H’ and ‘I’ respectively, delete other matches:
"[a-z]", mk_replacer("hi", "HI"), "hail") 'HI're.sub(
As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.
A Unicode example:
"[W]", mk_replacer(u'u0435u0438u043fu0440u0442u0432', u"EIPRTV"), u'u043fu0440u0438u0432u0435u0442') u'PRIVET're.sub(
As I stumbled upon the same problem and Simon’s answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:
from collections import defaultdict
And then for the translation, say you’d like to remove ‘@’ and ‘r’ characters:
remove_chars_map = defaultdict() remove_chars_map['@'] = None remove_chars_map['r'] = None new_string = old_string.translate(remove_chars_map)
And an example:
old_string = “word1@r word2@r word3@r”
new_string = “word1 word2 word3”
‘@’ and ‘r’ removed