Python unicode codepoint to unicode character

Posted on

Question :

Python unicode codepoint to unicode character

I’m trying to write out to a flat file some Chinese, or Russian or various non-English character-sets for testing purposes. I’m getting stuck on how to output a Unicode hex-decimal or decimal value to its corresponding character.

For example in Python, if you had a hard coded set of characters like абвгдежзийкл you would assign value = u"абвгдежзийкл" and no problem.

If however you had a single decimal or hex decimal like 1081 / 0439 stored in a variable and you wanted to print that out with it’s corresponding actual character (and not just output 0x439) how would this be done? The Unicode decimal/hex value above refers to й.

Answer #1:

Python 2: Use unichr():

>>> print(unichr(1081))
?

Python 3: Use chr():

>>> print(chr(1081))
?
Answered By: NPE

Answer #2:

So the answer to the question is:

  1. convert the hexadecimal value to decimal with int(hex_value, 16)
  2. then get the corresponding strin with chr().

To sum up:

>>> print(chr(int('0x897F', 16)))
?
Answered By: Édouard Lopez

Answer #3:

If you run into the error:

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

While trying to convert your hex value using unichr, you can get around that error by doing something like:

>>> n = int('0001f600', 16)
>>> s = '\U{:0>8X}'.format(n)
>>> s
'\U0001F600'
>>> binary = s.decode('unicode-escape')
>>> print(binary)
?
Answered By: Jaymon

Answer #4:

While working on a project that included parsing some JSONs, I encountered a similar problem. I had a lot of strings that had all non-ASCII characters escaped like this:

>>> print(content)
u0412u044B ju0435u0441u0442u0435 u0438u0437 u0420u043Eu0441u0441u0438u0438?
...
>>> print(content)
u010Cemu jesi nau010Dinal izuu010Dati medu017Euslovjansky jezyk?

Converting such mixes symbol-by-symbol with unichr() would be tedious. The solution I eventually decided on:

content.encode("utf8").decode("unicode-escape")

The first operation (encoding) produces bytestrings like this:

b'\u0412\u044B j\u0435\u0441\u0442\u0435 \u0438\u0437 \u0420\u043E\u0441\u0441\u0438\u0438?'
b'\u010Cemu jesi na\u010Dinal izu\u010Dati med\u017Euslovjansky jezyk?'

and the second operation (decoding) transforms the byte string into Unicode string but with \ replaced by , which “unpacks” the characters, giving the result like this:

?? j???? ?? ???????
?emu jesi na?inal izu?ati medžuslovjansky jezyk?
Answered By: Victor Bulatov

Leave a Reply

Your email address will not be published. Required fields are marked *