Saving utf-8 texts with json.dumps as UTF8, not as u escape sequence

Posted on

Solving problem is about exposing yourself to as many situations as possible like Saving utf-8 texts with json.dumps as UTF8, not as u escape sequence and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Saving utf-8 texts with json.dumps as UTF8, not as u escape sequence, which can be followed any time. Take easy to follow this discuss.

Saving utf-8 texts with json.dumps as UTF8, not as u escape sequence

Sample code:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print(json_string)
"u05d1u05e8u05d9 u05e6u05e7u05dcu05d4"

The problem: it’s not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of uXXXX)?

Answer #1:

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("??? ????", ensure_ascii=False).encode('utf8')
>>> json_string
b'"xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94"'
>>> print(json_string.decode())
"??? ????"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("??? ????", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"??? ????", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"??? ????", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "??? ????", 2: u"??? ????" }
>>> d
{1: 'xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94', 2: u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "u05d1u05e8u05d9 u05e6u05e7u05dcu05d4", "2": "u05d1u05e8u05d9 u05e6u05e7u05dcu05d4"}'
>>> json.loads(s)['1']
u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'
>>> json.loads(s)['2']
u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'
>>> print json.loads(s)['1']
??? ????
>>> print json.loads(s)['2']
??? ????
Answered By: Martijn Pieters

Answer #2:

To write to a file

import codecs
import json
with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào vi?t nam"}, f, ensure_ascii=False)

To print to stdout

import json
print(json.dumps({"message":"xin chào vi?t nam"}, ensure_ascii=False))
Answered By: Tr?n Quang Hi?p

Answer #3:

UPDATE: This is wrong answer, but it’s still useful to understand why it’s wrong. See comments.

How about unicode-escape?

>>> d = {1: "??? ????", 2: u"??? ????"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "??? ????", "2": "??? ????"}
Answered By: monitorius

Answer #4:

Peters’ python 2 workaround fails on an edge case:

d = {u'keyword': u'bad credit  xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'xe7' in position 25: ordinal not in range(128)

It was crashing on the .decode(‘utf8’) part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))
cat filename
{"keyword": "bad credit  çredit cards"}
Answered By: Jonathan Ray

Answer #5:

As of Python 3.7 the following code works fine:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

Output:

{"symbol": "ƒ"}
Answered By: Nik

Answer #6:

The following is my understanding var reading answer above and google.

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
    #how to Saving utf-8 texts in json.dumps as UTF8, not as u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""
from __future__ import print_function
import json
a = {"b": u"??"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"??"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')
# python2_bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')
# python2:unicode; python3_str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'u4e2du6587'}
'{"b": "\u4e2d\u6587"}'
'{"b": "\u4e2d\u6587"}'
u'{"b": "u4e2du6587"}'
'{"b": "xe4xb8xadxe6x96x87"}'
'xe4xb8xadxe6x96x87'
u'u4e2du6587'
u'u4e2du6587'
'xe4xb8xadxe6x96x87'
#python3
{'b': '??'}
'{"b": "\u4e2d\u6587"}'
b'{"b": "\u4e2d\u6587"}'
'{"b": "??"}'
b'{"b": "xe4xb8xadxe6x96x87"}'
b'xe4xb8xadxe6x96x87'
'??'
'??'
b'xe4xb8xadxe6x96x87'
"""
Answered By: Cheney

Answer #7:

Here’s my solution using json.dump():

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

where SYSTEM_ENCODING is set to:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]
Answered By: Neit Sabes

Answer #8:

Use codecs if possible,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))
Answered By: Yulin GUO

Leave a Reply

Your email address will not be published.