How to determine the encoding of text?

Posted on

Solving problem is about exposing yourself to as many situations as possible like How to determine the encoding of text? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How to determine the encoding of text?, which can be followed any time. Take easy to follow this discuss.

How to determine the encoding of text?

I received some text that is encoded, but I don’t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.

Asked By: Nope

||

Answer #1:

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized
for specific languages, and languages
are not random. Some character
sequences pop up all the time, while
other sequences make no sense. A
person fluent in English who opens a
newspaper and finds “txzqJv 2!dasd0a
QqdKjvz” will instantly recognize that
that isn’t English (even though it is
composed entirely of English letters).
By studying lots of “typical” text, a
computer algorithm can simulate this
kind of fluency and make an educated
guess about a text’s language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252
Answered By: nosklo

Answer #2:

Another option for working out the encoding is to use
libmagic (which is the code behind the
file command). There are a profusion of
python bindings available.

The python bindings that live in the file source tree are available as the
python-magic (or python3-magic)
debian package. It can determine the encoding of a file by doing:

import magic
blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

import magic
blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
Answered By: Hamish Downer

Answer #3:

Some encoding strategies, please uncomment to taste :

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

You might like to check the encoding by opening and reading the file in a form of a loop… but you might need to check the filesize first :

#PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
            for e in encodings:
                try:
                    fh = codecs.open('file.txt', 'r', encoding=e)
                    fh.readlines()
                    fh.seek(0)
                except UnicodeDecodeError:
                    print('got unicode error with %s , trying different encoding' % e)
                else:
                    print('opening the file with encoding:  %s ' % e)
                    break
Answered By: zzart

Answer #4:

Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

chardet also gives you a probability (i.e. confidence) of it’s encoding prediction (haven’t looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet
    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])
    return chardet.detect(rawdata)['encoding']
Answered By: ryanjdillon

Answer #5:

This might be helpful

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()
suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'
Answered By: richinex

Answer #6:

# Function: OpenRead(file)
# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed
import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)
def bomType(file):
    """
    returns file encoding string for open() function
    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """
    f = open(file, 'rb')
    b = f.read(4)
    f.close()
    if (b[0:3] == b'xefxbbxbf'):
        return "utf8"
    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'xfexff') or (b[0:2] == b'xffxfe')):
        return "utf16"
    if ((b[0:5] == b'xfexffx00x00')
              or (b[0:5] == b'x00x00xffxfe')):
        return "utf32"
    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252
def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')
#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()
fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("u2022 hi there (utf8)")
fout.close()
# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(x81x8Dx90x9D)")
fout.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()
# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline()
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()
# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline()
print(L)
fin.close()
# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()
Answered By: Bimo

Answer #7:

Depending on your platform, I just opt to use the linux shell file command. This works for me since I am using it in a script that exclusively runs on one of our linux machines.

Obviously this isn’t an ideal solution or answer, but it could be modified to fit your needs. In my case I just need to determine whether a file is UTF-8 or not.

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')
Answered By: MikeD

Answer #8:

It is, in principle, impossible to determine the encoding of a text file, in the general case. So no, there is no standard Python library to do that for you.

If you have more specific knowledge about the text file (e.g. that it is XML), there might be library functions.

Answered By: Martin v. Löwis

Leave a Reply

Your email address will not be published. Required fields are marked *