Heitor's log

Convert files to UTF-8

Whenever we are working with lots of text files we might have to deal with their encoding. Reading a file with the wrong encoding feeds obscure data to our code. This way, the result becomes unpredictable

For example, saving the string Kalsarikännit as an UTF-8 text file and then reading it in Python as an ASCII file gives an error:

>>> with open('k', 'r', encoding='ascii') as afile:
...     afile.read()
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
    File "/usr/lib/python3.8/encoding/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8:
        ordinal not in range(128)

I’ve run into this problem last week, when I had to parse hundreds of csv files. Most of them were fine, but I encountered weird problems, similar to the shown above, in some files. It turned out those problematic files were saved from a Windows machine, with ISO-8859-1 encoding.

One way to display the mime-type of a file and its encoding is by using the file command, that comes with most free operating systems (Linux, BSD, etc):

$ file --mime *
bla.txt: text/plain; charset=us-ascii
bla.txt: text/plain; charset=iso-8859-1
k.csv:   text/plain; charset=utf-8

Or display only the encoding:

$ file --mime-encoding *
bla.txt: us-ascii
bla.txt: iso-8859-1
k.csv:   utf-8

Or only the mime-type:

$ file --mime-type *
bla.txt: text/plain
ble.txt: text/plain
k.csv:   text/plain

In these examples, we can see 3 plain-text files, each one with a different encoding.

To convert a text file between encoding, we can use the iconv. So, to convert from ISO-8859-1 to UTF-8:

$ iconv -f iso-8859-1 -t utf-8 infile -o outfile

To convert all files in current dir from a different encoding to UTF-8:

for f in *
	enc=$(file --mime-encoding "$f" | sed -E 's/.*: //g')
	if [ $enc != "utf-8" ]
		echo Converting from $enc to UTF-8: $f
		iconv -f $enc -t utf-8 "$f" -o "$f".utf
		mv "$f".utf "$f"

A small bash script with sed, file and iconv is very versatile to uniformize the encoding of several text files. And it saves the day.