Heitor's log

Whenever we are working with lots of text files we might have to deal with their encoding. Reading a file with the wrong encoding feeds obscure data to our code. This way, the result becomes unpredictable

For example, saving the string Kalsarikännit as an UTF-8 text file and then reading it in Python as an ASCII file gives an error:

>>> with open('k', 'r', encoding='ascii') as afile:
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.8/encoding/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8:
ordinal not in range(128)


I’ve run into this problem last week, when I had to parse hundreds of csv files. Most of them were fine, but I encountered weird problems, similar to the shown above, in some files. It turned out those problematic files were saved from a Windows machine, with ISO-8859-1 encoding.

One way to display the mime-type of a file and its encoding is by using the file command, that comes with most free operating systems (Linux, BSD, etc):

$file --mime * bla.txt: text/plain; charset=us-ascii bla.txt: text/plain; charset=iso-8859-1 k.csv: text/plain; charset=utf-8  Or display only the encoding: $ file --mime-encoding *
bla.txt: us-ascii
bla.txt: iso-8859-1
k.csv:   utf-8


Or only the mime-type:

$file --mime-type * bla.txt: text/plain ble.txt: text/plain k.csv: text/plain  In these examples, we can see 3 plain-text files, each one with a different encoding. To convert a text file between encoding, we can use the iconv. So, to convert from ISO-8859-1 to UTF-8: $ iconv -f iso-8859-1 -t utf-8 infile -o outfile


To convert all files in current dir from a different encoding to UTF-8:

for f in *
do
enc=$(file --mime-encoding "$f" | sed -E 's/.*: //g')
if [ $enc != "utf-8" ] then echo Converting from$enc to UTF-8: $f iconv -f$enc -t utf-8 "$f" -o "$f".utf
mv "$f".utf "$f"
fi
done


A small bash script with sed, file and iconv is very versatile to uniformize the encoding of several text files. And it saves the day.