Whenever we are working with lots of text files we might have to deal with their encoding. Reading a file with the wrong encoding feeds obscure data to our code. This way, the result becomes unpredictable
For example, saving the string
Kalsarikännit as an UTF-8 text file and then
reading it in Python as an ASCII file gives an error:
>>> with open('k', 'r', encoding='ascii') as afile: ... afile.read() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib/python3.8/encoding/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
I’ve run into this problem last week, when I had to parse hundreds of csv files.
Most of them were fine, but I encountered weird problems, similar to the shown
above, in some files. It turned out those problematic files were saved from a
Windows machine, with
$ file --mime * bla.txt: text/plain; charset=us-ascii bla.txt: text/plain; charset=iso-8859-1 k.csv: text/plain; charset=utf-8
Or display only the encoding:
$ file --mime-encoding * bla.txt: us-ascii bla.txt: iso-8859-1 k.csv: utf-8
Or only the mime-type:
$ file --mime-type * bla.txt: text/plain ble.txt: text/plain k.csv: text/plain
In these examples, we can see 3 plain-text files, each one with a different encoding.
To convert a text file between encoding, we can use the
iconv. So, to convert from
$ iconv -f iso-8859-1 -t utf-8 infile -o outfile
To convert all files in current dir from a different encoding to
for f in * do enc=$(file --mime-encoding "$f" | sed -E 's/.*: //g') if [ $enc != "utf-8" ] then echo Converting from $enc to UTF-8: $f iconv -f $enc -t utf-8 "$f" -o "$f".utf mv "$f".utf "$f" fi done
A small bash script with
iconv is very versatile to
uniformize the encoding of several text files. And it saves the day.