Convert files to UTF-8
Whenever we are working with lots of text files we might have to deal with their encoding. Reading a file with the wrong encoding feeds obscure data to our code. This way, the result becomes unpredictable
For example, saving the string Kalsarikännit
as an UTF-8 text file and then
reading it in Python as an ASCII file gives an error:
>>> with open('k', 'r', encoding='ascii') as afile:
... afile.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.8/encoding/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8:
ordinal not in range(128)
I’ve run into this problem last week, when I had to parse hundreds of
CSV
files. Most of them were fine, but I encountered weird problems, similar to the shown
above, in some files. It turned out those problematic files were saved from a
Windows machine, with ISO-8859-1
encoding.
One way to display the
mime-type
of a file and its encoding is by using the
file
command, that comes with most free
operating systems (Linux, BSD, etc):
$ file --mime *
bla.txt: text/plain; charset=us-ascii
ble.txt: text/plain; charset=iso-8859-1
k.csv: text/plain; charset=utf-8
Or display only the encoding:
$ file --mime-encoding *
bla.txt: us-ascii
ble.txt: iso-8859-1
k.csv: utf-8
Or only the mime-type:
$ file --mime-type *
bla.txt: text/plain
ble.txt: text/plain
k.csv: text/plain
In these examples, we can see 3 plain-text files, each one with a different encoding.
To convert a text file between encoding, we can use the
iconv
. So, to convert from
ISO-8859-1
to UTF-8
:
$ iconv -f iso-8859-1 -t utf-8 infile -o outfile
To convert all files in current directory from a different encoding to UTF-8
:
for f in *
do
enc=$(file --mime-encoding "$f" | sed -E 's/.*: //g')
if [ $enc != "utf-8" ]
then
echo Converting from $enc to UTF-8: $f
iconv -f $enc -t utf-8 "$f" -o "$f".utf
mv "$f".utf "$f"
fi
done
A small bash script with sed
, file
and iconv
is very versatile to
uniformize the encoding of several text files. And it saves the day.