Convert a file in UTF-8 or any encoding with Python
You are not english and your language uses special characters other than latin for some or all the letters of the alphabet (like Greek, Chinese, Japanese, etc…)
You’ve recently swithched from Windows to Ubuntu and most of your text files are in a Windows encoding, or you’ve just downloaded a subtitle in you preffered language for one of your movies (most subtitles are in a Windows encoding. I never found one in utf-8, but that doesn’t mean that there aren’t some.)
And you open that file let’s say with gedit and all you see are funny characters. That happens because Ubuntu uses utf-8 for the default decoding and encoding so there is a missmatch with your file’s encoding.
So, how can I convert programmatically my file(s) to utf-8 so as all applications in Ubuntu display them properly?
First you must know in what encoding your file is, or make a guess between some. For example I’m greek, so most of my text files from Windows or the greek subtitles I download are mostly in either ‘windows-1253′ or ‘iso-8859-7′ encoding, two almost identical encodings. Your’s will be different but it’s not difficult to find out in what encoding it is or might be. For a full list of the encodings that Python can handle see the docs here.
So let’s get start coding… We’ll make a simple utility that first makes a backup of our original file and then converts it to utf-8.
#!/usr/bin/env python import os import sys import shutil def convert_to_utf8(filename): # gather the encodings you think that the file may be # encoded inside a tuple encodings = ('windows-1253', 'iso-8859-7', 'macgreek') # try to open the file and exit if some IOError occurs try: f = open(filename, 'r').read() except Exception: sys.exit(1) # now start iterating in our encodings tuple and try to # decode the file for enc in encodings: try: # try to decode the file with the first encoding # from the tuple. # if it succeeds then it will reach break, so we # will be out of the loop (something we want on # success). # the data variable will hold our decoded text data = f.decode(enc) break except Exception: # if the first encoding fail, then with the continue # keyword will start again with the second encoding # from the tuple an so on.... until it succeeds. # if for some reason it reaches the last encoding of # our tuple without success, then exit the program. if enc == encodings[-1]: sys.exit(1) continue # now get the absolute path of our filename and append .bak # to the end of it (for our backup file) fpath = os.path.abspath(filename) newfilename = fpath + '.bak' # and make our backup file with shutil shutil.copy(filename, newfilename) # and at last convert it to utf-8 f = open(filename, 'w') try: f.write(data.encode('utf-8')) except Exception, e: print e finally: f.close()
Of course you can also encode it in another encoding than utf-8, just replace the one you desire in the f.write(data.encode(‘utf-8′)) line. And also change the tuple elements according to your needs.
A simple use of our function might be like the following. But first add these lines at the end of our .py file and save it let’s say as ‘converter.py’:
if __name__ == '__main__': convert_to_utf8(sys.argv)
Here we ask the user to pass the file he/she wants to be converted as the fisrt argument in the command line. For example :
python converter.py /home/user/my_movies/my_movie.sub
or if it’s in the same dir with our converter.py
python converter.py my_movie.sub
And of course if our file has space between the words in it’s name, like ‘Cape Fear.srt‘, then we put it either inside ‘ ‘ or ” ” .
And that’s all. Later we can change our function so as to convert and make backup in a whole directory and not in a single file. But that’s a good beggining.
UPDATE [7/10/08]: If you just want to convert your files and you are not interested in the programming solution, I’ve just released cpConverter, a utility that converts files from and to among 84 code pages. Info and download link here.