Convert a file in UTF-8 or any encoding with Python

Problem:
You are not english and your language uses special characters other than latin for some or all the letters of the alphabet (like Greek, Chinese, Japanese, etc…)
You’ve recently swithched from Windows to Ubuntu and most of your text files are in a Windows encoding, or you’ve just downloaded a subtitle in you preffered language for one of your movies (most subtitles are in a Windows encoding. I never found one in utf-8, but that doesn’t mean that there aren’t some.)
And you open that file let’s say with gedit and all you see are funny characters. That happens because Ubuntu uses utf-8 for the default decoding and encoding so there is a missmatch with your file’s encoding.
So, how can I convert programmatically my file(s) to utf-8 so as all applications in Ubuntu display them properly?

Solution:
First you must know in what encoding your file is, or make a guess between some. For example I’m greek, so most of my text files from Windows or the greek subtitles I download are mostly in either ‘windows-1253′ or ‘iso-8859-7′ encoding, two almost identical encodings. Your’s will be different but it’s not difficult to find out in what encoding it is or might be. For a full list of the encodings that Python can handle see the docs here.

So let’s get start coding… We’ll make a simple utility that first makes a backup of our original file and then converts it to utf-8.

#!/usr/bin/env python

import os
import sys
import shutil

def convert_to_utf8(filename):
    # gather the encodings you think that the file may be
    # encoded inside a tuple
    encodings = ('windows-1253', 'iso-8859-7', 'macgreek')

    # try to open the file and exit if some IOError occurs
    try:
        f = open(filename, 'r').read()
    except Exception:
        sys.exit(1)

    # now start iterating in our encodings tuple and try to
    # decode the file
    for enc in encodings:
        try:
            # try to decode the file with the first encoding
            # from the tuple.
            # if it succeeds then it will reach break, so we
            # will be out of the loop (something we want on
            # success).
            # the data variable will hold our decoded text
            data = f.decode(enc)
            break
        except Exception:
            # if the first encoding fail, then with the continue
            # keyword will start again with the second encoding
            # from the tuple an so on.... until it succeeds.
            # if for some reason it reaches the last encoding of
            # our tuple without success, then exit the program.
            if enc == encodings[-1]:
                sys.exit(1)
            continue

    # now get the absolute path of our filename and append .bak
    # to the end of it (for our backup file)
    fpath = os.path.abspath(filename)
    newfilename = fpath + '.bak'
    # and make our backup file with shutil
    shutil.copy(filename, newfilename)

    # and at last convert it to utf-8
    f = open(filename, 'w')
    try:
        f.write(data.encode('utf-8'))
    except Exception, e:
        print e
    finally:
        f.close()

Of course you can also encode it in another encoding than utf-8, just replace the one you desire in the f.write(data.encode(‘utf-8′)) line. And also change the tuple elements according to your needs.
A simple use of our function might be like the following. But first add these lines at the end of our .py file and save it let’s say as ‘converter.py’:

if __name__ == '__main__':
    convert_to_utf8(sys.argv[1])

Here we ask the user to pass the file he/she wants to be converted as the fisrt argument in the command line. For example :

python converter.py /home/user/my_movies/my_movie.sub

or if it’s in the same dir with our converter.py

python converter.py my_movie.sub

And of course if our file has space between the words in it’s name, like ‘Cape Fear.srt‘, then we put it either inside ‘ ‘ or ” ” .

And that’s all. Later we can change our function so as to convert and make backup in a whole directory and not in a single file. But that’s a good beggining.

Cheers…

UPDATE [7/10/08]: If you just want to convert your files and you are not interested in the programming solution, I’ve just released cpConverter, a utility that converts files from and to among 84 code pages. Info and download link here.

About these ads

12 comments so far

  1. Jeroen Ruigrok van der Werven on

    Or you can try and see if the Universal Encoding Detector can guess the encoding of the file, which you can subsequently use in the conversion.

    http://chardet.feedparser.org/

  2. gomputor on

    Hey Jeroen, nice library, didn’t know it exists. :)
    But I’ve got some problems with some greek files I loaded for testing. The files are in windows-1253 codepage and the detector with ‘confidence’: ’0.95822640519414115′, detects them as ‘ISO-8859-7′.
    That’s not a very big deal, both codepages are almost identical, but if you decode a file that is actually ‘windows-1253′ with ‘ISO-8859-7′ you’ll missing some letters, like capital A with a ton (Ά).
    In it’s place you see only a space ‘ ‘. That’s a well known problem for us greeks.
    But again, great library. :)

  3. dimitris on

    I downloaded some txt files, and it seemed after all while the editor opens them as 8859-7, they were handled as plain ascii…
    Don’t get it much… No greek, just ascii

  4. gomputor on

    @dimitris
    1. Which editor are you refering to?
    2. It doesn’t matter in which encoding you are opening a file. What matters is if the file you want to open is encoded in the language you want.
    If the file you opened isn’t encoded with a greek encoding (ie windows-1253, iso-8859-7, etc.) then apparently you won’t see any greek no matter in what encoding you are opening it.
    So, most propably the files you are talking about aren’t encoded in greek. But then again, I can only guess as far as I don’t have the files you talk about.

  5. depika on

    How can I run this code from a web server that is installed in linux and has python installed?

  6. Joshua on

    Hi!

    Good blog.

    I have a problem. I need write a txt file with special characters in python. I make the following script:

    outp = open(“text.txt”,”w”)
    outp.write(“á”)

    The console say: “Non-ASCII character ‘\xf1′ in file”

    Then, I try with the commands:

    specialcharacter = unicode( “año”, “utf-8″)

    And now the error message is: “unicode decode error “utf8″ codec can’t decode byte 0xf1″

    Can you help me please?

    Regards.

    Joshua.

    PD: Sorry for my english.

  7. S.Selvam on

    have you set encoding of the python file,else need to do the following as the first line of your python code,

    #coding:utf-8;

    Note:I tried your code in my python terminal,it works fine

  8. tuxey on

    As an alternative: iconv (from command line) can also be used to convert between many encodings.

  9. Dr Michael Obeng reviews on

    Having read this I believed it was very enlightening. I appreciate you finding the time and effort
    to put this short article together. I once again find myself personally spending
    way too much time both reading and commenting.
    But so what, it was still worthwhile!

  10. Yorgos on

    φίλε διάβασα το κείμενο σήμερα και έχω το ίδιο πρόβλημα εν ετει 2013!!
    θα προσπαθήσω να βρω άκρη με βάση το σκεπτικό σου
    ευχαριστώ :)

  11. Live Drive Review of ProductsLive Drive Backup Services are London based, and provides software for Online
    Backup Services, restore and disaster recovery automation.
    These forward-looking statements are based on this particular offering.
    I also chatted on the phone, forward calls and texts to another phone,
    send a message to this effect should show in the status column.

  12. Umar on

    Hi, i am new to python so just used the above as it is i am ending with an error “NameError: name ‘encodings’ is not defined” can you please assist me in getting this fixed…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: