Archive for September, 2008|Monthly archive page

Change the extension of multiple files in a chosen directory and it’s sub-directories with Python

To change the extension of a single or multiple files, we just have to use the os module and some of it’s usefull methods.
The following function changes the extention of all the files in the directory that you specify (and in it’s sub-directories if you want):

import os

def change_file_ext(cur_dir, old_ext, new_ext, sub_dirs=False):
    if sub_dirs:
        for root, dirs, files in os.walk(cur_dir):
            for filename in files:
                file_ext = os.path.splitext(filename)[1]
                if old_ext == file_ext:
                    oldname = os.path.join(root, filename)
                    newname = oldname.replace(old_ext, new_ext)
                    os.rename(oldname, newname)
    else:
        files = os.listdir(cur_dir)
        for filename in files:
            file_ext = os.path.splitext(filename)[1]
            if old_ext == file_ext:
                newfile = filename.replace(old_ext, new_ext)
                os.rename(filename, newfile)

The first argument, cur_dir, is the directory that your files are in. The second one, old_ext, is the extension that you want to change. The third argument, new_ext, is the new extension that your files want to have, and the last argument, sub_dirs, is if you want to apply the changes also into the sub-directories of your chosen directory. By default it’s flag is set to False (thus it is optional), so no changes will take place into the sub-dirs. You can change of course it’s default behaviour to True if you want, but I prefer it more this way, because if I ever forget myself and all I want to do is change a file only in the current directory, with it’s default flag set to True it will also apply the changes into the sub-dirs, and that’s something I may not want to happen.
Let’s look at two examples of it’s use:

# change all .txt files to .html only in this directory
change_file_ext('/home/user/my_files', '.txt', '.html')

# change all .txt files to .html also in sub-directories
change_file_ext('/home/user/my_files', '.txt', '.html', True)

Now, if we want to change muttiple files with different extensions into a single extension all at once, ie change some ‘.txt’, ‘.htm’, ‘.info’ files to ‘.html’, then we can use the following slight modification of our function:

import os

def change_multi_file_ext(cur_dir, extensions, new_ext, sub_dirs=False):
    if sub_dirs:
        for root, dirs, files in os.walk(cur_dir):
            for filename in files:
                file_ext = os.path.splitext(filename)[1]
                for ext in extensions:
                    if ext == file_ext:
                        oldname = os.path.join(root, filename)
                        newname = oldname.replace(ext, new_ext)
                        os.rename(oldname, newname)
    else:
        files = os.listdir(cur_dir)
        for filename in files:
            file_ext = os.path.splitext(filename)[1]
            for ext in extensions:
                if ext == file_ext:
                    newfile = filename.replace(ext, new_ext)
                    os.rename(filename, newfile)

Here all the arguments are the same except the second one. The second argument is a tuple (or a list) that holds our extensions that we want to change and thus we must define it somewhere in our code before we call the function.
For example:

# define our extensions that we want to change
extensions = ('.txt', '.htm', '.info')
# change all .txt, .htm, .info files to .html
# only in this directory
change_multi_file_ext('/home/user/my_files', extensions, '.html')

# change all .txt, .htm, .info files to .html
# also in sub-directories
change_multi_file_ext('/home/user/my_files', extensions, '.html', True)

Pretty simple usage of both functions and easy to implement them in your code.

IMPORTANT NOTE: You can change a file’s extension to none if you want, ie change ‘my_file.txt’ just to ‘my_file’ (remove the extension), but not the opposite, at least with the methods I provided. If you try that, then I must warn you that you may lose a lot of files and sub-dirs in your selected dir, so never never never even think of trying it. Be warned.
You can do this if you want:

# remove the extension from the .txt files
change_file_ext('/home/user/my_files', '.txt', '')

BUT NEVER NEVER NEVER DO THAT:

# most probably you will lose
# both files and sub-dirs with this
change_file_ext('/home/user/my_files', '', '.html')

The methods I presented here apply only to files THAT HAVE an extension, don’t you ever forget it.

Hope you found this post usefull.
Cheers!

Search and Replace multiple words or characters with Python

A most frequently question is how to replace all occurrences of a word or a character inside a string or a file.
If you just want to replace a simple character or word, then all you have to do is to use the replace() method that Python provides for that purpose. The python replace() method takes 2 arguments and a third optional one. It’s definition is:

replace(old, new[, count])

And as the documents tell, it returns a copy of the string with all occurrences of substring ‘old’ replaced by ‘new‘. If the optional argument ‘count‘ is given, only the first count occurrences are replaced. If you leave that optional argument empty, then all occurrences will be replaced.
An example of this method might be:

my_text = 'Hello everyone. Say "Hello" to me!'
my_text = my_text.replace('Hello', 'Goodbye')
print my_text
#prints 'Goodbye everyone. Say "Goodbye" to me!'

or :

my_text = 'Hello everyone. Say "Hello" to me!'
print my_text.replace('Hello', 'Goodbye')

or:

print 'Hello everyone. Say "Hello" to me!'.replace('Hello', 'Goodbye')

It’s pretty the same. Use what you think is best for the situation or your style.
It’s a very simple and straigth forward method that you will not find difficult to include it in your code for simple replacements.

But what about if we want to replace multiple characters or words inside a string or file?
My implementation is a simple one. With it you can replace all occurences of a single character or word as the python replace() method mentioned above, but also multiple characters and words inside a string or a whole file.
Let’s see it:

def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text

Our method, replace_all(), takes 2 arguments. The first one, text, is the string or file (it’s text) that the replacement will take place. The second one, dic, is a dictionary with our word or character(s) to be replaced as the key, and the replacement word or character(s) as the value of that key. This dictionary can have just one key:value pair if you want to replace just one word or character, or multiple key:values if you want to replace multiple words or characters at once.
A sample dictionary is like that one:

reps = {'a':'@', 'e':'3', 's':'5'}

With this dictionary we define that we want to replace ‘a’ with ‘@’, ‘e’ with ‘3’ and ‘s’ with ‘5’. Of course you’ll make your own dictionary with your custom key:values.

So, let’s make a working example and see if it works so far, before we see how our method works.

# define our method
def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text

# our text the replacement will take place
my_text = 'Hello everybody.'

# our dictionary with our key:values.
# we want to replace 'H' with '|-|'
# 'e' with '3' and 'o' with '0'
reps = {'H':'|-|', 'e':'3', 'o':'0'}

# bind the returned text of the method
# to a variable and print it
txt = replace_all(my_text, reps)
print txt    # it prints '|-|3ll0 3v3ryb0dy'

# of course we can print the result
# at once with:
# print replace_all(my_text, reps)

Save it and run it from the console and see what you got.
Pretty simple so far, isn’t it? So let’s get inside our replace_all() method and see how it works.
First we start iterating in our dictionary using the iteritems() method that Python provides for a dictionary:

for i, j in dic.iteritems():

With the iteritems() method you can retrieve the key and corresponding value at the same time, so that’s why we use ‘for i, j’ and not a simple iterator. As we iterate, we bind the current key we are to ‘i’ and it’s corresponding value to ‘j’.

Next is the replacement method. Here we use the replace() method that we mentioned in the beginning.
We simply tell that we want to replace the ‘i’ key with it’s corresponding ‘j’ value in our text and then bind the returned copy of the python replace() method to our text again, so it will always be updated with the replacements that took place so far:

text = text.replace(i, j)

And lastly we return our text so that we can use it.

Here I must warn you about something that you may face using the dictionary with your custom key:values.
If you use this method for simple character or word replace, single or multiple, then it will work as expected.
For example both of the following dictionaries work as expected:

# for search & replace of whole words
dic = {'hello':'goodbye', 'bad':'good', 'yes':'no'}
# for search & replace of characters
dic = {'a':'@', 'e':'3', 'o':'0', '8':'eight'}

We fill our dictionaries with as many words we want to be searched and replaced in the first case, and as many characters we want to be searched and replaced in the second one, and everything works fine.

But what about if in rare occasions we mix up characters and words together and our dictionary looks like the following one?

# assuming that our string the replacement
# will take place is 'hello everybody'
dic = {'hel':'HEL', 'e':'3', 'o':'0'}

It will work, replacements will take place, but not as you expected to. You’d expect it to return something like ‘H3Ll0 3v3ryb0dy‘ or ‘HELl0 3v3ryb0dy‘ I guess, eh?
No. Most possibly it will return ‘h3ll0 3v3ryb0dy‘, at least that’s what I get when I run it on my machine. But why?
This happens because our ‘e‘ key is overlapped with the ‘hel‘ key and because when python iterates through a dictionary, the ordering of the keys and values retrieved from that dictionary cannot be defined. That means that you can’t be sure in which order it searches the keys. And most of the times, definitely not in the order that you define them in your dictionary. The algorithm that Python uses to search through a dictionary is a complex one to discuss it here. That means that our ‘e‘ key may come before our ‘hel‘ key or in other occasions the opposite. In our example ‘e‘ comes first, so first it finds the ‘e‘ in our text and replaces it with ‘3‘. Then it searches for the ‘hel‘ key. But now there is no ‘hel‘ in our text, we modified it with the previous replacement. Now we have ‘h3l‘, so no ‘hel’:’HEL‘ replacement can take place. Understood?
So don’t mix characters along with words in your replacement. In rare occasions that you wanna do it, it’s better to define two dictionaries, one for the words and one for the characters, and then use the replace_all() method two times. So for our previous example, you can do something like this:

text = 'hello everybody'
w_dic = {'hel':'HEL'}
c_dic = {'E':'3', 'e':'3', 'o':'0'}
text = replace_all(text, w_dic)
text = replace_all(text, c_dic)
print text  # prints 'H3Ll0 3v3ryb0dy'

And of course we must always remember that the python replace() method is case sensitive, don’t forget that ever. That’s the reason we included both ‘E‘ and ‘e‘ in our dictionary.

You can implement this method in whatever way you want inside your code, just find the one that suits you.
And of course if all you want is to replace just a single character or word, you can use the python replace() method mentioned in the beginning, so you don’t need a dictionary and my method. Just use the method I provided for multiple word or character replacements instead.
And both methods of course can be used to replace occurrences inside a simple string or a file’s text. That is up to you and very easy to implement.

Hope you found that post interesting. For any questions or recomendations or whatever, just leave a comment.

[PS: And once again I must apologize for my bad use of the english language.]

Convert a file in UTF-8 or any encoding with Python

Problem:
You are not english and your language uses special characters other than latin for some or all the letters of the alphabet (like Greek, Chinese, Japanese, etc…)
You’ve recently swithched from Windows to Ubuntu and most of your text files are in a Windows encoding, or you’ve just downloaded a subtitle in you preffered language for one of your movies (most subtitles are in a Windows encoding. I never found one in utf-8, but that doesn’t mean that there aren’t some.)
And you open that file let’s say with gedit and all you see are funny characters. That happens because Ubuntu uses utf-8 for the default decoding and encoding so there is a missmatch with your file’s encoding.
So, how can I convert programmatically my file(s) to utf-8 so as all applications in Ubuntu display them properly?

Solution:
First you must know in what encoding your file is, or make a guess between some. For example I’m greek, so most of my text files from Windows or the greek subtitles I download are mostly in either ‘windows-1253’ or ‘iso-8859-7’ encoding, two almost identical encodings. Your’s will be different but it’s not difficult to find out in what encoding it is or might be. For a full list of the encodings that Python can handle see the docs here.

So let’s get start coding… We’ll make a simple utility that first makes a backup of our original file and then converts it to utf-8.

#!/usr/bin/env python

import os
import sys
import shutil

def convert_to_utf8(filename):
    # gather the encodings you think that the file may be
    # encoded inside a tuple
    encodings = ('windows-1253', 'iso-8859-7', 'macgreek')

    # try to open the file and exit if some IOError occurs
    try:
        f = open(filename, 'r').read()
    except Exception:
        sys.exit(1)

    # now start iterating in our encodings tuple and try to
    # decode the file
    for enc in encodings:
        try:
            # try to decode the file with the first encoding
            # from the tuple.
            # if it succeeds then it will reach break, so we
            # will be out of the loop (something we want on
            # success).
            # the data variable will hold our decoded text
            data = f.decode(enc)
            break
        except Exception:
            # if the first encoding fail, then with the continue
            # keyword will start again with the second encoding
            # from the tuple an so on.... until it succeeds.
            # if for some reason it reaches the last encoding of
            # our tuple without success, then exit the program.
            if enc == encodings[-1]:
                sys.exit(1)
            continue

    # now get the absolute path of our filename and append .bak
    # to the end of it (for our backup file)
    fpath = os.path.abspath(filename)
    newfilename = fpath + '.bak'
    # and make our backup file with shutil
    shutil.copy(filename, newfilename)

    # and at last convert it to utf-8
    f = open(filename, 'w')
    try:
        f.write(data.encode('utf-8'))
    except Exception, e:
        print e
    finally:
        f.close()

Of course you can also encode it in another encoding than utf-8, just replace the one you desire in the f.write(data.encode(‘utf-8’)) line. And also change the tuple elements according to your needs.
A simple use of our function might be like the following. But first add these lines at the end of our .py file and save it let’s say as ‘converter.py’:

if __name__ == '__main__':
    convert_to_utf8(sys.argv[1])

Here we ask the user to pass the file he/she wants to be converted as the fisrt argument in the command line. For example :

python converter.py /home/user/my_movies/my_movie.sub

or if it’s in the same dir with our converter.py

python converter.py my_movie.sub

And of course if our file has space between the words in it’s name, like ‘Cape Fear.srt‘, then we put it either inside ‘ ‘ or ” ” .

And that’s all. Later we can change our function so as to convert and make backup in a whole directory and not in a single file. But that’s a good beggining.

Cheers…

UPDATE [7/10/08]: If you just want to convert your files and you are not interested in the programming solution, I’ve just released cpConverter, a utility that converts files from and to among 84 code pages. Info and download link here.

How to find if a specific font is installed in a user’s system (with PyGtk)

After 10 years of working on Windows I recently switched to Linux and especially Ubuntu (‘working’ is not the right word ’cause actually I don’t work with a pc and I’m not a professional programmer, I learned programming by myself and for the fun of it. I just use my pc for fun, entertainment and to broaden my knowledge on various fields.)
But I found myself that so far in Ubuntu I’m missing a lot of things from Windows. No, I don’t miss Windows as an OS, not at all, but I do miss a lot of applications written for that specific OS (and I don’t want to use Wine to run them in Ubuntu, I want native applications.)
So, I found that Linux lacks a decent .nfo and ascii art viewer. Well I know that there is NFO Viewer, but you must download it first and then compile it and then install it. Ok, I did that, it’s a very descent program, but I’m not fully satisfied with it. So what can I do? Write an .nfo viewer myself of course! Even for the fun of it and for learning a lot of things in the process, both for pyGtk and Linux (and Python.)
So, I started. But I came across my first problem. That is, .nfo files and ascii art in general use codepage cp437 (or IBM437) for the encoding and in Windows to view it properly you must select the Terminal font in your application, a fixed-size font used in the Windows command line with special characters that an .nfo file use. But there isn’t in Ubuntu a font that is close to the Windows Terminal font. All the native Ubuntu fixed-size fonts are way too far from the Windows one. I think the most close native font is DejaVu Sans Mono, but still it lacks a lot to view .nfo and ascii art properly.
But I searched… and found at last the Terminus font. A font that I must say is almost identicall to the Windows one. But you must download it and install it if you wanna use it. And as far as regarding my program, I don’t want to force the user to download it and install it (a hypothetical story of course, if I ever gonna release that program.)
So back to code and to my problem, I wanted to check if the user has that specific font in his/her system and then load it. If I don’t find it then set another font as the default one so as to keep the user close to see what ascii art is. And all of these because the default font that gtk+ and pango use is not a fixed-size one (I think it uses one from the Sans family).
But how to check if the user has that font? I didn’t know, so I started googling first and reading the docs, both pyGtk ones and Python. I posted also in a couple of forums my question. But I didn’t get any answer. Something was missing. What was that? Well, the answer was simple after all, and it was lying inside the pango docs that I didn’t read carefully at first. So I read it all over again and found the answer (and realized for once more that first I must read the docs carefully.)

And here it is if you ever need to search for a specific font in the user’s system and set it as the default for your application if found, or otherwise set a secondary as default:


# first you must import pango of course
import pango

# The get_pango_context() method returns the pango.Context with
# the appropriate colormap, font description and base direction
# for the selected widget (text_view in our case, assuming
# text_view is your gtk.TextView)

context = text_view.get_pango_context()

# list_families() returns a tuple containing a set of the fonts
# that pango can use with the selected widget

fonts = context.list_families()

# so we iterate in the font tuple
for font in fonts:
    # if our font is found (get_name returns a string)
    if font.get_name() == 'Terminus':
        # set that font for our text_view and break the loop
        text_view.modify_font(pango.FontDescription('Terminus 8'))
        break
    else:
        # else set our second choice font
        text_view.modify_font(pango.FontDescription('DejaVu Sans Mono 8'))

Pretty simple, isn’t it?

Hope it’ll help someone as it did to me.

[PS: First I must apologize for the long, boring and tiring story just to present 9 lines of code and second I must apologize for my bad english. But you see, english is not my native language.]