Normalizing Strings

A fifteen minute implementation to standardize strings.

While this works well most of the time, it is not perfect. For example, while the unicode character ç has a defined decomposition, œ does not. In which case, these exceptions must be considered when performing a standardization.

I believe that whatever strategy is chosen to normalize strings must be based on both the audience and the supporting data. For example, if none of your data has the character ß and your intended audience does not have the ß character on their keyboard, it would probably be unnecessary to create any mappings.

The tracer decorator is where I spent the majority of my time. It allows you to keep track of the standardization methods that have changed the provided string.

#! -*- coding: utf-8 -*-
from collections import OrderedDict
from functools import wraps
from unicodedata import normalize
import re


def tracer(f):
    """
    Decorator function to keep track of the normalization methods that
    change the string.

    Given the following example:

    @tracer
    def my_normalization_method(self):
        self.word = self.word.lower()

    The tracer will add a value to the changes ordered dictionary only if
    the word that it replaces is different from the currently defined.

    So, if the original word is: "lowercase", the tracer function will
    not add a value to the changes dictionary as it will not modify
    the string. However, if the original word is "uppercase", the
    my_normalization_method will change the word to "UPPERCASE",
    and an entry will be created
    """
    @wraps(f)
    def wrapper(self):
        pre_method_word_value = self.word
        f(self)

        post_method_word_value = self.word
        if pre_method_word_value != post_method_word_value:
            method_name = f.func_name
            self.changes[method_name] = post_method_word_value

    return wrapper


class StandardizeWord(object):
    def __init__(self, word):
        self.word = word
        self.changes = OrderedDict()

    @tracer
    def strip_leading_trailing_spaces(self):
        """
        Remove leading and trailing spaces
        """
        self.word = self.word.strip()

    @tracer
    def word_to_ascii(self):
        """
        Convert non-ASCII characters to their ASCII equivalent
        Upper case string
        """
        self.word = normalize('NFKD', self.word).encode('ascii',
                                                        'ignore').upper()

    @tracer
    def replace_multiple_spaces(self):
        """
        Simple regular expression to replace multiple spaces with a single
        one
        """
        regular_expression = re.compile(r' {2,}')
        self.word = regular_expression.sub(' ', self.word)

    def run(self):
        """
        Run all standardization methods in order
        """
        normalizing_methods = ('strip_leading_trailing_spaces',
                               'word_to_ascii', 'replace_multiple_spaces')
        for method in normalizing_methods:
            getattr(self, method)()

    def __str__(self):
        return self.word

tests = (u'   montréal    and quebec  ',
         u'ÁÂÀÄÅÃ Ç ÉÊÈË ÍÎÌÏ Ñ ÓÔÒÖÕ ÚÛÙÜ Ý',
         u'naïve café',
         u'maçã')


for test in tests:
    standard_word = StandardizeWord(word=test)
    standard_word.run()
    print standard_word.changes