Fixing common Unicode mistakes with Python â€“ after they've been made

2023-06-25 17:34| 来源: 网络整理| 查看: 265

Update: Not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package, “ftfy”.

You have almost certainly seen text on a computer that looks something like this:

If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Here’s the type of Unicode mistake we’re fixing.

Some text, somewhere, was encoded into bytes using UTF-8 (which is quickly becoming the standard encoding for text on the Internet). The software that received this text wasn’t expecting UTF-8. It instead decodes the bytes in an encoding with only 256 characters. The simplest of these encodings is the one called “ISO-8859-1”, or “Latin-1” among friends. In Latin-1, you map the 256 possible bytes to the first 256 Unicode characters. This encoding can arise naturally from software that doesn’t even consider that different encodings exist. The result is that every non-ASCII character turns into two or three garbage characters.

The three most commonly-confused codecs are UTF-8, Latin-1, and Windows-1252. There are lots of other codecs in use in the world, but they are so obviously different from these three that everyone can tell when they’ve gone wrong. We’ll focus on fixing cases where text was encoded as one of these three codecs and decoded as another.

A first attempt¶

When you look at the kind of junk that’s produced by this process, the character sequences seem so ugly and meaningless that you could just replace anything that looks like it should have been UTF-8. Just find those sequences, replace them unconditionally with what they would be in UTF-8, and you’re done. In fact, that’s what my first version did. Skipping a bunch of edge cases and error handling, it looked something like this:

# A table telling us how to interpret the first word of a letter's Unicode # name. The number indicates how frequently we expect this script to be used # on computers. Many scripts not included here are assumed to have a frequency # of "0" -- if you're going to write in Linear B using Unicode, you're # probably aware enough of encoding issues to get it right. # # The lowercase name is a general category -- for example, Han characters and # Hiragana characters are very frequently adjacent in Japanese, so they all go # into category 'cjk'. Letters of different categories are assumed not to # appear next to each other often. SCRIPT_TABLE = { 'LATIN': (3, 'latin'), 'CJK': (2, 'cjk'), 'ARABIC': (2, 'arabic'), 'CYRILLIC': (2, 'cyrillic'), 'GREEK': (2, 'greek'), 'HEBREW': (2, 'hebrew'), 'KATAKANA': (2, 'cjk'), 'HIRAGANA': (2, 'cjk'), 'HIRAGANA-KATAKANA': (2, 'cjk'), 'HANGUL': (2, 'cjk'), 'DEVANAGARI': (2, 'devanagari'), 'THAI': (2, 'thai'), 'FULLWIDTH': (2, 'cjk'), 'MODIFIER': (2, None), 'HALFWIDTH': (1, 'cjk'), 'BENGALI': (1, 'bengali'), 'LAO': (1, 'lao'), 'KHMER': (1, 'khmer'), 'TELUGU': (1, 'telugu'), 'MALAYALAM': (1, 'malayalam'), 'SINHALA': (1, 'sinhala'), 'TAMIL': (1, 'tamil'), 'GEORGIAN': (1, 'georgian'), 'ARMENIAN': (1, 'armenian'), 'KANNADA': (1, 'kannada'), # mostly used for looks of disapproval 'MASCULINE': (1, 'latin'), 'FEMININE': (1, 'latin') } An intelligent Unicode fixer¶

Because encoded text can actually be ambiguous, we have to figure out whether the text is better when we fix it or when we leave it alone. The venerable Mark Pilgrim has a key insight when discussing his chardet module:

Encoding detection is really language detection in drag. –Mark Pilgrim, Dive Into Python 3

The reason the word “Bront녔” is so clearly wrong is that the first five characters are Roman letters, while the last one is Hangul, and most words in most languages don’t mix two different scripts like that.

This is where Python’s standard library starts to shine. The unicodedata module can tell us lots of things we want to know about any given character:

>>> import unicodedata >>> unicodedata.category(u't') 'Ll' >>> unicodedata.name(u't') 'LATIN SMALL LETTER T' >>> unicodedata.category(u'녔') 'Lo' >>> unicodedata.name(u'녔') 'HANGUL SYLLABLE NYEOSS'

Now we can write a more complicated but much more principled Unicode fixer by following some rules of thumb:

We want to apply a consistent transformation that minimizes the number of “weird things” that happen in a string. Obscure single-byte characters, such as ¶ and ƒ, are weird. Math and currency symbols adjacent to other symbols are weird. Having two adjacent letters from different scripts is very weird. Causing new decoding errors that turn normal characters into � is unacceptable and should count for much more than any other problem. Favor shorter strings over longer ones, as long as the shorter string isn’t weirder. Favor correctly-decoded Windows-1252 gremlins over incorrectly-decoded ones.

That leads us to a complete Unicode fixer that applies these rules. It does an excellent job at fixing files full of garble line-by-line, such as the University of Leeds Internet Spanish frequency list, which picked up that “mÃ¡s” is a really common word in Spanish text because there is so much incorrect Unicode on the Web.

The code we arrive at appears below. (But as I edit this post six years later, I should remind you that this was 2012! We’ve gotten much fancier about this, so you should try our full-featured Unicode fixing library, ftfy.)

# -*- coding: utf-8 -*- # # This code has become part of the "ftfy" library: # # http://ftfy.readthedocs.io/en/latest/ # # That library is actively maintained and works on Python 2 or 3. This recipe # is not. import unicodedata def fix_bad_unicode(text): u""" Something you will find all over the place, in real-world text, is text that's mistakenly encoded as utf-8, decoded in some ugly format like latin-1 or even Windows codepage 1252, and encoded as utf-8 again. This causes your perfectly good Unicode-aware code to end up with garbage text because someone else (or maybe "someone else") made a mistake. This function looks for the evidence of that having happened and fixes it. It determines whether it should replace nonsense sequences of single-byte characters that were really meant to be UTF-8 characters, and if so, turns them into the correctly-encoded Unicode character that they were meant to represent. The input to the function must be Unicode. It's not going to try to auto-decode bytes for you -- then it would just create the problems it's supposed to fix. >>> print fix_bad_unicode(u'Ãºnico') único >>> print fix_bad_unicode(u'This text is fine already :þ') This text is fine already :þ Because these characters often come from Microsoft products, we allow for the possibility that we get not just Unicode characters 128-255, but also Windows's conflicting idea of what characters 128-160 are. >>> print fix_bad_unicode(u'This â€” should be an em dash') This — should be an em dash We might have to deal with both Windows characters and raw control characters at the same time, especially when dealing with characters like \x81 that have no mapping in Windows. >>> print fix_bad_unicode(u'This text is sad .â\x81”.') This text is sad .⁔. This function even fixes multiple levels of badness: >>> wtf = u'\xc3\xa0\xc2\xb2\xc2\xa0_\xc3\xa0\xc2\xb2\xc2\xa0' >>> print fix_bad_unicode(wtf) ಠ_ಠ However, it has safeguards against fixing sequences of letters and punctuation that can occur in valid text: >>> print fix_bad_unicode(u'not such a fan of Charlotte Brontë…”') not such a fan of Charlotte Brontë…” Cases of genuine ambiguity can sometimes be addressed by finding other characters that are not double-encoding, and expecting the encoding to be consistent: >>> print fix_bad_unicode(u'AHÅ™, the new sofa from IKEA®') AHÅ™, the new sofa from IKEA® Finally, we handle the case where the text is in a single-byte encoding that was intended as Windows-1252 all along but read as Latin-1: >>> print fix_bad_unicode(u'This text was never Unicode at all\x85') This text was never Unicode at all… """ if not isinstance(text, unicode): raise TypeError("This isn't even decoded into Unicode yet. " "Decode it first.") if len(text) == 0: return text maxord = max(ord(char) for char in text) tried_fixing = [] if maxord

【本文地址】

公司简介

联系我们