際際滷

際際滷Share a Scribd company logo
except UnicodeError:
  # A practical guide to fighting Unicode demons

                               Aram Dulyan (@Aramgutang)
                              Sydney Python Users group (SyPy)
                                                05 APR 2012
Except UnicodeError: battling Unicode demons in Python
What is Unicode?
Looking inside:
In Python:




  class unicode(basestring):
    ...
The great escapes:


  >>> 'e' == u'e'
  True

  >>> 'xc9' == u'xc9'
  False

  >>> u'xc9' == u'u00c9' == u'U000000c9'
  True
UTF-8
   There is no difference between an ASCII-encoded and a UTF-8 encoded
    file if no extended characters appear in it.
   Except if there's a BOM (byte order mark):
       UTF-8: EF BB BF ( 誰損多 )
       UTF-16: FE FF ( U+FFFE is reserved for this very purpose )




    NOT HELPFUL:
Encode/decode:


 Encode to bytes
 Decode to unicode




   or, forget decode completely:
    >>> 'fortxc3xa3'.decode('utf-8')
    u'fortxe9'
    >>> unicode('fortxc3xa3', 'utf-8')
    u'fortxe9'
This is why we declare encodings:



                                 RIGHT SINGLE QUOTATION MARK
                                            U+2019




 >>> u'u2019'.encode('utf-8')
 'xe2x80x99'
 >>> 'xe2x80x99'.decode('cp1252')
 u'xe2u20acu2122'
 >>> print u'xe2u20acu2122'
 但



 All because of a missing <meta charset="utf-8">
If you REALLY need ASCII:


  >>> print u'rxe9sumxe9'
  r辿sum辿
  >>> print u'rxe9sumxe9'.encode(errors='ignore')
  rsum
  >>> print u'rxe9sumxe9'.encode(errors='replace')
  r?sum?


  $ pip install unidecode
  >>> from unidecode import unidecode
  >>> print unidecode(u'rxe9sumxe9')
  resume
The u prefix:
  >>> '%s %s' % (u'unicode', 'string')
  u'unicode string'
  >>> 'string ' + u'unicode'
  u'string unicode'


  class Loonie(object):
      def __str__(self):
          return 'Throatwobbler Mangrove'
      def __unicode__(self):
          return u'Richard Luxuryyacht'

  >>> '%s' % Loonie()
  'Throatwobbler Mangrove'
  >>> u'%s' % Loonie()
  u'Richard Luxuryyacht'

  >>> '%s %s' % (Loonie(), u'is silly')
  u'Throatwobbler Mangrove is silly'
Combining marks:


LATIN SMALL LETTER E       LATIN SMALL LETTER E   COMBINING DIAERESIS
   WITH DIAERESIS                 U+0065                U+0308
       U+00EB


>>> print u'Zoxeb'
Zo谷
>>> print u'Zoeu0308'
Zo谷

>>> from unicodedata   import normalize
>>> normalize('NFC',   u'Zoeu0308')
u'Zoxeb'
>>> normalize('NFD',   u'Zoxeb')
u'Zoeu0308'


OS X on HFS+ normalises filenames, others don't
Warning:
PEP-8
Code in the core Python distribution should always use the ASCII or Latin-1
encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is
preferred over Latin-1, see PEP 3120.
Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)
should only be used when a comment or docstring needs to mention an
author name that requires Latin-1; otherwise, using x, u or U escapes is the
preferred way to include non-ASCII data in string literals.
For Python 3.0 and beyond, the following policy is prescribed for the
standard library (see PEP 3131): All identifiers in the Python standard
library MUST use ASCII-only identifiers, and SHOULD use English words
wherever feasible (in many cases, abbreviations and technical terms are used
which aren't English). In addition, string literals and comments must also be
in ASCII. The only exceptions are (a) test cases testing the non-ASCII
features, and (b) names of authors. Authors whose names are not based on
the latin alphabet MUST provide a latin transliteration of their names.
Libraries:

   unidecode
       For when you absolutely need ASCII  folds accents and
        transliterates from many languages.
   chardet
       Guesses most likely character encoding of a given bytestring.
        Based on Mozilla's code.
   unicode-nazi
       Yells about any implicit unicode/bytestring conversion in your
        code. Useful when porting code to Python 3.
Links:

   All About Python and Unicode
       A detailed reference on all things pertaining to Python and Unicode.
   Pragmatic Unicode
       PyCon 2012 talk on Unicode in Python, covering v3 as well.
   Love Hotels and Unicode
       A look at the inside politics and other quirky aspects of Unicode.
   Python Unicode  Fixing UTF-8 encoded as Latin-1
       Another poor soul who ran into this problem.
   Why the Obama tweet was garbled
       A quick explanation with comments from the people responsible.
   Unicode Support Shootout
       An advanced treatise on how most languages (including Python) fail at Unicode.

More Related Content

Viewers also liked (6)

PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmers
Jerome Eteve
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
Jerome Eteve
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
Kirsty Hulse
Succession Losers: What Happens to Executives Passed Over for the CEO Job?
Succession Losers: What Happens to Executives Passed Over for the CEO Job? Succession Losers: What Happens to Executives Passed Over for the CEO Job?
Succession Losers: What Happens to Executives Passed Over for the CEO Job?
Stanford GSB Corporate Governance Research Initiative
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
ux singapore
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmers
Jerome Eteve
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
Jerome Eteve
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
Kirsty Hulse
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
ux singapore

Similar to Except UnicodeError: battling Unicode demons in Python (20)

Unicode basics in python
Unicode basics in pythonUnicode basics in python
Unicode basics in python
Navaneethan Ramasamy
Unicode 101
Unicode 101Unicode 101
Unicode 101
davidfstr
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
Bert Pattyn
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
Cyber Security Alliance
Ghosterr
GhosterrGhosterr
Ghosterr
abelino22
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
Lecture 04 Programming C for Beginners 001
Lecture 04 Programming C for Beginners 001Lecture 04 Programming C for Beginners 001
Lecture 04 Programming C for Beginners 001
MahmoudElsamanty
Userspace drivers-2016
Userspace drivers-2016Userspace drivers-2016
Userspace drivers-2016
Chris Simmonds
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
Area41
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
Taking the hard out of hardware
Taking the hard out of hardwareTaking the hard out of hardware
Taking the hard out of hardware
Ronald McCollam
Arduino arduino boardnano
Arduino   arduino boardnanoArduino   arduino boardnano
Arduino arduino boardnano
clickengenharia
Ardx experimenters-guide-web
Ardx experimenters-guide-webArdx experimenters-guide-web
Ardx experimenters-guide-web
Jhonny Wladimir Pe単aloza Cabello
Indroduction arduino
Indroduction arduinoIndroduction arduino
Indroduction arduino
ThingerbitsElectroni
Indroduction the arduino
Indroduction the arduinoIndroduction the arduino
Indroduction the arduino
Hasarinda Manjula
Let's begin io t with $10
Let's begin io t with $10Let's begin io t with $10
Let's begin io t with $10
Makoto Takahashi
Don't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade MachinesDon't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade Machines
Michael Scovetta
Overview of file type identifiers (HackLu)
Overview of file type identifiers (HackLu)Overview of file type identifiers (HackLu)
Overview of file type identifiers (HackLu)
Ange Albertini
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
amiable_indian
arduino
arduinoarduino
arduino
murbz
Unicode 101
Unicode 101Unicode 101
Unicode 101
davidfstr
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
Bert Pattyn
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
Lecture 04 Programming C for Beginners 001
Lecture 04 Programming C for Beginners 001Lecture 04 Programming C for Beginners 001
Lecture 04 Programming C for Beginners 001
MahmoudElsamanty
Userspace drivers-2016
Userspace drivers-2016Userspace drivers-2016
Userspace drivers-2016
Chris Simmonds
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
Area41
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
Taking the hard out of hardware
Taking the hard out of hardwareTaking the hard out of hardware
Taking the hard out of hardware
Ronald McCollam
Arduino arduino boardnano
Arduino   arduino boardnanoArduino   arduino boardnano
Arduino arduino boardnano
clickengenharia
Let's begin io t with $10
Let's begin io t with $10Let's begin io t with $10
Let's begin io t with $10
Makoto Takahashi
Don't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade MachinesDon't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade Machines
Michael Scovetta
Overview of file type identifiers (HackLu)
Overview of file type identifiers (HackLu)Overview of file type identifiers (HackLu)
Overview of file type identifiers (HackLu)
Ange Albertini
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
amiable_indian
arduino
arduinoarduino
arduino
murbz

Recently uploaded (20)

Artificial Neural Networks, basics, its variations and examples
Artificial Neural Networks, basics, its variations and examplesArtificial Neural Networks, basics, its variations and examples
Artificial Neural Networks, basics, its variations and examples
anandsimple
STARLINK-JIO-AIRTEL Security issues to Ponder
STARLINK-JIO-AIRTEL Security issues to PonderSTARLINK-JIO-AIRTEL Security issues to Ponder
STARLINK-JIO-AIRTEL Security issues to Ponder
anupriti
Least Privilege AWS IAM Role Permissions
Least Privilege AWS IAM Role PermissionsLeast Privilege AWS IAM Role Permissions
Least Privilege AWS IAM Role Permissions
Chris Wahl
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
David Brossard
Recruiting Tech: A Look at Why AI is Actually OG
Recruiting Tech: A Look at Why AI is Actually OGRecruiting Tech: A Look at Why AI is Actually OG
Recruiting Tech: A Look at Why AI is Actually OG
Matt Charney
A General introduction to Ad ranking algorithms
A General introduction to Ad ranking algorithmsA General introduction to Ad ranking algorithms
A General introduction to Ad ranking algorithms
Buhwan Jeong
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
DianaGray10
How Telemedicine App Development is Revolutionizing Virtual Care.pptx
How Telemedicine App Development is Revolutionizing Virtual Care.pptxHow Telemedicine App Development is Revolutionizing Virtual Care.pptx
How Telemedicine App Development is Revolutionizing Virtual Care.pptx
Dash Technologies Inc
Automating Behavior-Driven Development: Boosting Productivity with Template-D...
Automating Behavior-Driven Development: Boosting Productivity with Template-D...Automating Behavior-Driven Development: Boosting Productivity with Template-D...
Automating Behavior-Driven Development: Boosting Productivity with Template-D...
DOCOMO Innovations, Inc.
The Road to SAP S4HANA Cloud with SAP Activate.pptx
The Road to SAP S4HANA Cloud with SAP Activate.pptxThe Road to SAP S4HANA Cloud with SAP Activate.pptx
The Road to SAP S4HANA Cloud with SAP Activate.pptx
zsbaranyai
Getting the Best of TrueDEM April News & Updates
Getting the Best of TrueDEM  April News & UpdatesGetting the Best of TrueDEM  April News & Updates
Getting the Best of TrueDEM April News & Updates
panagenda
The effectiveness of ai powered educational tools in enhancing academic perfo...
The effectiveness of ai powered educational tools in enhancing academic perfo...The effectiveness of ai powered educational tools in enhancing academic perfo...
The effectiveness of ai powered educational tools in enhancing academic perfo...
aebhpmqaocxhydmajf
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
DianaGray10
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptxHHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HampshireHUG
Threat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community DayThreat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community Day
Teri Radichel
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
jackalen173
CIOs Speak Out - A Research Series by Jasper Colin
CIOs Speak Out - A Research Series by Jasper ColinCIOs Speak Out - A Research Series by Jasper Colin
CIOs Speak Out - A Research Series by Jasper Colin
Jasper Colin
Network_Packet_Brokers_Presentation.pptx
Network_Packet_Brokers_Presentation.pptxNetwork_Packet_Brokers_Presentation.pptx
Network_Packet_Brokers_Presentation.pptx
Khushi Communications
Research Data Management (RDM): the management of dat in the research process
Research Data Management (RDM): the management of dat in the research processResearch Data Management (RDM): the management of dat in the research process
Research Data Management (RDM): the management of dat in the research process
HeilaPienaar
Next.js Development: The Ultimate Solution for High-Performance Web Apps
Next.js Development: The Ultimate Solution for High-Performance Web AppsNext.js Development: The Ultimate Solution for High-Performance Web Apps
Next.js Development: The Ultimate Solution for High-Performance Web Apps
rwinfotech31
Artificial Neural Networks, basics, its variations and examples
Artificial Neural Networks, basics, its variations and examplesArtificial Neural Networks, basics, its variations and examples
Artificial Neural Networks, basics, its variations and examples
anandsimple
STARLINK-JIO-AIRTEL Security issues to Ponder
STARLINK-JIO-AIRTEL Security issues to PonderSTARLINK-JIO-AIRTEL Security issues to Ponder
STARLINK-JIO-AIRTEL Security issues to Ponder
anupriti
Least Privilege AWS IAM Role Permissions
Least Privilege AWS IAM Role PermissionsLeast Privilege AWS IAM Role Permissions
Least Privilege AWS IAM Role Permissions
Chris Wahl
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
AuthZEN The OpenID Connect of Authorization - Gartner IAM EMEA 2025
David Brossard
Recruiting Tech: A Look at Why AI is Actually OG
Recruiting Tech: A Look at Why AI is Actually OGRecruiting Tech: A Look at Why AI is Actually OG
Recruiting Tech: A Look at Why AI is Actually OG
Matt Charney
A General introduction to Ad ranking algorithms
A General introduction to Ad ranking algorithmsA General introduction to Ad ranking algorithms
A General introduction to Ad ranking algorithms
Buhwan Jeong
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
SAP Automation with UiPath: Solution Accelerators and Best Practices - Part 6...
DianaGray10
How Telemedicine App Development is Revolutionizing Virtual Care.pptx
How Telemedicine App Development is Revolutionizing Virtual Care.pptxHow Telemedicine App Development is Revolutionizing Virtual Care.pptx
How Telemedicine App Development is Revolutionizing Virtual Care.pptx
Dash Technologies Inc
Automating Behavior-Driven Development: Boosting Productivity with Template-D...
Automating Behavior-Driven Development: Boosting Productivity with Template-D...Automating Behavior-Driven Development: Boosting Productivity with Template-D...
Automating Behavior-Driven Development: Boosting Productivity with Template-D...
DOCOMO Innovations, Inc.
The Road to SAP S4HANA Cloud with SAP Activate.pptx
The Road to SAP S4HANA Cloud with SAP Activate.pptxThe Road to SAP S4HANA Cloud with SAP Activate.pptx
The Road to SAP S4HANA Cloud with SAP Activate.pptx
zsbaranyai
Getting the Best of TrueDEM April News & Updates
Getting the Best of TrueDEM  April News & UpdatesGetting the Best of TrueDEM  April News & Updates
Getting the Best of TrueDEM April News & Updates
panagenda
The effectiveness of ai powered educational tools in enhancing academic perfo...
The effectiveness of ai powered educational tools in enhancing academic perfo...The effectiveness of ai powered educational tools in enhancing academic perfo...
The effectiveness of ai powered educational tools in enhancing academic perfo...
aebhpmqaocxhydmajf
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
SAP Automation with UiPath: SAP Test Automation - Part 5 of 8
DianaGray10
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptxHHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HHUG-04-2025-Close-more-deals-from-your-existing-pipeline-FOR SLIDESHARE.pptx
HampshireHUG
Threat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community DayThreat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community Day
Teri Radichel
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
Fast Screen Recorder v2.1.0.11 Crack Updated [April-2025]
jackalen173
CIOs Speak Out - A Research Series by Jasper Colin
CIOs Speak Out - A Research Series by Jasper ColinCIOs Speak Out - A Research Series by Jasper Colin
CIOs Speak Out - A Research Series by Jasper Colin
Jasper Colin
Network_Packet_Brokers_Presentation.pptx
Network_Packet_Brokers_Presentation.pptxNetwork_Packet_Brokers_Presentation.pptx
Network_Packet_Brokers_Presentation.pptx
Khushi Communications
Research Data Management (RDM): the management of dat in the research process
Research Data Management (RDM): the management of dat in the research processResearch Data Management (RDM): the management of dat in the research process
Research Data Management (RDM): the management of dat in the research process
HeilaPienaar
Next.js Development: The Ultimate Solution for High-Performance Web Apps
Next.js Development: The Ultimate Solution for High-Performance Web AppsNext.js Development: The Ultimate Solution for High-Performance Web Apps
Next.js Development: The Ultimate Solution for High-Performance Web Apps
rwinfotech31

Except UnicodeError: battling Unicode demons in Python

  • 1. except UnicodeError: # A practical guide to fighting Unicode demons Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012
  • 5. In Python: class unicode(basestring): ...
  • 6. The great escapes: >>> 'e' == u'e' True >>> 'xc9' == u'xc9' False >>> u'xc9' == u'u00c9' == u'U000000c9' True
  • 7. UTF-8 There is no difference between an ASCII-encoded and a UTF-8 encoded file if no extended characters appear in it. Except if there's a BOM (byte order mark): UTF-8: EF BB BF ( 誰損多 ) UTF-16: FE FF ( U+FFFE is reserved for this very purpose ) NOT HELPFUL:
  • 8. Encode/decode: Encode to bytes Decode to unicode or, forget decode completely: >>> 'fortxc3xa3'.decode('utf-8') u'fortxe9' >>> unicode('fortxc3xa3', 'utf-8') u'fortxe9'
  • 9. This is why we declare encodings: RIGHT SINGLE QUOTATION MARK U+2019 >>> u'u2019'.encode('utf-8') 'xe2x80x99' >>> 'xe2x80x99'.decode('cp1252') u'xe2u20acu2122' >>> print u'xe2u20acu2122' 但 All because of a missing <meta charset="utf-8">
  • 10. If you REALLY need ASCII: >>> print u'rxe9sumxe9' r辿sum辿 >>> print u'rxe9sumxe9'.encode(errors='ignore') rsum >>> print u'rxe9sumxe9'.encode(errors='replace') r?sum? $ pip install unidecode >>> from unidecode import unidecode >>> print unidecode(u'rxe9sumxe9') resume
  • 11. The u prefix: >>> '%s %s' % (u'unicode', 'string') u'unicode string' >>> 'string ' + u'unicode' u'string unicode' class Loonie(object): def __str__(self): return 'Throatwobbler Mangrove' def __unicode__(self): return u'Richard Luxuryyacht' >>> '%s' % Loonie() 'Throatwobbler Mangrove' >>> u'%s' % Loonie() u'Richard Luxuryyacht' >>> '%s %s' % (Loonie(), u'is silly') u'Throatwobbler Mangrove is silly'
  • 12. Combining marks: LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS WITH DIAERESIS U+0065 U+0308 U+00EB >>> print u'Zoxeb' Zo谷 >>> print u'Zoeu0308' Zo谷 >>> from unicodedata import normalize >>> normalize('NFC', u'Zoeu0308') u'Zoxeb' >>> normalize('NFD', u'Zoxeb') u'Zoeu0308' OS X on HFS+ normalises filenames, others don't
  • 14. PEP-8 Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120. Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using x, u or U escapes is the preferred way to include non-ASCII data in string literals. For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.
  • 15. Libraries: unidecode For when you absolutely need ASCII folds accents and transliterates from many languages. chardet Guesses most likely character encoding of a given bytestring. Based on Mozilla's code. unicode-nazi Yells about any implicit unicode/bytestring conversion in your code. Useful when porting code to Python 3.
  • 16. Links: All About Python and Unicode A detailed reference on all things pertaining to Python and Unicode. Pragmatic Unicode PyCon 2012 talk on Unicode in Python, covering v3 as well. Love Hotels and Unicode A look at the inside politics and other quirky aspects of Unicode. Python Unicode Fixing UTF-8 encoded as Latin-1 Another poor soul who ran into this problem. Why the Obama tweet was garbled A quick explanation with comments from the people responsible. Unicode Support Shootout An advanced treatise on how most languages (including Python) fail at Unicode.