ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
????? ????????
????? ???? ??????? ?????? ??????

??MashhadLUG.org??

??@bijan@quitter.se??
???? ??
????
?¡°?¡±The Unicode??

????? ??? ??? ?????? ??????
?? ????? ???
??????????
¡°The Introduction¡±
???? ??? ??
?????
¡°Why ±«²Ô¾±³¦´Ç»å±ð?¡±
???? ??? ? ???? ¡­?
?¡±... ?¡°if you think??

???? ?????
???
???? ???

??ASCII??

???????? )? ????(?

??????? ???? ?????????? ????? ???? ???? ????
??? ???? ???? ???? ???? ?? ????? ???????? ????
???? ?? ???¡­?
?¡±... ?¡°If by now??
???????? ????????
¡°The Character ·¡²Ô³¦´Ç»å¾±²Ô²µ¡±
???????? ??????? ??????
¡°What is The Character Encoding?¡±
#

C
T

!

M

H

JS

N



O

F

Q

A

I
P

x

Y

E

G
V

U

K

B

L

D

Z

R
W

?

0100110011011001011010
0100100011101010101101
0011001100000110010101
0101000100111101101101
0010110100101010111101
01010101010101
???????? ??? ??
????
¡°The unicode History¡±
???????? ??
?????
?¡±?¡°ASCII Encoding??
???

?? ??? )??? ???????(?

???

??? ??????? ??????

???

??? ??????? ????

???

?? ??? ??????
???? ? ??? ??????
?¡±?¡°ASCII and the extra bit??
???

?????? ????? ??? ? ????

???

???? ???? ????? ????? ????????
?1?

?2?

?3?

?4?

?5?

?6?

?7?

??x??

??x??

??x??

??x??

??x??

??x??

??x??

???

???? ??????? ??????
???? ? ??? ??????
¡°ASCII and the extra bit¡±
???? ? ???? ??
?????? ???

Codepage dependent

Fixed

¡°ANSI and The birth of the Code Pages¡±

CP437 (IBM)

CP1256 (Arabic)
????? ? ??
?????? ???
?¡±?¡°Code Pages Problems??
???

????????? ???? ??? ??????? ????

???

????? ?? ?????? ?? ????? ??? ????? ?????

???

???????? ???? ???? ??? ??????
????? ? ???? ??????? ????? ???
¡°Difference between codepages characters¡±

$ python
 print chr(202).decode('cp437')
¨m
 print chr(202).decode('cp1256')
???
?????????? ??? ??
????
¡°The Unicode standard¡±
??????

ÈÕ±¾ÕZ

English

Sloven??ina

?????????

???????

§²§å§ã§ã§Ü§Ú§Û

???????

Polski

ÖÐÎÄ
§¢§ì§Ý§Ô§Ñ§â§ã§Ü§Ú

...
?????????? ??? ??
????
¡°The Unicode standard¡±

Basic Latin

A

Letter
ISO Control
Uppercase
Lowercase

Whitespace

Digit
Left to Right
...
AlphaNumeric

Mirrored

Code Point
U+hexadecimal
??????? ? ???????????
???? ??
????
¡°Unicode characters properties¡±
$ python
 import unicodedata as ud
 ud.name(u¡±?)¡±??
'ARABIC LETTER BEH'
 ud.category(u¡±?)¡±??
'Lo'
 ud.numeric(u¡±?¡±)
3.0
?????????? ??? ??
????
¡°The Unicode standard¡±

Code Point
U+hexadecimal

0100110011011001011010
0100100011101010101101
0011001100000110010101
0101000100111101101101
0010110100101010111101
01010101010101
???????? ??? ??? ??
????
¡°The Unicode encodings¡±

¡°Hello¡±

U+0048 U+0065 U+006C U+006C U+006F
???????? ??? ??? ??
????
¡°The Unicode encodings¡±

¡°Hello¡±

U+0048 U+0065 U+006C U+006C U+006F

0048 0065 006C 006C 006F
4800 6500 6C00 6C00 6F00
???????? ??? ??? ??
????
¡°The Unicode encodings¡±

¡°Hello¡±

U+0048 U+0065 U+006C U+006C U+006F

0048 0065 006C 006C 006F

low-endian

4800 6500 6C00 6C00 6F00

hi-endian
???????? ??? ??? ??
????
¡°The Unicode encodings¡±

¡°Hello¡±

U+0048 U+0065 U+006C U+006C U+006F

FFFE
Byte Order Mark
(BOM)

0048 0065 006C 006C 006F

low-endian

FEFF

4800 6500 6C00 6C00 6F00

hi-endian
???????? ??? ??? ??
????
¡°The Unicode encodings¡±

¡°Hello¡±

U+0048 U+0065 U+006C U+006C U+006F

FFFE
Byte Order Mark
(BOM)

0048 0065 006C 006C 006F

low-endian
UCS-2

FEFF

4800 6500 6C00 6C00 6F00

hi-endian
????? ? ??????? ??? ??? ??
????
?¡±?¡°The Unicode encodings cons??

???

???? ????? ?? ??? ?? ??????? ??? ??? ????

???

???? ??????? ?? ??????? ????

???

???? ??????? ?? ?????? ??? ??????
UTF-8 ?????????
¡°The UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡±
UTF-8 ?????????
¡°UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡±

???? ??????
Bites

First

Last

Bytes

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

7

U+000

U+007F

1

0xxxxxxx

11

U+0080

U+07FF

2

110xxxxx

10xxxxxx

16

U+0800

U+FFFF

3

1110xxxx

10xxxxxx

10xxxxxx

21

U+10000

U+1FFFFF

4

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

26

U+200000

U+3FFFFFF

5

111110xx

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

31

U+400000

U+7FFFFFFF

6

1111110x

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

Byte 6

10xxxxxx
UTF-8 ?????????
¡°UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡±
????? ?????

Byte 1
0xxxxxxx
110xxxxx
1110xxxx
11110xxx
111110xx
1111110x

????? ??? ??????

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx

10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx

10xxxxxx
10xxxxxx
10xxxxxx

10xxxxxx
10xxxxxx

10xxxxxx

????? ????? ???? ??? ??????
??????? ??????? 8?UTF??
?¡±?¡°The UTF8 encoding pros??

???
???
???

?????? ?? ??????? ????
???????? ??? ????
???????? ???? ?? ??????? ????
??48 65 6C 6C 6F??
??48 65 6C 6C 6F??

?¡±?¡°Hello??
?¡±?¡°Hello??

??ASCII??
?8-?UTF??
????? ??????? ??? ??? ??
????
¡°Other unicode encodings¡±
?

UCS-2 (LE-BE) + BOM

?

UTF-16 (LE-BE) + BOM

?

UTF-32 (LE-BE) + BOM

?

UTF-7
????? ??????? ??? ??? ??
????
¡°Other unicode encodings¡±
$ python
 unichr(202).encode('utf-16le')
'xcax00'
 unichr(202).encode('utf-16be')
'x00xca'
 unichr(202).encode('utf-16')
'xffxfexcax00'
 unichr(202).encode('utf-32')
'xffxfex00x00xcax00x00x00'
 unichr(202).encode('utf-7')
'+AMo-'
?????? ??????? ???
¡°Character Encodings conversion¡±
??? ???? ?????
¡°A Golden note¡±
???? ????? ???? ?? ????
????? ???? ???? ??????
¡±There's nothing as plain text on memory¡±
??????? ?? ???? ?????
??????? ??? ?????? ???
??? ??? ????
¡±It does not make sense to have a string without
knowing what encoding it uses¡±
??. ????? ??? ????????
¡°Sending the encoding type¡±
?

HTTP

Content-Type: text/html; charset=UTF-8

?

HTML 4

meta http-equiv=Content-Type
content=
??. ????? ??? ????????
¡°Sending the encoding type¡±
$ curl -I http://google.com
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Mon, 24 Feb 2014 12:32:10 GMT
Expires: Wed, 26 Mar 2014 12:32:10 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
$
??. ?????? ??? ????????
¡°Detecting the encoding type¡±

???? ??????? ???? ?????

?

?????? ? ??? ??????? ????? ??????? ?????

?

https://en.wikipedia.org/wiki/Charset_detection
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
??????? ?????? ??????? ??????
¡°Mozilla universal charset detection¡±

???
???
????? ???

?

????? ????? ?????

?

????? ????? ?? ??? ??????

?

https://en.wikipedia.org/wiki/Charset_detection
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
??????? ?????? ??????? ??????
¡°Mozilla universal charset detection¡±

$ python
 import chardet
 hello world.encode(utf16)
'xffxfehx00ex00lx00lx00ox00
x00wx00ox00rx00lx00dx00'
 chardet.detect(¡°xffxfehx00ex00lx00lx00 o
x00x00wx00ox00rx00lx00dx00¡±)
{'confidence': 1.0, 'encoding': 'UTF-16LE'}
!??. ??? ??? ??? ????????
¡°Guess the type of encoding!¡±
???????? ?? ??? ??
??? ???
???? ?? ???? ?????
¡°is using unicode in everywhere enough?¡±
Unicode, character encodings in programming and standard persian keyboard layout
$
str=¡±#216;#170;#216;#179;#217;#136;#217;#138
;#217;#135;
#217;#130;#216;#168;#217;#136;#216;#182;
#216;#162;#216;#168;¡±
$ dec2hex(){ echo ¡°obase=16; $1¡± | bc }
$ echo $str | grep -o ¡°[0-9]*¡± | while read num; do
echo -n ¡°x`dec2hex #num`¡±; done | chardet
stdin: utf-8 (confidence: 0.99)
$ python
 real_string =
(chr(216)+chr(170)+chr(216)+chr(179)+...).decode(¡°ut
f8¡±).encode(¡°ascii¡±, ¡°xmlcharrefreplace¡±)
 real_string
'#1578;#1587;#1608;#1610;#1607;#1602;#1576;#
1608;#1590;#1570;#1576;'
 print real_string
?????? ???? ?? ??????? ????? ?????? ?????
?????? ??? ?????
!??????????? ????? ???????
¡°You're not always the only content producer!¡±
???? ??
??? ? ????????
¡°Unicode and multilingual¡±
???? ??????
¡°Bi-Directional Text¡±

?? ??????Bi-Directional ????? ?? ????
RTL

LTR

RTL
???? ??????????
?¡±?¡°Character's directional??

??????????? ?????
??Weak Characters??
???????

??????????? ?????
??Strong Characters??

??????????? ????
??Neutral Characters??

????? ??????

?????? ??? ????
????? ??????
??????? ????????? ?????
¡°Persian standard keyboard¡±
?????? ????????? ??
??????? ? ???? ??????
¡°Institute of Standards
And Industrial Research of Iran¡±
????????????? ???
????? ???? )??????( ?????
¡°National standards of persian layout¡±
?028 ?ISIRI??

?????? ???? ???
?????? ???????
??????
?1092 ?ISIRI??

????? ?????????? ???? ??
????? ???? ???? ?? ????
????? ??????
??????
?7419 ?ISIRI??

???????? ???? ? ?????
????? ?? ???? ?????
?????????
?????????? ?? ?????
?¡°7419 ?¡±ISIRI??
???

???
????? ?? ????????? ??? ???

???

??????? ?? ????????? ?????

???

?????? ??? ????????? ??? ?????

???

?????? ?? ???? ???? ????????? ???? ? ????
??????? ????????? ?????
¡°Persian standard keyboard¡±
??????? ????????? ?????
?¡±?¡°Persian standard keyboard??
?????
?????? ??????

???????? ???????
???
??ZERO WIDTH??
??JOINER??

??? ????? ???????
???
???
??U+200d??

??? ?HTML??
?;?zwj??
????? ??? ?? ??? ??
??????? ????????? ?????
???
???

???
?`?

?`?

???
???

?`?

???

?`?
????? ??? ?? ??? ??
??????? ????????? ?????
????
?1?

?2?

?21?
?1?

??Num Lock??

?2?

??Num Lock??
??????? ????????? ?????
¡°Persian standard keyboard¡±
??????? ????????? ?????
?¡±?¡°Persian standard keyboard??
?????
?????? ??????
?????? ??????

???????? ???????
???
??ZERO WIDTH??
??JOINER??
??ZERO WIDTH??
??NON-JOINER??

??? ????? ???????
???
???
??U+200d??

??? ?HTML??
?;?zwj??

??U+200c??

?;?zwnj??
????? ??? ?? ??? ??
??????? ????????? ?????
???????
???

???

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
?? ????
???

???

??Space??

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
?? ????
???

???

??Shift??

?+ ?Space??

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
?¡°??? ????¡±?
?¡°?

???

???

???

???

???

???

?¡°?
????? ??? ?? ??? ??
??????? ????????? ?????
????? ??????
??Shift??
???

?+?
???

??L??
???

???

???
??Shift??

???
?+?

??K??
??????? ????????? ?????
¡°Persian standard keyboard¡±
??????? ????????? ?????
¡°Persian standard keyboard¡±
HTML ????

??? ????? ???????
???
???

???????? ???????
???

?????

rlm;

U+200f

RIGHT-TO-LEFT
MARK

?????? ???? ?? ???

lrm;

U+200e

LEFT-TO-RIGHT
MARK

?????? ?? ?? ?????

#8235;

U+202b

RIGHT-TO-LEFT
EMBEDDING

???? ??? ???? ?? ???

#8234;

U+202a

LEFT-TO-RIGHT
EMBEDDING

???? ??? ?? ?? ?????

#8236;

U+202c

POP
DIRECTIONAL
FORMATTING

?????? ??? ????
??????? ????????? ?????
?¡±?¡°Persian standard keyboard??
?????

???????? ???????
???

??? ????? ???????
???
???

??? ?HTML??

??????? ????? ???? ???
????

??RIGHT-TO-LEFT??
??OVERRIDE??

??U+202e??

?;8328#?

??????? ????? ?? ???
??????

??RIGHT-TO-LEFT??
??OVERRIDE??

??U+202d??

?;7328#?
????? ??? ?? ??? ??
??????? ????????? ?????
?? ...?
???

??Space??

?.?

?.?

?.?
????? ??? ?? ??? ??
??????? ????????? ?????
??¡­?
???

??Right Alt??

?+?

??m??
????? ??? ?? ??? ??
??????? ????????? ?????
??? ????? ??? ?Hello??
??H??

??e??

??l??

??l??

??o??

???

???

???

???

???

???

???

???

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
?? Hello??? ????? ????
??+ right Alt??
???
???

???

?0?

??H??

??e??

??l??

??l??

??o??

???

???

???

???

???

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
????? ????? ? /:c??????
???

???

???

???

???

???

???

??c??

?:?

?/?

???

???

???

???

???

???
????? ??? ?? ??? ??
??????? ????????? ?????
????? ????? /:? c??????
???

???

??+ right Alt??
???

???

???

???

???

???

?[?

??c??

?:?

?/?

???

???

???

???

??+ right Alt??

???
??p??
??????? ????????? ???? ??
???????? ?? ??????
Persian Standard keyboard
and backward compatibility
Unicode, character encodings in programming and standard persian keyboard layout
??? ???? ?? ???? ? ????? ????
Thank you for your time and patience
????? ? ?????
?¡°?¡±Question/Answer??

???
???? ???
????? ???? ????????

More Related Content

Viewers also liked (19)

Character Sets
Character SetsCharacter Sets
Character Sets
Leo Hernandez
?
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
Seda Yal??n
?
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
Project Student
?
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
John Collins
?
Aids to creativity
Aids to creativityAids to creativity
Aids to creativity
preciouspresentation
?
Django & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two CitiesDjango & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two Cities
Donna Benjamin
?
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
John Collins
?
 Shrunken Head  Shrunken Head
Shrunken Head
lindsaydavis
?
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of Life
Nafass
?
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU Decide
SizzlynRose
?
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
iapain
?
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
David Sommer
?
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
charlyheus
?
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with Rails
HeatherRivers
?
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)
John Collins
?
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint BrandingSharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
Marcy Kellar
?
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Ronald G. Shapiro
?
Edge Amsterdam profile
Edge Amsterdam profileEdge Amsterdam profile
Edge Amsterdam profile
charlyheus
?
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
Seda Yal??n
?
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
Project Student
?
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
John Collins
?
Django & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two CitiesDjango & Drupal: A Tale of Two Cities
Django & Drupal: A Tale of Two Cities
Donna Benjamin
?
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
John Collins
?
 Shrunken Head  Shrunken Head
Shrunken Head
lindsaydavis
?
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of Life
Nafass
?
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU Decide
SizzlynRose
?
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
iapain
?
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
David Sommer
?
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
charlyheus
?
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with Rails
HeatherRivers
?
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)
John Collins
?
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint BrandingSharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
SharePoint Exchange Forum - 10 Worst Mistakes in SharePoint Branding
Marcy Kellar
?
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Games To Explain Human Factors: Come, Participate, Learn & Have Fun!!! Photo ...
Ronald G. Shapiro
?
Edge Amsterdam profile
Edge Amsterdam profileEdge Amsterdam profile
Edge Amsterdam profile
charlyheus
?

Unicode, character encodings in programming and standard persian keyboard layout

  • 1. ????? ???????? ????? ???? ??????? ?????? ?????? ??MashhadLUG.org?? ??@bijan@quitter.se??
  • 2. ???? ?? ???? ?¡°?¡±The Unicode?? ????? ??? ??? ?????? ?????? ?? ????? ???
  • 4. ???? ??? ?? ????? ¡°Why ±«²Ô¾±³¦´Ç»å±ð?¡±
  • 5. ???? ??? ? ???? ¡­? ?¡±... ?¡°if you think?? ???? ????? ??? ???? ??? ??ASCII?? ???????? )? ????(? ??????? ???? ?????????? ????? ???? ???? ???? ??? ???? ???? ???? ???? ?? ????? ???????? ????
  • 6. ???? ?? ???¡­? ?¡±... ?¡°If by now??
  • 7. ???????? ???????? ¡°The Character ·¡²Ô³¦´Ç»å¾±²Ô²µ¡±
  • 8. ???????? ??????? ?????? ¡°What is The Character Encoding?¡±
  • 10. ???????? ??? ?? ???? ¡°The unicode History¡±
  • 11. ???????? ?? ????? ?¡±?¡°ASCII Encoding?? ??? ?? ??? )??? ???????(? ??? ??? ??????? ?????? ??? ??? ??????? ???? ??? ?? ??? ??????
  • 12. ???? ? ??? ?????? ?¡±?¡°ASCII and the extra bit?? ??? ?????? ????? ??? ? ???? ??? ???? ???? ????? ????? ???????? ?1? ?2? ?3? ?4? ?5? ?6? ?7? ??x?? ??x?? ??x?? ??x?? ??x?? ??x?? ??x?? ??? ???? ??????? ??????
  • 13. ???? ? ??? ?????? ¡°ASCII and the extra bit¡±
  • 14. ???? ? ???? ?? ?????? ??? Codepage dependent Fixed ¡°ANSI and The birth of the Code Pages¡± CP437 (IBM) CP1256 (Arabic)
  • 15. ????? ? ?? ?????? ??? ?¡±?¡°Code Pages Problems?? ??? ????????? ???? ??? ??????? ???? ??? ????? ?? ?????? ?? ????? ??? ????? ????? ??? ???????? ???? ???? ??? ??????
  • 16. ????? ? ???? ??????? ????? ??? ¡°Difference between codepages characters¡± $ python print chr(202).decode('cp437') ¨m print chr(202).decode('cp1256') ???
  • 17. ?????????? ??? ?? ???? ¡°The Unicode standard¡±
  • 19. ?????????? ??? ?? ???? ¡°The Unicode standard¡± Basic Latin A Letter ISO Control Uppercase Lowercase Whitespace Digit Left to Right ... AlphaNumeric Mirrored Code Point U+hexadecimal
  • 20. ??????? ? ??????????? ???? ?? ???? ¡°Unicode characters properties¡± $ python import unicodedata as ud ud.name(u¡±?)¡±?? 'ARABIC LETTER BEH' ud.category(u¡±?)¡±?? 'Lo' ud.numeric(u¡±?¡±) 3.0
  • 21. ?????????? ??? ?? ???? ¡°The Unicode standard¡± Code Point U+hexadecimal 0100110011011001011010 0100100011101010101101 0011001100000110010101 0101000100111101101101 0010110100101010111101 01010101010101
  • 22. ???????? ??? ??? ?? ???? ¡°The Unicode encodings¡± ¡°Hello¡± U+0048 U+0065 U+006C U+006C U+006F
  • 23. ???????? ??? ??? ?? ???? ¡°The Unicode encodings¡± ¡°Hello¡± U+0048 U+0065 U+006C U+006C U+006F 0048 0065 006C 006C 006F 4800 6500 6C00 6C00 6F00
  • 24. ???????? ??? ??? ?? ???? ¡°The Unicode encodings¡± ¡°Hello¡± U+0048 U+0065 U+006C U+006C U+006F 0048 0065 006C 006C 006F low-endian 4800 6500 6C00 6C00 6F00 hi-endian
  • 25. ???????? ??? ??? ?? ???? ¡°The Unicode encodings¡± ¡°Hello¡± U+0048 U+0065 U+006C U+006C U+006F FFFE Byte Order Mark (BOM) 0048 0065 006C 006C 006F low-endian FEFF 4800 6500 6C00 6C00 6F00 hi-endian
  • 26. ???????? ??? ??? ?? ???? ¡°The Unicode encodings¡± ¡°Hello¡± U+0048 U+0065 U+006C U+006C U+006F FFFE Byte Order Mark (BOM) 0048 0065 006C 006C 006F low-endian UCS-2 FEFF 4800 6500 6C00 6C00 6F00 hi-endian
  • 27. ????? ? ??????? ??? ??? ?? ???? ?¡±?¡°The Unicode encodings cons?? ??? ???? ????? ?? ??? ?? ??????? ??? ??? ???? ??? ???? ??????? ?? ??????? ???? ??? ???? ??????? ?? ?????? ??? ??????
  • 28. UTF-8 ????????? ¡°The UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡±
  • 29. UTF-8 ????????? ¡°UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡± ???? ?????? Bites First Last Bytes Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 7 U+000 U+007F 1 0xxxxxxx 11 U+0080 U+07FF 2 110xxxxx 10xxxxxx 16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx 21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 31 U+400000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx Byte 6 10xxxxxx
  • 30. UTF-8 ????????? ¡°UTF-8 ±ð²Ô³¦´Ç»å¾±²Ô²µ¡± ????? ????? Byte 1 0xxxxxxx 110xxxxx 1110xxxx 11110xxx 111110xx 1111110x ????? ??? ?????? Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ????? ????? ???? ??? ??????
  • 31. ??????? ??????? 8?UTF?? ?¡±?¡°The UTF8 encoding pros?? ??? ??? ??? ?????? ?? ??????? ???? ???????? ??? ???? ???????? ???? ?? ??????? ???? ??48 65 6C 6C 6F?? ??48 65 6C 6C 6F?? ?¡±?¡°Hello?? ?¡±?¡°Hello?? ??ASCII?? ?8-?UTF??
  • 32. ????? ??????? ??? ??? ?? ???? ¡°Other unicode encodings¡± ? UCS-2 (LE-BE) + BOM ? UTF-16 (LE-BE) + BOM ? UTF-32 (LE-BE) + BOM ? UTF-7
  • 33. ????? ??????? ??? ??? ?? ???? ¡°Other unicode encodings¡± $ python unichr(202).encode('utf-16le') 'xcax00' unichr(202).encode('utf-16be') 'x00xca' unichr(202).encode('utf-16') 'xffxfexcax00' unichr(202).encode('utf-32') 'xffxfex00x00xcax00x00x00' unichr(202).encode('utf-7') '+AMo-'
  • 34. ?????? ??????? ??? ¡°Character Encodings conversion¡±
  • 35. ??? ???? ????? ¡°A Golden note¡±
  • 36. ???? ????? ???? ?? ???? ????? ???? ???? ?????? ¡±There's nothing as plain text on memory¡±
  • 37. ??????? ?? ???? ????? ??????? ??? ?????? ??? ??? ??? ???? ¡±It does not make sense to have a string without knowing what encoding it uses¡±
  • 38. ??. ????? ??? ???????? ¡°Sending the encoding type¡± ? HTTP Content-Type: text/html; charset=UTF-8 ? HTML 4 meta http-equiv=Content-Type content="text/html; charset=gbk""Transcript_link__MLbGS" href="/slideshow/unicode-programming-and-persian-keyboard-layout/31655051#39">??. ????? ??? ???????? ¡°Sending the encoding type¡± $ curl -I http://google.com HTTP/1.1 301 Moved Permanently Location: http://www.google.com/ Content-Type: text/html; charset=UTF-8 Date: Mon, 24 Feb 2014 12:32:10 GMT Expires: Wed, 26 Mar 2014 12:32:10 GMT Cache-Control: public, max-age=2592000 Server: gws Content-Length: 219 X-XSS-Protection: 1; mode=block X-Frame-Options: SAMEORIGIN $
  • 40. ??. ?????? ??? ???????? ¡°Detecting the encoding type¡± ???? ??????? ???? ????? ? ?????? ? ??? ??????? ????? ??????? ????? ? https://en.wikipedia.org/wiki/Charset_detection http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
  • 41. ??????? ?????? ??????? ?????? ¡°Mozilla universal charset detection¡± ??? ??? ????? ??? ? ????? ????? ????? ? ????? ????? ?? ??? ?????? ? https://en.wikipedia.org/wiki/Charset_detection http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
  • 42. ??????? ?????? ??????? ?????? ¡°Mozilla universal charset detection¡± $ python import chardet hello world.encode(utf16) 'xffxfehx00ex00lx00lx00ox00 x00wx00ox00rx00lx00dx00' chardet.detect(¡°xffxfehx00ex00lx00lx00 o x00x00wx00ox00rx00lx00dx00¡±) {'confidence': 1.0, 'encoding': 'UTF-16LE'}
  • 43. !??. ??? ??? ??? ???????? ¡°Guess the type of encoding!¡±
  • 44. ???????? ?? ??? ?? ??? ??? ???? ?? ???? ????? ¡°is using unicode in everywhere enough?¡±
  • 46. $ str=¡±#216;#170;#216;#179;#217;#136;#217;#138 ;#217;#135; #217;#130;#216;#168;#217;#136;#216;#182; #216;#162;#216;#168;¡± $ dec2hex(){ echo ¡°obase=16; $1¡± | bc } $ echo $str | grep -o ¡°[0-9]*¡± | while read num; do echo -n ¡°x`dec2hex #num`¡±; done | chardet stdin: utf-8 (confidence: 0.99) $ python real_string = (chr(216)+chr(170)+chr(216)+chr(179)+...).decode(¡°ut f8¡±).encode(¡°ascii¡±, ¡°xmlcharrefreplace¡±) real_string '#1578;#1587;#1608;#1610;#1607;#1602;#1576;# 1608;#1590;#1570;#1576;' print real_string ?????? ???? ?? ??????? ????? ?????? ?????
  • 47. ?????? ??? ????? !??????????? ????? ??????? ¡°You're not always the only content producer!¡±
  • 48. ???? ?? ??? ? ???????? ¡°Unicode and multilingual¡±
  • 49. ???? ?????? ¡°Bi-Directional Text¡± ?? ??????Bi-Directional ????? ?? ???? RTL LTR RTL
  • 50. ???? ?????????? ?¡±?¡°Character's directional?? ??????????? ????? ??Weak Characters?? ??????? ??????????? ????? ??Strong Characters?? ??????????? ???? ??Neutral Characters?? ????? ?????? ?????? ??? ???? ????? ??????
  • 51. ??????? ????????? ????? ¡°Persian standard keyboard¡±
  • 52. ?????? ????????? ?? ??????? ? ???? ?????? ¡°Institute of Standards And Industrial Research of Iran¡±
  • 53. ????????????? ??? ????? ???? )??????( ????? ¡°National standards of persian layout¡±
  • 54. ?028 ?ISIRI?? ?????? ???? ??? ?????? ??????? ??????
  • 55. ?1092 ?ISIRI?? ????? ?????????? ???? ?? ????? ???? ???? ?? ???? ????? ?????? ??????
  • 56. ?7419 ?ISIRI?? ???????? ???? ? ????? ????? ?? ???? ????? ?????????
  • 57. ?????????? ?? ????? ?¡°7419 ?¡±ISIRI?? ??? ??? ????? ?? ????????? ??? ??? ??? ??????? ?? ????????? ????? ??? ?????? ??? ????????? ??? ????? ??? ?????? ?? ???? ???? ????????? ???? ? ????
  • 58. ??????? ????????? ????? ¡°Persian standard keyboard¡±
  • 59. ??????? ????????? ????? ?¡±?¡°Persian standard keyboard?? ????? ?????? ?????? ???????? ??????? ??? ??ZERO WIDTH?? ??JOINER?? ??? ????? ??????? ??? ??? ??U+200d?? ??? ?HTML?? ?;?zwj??
  • 60. ????? ??? ?? ??? ?? ??????? ????????? ????? ??? ??? ??? ?`? ?`? ??? ??? ?`? ??? ?`?
  • 61. ????? ??? ?? ??? ?? ??????? ????????? ????? ???? ?1? ?2? ?21? ?1? ??Num Lock?? ?2? ??Num Lock??
  • 62. ??????? ????????? ????? ¡°Persian standard keyboard¡±
  • 63. ??????? ????????? ????? ?¡±?¡°Persian standard keyboard?? ????? ?????? ?????? ?????? ?????? ???????? ??????? ??? ??ZERO WIDTH?? ??JOINER?? ??ZERO WIDTH?? ??NON-JOINER?? ??? ????? ??????? ??? ??? ??U+200d?? ??? ?HTML?? ?;?zwj?? ??U+200c?? ?;?zwnj??
  • 64. ????? ??? ?? ??? ?? ??????? ????????? ????? ??????? ??? ??? ??? ??? ???
  • 65. ????? ??? ?? ??? ?? ??????? ????????? ????? ?? ???? ??? ??? ??Space?? ??? ??? ???
  • 66. ????? ??? ?? ??? ?? ??????? ????????? ????? ?? ???? ??? ??? ??Shift?? ?+ ?Space?? ??? ??? ???
  • 67. ????? ??? ?? ??? ?? ??????? ????????? ????? ?¡°??? ????¡±? ?¡°? ??? ??? ??? ??? ??? ??? ?¡°?
  • 68. ????? ??? ?? ??? ?? ??????? ????????? ????? ????? ?????? ??Shift?? ??? ?+? ??? ??L?? ??? ??? ??? ??Shift?? ??? ?+? ??K??
  • 69. ??????? ????????? ????? ¡°Persian standard keyboard¡±
  • 70. ??????? ????????? ????? ¡°Persian standard keyboard¡± HTML ???? ??? ????? ??????? ??? ??? ???????? ??????? ??? ????? rlm; U+200f RIGHT-TO-LEFT MARK ?????? ???? ?? ??? lrm; U+200e LEFT-TO-RIGHT MARK ?????? ?? ?? ????? #8235; U+202b RIGHT-TO-LEFT EMBEDDING ???? ??? ???? ?? ??? #8234; U+202a LEFT-TO-RIGHT EMBEDDING ???? ??? ?? ?? ????? #8236; U+202c POP DIRECTIONAL FORMATTING ?????? ??? ????
  • 71. ??????? ????????? ????? ?¡±?¡°Persian standard keyboard?? ????? ???????? ??????? ??? ??? ????? ??????? ??? ??? ??? ?HTML?? ??????? ????? ???? ??? ???? ??RIGHT-TO-LEFT?? ??OVERRIDE?? ??U+202e?? ?;8328#? ??????? ????? ?? ??? ?????? ??RIGHT-TO-LEFT?? ??OVERRIDE?? ??U+202d?? ?;7328#?
  • 72. ????? ??? ?? ??? ?? ??????? ????????? ????? ?? ...? ??? ??Space?? ?.? ?.? ?.?
  • 73. ????? ??? ?? ??? ?? ??????? ????????? ????? ??¡­? ??? ??Right Alt?? ?+? ??m??
  • 74. ????? ??? ?? ??? ?? ??????? ????????? ????? ??? ????? ??? ?Hello?? ??H?? ??e?? ??l?? ??l?? ??o?? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ???
  • 75. ????? ??? ?? ??? ?? ??????? ????????? ????? ?? Hello??? ????? ???? ??+ right Alt?? ??? ??? ??? ?0? ??H?? ??e?? ??l?? ??l?? ??o?? ??? ??? ??? ??? ??? ??? ??? ???
  • 76. ????? ??? ?? ??? ?? ??????? ????????? ????? ????? ????? ? /:c?????? ??? ??? ??? ??? ??? ??? ??? ??c?? ?:? ?/? ??? ??? ??? ??? ??? ???
  • 77. ????? ??? ?? ??? ?? ??????? ????????? ????? ????? ????? /:? c?????? ??? ??? ??+ right Alt?? ??? ??? ??? ??? ??? ??? ?[? ??c?? ?:? ?/? ??? ??? ??? ??? ??+ right Alt?? ??? ??p??
  • 78. ??????? ????????? ???? ?? ???????? ?? ?????? Persian Standard keyboard and backward compatibility
  • 80. ??? ???? ?? ???? ? ????? ???? Thank you for your time and patience