每日一读:《 关于定义Python源代码编码 》,,官方pep原文:Ab


官方pep原文:

Abstract:
This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

Problem:
In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. Programmers can write their 8-bit strings using the favorite encoding, but are bound to the "unicode-escape" encoding for Unicode literals.

Proposed Solution:
I propose to make the Python source code encoding both visible and changeable on a per-source file basis by using a special comment at the top of the file to declare the encoding. To make Python aware of this encoding declaration a number of concept changes are necessary with respect to the handling of Python source code data.

Defining the Encoding:
Python will default to ASCII as standard encoding if no other encoding hints are given. To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
# coding=
or (using formats recognized by popular editors):

#!/usr/bin/python
#-*- coding:-*-
or:

#!/usr/bin/python
# vim: set fileencoding=:
More precisely, the first or second line must match the following regular expression:

^[\t\v]*#.*?coding[:=][\t]*([-_.a-zA-Z0-9]+)
The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration. If the first line matches the second line is ignored.
To aid with platforms such as Windows, which add Unicode BOM marks to the beginning of Unicode files, the UTF-8 signature \xef\xbb\xbf will be interpreted as ‘utf-8‘ encoding as well (even if no magic encoding comment is given).
If a source file uses both the UTF-8 BOM mark signature and a magic encoding comment, the only allowed encoding for the comment is ‘utf-8‘. Any other encoding will cause an error.
翻译:(因作者英语阅读能力有限,以下只是作为阅读参考)

抽象:
该PEP建议引入一种语法来声明Python源文件的编码。 Python解析器收到编码信息然后使用给定的编码解释源文件。最值得注意的是,这增强了源代码中Unicode文字的解释,并且可以直接使用在Unicode编辑器中。例如:使用UTF-8编写Unicode文字。

问题:
在Python 2.1中,Unicode文字只能使用基于Latin-1的编码“unicode-escape”编写。这让亚洲许多国家的Python用户而言就不太友好。程序员可以使用自己喜欢的编码编写他们的8位字符串,但是绑定的Unicode文字是“unicode-escape”编码。

建议的解决方案:
我建议通过在源文件顶部使用特殊注释来声明编码格式,使每个python源文件在源代码里的编码格式都可见并且可以更改。
为了让Python知道这个编码声明,需要对Python源代码数据的处理进行一些概念上的改变。

定义编码:
如果没有给出其他编码提示,Python将默认为ASCII作为标准编码。
要定义源代码编码,必须将注释作为文件的第一行或第二行放入源文件中,例如:

# coding =<你的编码格式>
或者(使用大部分编辑器认可的格式):

#!/usr/bin/python
#-*- coding:<你的编码格式>-*-
要么:

#!/usr/bin/python
#vim:set fileencoding =<你的编码格式>:
更确切地说,第一行或第二行必须符合以下正则表达式:

^[\t\v]*#.*?coding[:=][\t]*([-_.a-zA-Z0-9]+)
这个正则表达式组被解释为编码格式名称。如果是Python不知道的编码,编译的时候会产生错误。记住。包含编码格式声明的行上面不能有任何的Python语句。否则第一行与第二行匹配则被忽略。
为了帮助平台(如Windows)在Unicode文件的开头添加Unicode BOM标记,(UTF-8标记)\xef\xbb\xbf 将被解释为‘utf-8‘编码(即使没有给出编码注释)。
如果源文件同时使用UTF-8 BOM标记和编码注释,则唯一允许的注释编码为‘utf-8‘。任何其他编码都会导致错误。

总结:
这是一篇关于源文件编码格式的建议。有我们常用到的一行代码
#-*- coding:utf-8 -*-
这行代码的意思是让源文件支持中文注释。
有一点我们要知道,python的源文件编码格式支持是可更改的。
需要注意的是,以上讨论中。关于中文编码相关的问题只在python2系列版本中存在。在python3中默认的编码是Unicode,所以不需要在每个python文件中再加上
#-*- coding:utf-8 -*-注释。

每日一读:《 关于定义Python源代码编码 》

评论关闭