Python3源码

#Python3源码| 来源: 网络整理| 查看: 265

8.1. Python程序的执行过程

Python解释器在执行任何一个Python程序文件时，首先进行的动作都是先对文件中的Python源代码进行编译，编译的主要结果是产生一组Python的byte code（字节码），然后将编译的结果交给Python的虚拟机（Virtual Machine），由虚拟机按照顺序一条一条地执行字节码，从而完成对Python程序的执行动作。

对于Python编译器来说，PyCodeObject对象才是其真正的编译结果，而pyc文件只是这个对象在硬盘上的表现形式，它们实际上是Python对源文件编译的结果的两种不同存在方式。

在程序运行期间，编译结果存在于内存的PyCodeObject对象中；而Python结束运行后，编译结果又被保存到了pyc文件中。当下一次运行相同的程序时，Python会根据pyc文件中记录的编译结果直接建立内存中的PyCodeObject对象，而不用再次对源文件进行编译了。

对整体流程认识清晰后完全可以写一个工具，将基于Python3.7生成的pyc文件解析出来，pyc文件的内容用json格式组织一下如下图：

写工具的目的只是为了更加理解整个流程。实际上使用Python的dis模块可以输出更为详细清晰的内容，如下图：

8.2. PyCodeObject源码 // code.h typedef struct { PyObject_HEAD int co_argcount; int co_kwonlyargcount; int co_nlocals; int co_stacksize; int co_flags; int co_firstlineno; PyObject *co_code; PyObject *co_consts; PyObject *co_names; PyObject *co_varnames; PyObject *co_freevars; PyObject *co_cellvars; Py_ssize_t *co_cell2arg; PyObject *co_filename; PyObject *co_name; PyObject *co_lnotab; void *co_zombieframe; PyObject *co_weakreflist; void *co_extra; } PyCodeObject;复制代码 Code Block：Python编译器在对Python源代码进行编译的时候，对于代码中的一个Code Block，会创建一个PyCodeObject对象与这段代码对应。当进入一个新的名字空间，或者说作用域时，就算是进入了一个新的Code Block了。比如下面的代码有三个code block：一个对应整个test.py文件，一个对应class A，一个对应def Fun。 # test.py class A: pass def Fun(): pass a = A() Fun()复制代码名字空间：名字空间是符号的上下文环境，符号的含义取决于名字空间。更具体地说，一个变量名对应的变量值是什么，在Python中，这并不是确定的，而是需要通过名字空间来决定。一个Code Block，对应着一个名字空间，它会对应一个PyCodeObject对象。 Python中的code对象：在Python中，有与C语言下的PyCodeObject对象对应的对象——code对象，这个对象是对C语言下的PyCodeObject对象的一个简单包装，通过code对象，我们可以访问PyCodeObject对象中的各个域。

8.3. 生成pyc文件 # pyc_generator.py import imp import sys def generate_pyc(name): fp, pathname, description = imp.find_module(name) try: imp.load_module(name, fp, pathname, description) finally: if fp: fp.close() if __name__ == '__main__': generate_pyc(sys.argv[1])复制代码

命令行中输入如下命令会生成pyc文件：

>>> ./python3.7 pyc_generator.py test复制代码 8.3.1. 生成PyCodeObject对象和pyc文件的C流程

从上面的pyc_generator文件中的imp.load_module开始，函数调用顺序如下：

// imp.py load_module =>load_source // _bootstrap.py[1] =>_load =>_load_unlocked // _bootstrap_external.py => exec_module => get_code复制代码

get_code方法中调用source_to_code方法生成PyCodeObject对象，调用_code_to_timestamp_pyc将PyCodeObject转为二进制数据，调用_cache_bytecode方法将二进制数据写入文件。

值得注意的是真正的Python不会调用_bootstrap.py的_load方法（上面函数调用顺序中的[1]），在Lib/importlib/__init__.py中：

# __init__.py try: import _frozen_importlib as _bootstrap except ImportError: from . import _bootstrap _bootstrap._setup(sys, _imp) else: # do sth try: import _frozen_importlib_external as _bootstrap_external except ImportError: from . import _bootstrap_external _bootstrap_external._setup(_bootstrap) _bootstrap._bootstrap_external = _bootstrap_external else: # do sth复制代码

可以看到实际上调用的是_frozen_importlib中的_load方法，而不是_bootstrap中的_load方法，此lib的内容在Python/importlib.h中被定义：不太明白为什么要这么处理，但是分析整体流程时将此处换成了_bootstrap，便于阅读源码。

下面会详细分析生成PyCodeObject对象，将PyCodeObject转为二进制数据和将二进制数据写入文件的流程。

8.3.2. 生成PyCodeObject对象源码 // _bootstrap_external.py source_to_code // _bootstrap.py =>_call_with_frames_removed // bltinmodule.c => builtin_compile_impl复制代码

builtin_compile_impl的C源码如下：

// bltinmodule.c static PyObject * builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize) { PyObject *source_copy; const char *str; int compile_mode = -1; int is_ast; PyCompilerFlags cf; int start[] = {Py_file_input, Py_eval_input, Py_single_input}; PyObject *result; cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8; if (flags & ~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST)) { PyErr_SetString(PyExc_ValueError, "compile(): unrecognised flags "); goto error; } /* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */ if (optimize < -1 || optimize > 2) { PyErr_SetString(PyExc_ValueError, "compile(): invalid optimize value "); goto error; } if (!dont_inherit) { PyEval_MergeCompilerFlags(&cf); } if (strcmp(mode, "exec ") == 0) compile_mode = 0; else if (strcmp(mode, "eval ") == 0) compile_mode = 1; else if (strcmp(mode, "single ") == 0) compile_mode = 2; else { PyErr_SetString(PyExc_ValueError, "compile() mode must be 'exec', 'eval' or 'single' "); goto error; } is_ast = PyAST_Check(source); if (is_ast == -1) goto error; if (is_ast) { // do sth. } str = source_as_string(source, "compile ", "string, bytes or AST ", &cf, &source_copy); if (str == NULL) goto error; result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize); Py_XDECREF(source_copy); goto finally; error: result = NULL; finally: Py_DECREF(filename); return result; }复制代码

其中：

调用source_as_string方法将上面的test.py源码加载进内存：

调用Py_CompileStringObject方法生成PyCodeObject对象： // pythonrun.c PyObject * Py_CompileStringObject(const char *str, PyObject *filename, int start, PyCompilerFlags *flags, int optimize) { PyCodeObject *co; mod_ty mod; PyArena *arena = PyArena_New(); if (arena == NULL) return NULL; mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena); if (mod == NULL) { PyArena_Free(arena); return NULL; } if (flags && (flags->cf_flags & PyCF_ONLY_AST)) { PyObject *result = PyAST_mod2obj(mod); PyArena_Free(arena); return result; } co = PyAST_CompileObject(mod, filename, flags, optimize, arena); PyArena_Free(arena); return (PyObject *)co; }复制代码

调用PyParser_ASTFromStringObject方法生成语法树，调用PyAST_CompileObject方法生成PyCodeObject对象。此处不对语法解析和编译做深入分析。

8.3.3. 将PyCodeObject对象转为二进制数据

_code_to_timestamp_pyc方法负责将PyCodeObject对象转为二进制数据，源码如下：

// _bootstrap_external.py def _code_to_timestamp_pyc(code, mtime=0, source_size=0): "Produce the data for a timestamp-based pyc. " data = bytearray(MAGIC_NUMBER) data.extend(_w_long(0)) data.extend(_w_long(mtime)) data.extend(_w_long(source_size)) data.extend(marshal.dumps(code)) return data复制代码

可以看出一个pyc文件包含几部分内容：

MAGIC_NUMBER：不同版本的Python实现都会定义不同的MAGIC_NUMBER，比如Python 3.7a0 3392，Python 3.6a0 3360，防止加载不兼容的pyc文件； 0：不清楚是用作什么； mtime：py文件创建或最近一次修改的时间信息，如果修改时间没有改变则不需要转为二进制保存，即不需要修改pyc文件； source_size：源码大小； marshal.dumps(code)：PyCodeObject对象的二进制流；

marshal.dumps调用marshal_dumps_impl方法：

// marshal.c static PyObject * marshal_dumps_impl(PyObject *module, PyObject *value, int version) /*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/ { return PyMarshal_WriteObjectToString(value, version); }复制代码

PyMarshal_WriteObjectToString源码为：

// marshal.c PyObject * PyMarshal_WriteObjectToString(PyObject *x, int version) { WFILE wf; memset(&wf, 0, sizeof(wf)); wf.str = PyBytes_FromStringAndSize((char *)NULL, 50); if (wf.str == NULL) return NULL; wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str); wf.end = wf.ptr + PyBytes_Size(wf.str); wf.error = WFERR_OK; wf.version = version; if (w_init_refs(&wf, version)) { Py_DECREF(wf.str); return NULL; } w_object(x, &wf); w_clear_refs(&wf); if (wf.str != NULL) { char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str); if (wf.ptr - base > PY_SSIZE_T_MAX) { Py_DECREF(wf.str); PyErr_SetString(PyExc_OverflowError, "too much marshal data for a bytes object "); return NULL; } if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0) return NULL; } if (wf.error != WFERR_OK) { Py_XDECREF(wf.str); if (wf.error == WFERR_NOMEMORY) PyErr_NoMemory(); else PyErr_SetString(PyExc_ValueError, (wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object " :"object too deeply nested to marshal "); return NULL; } return wf.str;复制代码

此处最关键的方法为w_object，该方法会调用w_complex_object，真正将PyCodeObject对象转为二进制数据就在w_complex_object方法中：

// marshal.c static void w_complex_object(PyObject *v, char flag, WFILE *p) { // do sth. else if (PyCode_Check(v)) { PyCodeObject *co = (PyCodeObject *)v; W_TYPE(TYPE_CODE, p); w_long(co->co_argcount, p); w_long(co->co_kwonlyargcount, p); w_long(co->co_nlocals, p); w_long(co->co_stacksize, p); w_long(co->co_flags, p); w_object(co->co_code, p); w_object(co->co_consts, p); w_object(co->co_names, p); w_object(co->co_varnames, p); w_object(co->co_freevars, p); w_object(co->co_cellvars, p); w_object(co->co_filename, p); w_object(co->co_name, p); w_long(co->co_firstlineno, p); w_object(co->co_lnotab, p); } // do sth. }复制代码

可以看出：

PyCodeObject对象的类型是TYPE_CODE，8.2节中的test.py文件会生成三个PyCodeObject对象，它们之间的关系为一个PyCodeObject对象嵌套两个PyCodeObject对象； co_argcount、co_kwonlyargcount等字段是通过调用w_long（调用w_byte方法写入四个字节），co_code、co_consts 等字段是通过调用w_object（实际上是调用w_long、w_string等方法），最终转为二进制数据的。这些字段的具体含义之后再进行深入分析；需要注意的是有一个特殊的类型：TYPE_REF，可以通过该类型节约存储空间。以co_filename为例，这个字段的含义为py文件的完整路径，下面为test.py生成的pyc文件中co_filename字段的值： // class A "co_filename ": { "type ": "unicode ", "size ": 49, "value ": "/Users/l.wang/Documents/pythonindepth/bin/test.py " } // def Fun "co_filename ": { "type ": "ref ", "ref ": 6 } // test.py "co_filename ": { "type ": "ref ", "ref ": 6 }复制代码

这是通过w_ref方法实现的，w_ref的源码如下。其中有一个hash表，该表的key为对象的地址，value为index，如果表中存在相同地址的对象，则写入TYPE_REF类型和index，从而节省空间。

// marshal.c static int w_ref(PyObject *v, char *flag, WFILE *p) { _Py_hashtable_entry_t *entry; int w; if (p->version < 3 || p->hashtable == NULL) { return 0; /* not writing object references */ } /* if it has only one reference, it definitely isn't shared */ if (Py_REFCNT(v) == 1) { return 0; } entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v); if (entry != NULL) { /* write the reference index to the stream */ _Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w); /* we don't store "long " indices in the dict */ assert(0 entries; /* we don't support long indices */ if (s >= 0x7fffffff) { PyErr_SetString(PyExc_ValueError, "too many objects "); goto err; } w = (int)s; Py_INCREF(v); if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) { Py_DECREF(v); goto err; } *flag |= FLAG_REF; return 0; } err: p->error = WFERR_UNMARSHALLABLE; return 1; }复制代码

这个过程的逆序实现过程如下。如果flag不为0，则向list表中增加实际的值。如果类型为TYPE_REF，则根据读取的index从list表中获取真实的值。

static PyObject * r_object(RFILE *p) { PyObject *v, *v2; Py_ssize_t idx = 0; long i, n; int type, code = r_byte(p); int flag, is_interned = 0; PyObject *retval = NULL; if (code == EOF) { PyErr_SetString(PyExc_EOFError, "EOF read where object expected "); return NULL; } p->depth++; if (p->depth > MAX_MARSHAL_STACK_DEPTH) { p->depth--; PyErr_SetString(PyExc_ValueError, "recursion limit exceeded "); return NULL; } flag = code & FLAG_REF; type = code & ~FLAG_REF; #define R_REF(O) do{\ if (flag) \ O = r_ref(O, flag, p);\ } while (0) switch (type) { // do sth. case TYPE_REF: n = r_long(p); if (n < 0 || n >= PyList_GET_SIZE(p->refs)) { if (n == -1 && PyErr_Occurred()) break; PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference) "); break; } v = PyList_GET_ITEM(p->refs, n); if (v == Py_None) { PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference) "); break; } Py_INCREF(v); retval = v; break; // do sth. } }复制代码

这里存在一个问题，为什么w_ref没有像r_object中根据flag的值决定哪个字段写入hash表中，目前没有想明白。

8.3.4. 将二进制数据写入文件

_cache_bytecode方法负责将将二进制数据写入文件，源码如下：

# _bootstrap_external.py def _cache_bytecode(self, source_path, bytecode_path, data): # Adapt between the two APIs mode = _calc_mode(source_path) return self.set_data(bytecode_path, data, _mode=mode)复制代码

set_data方法源码如下：

def set_data(self, path, data, *, _mode=0o666): " ""Write bytes data to a file. "" " parent, filename = _path_split(path) path_parts = [] # Figure out what directories are missing. while parent and not _path_isdir(parent): parent, part = _path_split(parent) path_parts.append(part) # Create needed directories. for part in reversed(path_parts): parent = _path_join(parent, part) try: _os.mkdir(parent) except FileExistsError: # Probably another Python process already created the dir. continue except OSError as exc: # Could be a permission error, read-only filesystem: just forget # about writing the data. _bootstrap._verbose_message('could not create {!r}: {!r}', parent, exc) return try: _write_atomic(path, data, _mode) _bootstrap._verbose_message('created {!r}', path) except OSError as exc: # Same as above: just don't write the bytecode. _bootstrap._verbose_message('could not create {!r}: {!r}', path, exc)复制代码

写入文件的关键方法为_write_atomic，源码如下。该方法采用写入临时文件，而后重命名的方式，用于保证要么有异常从而不会生成文件，要么无异常生成指定名称的文件。

def _write_atomic(path, data, mode=0o666): " ""Best-effort function to write data to a path atomically. Be prepared to handle a FileExistsError if concurrent writing of the temporary file is attempted. "" " # id() is used to generate a pseudo-random filename. path_tmp = '{}.{}'.format(path, id(path)) fd = _os.open(path_tmp, _os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666) try: # We first write data to a temporary file, and then use os.replace() to # perform an atomic rename. with _io.FileIO(fd, 'wb') as file: file.write(data) _os.replace(path_tmp, path) except OSError: try: _os.unlink(path_tmp) except OSError: pass raise复制代码 8.4. 参考 Python源码剖析 8.5. 附录

分析清楚pyc文件生成的流程后，就可以实现8.1节中提到的工具，工具源码如下：

# -*- coding:utf-8 -*- import json import datetime import sys FLAG_REF = ord('\x80') TYPE_CODE = ord('c') TYPE_STRING = ord('s') TYPE_SMALL_TUPLE = ord(')') TYPE_INT = ord('i') TYPE_SHORT_ASCII = ord('z') TYPE_SHORT_ASCII_INTERNED = ord('Z') TYPE_REF = ord('r') TYPE_NONE = ord('N') REFS_HASH = {} def parse_code(fp): code = int.from_bytes(fp.read(1), 'little') code_type = code & ~FLAG_REF code_flag = code & FLAG_REF idx = len(REFS_HASH) if code_flag: REFS_HASH[idx] = None code_dict = {} if code_type == TYPE_CODE: code_dict['type'] = 'code' code_dict['co_argcount'] = int.from_bytes(fp.read(4), 'little') code_dict['co_kwonlyargcount'] = int.from_bytes(fp.read(4), 'little') code_dict['co_nlocals'] = int.from_bytes(fp.read(4), 'little') code_dict['co_stacksize'] = int.from_bytes(fp.read(4), 'little') code_dict['co_flags'] = int.from_bytes(fp.read(4), 'little') code_dict['co_code'] = parse_code(fp) code_dict['co_consts'] = parse_code(fp) code_dict['co_names'] = parse_code(fp) code_dict['co_varnames'] = parse_code(fp) code_dict['co_freevars'] = parse_code(fp) code_dict['co_cellvars'] = parse_code(fp) code_dict['co_filename'] = parse_code(fp) code_dict['co_name'] = parse_code(fp) code_dict['co_firstlineno'] = int.from_bytes(fp.read(4), 'little') code_dict['co_lnotab'] = parse_code(fp) elif code_type == TYPE_STRING: code_dict['type'] = 'string' length = int.from_bytes(fp.read(4), 'little') code_dict['length'] = length # todo value = fp.read(length) code_dict['value'] = str(value) if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SMALL_TUPLE: code_dict['type'] = 'tuple' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size items = [] for _ in range(size): items.append(parse_code(fp)) code_dict['items'] = items if code_flag: REFS_HASH[idx] = code_dict['items'] elif code_type == TYPE_INT: code_dict['type'] = 'long' value = int.from_bytes(fp.read(4), 'little') code_dict['value'] = value if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SHORT_ASCII: code_dict['type'] = 'unicode' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size code_dict['value'] = fp.read(size).decode() if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SHORT_ASCII_INTERNED: code_dict['type'] = 'unicode' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size code_dict['value'] = fp.read(size).decode() if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_REF: code_dict['type'] = 'ref' code_dict['ref'] = int.from_bytes(fp.read(4), 'little') code_dict['value'] = REFS_HASH[code_dict['ref']] elif code_type == TYPE_NONE: code_dict['type'] = 'none' else: print(code_type) return code_dict def parse_pyc(file_name): pyc_dict = {} with open(file_name, 'rb') as fp: magic_number = int.from_bytes(fp.read(2), 'little') if magic_number >= 3390 and magic_number

【本文地址】

公司简介

联系我们