'''"Executable documentation" for the pickle module.
Extensive comments about the pickle protocols and pickle-machine opcodes
can be found here. Some functions meant for external use:
Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
dis(pickle, out=None, memo=None, indentlevel=4)
Print a symbolic disassembly of a pickle.
__all__ = ['dis', 'genops', 'optimize']
# - A pickle verifier: read a pickle and check it exhaustively for
# well-formedness. dis() does a lot of this already.
# - A protocol identifier: examine a pickle and return its protocol number
# (== the highest .proto attr value among all the opcodes in the pickle).
# dis() already prints this info at the end.
# - A pickle optimizer: for example, tuple-building code is sometimes more
# elaborate than necessary, catering for the possibility that the tuple
# is recursive. Or lots of times a PUT is generated that's never accessed
"A pickle" is a program for a virtual pickle machine (PM, but more accurately
called an unpickling machine). It's a sequence of opcodes, interpreted by the
PM, building an arbitrarily complex Python object.
For the most part, the PM is very simple: there are no looping, testing, or
conditional instructions, no arithmetic and no function calls. Opcodes are
executed once each, from first to last, until a STOP opcode is reached.
The PM has two data areas, "the stack" and "the memo".
Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
integer object on the stack, whose value is gotten from a decimal string
literal immediately following the INT opcode in the pickle bytestream. Other
opcodes take Python objects off the stack. The result of unpickling is
whatever object is left on the stack when the final STOP opcode is executed.
The memo is simply an array of objects, or it can be implemented as a dict
mapping little integers to objects. The memo serves as the PM's "long term
memory", and the little integers indexing the memo are akin to variable
names. Some opcodes pop a stack object into the memo at a given index,
and others push a memo object at a given index onto the stack again.
At heart, that's all the PM has. Subtleties arise for these reasons:
+ Object identity. Objects can be arbitrarily complex, and subobjects
may be shared (for example, the list [a, a] refers to the same object a
twice). It can be vital that unpickling recreate an isomorphic object
graph, faithfully reproducing sharing.
+ Recursive objects. For example, after "L = []; L.append(L)", L is a
list, and L[0] is the same list. This is related to the object identity
point, and some sequences of pickle opcodes are subtle in order to
get the right result in all cases.
+ Things pickle doesn't know everything about. Examples of things pickle
does know everything about are Python's builtin scalar and container
types, like ints and tuples. They generally have opcodes dedicated to
them. For things like module references and instances of user-defined
classes, pickle's knowledge is limited. Historically, many enhancements
have been made to the pickle protocol in order to do a better (faster,
and/or more compact) job on those.
+ Backward compatibility and micro-optimization. As explained below,
pickle opcodes never go away, not even when better ways to do a thing
get invented. The repertoire of the PM just keeps growing over time.
For example, protocol 0 had two opcodes for building Python integers (INT
and LONG), protocol 1 added three more for more-efficient pickling of short
integers, and protocol 2 added two more for more-efficient pickling of
long integers (before protocol 2, the only ways to pickle a Python long
took time quadratic in the number of digits, for both pickling and
unpickling). "Opcode bloat" isn't so much a subtlety as a source of
For compatibility, the meaning of a pickle opcode never changes. Instead new
pickle opcodes get added, and each version's unpickler can handle all the
pickle opcodes in all protocol versions to date. So old pickles continue to
be readable forever. The pickler can generally be told to restrict itself to
the subset of opcodes available under previous protocol versions too, so that
users can create pickles under the current version readable by older
versions. However, a pickle does not contain its version number embedded
within it. If an older unpickler tries to read a pickle using a later
protocol, the result is most likely an exception due to seeing an unknown (in
the older unpickler) opcode.
The original pickle used what's now called "protocol 0", and what was called
"text mode" before Python 2.3. The entire pickle bytestream is made up of
printable 7-bit ASCII characters, plus the newline character, in protocol 0.
That's why it was called text mode. Protocol 0 is small and elegant, but
sometimes painfully inefficient.
The second major set of additions is now called "protocol 1", and was called
"binary mode" before Python 2.3. This added many opcodes with arguments
consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
bytes. Binary mode pickles can be substantially smaller than equivalent
text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
int as 4 bytes following the opcode, which is cheaper to unpickle than the
(perhaps) 11-character decimal string attached to INT. Protocol 1 also added
a number of opcodes that operate on many stack elements at once (like APPENDS
and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
The third major set of additions came in Python 2.3, and is called "protocol
- A better way to pickle instances of new-style classes (NEWOBJ).
- A way for a pickle to identify its protocol (PROTO).
- Time- and space- efficient pickling of long ints (LONG{1,4}).
- Shortcuts for small tuples (TUPLE{1,2,3}}.
- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
- The "extension registry", a vector of popular objects that can be pushed
efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
the registry contents are predefined (there's nothing akin to the memo's
Another independent change with Python 2.3 is the abandonment of any
pretense that it might be safe to load pickles received from untrusted
parties -- no sufficient security analysis has been done to guarantee
this and there isn't a use case that warrants the expense of such an
To this end, all tests for __safe_for_unpickling__ or for
copy_reg.safe_constructors are removed from the unpickling code.
References to these variables in the descriptions below are to be seen
as describing unpickling in Python 2.2 and before.
# Meta-rule: Descriptions are stored in instances of descriptor objects,
# with plain constructors. No meta-language is defined from which
# descriptors could be constructed. If you want, e.g., XML, write a little
# program to generate XML from the objects.
##############################################################################
# Some pickle opcodes have an argument, following the opcode in the
# bytestream. An argument is of a specific type, described by an instance
# of ArgumentDescriptor. These are not to be confused with arguments taken
# off the stack -- ArgumentDescriptor applies only to arguments embedded in
# the opcode stream, immediately following an opcode.
# Represents the number of bytes consumed by an argument delimited by the
# next newline character.
# Represents the number of bytes consumed by a two-argument opcode where
# the first argument gives the number of bytes in the second argument.
TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
class ArgumentDescriptor(object):
# name of descriptor record, also a module global name; a string
# length of argument, in bytes; an int; UP_TO_NEWLINE and
# TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
# a function taking a file-like object, reading this kind of argument
# from the object at the current position, advancing the current
# position by n bytes, and returning the value of the argument
# human-readable docs for this arg descriptor; a string
def __init__(self, name, n, reader, doc):
assert isinstance(name, str)
assert isinstance(n, (int, long)) and (n >= 0 or
assert isinstance(doc, str)
from struct import unpack as _unpack
>>> read_uint1(StringIO.StringIO('\xff'))
raise ValueError("not enough data in stream to read uint1")
uint1 = ArgumentDescriptor(
doc="One-byte unsigned integer.")
>>> read_uint2(StringIO.StringIO('\xff\x00'))
>>> read_uint2(StringIO.StringIO('\xff\xff'))
return _unpack("<H", data)[0]
raise ValueError("not enough data in stream to read uint2")
uint2 = ArgumentDescriptor(
doc="Two-byte unsigned integer, little-endian.")
>>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
>>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
return _unpack("<i", data)[0]
raise ValueError("not enough data in stream to read int4")
int4 = ArgumentDescriptor(
doc="Four-byte signed integer, little-endian, 2's complement.")
def read_stringnl(f, decode=True, stripquotes=True):
>>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
>>> read_stringnl(StringIO.StringIO("\n"))
Traceback (most recent call last):
ValueError: no string quotes around ''
>>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
>>> read_stringnl(StringIO.StringIO("''\n"))
>>> read_stringnl(StringIO.StringIO('"abcd"'))
Traceback (most recent call last):
ValueError: no newline found when trying to read stringnl
Embedded escapes are undone in the result.
>>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
if not data.endswith('\n'):
raise ValueError("no newline found when trying to read stringnl")
data = data[:-1] # lose the newline
raise ValueError("strinq quote %r not found at both "
"ends of %r" % (q, data))
raise ValueError("no string quotes around %r" % data)
# I'm not sure when 'string_escape' was added to the std codecs; it's
# crazy not to use it if it's there.
data = data.decode('string_escape')
stringnl = ArgumentDescriptor(
doc="""A newline-terminated string.
This is a repr-style string, with embedded escapes, and
def read_stringnl_noescape(f):
return read_stringnl(f, decode=False, stripquotes=False)
stringnl_noescape = ArgumentDescriptor(
name='stringnl_noescape',
reader=read_stringnl_noescape,
doc="""A newline-terminated string.
This is a str-style string, without embedded escapes,
or bracketing quotes. It should consist solely of
printable ASCII characters.
def read_stringnl_noescape_pair(f):
>>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
stringnl_noescape_pair = ArgumentDescriptor(
name='stringnl_noescape_pair',
reader=read_stringnl_noescape_pair,
doc="""A pair of newline-terminated strings.
These are str-style strings, without embedded
escapes, or bracketing quotes. They should
consist solely of printable ASCII characters.
The pair is returned as a single string, with
a single blank separating the two strings.
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
>>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
Traceback (most recent call last):
ValueError: expected 50331648 bytes in a string4, but only 6 remain
raise ValueError("string4 byte count < 0: %d" % n)
raise ValueError("expected %d bytes in a string4, but only %d remain" %
string4 = ArgumentDescriptor(
The first argument is a 4-byte little-endian signed int giving
the number of bytes in the string, and the second argument is
>>> read_string1(StringIO.StringIO("\x00"))
>>> read_string1(StringIO.StringIO("\x03abcdef"))
raise ValueError("expected %d bytes in a string1, but only %d remain" %
string1 = ArgumentDescriptor(
The first argument is a 1-byte unsigned int giving the number
of bytes in the string, and the second argument is that many
def read_unicodestringnl(f):
>>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
if not data.endswith('\n'):
raise ValueError("no newline found when trying to read "
data = data[:-1] # lose the newline
return unicode(data, 'raw-unicode-escape')
unicodestringnl = ArgumentDescriptor(
reader=read_unicodestringnl,
doc="""A newline-terminated Unicode string.
This is raw-unicode-escape encoded, so consists of
printable ASCII characters, and may contain embedded
def read_unicodestring4(f):
>>> enc = s.encode('utf-8')
>>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length
>>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
>>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
Traceback (most recent call last):
ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
raise ValueError("unicodestring4 byte count < 0: %d" % n)
return unicode(data, 'utf-8')
raise ValueError("expected %d bytes in a unicodestring4, but only %d "
"remain" % (n, len(data)))
unicodestring4 = ArgumentDescriptor(
reader=read_unicodestring4,
doc="""A counted Unicode string.
The first argument is a 4-byte little-endian signed int
giving the number of bytes in the string, and the second
argument-- the UTF-8 encoding of the Unicode string --
contains that many bytes.
def read_decimalnl_short(f):
>>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
>>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
Traceback (most recent call last):
ValueError: trailing 'L' not allowed in '1234L'
s = read_stringnl(f, decode=False, stripquotes=False)
raise ValueError("trailing 'L' not allowed in %r" % s)