Edit File by line

'''"Executable documentation" for the pickle module.

[0] Fix | Delete

[1] Fix | Delete

Extensive comments about the pickle protocols and pickle-machine opcodes

[2] Fix | Delete

can be found here. Some functions meant for external use:

[3] Fix | Delete

[4] Fix | Delete

genops(pickle)

[5] Fix | Delete

Generate all the opcodes in a pickle, as (opcode, arg, position) triples.

[6] Fix | Delete

[7] Fix | Delete

dis(pickle, out=None, memo=None, indentlevel=4)

[8] Fix | Delete

Print a symbolic disassembly of a pickle.

[9] Fix | Delete

'''

[10] Fix | Delete

[11] Fix | Delete

__all__ = ['dis', 'genops', 'optimize']

[12] Fix | Delete

[13] Fix | Delete

# Other ideas:

[14] Fix | Delete

[15] Fix | Delete

# - A pickle verifier: read a pickle and check it exhaustively for

[16] Fix | Delete

# well-formedness. dis() does a lot of this already.

[17] Fix | Delete

[18] Fix | Delete

# - A protocol identifier: examine a pickle and return its protocol number

[19] Fix | Delete

# (== the highest .proto attr value among all the opcodes in the pickle).

[20] Fix | Delete

# dis() already prints this info at the end.

[21] Fix | Delete

[22] Fix | Delete

# - A pickle optimizer: for example, tuple-building code is sometimes more

[23] Fix | Delete

# elaborate than necessary, catering for the possibility that the tuple

[24] Fix | Delete

# is recursive. Or lots of times a PUT is generated that's never accessed

[25] Fix | Delete

# by a later GET.

[26] Fix | Delete

[27] Fix | Delete

[28] Fix | Delete

"""

[29] Fix | Delete

"A pickle" is a program for a virtual pickle machine (PM, but more accurately

[30] Fix | Delete

called an unpickling machine). It's a sequence of opcodes, interpreted by the

[31] Fix | Delete

PM, building an arbitrarily complex Python object.

[32] Fix | Delete

[33] Fix | Delete

For the most part, the PM is very simple: there are no looping, testing, or

[34] Fix | Delete

conditional instructions, no arithmetic and no function calls. Opcodes are

[35] Fix | Delete

executed once each, from first to last, until a STOP opcode is reached.

[36] Fix | Delete

[37] Fix | Delete

The PM has two data areas, "the stack" and "the memo".

[38] Fix | Delete

[39] Fix | Delete

Many opcodes push Python objects onto the stack; e.g., INT pushes a Python

[40] Fix | Delete

integer object on the stack, whose value is gotten from a decimal string

[41] Fix | Delete

literal immediately following the INT opcode in the pickle bytestream. Other

[42] Fix | Delete

opcodes take Python objects off the stack. The result of unpickling is

[43] Fix | Delete

whatever object is left on the stack when the final STOP opcode is executed.

[44] Fix | Delete

[45] Fix | Delete

The memo is simply an array of objects, or it can be implemented as a dict

[46] Fix | Delete

mapping little integers to objects. The memo serves as the PM's "long term

[47] Fix | Delete

memory", and the little integers indexing the memo are akin to variable

[48] Fix | Delete

names. Some opcodes pop a stack object into the memo at a given index,

[49] Fix | Delete

and others push a memo object at a given index onto the stack again.

[50] Fix | Delete

[51] Fix | Delete

At heart, that's all the PM has. Subtleties arise for these reasons:

[52] Fix | Delete

[53] Fix | Delete

+ Object identity. Objects can be arbitrarily complex, and subobjects

[54] Fix | Delete

may be shared (for example, the list [a, a] refers to the same object a

[55] Fix | Delete

twice). It can be vital that unpickling recreate an isomorphic object

[56] Fix | Delete

graph, faithfully reproducing sharing.

[57] Fix | Delete

[58] Fix | Delete

+ Recursive objects. For example, after "L = []; L.append(L)", L is a

[59] Fix | Delete

list, and L[0] is the same list. This is related to the object identity

[60] Fix | Delete

point, and some sequences of pickle opcodes are subtle in order to

[61] Fix | Delete

get the right result in all cases.

[62] Fix | Delete

[63] Fix | Delete

+ Things pickle doesn't know everything about. Examples of things pickle

[64] Fix | Delete

does know everything about are Python's builtin scalar and container

[65] Fix | Delete

types, like ints and tuples. They generally have opcodes dedicated to

[66] Fix | Delete

them. For things like module references and instances of user-defined

[67] Fix | Delete

classes, pickle's knowledge is limited. Historically, many enhancements

[68] Fix | Delete

have been made to the pickle protocol in order to do a better (faster,

[69] Fix | Delete

and/or more compact) job on those.

[70] Fix | Delete

[71] Fix | Delete

+ Backward compatibility and micro-optimization. As explained below,

[72] Fix | Delete

pickle opcodes never go away, not even when better ways to do a thing

[73] Fix | Delete

get invented. The repertoire of the PM just keeps growing over time.

[74] Fix | Delete

For example, protocol 0 had two opcodes for building Python integers (INT

[75] Fix | Delete

and LONG), protocol 1 added three more for more-efficient pickling of short

[76] Fix | Delete

integers, and protocol 2 added two more for more-efficient pickling of

[77] Fix | Delete

long integers (before protocol 2, the only ways to pickle a Python long

[78] Fix | Delete

took time quadratic in the number of digits, for both pickling and

[79] Fix | Delete

unpickling). "Opcode bloat" isn't so much a subtlety as a source of

[80] Fix | Delete

wearying complication.

[81] Fix | Delete

[82] Fix | Delete

[83] Fix | Delete

Pickle protocols:

[84] Fix | Delete

[85] Fix | Delete

For compatibility, the meaning of a pickle opcode never changes. Instead new

[86] Fix | Delete

pickle opcodes get added, and each version's unpickler can handle all the

[87] Fix | Delete

pickle opcodes in all protocol versions to date. So old pickles continue to

[88] Fix | Delete

be readable forever. The pickler can generally be told to restrict itself to

[89] Fix | Delete

the subset of opcodes available under previous protocol versions too, so that

[90] Fix | Delete

users can create pickles under the current version readable by older

[91] Fix | Delete

versions. However, a pickle does not contain its version number embedded

[92] Fix | Delete

within it. If an older unpickler tries to read a pickle using a later

[93] Fix | Delete

protocol, the result is most likely an exception due to seeing an unknown (in

[94] Fix | Delete

the older unpickler) opcode.

[95] Fix | Delete

[96] Fix | Delete

The original pickle used what's now called "protocol 0", and what was called

[97] Fix | Delete

"text mode" before Python 2.3. The entire pickle bytestream is made up of

[98] Fix | Delete

printable 7-bit ASCII characters, plus the newline character, in protocol 0.

[99] Fix | Delete

That's why it was called text mode. Protocol 0 is small and elegant, but

[100] Fix | Delete

sometimes painfully inefficient.

[101] Fix | Delete

[102] Fix | Delete

The second major set of additions is now called "protocol 1", and was called

[103] Fix | Delete

"binary mode" before Python 2.3. This added many opcodes with arguments

[104] Fix | Delete

consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"

[105] Fix | Delete

bytes. Binary mode pickles can be substantially smaller than equivalent

[106] Fix | Delete

text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte

[107] Fix | Delete

int as 4 bytes following the opcode, which is cheaper to unpickle than the

[108] Fix | Delete

(perhaps) 11-character decimal string attached to INT. Protocol 1 also added

[109] Fix | Delete

a number of opcodes that operate on many stack elements at once (like APPENDS

[110] Fix | Delete

and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).

[111] Fix | Delete

[112] Fix | Delete

The third major set of additions came in Python 2.3, and is called "protocol

[113] Fix | Delete

2". This added:

[114] Fix | Delete

[115] Fix | Delete

- A better way to pickle instances of new-style classes (NEWOBJ).

[116] Fix | Delete

[117] Fix | Delete

- A way for a pickle to identify its protocol (PROTO).

[118] Fix | Delete

[119] Fix | Delete

- Time- and space- efficient pickling of long ints (LONG{1,4}).

[120] Fix | Delete

[121] Fix | Delete

- Shortcuts for small tuples (TUPLE{1,2,3}}.

[122] Fix | Delete

[123] Fix | Delete

- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).

[124] Fix | Delete

[125] Fix | Delete

- The "extension registry", a vector of popular objects that can be pushed

[126] Fix | Delete

efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but

[127] Fix | Delete

the registry contents are predefined (there's nothing akin to the memo's

[128] Fix | Delete

PUT).

[129] Fix | Delete

[130] Fix | Delete

Another independent change with Python 2.3 is the abandonment of any

[131] Fix | Delete

pretense that it might be safe to load pickles received from untrusted

[132] Fix | Delete

parties -- no sufficient security analysis has been done to guarantee

[133] Fix | Delete

this and there isn't a use case that warrants the expense of such an

[134] Fix | Delete

analysis.

[135] Fix | Delete

[136] Fix | Delete

To this end, all tests for __safe_for_unpickling__ or for

[137] Fix | Delete

copy_reg.safe_constructors are removed from the unpickling code.

[138] Fix | Delete

References to these variables in the descriptions below are to be seen

[139] Fix | Delete

as describing unpickling in Python 2.2 and before.

[140] Fix | Delete

"""

[141] Fix | Delete

[142] Fix | Delete

# Meta-rule: Descriptions are stored in instances of descriptor objects,

[143] Fix | Delete

# with plain constructors. No meta-language is defined from which

[144] Fix | Delete

# descriptors could be constructed. If you want, e.g., XML, write a little

[145] Fix | Delete

# program to generate XML from the objects.

[146] Fix | Delete

[147] Fix | Delete

##############################################################################

[148] Fix | Delete

# Some pickle opcodes have an argument, following the opcode in the

[149] Fix | Delete

# bytestream. An argument is of a specific type, described by an instance

[150] Fix | Delete

# of ArgumentDescriptor. These are not to be confused with arguments taken

[151] Fix | Delete

# off the stack -- ArgumentDescriptor applies only to arguments embedded in

[152] Fix | Delete

# the opcode stream, immediately following an opcode.

[153] Fix | Delete

[154] Fix | Delete

# Represents the number of bytes consumed by an argument delimited by the

[155] Fix | Delete

# next newline character.

[156] Fix | Delete

UP_TO_NEWLINE = -1

[157] Fix | Delete

[158] Fix | Delete

# Represents the number of bytes consumed by a two-argument opcode where

[159] Fix | Delete

# the first argument gives the number of bytes in the second argument.

[160] Fix | Delete

TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int

[161] Fix | Delete

TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int

[162] Fix | Delete

[163] Fix | Delete

class ArgumentDescriptor(object):

[164] Fix | Delete

__slots__ = (

[165] Fix | Delete

# name of descriptor record, also a module global name; a string

[166] Fix | Delete

'name',

[167] Fix | Delete

[168] Fix | Delete

# length of argument, in bytes; an int; UP_TO_NEWLINE and

[169] Fix | Delete

# TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length

[170] Fix | Delete

# cases

[171] Fix | Delete

'n',

[172] Fix | Delete

[173] Fix | Delete

# a function taking a file-like object, reading this kind of argument

[174] Fix | Delete

# from the object at the current position, advancing the current

[175] Fix | Delete

# position by n bytes, and returning the value of the argument

[176] Fix | Delete

'reader',

[177] Fix | Delete

[178] Fix | Delete

# human-readable docs for this arg descriptor; a string

[179] Fix | Delete

'doc',

[180] Fix | Delete

)

[181] Fix | Delete

[182] Fix | Delete

def __init__(self, name, n, reader, doc):

[183] Fix | Delete

assert isinstance(name, str)

[184] Fix | Delete

self.name = name

[185] Fix | Delete

[186] Fix | Delete

assert isinstance(n, (int, long)) and (n >= 0 or

[187] Fix | Delete

n in (UP_TO_NEWLINE,

[188] Fix | Delete

TAKEN_FROM_ARGUMENT1,

[189] Fix | Delete

TAKEN_FROM_ARGUMENT4))

[190] Fix | Delete

self.n = n

[191] Fix | Delete

[192] Fix | Delete

self.reader = reader

[193] Fix | Delete

[194] Fix | Delete

assert isinstance(doc, str)

[195] Fix | Delete

self.doc = doc

[196] Fix | Delete

[197] Fix | Delete

from struct import unpack as _unpack

[198] Fix | Delete

[199] Fix | Delete

def read_uint1(f):

[200] Fix | Delete

r"""

[201] Fix | Delete

>>> import StringIO

[202] Fix | Delete

>>> read_uint1(StringIO.StringIO('\xff'))

[203] Fix | Delete

255

[204] Fix | Delete

"""

[205] Fix | Delete

[206] Fix | Delete

data = f.read(1)

[207] Fix | Delete

if data:

[208] Fix | Delete

return ord(data)

[209] Fix | Delete

raise ValueError("not enough data in stream to read uint1")

[210] Fix | Delete

[211] Fix | Delete

uint1 = ArgumentDescriptor(

[212] Fix | Delete

name='uint1',

[213] Fix | Delete

n=1,

[214] Fix | Delete

reader=read_uint1,

[215] Fix | Delete

doc="One-byte unsigned integer.")

[216] Fix | Delete

[217] Fix | Delete

[218] Fix | Delete

def read_uint2(f):

[219] Fix | Delete

r"""

[220] Fix | Delete

>>> import StringIO

[221] Fix | Delete

>>> read_uint2(StringIO.StringIO('\xff\x00'))

[222] Fix | Delete

255

[223] Fix | Delete

>>> read_uint2(StringIO.StringIO('\xff\xff'))

[224] Fix | Delete

65535

[225] Fix | Delete

"""

[226] Fix | Delete

[227] Fix | Delete

data = f.read(2)

[228] Fix | Delete

if len(data) == 2:

[229] Fix | Delete

return _unpack("<H", data)[0]

[230] Fix | Delete

raise ValueError("not enough data in stream to read uint2")

[231] Fix | Delete

[232] Fix | Delete

uint2 = ArgumentDescriptor(

[233] Fix | Delete

name='uint2',

[234] Fix | Delete

n=2,

[235] Fix | Delete

reader=read_uint2,

[236] Fix | Delete

doc="Two-byte unsigned integer, little-endian.")

[237] Fix | Delete

[238] Fix | Delete

[239] Fix | Delete

def read_int4(f):

[240] Fix | Delete

r"""

[241] Fix | Delete

>>> import StringIO

[242] Fix | Delete

>>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))

[243] Fix | Delete

255

[244] Fix | Delete

>>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)

[245] Fix | Delete

True

[246] Fix | Delete

"""

[247] Fix | Delete

[248] Fix | Delete

data = f.read(4)

[249] Fix | Delete

if len(data) == 4:

[250] Fix | Delete

return _unpack("<i", data)[0]

[251] Fix | Delete

raise ValueError("not enough data in stream to read int4")

[252] Fix | Delete

[253] Fix | Delete

int4 = ArgumentDescriptor(

[254] Fix | Delete

name='int4',

[255] Fix | Delete

n=4,

[256] Fix | Delete

reader=read_int4,

[257] Fix | Delete

doc="Four-byte signed integer, little-endian, 2's complement.")

[258] Fix | Delete

[259] Fix | Delete

[260] Fix | Delete

def read_stringnl(f, decode=True, stripquotes=True):

[261] Fix | Delete

r"""

[262] Fix | Delete

>>> import StringIO

[263] Fix | Delete

>>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))

[264] Fix | Delete

'abcd'

[265] Fix | Delete

[266] Fix | Delete

>>> read_stringnl(StringIO.StringIO("\n"))

[267] Fix | Delete

Traceback (most recent call last):

[268] Fix | Delete

...

[269] Fix | Delete

ValueError: no string quotes around ''

[270] Fix | Delete

[271] Fix | Delete

>>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)

[272] Fix | Delete

[273] Fix | Delete

[274] Fix | Delete

>>> read_stringnl(StringIO.StringIO("''\n"))

[275] Fix | Delete

[276] Fix | Delete

[277] Fix | Delete

>>> read_stringnl(StringIO.StringIO('"abcd"'))

[278] Fix | Delete

Traceback (most recent call last):

[279] Fix | Delete

...

[280] Fix | Delete

ValueError: no newline found when trying to read stringnl

[281] Fix | Delete

[282] Fix | Delete

Embedded escapes are undone in the result.

[283] Fix | Delete

>>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))

[284] Fix | Delete

'a\n\\b\x00c\td'

[285] Fix | Delete

"""

[286] Fix | Delete

[287] Fix | Delete

data = f.readline()

[288] Fix | Delete

if not data.endswith('\n'):

[289] Fix | Delete

raise ValueError("no newline found when trying to read stringnl")

[290] Fix | Delete

data = data[:-1] # lose the newline

[291] Fix | Delete

[292] Fix | Delete

if stripquotes:

[293] Fix | Delete

for q in "'\"":

[294] Fix | Delete

if data.startswith(q):

[295] Fix | Delete

if not data.endswith(q):

[296] Fix | Delete

raise ValueError("strinq quote %r not found at both "

[297] Fix | Delete

"ends of %r" % (q, data))

[298] Fix | Delete

data = data[1:-1]

[299] Fix | Delete

break

[300] Fix | Delete

else:

[301] Fix | Delete

raise ValueError("no string quotes around %r" % data)

[302] Fix | Delete

[303] Fix | Delete

# I'm not sure when 'string_escape' was added to the std codecs; it's

[304] Fix | Delete

# crazy not to use it if it's there.

[305] Fix | Delete

if decode:

[306] Fix | Delete

data = data.decode('string_escape')

[307] Fix | Delete

return data

[308] Fix | Delete

[309] Fix | Delete

stringnl = ArgumentDescriptor(

[310] Fix | Delete

name='stringnl',

[311] Fix | Delete

n=UP_TO_NEWLINE,

[312] Fix | Delete

reader=read_stringnl,

[313] Fix | Delete

doc="""A newline-terminated string.

[314] Fix | Delete

[315] Fix | Delete

This is a repr-style string, with embedded escapes, and

[316] Fix | Delete

bracketing quotes.

[317] Fix | Delete

""")

[318] Fix | Delete

[319] Fix | Delete

def read_stringnl_noescape(f):

[320] Fix | Delete

return read_stringnl(f, decode=False, stripquotes=False)

[321] Fix | Delete

[322] Fix | Delete

stringnl_noescape = ArgumentDescriptor(

[323] Fix | Delete

name='stringnl_noescape',

[324] Fix | Delete

n=UP_TO_NEWLINE,

[325] Fix | Delete

reader=read_stringnl_noescape,

[326] Fix | Delete

doc="""A newline-terminated string.

[327] Fix | Delete

[328] Fix | Delete

This is a str-style string, without embedded escapes,

[329] Fix | Delete

or bracketing quotes. It should consist solely of

[330] Fix | Delete

printable ASCII characters.

[331] Fix | Delete

""")

[332] Fix | Delete

[333] Fix | Delete

def read_stringnl_noescape_pair(f):

[334] Fix | Delete

r"""

[335] Fix | Delete

>>> import StringIO

[336] Fix | Delete

>>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))

[337] Fix | Delete

'Queue Empty'

[338] Fix | Delete

"""

[339] Fix | Delete

[340] Fix | Delete

return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))

[341] Fix | Delete

[342] Fix | Delete

stringnl_noescape_pair = ArgumentDescriptor(

[343] Fix | Delete

name='stringnl_noescape_pair',

[344] Fix | Delete

n=UP_TO_NEWLINE,

[345] Fix | Delete

reader=read_stringnl_noescape_pair,

[346] Fix | Delete

doc="""A pair of newline-terminated strings.

[347] Fix | Delete

[348] Fix | Delete

These are str-style strings, without embedded

[349] Fix | Delete

escapes, or bracketing quotes. They should

[350] Fix | Delete

consist solely of printable ASCII characters.

[351] Fix | Delete

The pair is returned as a single string, with

[352] Fix | Delete

a single blank separating the two strings.

[353] Fix | Delete

""")

[354] Fix | Delete

[355] Fix | Delete

def read_string4(f):

[356] Fix | Delete

r"""

[357] Fix | Delete

>>> import StringIO

[358] Fix | Delete

>>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))

[359] Fix | Delete

[360] Fix | Delete

>>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))

[361] Fix | Delete

'abc'

[362] Fix | Delete

>>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))

[363] Fix | Delete

Traceback (most recent call last):

[364] Fix | Delete

...

[365] Fix | Delete

ValueError: expected 50331648 bytes in a string4, but only 6 remain

[366] Fix | Delete

"""

[367] Fix | Delete

[368] Fix | Delete

n = read_int4(f)

[369] Fix | Delete

if n < 0:

[370] Fix | Delete

raise ValueError("string4 byte count < 0: %d" % n)

[371] Fix | Delete

data = f.read(n)

[372] Fix | Delete

if len(data) == n:

[373] Fix | Delete

return data

[374] Fix | Delete

raise ValueError("expected %d bytes in a string4, but only %d remain" %

[375] Fix | Delete

(n, len(data)))

[376] Fix | Delete

[377] Fix | Delete

string4 = ArgumentDescriptor(

[378] Fix | Delete

name="string4",

[379] Fix | Delete

n=TAKEN_FROM_ARGUMENT4,

[380] Fix | Delete

reader=read_string4,

[381] Fix | Delete

doc="""A counted string.

[382] Fix | Delete

[383] Fix | Delete

The first argument is a 4-byte little-endian signed int giving

[384] Fix | Delete

the number of bytes in the string, and the second argument is

[385] Fix | Delete

that many bytes.

[386] Fix | Delete

""")

[387] Fix | Delete

[388] Fix | Delete

[389] Fix | Delete

def read_string1(f):

[390] Fix | Delete

r"""

[391] Fix | Delete

>>> import StringIO

[392] Fix | Delete

>>> read_string1(StringIO.StringIO("\x00"))

[393] Fix | Delete

[394] Fix | Delete

>>> read_string1(StringIO.StringIO("\x03abcdef"))

[395] Fix | Delete

'abc'

[396] Fix | Delete

"""

[397] Fix | Delete

[398] Fix | Delete

n = read_uint1(f)

[399] Fix | Delete

assert n >= 0

[400] Fix | Delete

data = f.read(n)

[401] Fix | Delete

if len(data) == n:

[402] Fix | Delete

return data

[403] Fix | Delete

raise ValueError("expected %d bytes in a string1, but only %d remain" %

[404] Fix | Delete

(n, len(data)))

[405] Fix | Delete

[406] Fix | Delete

string1 = ArgumentDescriptor(

[407] Fix | Delete

name="string1",

[408] Fix | Delete

n=TAKEN_FROM_ARGUMENT1,

[409] Fix | Delete

reader=read_string1,

[410] Fix | Delete

doc="""A counted string.

[411] Fix | Delete

[412] Fix | Delete

The first argument is a 1-byte unsigned int giving the number

[413] Fix | Delete

of bytes in the string, and the second argument is that many

[414] Fix | Delete

bytes.

[415] Fix | Delete

""")

[416] Fix | Delete

[417] Fix | Delete

[418] Fix | Delete

def read_unicodestringnl(f):

[419] Fix | Delete

r"""

[420] Fix | Delete

>>> import StringIO

[421] Fix | Delete

>>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))

[422] Fix | Delete

u'abc\uabcd'

[423] Fix | Delete

"""

[424] Fix | Delete

[425] Fix | Delete

data = f.readline()

[426] Fix | Delete

if not data.endswith('\n'):

[427] Fix | Delete

raise ValueError("no newline found when trying to read "

[428] Fix | Delete

"unicodestringnl")

[429] Fix | Delete

data = data[:-1] # lose the newline

[430] Fix | Delete

return unicode(data, 'raw-unicode-escape')

[431] Fix | Delete

[432] Fix | Delete

unicodestringnl = ArgumentDescriptor(

[433] Fix | Delete

name='unicodestringnl',

[434] Fix | Delete

n=UP_TO_NEWLINE,

[435] Fix | Delete

reader=read_unicodestringnl,

[436] Fix | Delete

doc="""A newline-terminated Unicode string.

[437] Fix | Delete

[438] Fix | Delete

This is raw-unicode-escape encoded, so consists of

[439] Fix | Delete

printable ASCII characters, and may contain embedded

[440] Fix | Delete

escape sequences.

[441] Fix | Delete

""")

[442] Fix | Delete

[443] Fix | Delete

def read_unicodestring4(f):

[444] Fix | Delete

r"""

[445] Fix | Delete

>>> import StringIO

[446] Fix | Delete

>>> s = u'abcd\uabcd'

[447] Fix | Delete

>>> enc = s.encode('utf-8')

[448] Fix | Delete

>>> enc

[449] Fix | Delete

'abcd\xea\xaf\x8d'

[450] Fix | Delete

>>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length

[451] Fix | Delete

>>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))

[452] Fix | Delete

>>> s == t

[453] Fix | Delete

True

[454] Fix | Delete

[455] Fix | Delete

>>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))

[456] Fix | Delete

Traceback (most recent call last):

[457] Fix | Delete

...

[458] Fix | Delete

ValueError: expected 7 bytes in a unicodestring4, but only 6 remain

[459] Fix | Delete

"""

[460] Fix | Delete

[461] Fix | Delete

n = read_int4(f)

[462] Fix | Delete

if n < 0:

[463] Fix | Delete

raise ValueError("unicodestring4 byte count < 0: %d" % n)

[464] Fix | Delete

data = f.read(n)

[465] Fix | Delete

if len(data) == n:

[466] Fix | Delete

return unicode(data, 'utf-8')

[467] Fix | Delete

raise ValueError("expected %d bytes in a unicodestring4, but only %d "

[468] Fix | Delete

"remain" % (n, len(data)))

[469] Fix | Delete

[470] Fix | Delete

unicodestring4 = ArgumentDescriptor(

[471] Fix | Delete

name="unicodestring4",

[472] Fix | Delete

n=TAKEN_FROM_ARGUMENT4,

[473] Fix | Delete

reader=read_unicodestring4,

[474] Fix | Delete

doc="""A counted Unicode string.

[475] Fix | Delete

[476] Fix | Delete

The first argument is a 4-byte little-endian signed int

[477] Fix | Delete

giving the number of bytes in the string, and the second

[478] Fix | Delete

argument-- the UTF-8 encoding of the Unicode string --

[479] Fix | Delete

contains that many bytes.

[480] Fix | Delete

""")

[481] Fix | Delete

[482] Fix | Delete

[483] Fix | Delete

def read_decimalnl_short(f):

[484] Fix | Delete

r"""

[485] Fix | Delete

>>> import StringIO

[486] Fix | Delete

>>> read_decimalnl_short(StringIO.StringIO("1234\n56"))

[487] Fix | Delete

1234

[488] Fix | Delete

[489] Fix | Delete

>>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))

[490] Fix | Delete

Traceback (most recent call last):

[491] Fix | Delete

...

[492] Fix | Delete

ValueError: trailing 'L' not allowed in '1234L'

[493] Fix | Delete

"""

[494] Fix | Delete

[495] Fix | Delete

s = read_stringnl(f, decode=False, stripquotes=False)

[496] Fix | Delete

if s.endswith("L"):

[497] Fix | Delete

raise ValueError("trailing 'L' not allowed in %r" % s)

[498] Fix | Delete

[499] Fix | Delete

12 3 4 5