Edit File by line
/home/barbar84/public_h.../wp-conte.../plugins/sujqvwi/ShExBy/shex_roo.../lib64/python2....
File: difflib.py
"""
[0] Fix | Delete
Module difflib -- helpers for computing deltas between objects.
[1] Fix | Delete
[2] Fix | Delete
Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
[3] Fix | Delete
Use SequenceMatcher to return list of the best "good enough" matches.
[4] Fix | Delete
[5] Fix | Delete
Function context_diff(a, b):
[6] Fix | Delete
For two lists of strings, return a delta in context diff format.
[7] Fix | Delete
[8] Fix | Delete
Function ndiff(a, b):
[9] Fix | Delete
Return a delta: the difference between `a` and `b` (lists of strings).
[10] Fix | Delete
[11] Fix | Delete
Function restore(delta, which):
[12] Fix | Delete
Return one of the two sequences that generated an ndiff delta.
[13] Fix | Delete
[14] Fix | Delete
Function unified_diff(a, b):
[15] Fix | Delete
For two lists of strings, return a delta in unified diff format.
[16] Fix | Delete
[17] Fix | Delete
Class SequenceMatcher:
[18] Fix | Delete
A flexible class for comparing pairs of sequences of any type.
[19] Fix | Delete
[20] Fix | Delete
Class Differ:
[21] Fix | Delete
For producing human-readable deltas from sequences of lines of text.
[22] Fix | Delete
[23] Fix | Delete
Class HtmlDiff:
[24] Fix | Delete
For producing HTML side by side comparison with change highlights.
[25] Fix | Delete
"""
[26] Fix | Delete
[27] Fix | Delete
__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
[28] Fix | Delete
'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
[29] Fix | Delete
'unified_diff', 'HtmlDiff', 'Match']
[30] Fix | Delete
[31] Fix | Delete
import heapq
[32] Fix | Delete
from collections import namedtuple as _namedtuple
[33] Fix | Delete
from functools import reduce
[34] Fix | Delete
[35] Fix | Delete
Match = _namedtuple('Match', 'a b size')
[36] Fix | Delete
[37] Fix | Delete
def _calculate_ratio(matches, length):
[38] Fix | Delete
if length:
[39] Fix | Delete
return 2.0 * matches / length
[40] Fix | Delete
return 1.0
[41] Fix | Delete
[42] Fix | Delete
class SequenceMatcher:
[43] Fix | Delete
[44] Fix | Delete
"""
[45] Fix | Delete
SequenceMatcher is a flexible class for comparing pairs of sequences of
[46] Fix | Delete
any type, so long as the sequence elements are hashable. The basic
[47] Fix | Delete
algorithm predates, and is a little fancier than, an algorithm
[48] Fix | Delete
published in the late 1980's by Ratcliff and Obershelp under the
[49] Fix | Delete
hyperbolic name "gestalt pattern matching". The basic idea is to find
[50] Fix | Delete
the longest contiguous matching subsequence that contains no "junk"
[51] Fix | Delete
elements (R-O doesn't address junk). The same idea is then applied
[52] Fix | Delete
recursively to the pieces of the sequences to the left and to the right
[53] Fix | Delete
of the matching subsequence. This does not yield minimal edit
[54] Fix | Delete
sequences, but does tend to yield matches that "look right" to people.
[55] Fix | Delete
[56] Fix | Delete
SequenceMatcher tries to compute a "human-friendly diff" between two
[57] Fix | Delete
sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
[58] Fix | Delete
longest *contiguous* & junk-free matching subsequence. That's what
[59] Fix | Delete
catches peoples' eyes. The Windows(tm) windiff has another interesting
[60] Fix | Delete
notion, pairing up elements that appear uniquely in each sequence.
[61] Fix | Delete
That, and the method here, appear to yield more intuitive difference
[62] Fix | Delete
reports than does diff. This method appears to be the least vulnerable
[63] Fix | Delete
to synching up on blocks of "junk lines", though (like blank lines in
[64] Fix | Delete
ordinary text files, or maybe "<P>" lines in HTML files). That may be
[65] Fix | Delete
because this is the only method of the 3 that has a *concept* of
[66] Fix | Delete
"junk" <wink>.
[67] Fix | Delete
[68] Fix | Delete
Example, comparing two strings, and considering blanks to be "junk":
[69] Fix | Delete
[70] Fix | Delete
>>> s = SequenceMatcher(lambda x: x == " ",
[71] Fix | Delete
... "private Thread currentThread;",
[72] Fix | Delete
... "private volatile Thread currentThread;")
[73] Fix | Delete
>>>
[74] Fix | Delete
[75] Fix | Delete
.ratio() returns a float in [0, 1], measuring the "similarity" of the
[76] Fix | Delete
sequences. As a rule of thumb, a .ratio() value over 0.6 means the
[77] Fix | Delete
sequences are close matches:
[78] Fix | Delete
[79] Fix | Delete
>>> print round(s.ratio(), 3)
[80] Fix | Delete
0.866
[81] Fix | Delete
>>>
[82] Fix | Delete
[83] Fix | Delete
If you're only interested in where the sequences match,
[84] Fix | Delete
.get_matching_blocks() is handy:
[85] Fix | Delete
[86] Fix | Delete
>>> for block in s.get_matching_blocks():
[87] Fix | Delete
... print "a[%d] and b[%d] match for %d elements" % block
[88] Fix | Delete
a[0] and b[0] match for 8 elements
[89] Fix | Delete
a[8] and b[17] match for 21 elements
[90] Fix | Delete
a[29] and b[38] match for 0 elements
[91] Fix | Delete
[92] Fix | Delete
Note that the last tuple returned by .get_matching_blocks() is always a
[93] Fix | Delete
dummy, (len(a), len(b), 0), and this is the only case in which the last
[94] Fix | Delete
tuple element (number of elements matched) is 0.
[95] Fix | Delete
[96] Fix | Delete
If you want to know how to change the first sequence into the second,
[97] Fix | Delete
use .get_opcodes():
[98] Fix | Delete
[99] Fix | Delete
>>> for opcode in s.get_opcodes():
[100] Fix | Delete
... print "%6s a[%d:%d] b[%d:%d]" % opcode
[101] Fix | Delete
equal a[0:8] b[0:8]
[102] Fix | Delete
insert a[8:8] b[8:17]
[103] Fix | Delete
equal a[8:29] b[17:38]
[104] Fix | Delete
[105] Fix | Delete
See the Differ class for a fancy human-friendly file differencer, which
[106] Fix | Delete
uses SequenceMatcher both to compare sequences of lines, and to compare
[107] Fix | Delete
sequences of characters within similar (near-matching) lines.
[108] Fix | Delete
[109] Fix | Delete
See also function get_close_matches() in this module, which shows how
[110] Fix | Delete
simple code building on SequenceMatcher can be used to do useful work.
[111] Fix | Delete
[112] Fix | Delete
Timing: Basic R-O is cubic time worst case and quadratic time expected
[113] Fix | Delete
case. SequenceMatcher is quadratic time for the worst case and has
[114] Fix | Delete
expected-case behavior dependent in a complicated way on how many
[115] Fix | Delete
elements the sequences have in common; best case time is linear.
[116] Fix | Delete
[117] Fix | Delete
Methods:
[118] Fix | Delete
[119] Fix | Delete
__init__(isjunk=None, a='', b='')
[120] Fix | Delete
Construct a SequenceMatcher.
[121] Fix | Delete
[122] Fix | Delete
set_seqs(a, b)
[123] Fix | Delete
Set the two sequences to be compared.
[124] Fix | Delete
[125] Fix | Delete
set_seq1(a)
[126] Fix | Delete
Set the first sequence to be compared.
[127] Fix | Delete
[128] Fix | Delete
set_seq2(b)
[129] Fix | Delete
Set the second sequence to be compared.
[130] Fix | Delete
[131] Fix | Delete
find_longest_match(alo, ahi, blo, bhi)
[132] Fix | Delete
Find longest matching block in a[alo:ahi] and b[blo:bhi].
[133] Fix | Delete
[134] Fix | Delete
get_matching_blocks()
[135] Fix | Delete
Return list of triples describing matching subsequences.
[136] Fix | Delete
[137] Fix | Delete
get_opcodes()
[138] Fix | Delete
Return list of 5-tuples describing how to turn a into b.
[139] Fix | Delete
[140] Fix | Delete
ratio()
[141] Fix | Delete
Return a measure of the sequences' similarity (float in [0,1]).
[142] Fix | Delete
[143] Fix | Delete
quick_ratio()
[144] Fix | Delete
Return an upper bound on .ratio() relatively quickly.
[145] Fix | Delete
[146] Fix | Delete
real_quick_ratio()
[147] Fix | Delete
Return an upper bound on ratio() very quickly.
[148] Fix | Delete
"""
[149] Fix | Delete
[150] Fix | Delete
def __init__(self, isjunk=None, a='', b='', autojunk=True):
[151] Fix | Delete
"""Construct a SequenceMatcher.
[152] Fix | Delete
[153] Fix | Delete
Optional arg isjunk is None (the default), or a one-argument
[154] Fix | Delete
function that takes a sequence element and returns true iff the
[155] Fix | Delete
element is junk. None is equivalent to passing "lambda x: 0", i.e.
[156] Fix | Delete
no elements are considered to be junk. For example, pass
[157] Fix | Delete
lambda x: x in " \\t"
[158] Fix | Delete
if you're comparing lines as sequences of characters, and don't
[159] Fix | Delete
want to synch up on blanks or hard tabs.
[160] Fix | Delete
[161] Fix | Delete
Optional arg a is the first of two sequences to be compared. By
[162] Fix | Delete
default, an empty string. The elements of a must be hashable. See
[163] Fix | Delete
also .set_seqs() and .set_seq1().
[164] Fix | Delete
[165] Fix | Delete
Optional arg b is the second of two sequences to be compared. By
[166] Fix | Delete
default, an empty string. The elements of b must be hashable. See
[167] Fix | Delete
also .set_seqs() and .set_seq2().
[168] Fix | Delete
[169] Fix | Delete
Optional arg autojunk should be set to False to disable the
[170] Fix | Delete
"automatic junk heuristic" that treats popular elements as junk
[171] Fix | Delete
(see module documentation for more information).
[172] Fix | Delete
"""
[173] Fix | Delete
[174] Fix | Delete
# Members:
[175] Fix | Delete
# a
[176] Fix | Delete
# first sequence
[177] Fix | Delete
# b
[178] Fix | Delete
# second sequence; differences are computed as "what do
[179] Fix | Delete
# we need to do to 'a' to change it into 'b'?"
[180] Fix | Delete
# b2j
[181] Fix | Delete
# for x in b, b2j[x] is a list of the indices (into b)
[182] Fix | Delete
# at which x appears; junk elements do not appear
[183] Fix | Delete
# fullbcount
[184] Fix | Delete
# for x in b, fullbcount[x] == the number of times x
[185] Fix | Delete
# appears in b; only materialized if really needed (used
[186] Fix | Delete
# only for computing quick_ratio())
[187] Fix | Delete
# matching_blocks
[188] Fix | Delete
# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
[189] Fix | Delete
# ascending & non-overlapping in i and in j; terminated by
[190] Fix | Delete
# a dummy (len(a), len(b), 0) sentinel
[191] Fix | Delete
# opcodes
[192] Fix | Delete
# a list of (tag, i1, i2, j1, j2) tuples, where tag is
[193] Fix | Delete
# one of
[194] Fix | Delete
# 'replace' a[i1:i2] should be replaced by b[j1:j2]
[195] Fix | Delete
# 'delete' a[i1:i2] should be deleted
[196] Fix | Delete
# 'insert' b[j1:j2] should be inserted
[197] Fix | Delete
# 'equal' a[i1:i2] == b[j1:j2]
[198] Fix | Delete
# isjunk
[199] Fix | Delete
# a user-supplied function taking a sequence element and
[200] Fix | Delete
# returning true iff the element is "junk" -- this has
[201] Fix | Delete
# subtle but helpful effects on the algorithm, which I'll
[202] Fix | Delete
# get around to writing up someday <0.9 wink>.
[203] Fix | Delete
# DON'T USE! Only __chain_b uses this. Use isbjunk.
[204] Fix | Delete
# isbjunk
[205] Fix | Delete
# for x in b, isbjunk(x) == isjunk(x) but much faster;
[206] Fix | Delete
# it's really the __contains__ method of a hidden dict.
[207] Fix | Delete
# DOES NOT WORK for x in a!
[208] Fix | Delete
# isbpopular
[209] Fix | Delete
# for x in b, isbpopular(x) is true iff b is reasonably long
[210] Fix | Delete
# (at least 200 elements) and x accounts for more than 1 + 1% of
[211] Fix | Delete
# its elements (when autojunk is enabled).
[212] Fix | Delete
# DOES NOT WORK for x in a!
[213] Fix | Delete
[214] Fix | Delete
self.isjunk = isjunk
[215] Fix | Delete
self.a = self.b = None
[216] Fix | Delete
self.autojunk = autojunk
[217] Fix | Delete
self.set_seqs(a, b)
[218] Fix | Delete
[219] Fix | Delete
def set_seqs(self, a, b):
[220] Fix | Delete
"""Set the two sequences to be compared.
[221] Fix | Delete
[222] Fix | Delete
>>> s = SequenceMatcher()
[223] Fix | Delete
>>> s.set_seqs("abcd", "bcde")
[224] Fix | Delete
>>> s.ratio()
[225] Fix | Delete
0.75
[226] Fix | Delete
"""
[227] Fix | Delete
[228] Fix | Delete
self.set_seq1(a)
[229] Fix | Delete
self.set_seq2(b)
[230] Fix | Delete
[231] Fix | Delete
def set_seq1(self, a):
[232] Fix | Delete
"""Set the first sequence to be compared.
[233] Fix | Delete
[234] Fix | Delete
The second sequence to be compared is not changed.
[235] Fix | Delete
[236] Fix | Delete
>>> s = SequenceMatcher(None, "abcd", "bcde")
[237] Fix | Delete
>>> s.ratio()
[238] Fix | Delete
0.75
[239] Fix | Delete
>>> s.set_seq1("bcde")
[240] Fix | Delete
>>> s.ratio()
[241] Fix | Delete
1.0
[242] Fix | Delete
>>>
[243] Fix | Delete
[244] Fix | Delete
SequenceMatcher computes and caches detailed information about the
[245] Fix | Delete
second sequence, so if you want to compare one sequence S against
[246] Fix | Delete
many sequences, use .set_seq2(S) once and call .set_seq1(x)
[247] Fix | Delete
repeatedly for each of the other sequences.
[248] Fix | Delete
[249] Fix | Delete
See also set_seqs() and set_seq2().
[250] Fix | Delete
"""
[251] Fix | Delete
[252] Fix | Delete
if a is self.a:
[253] Fix | Delete
return
[254] Fix | Delete
self.a = a
[255] Fix | Delete
self.matching_blocks = self.opcodes = None
[256] Fix | Delete
[257] Fix | Delete
def set_seq2(self, b):
[258] Fix | Delete
"""Set the second sequence to be compared.
[259] Fix | Delete
[260] Fix | Delete
The first sequence to be compared is not changed.
[261] Fix | Delete
[262] Fix | Delete
>>> s = SequenceMatcher(None, "abcd", "bcde")
[263] Fix | Delete
>>> s.ratio()
[264] Fix | Delete
0.75
[265] Fix | Delete
>>> s.set_seq2("abcd")
[266] Fix | Delete
>>> s.ratio()
[267] Fix | Delete
1.0
[268] Fix | Delete
>>>
[269] Fix | Delete
[270] Fix | Delete
SequenceMatcher computes and caches detailed information about the
[271] Fix | Delete
second sequence, so if you want to compare one sequence S against
[272] Fix | Delete
many sequences, use .set_seq2(S) once and call .set_seq1(x)
[273] Fix | Delete
repeatedly for each of the other sequences.
[274] Fix | Delete
[275] Fix | Delete
See also set_seqs() and set_seq1().
[276] Fix | Delete
"""
[277] Fix | Delete
[278] Fix | Delete
if b is self.b:
[279] Fix | Delete
return
[280] Fix | Delete
self.b = b
[281] Fix | Delete
self.matching_blocks = self.opcodes = None
[282] Fix | Delete
self.fullbcount = None
[283] Fix | Delete
self.__chain_b()
[284] Fix | Delete
[285] Fix | Delete
# For each element x in b, set b2j[x] to a list of the indices in
[286] Fix | Delete
# b where x appears; the indices are in increasing order; note that
[287] Fix | Delete
# the number of times x appears in b is len(b2j[x]) ...
[288] Fix | Delete
# when self.isjunk is defined, junk elements don't show up in this
[289] Fix | Delete
# map at all, which stops the central find_longest_match method
[290] Fix | Delete
# from starting any matching block at a junk element ...
[291] Fix | Delete
# also creates the fast isbjunk function ...
[292] Fix | Delete
# b2j also does not contain entries for "popular" elements, meaning
[293] Fix | Delete
# elements that account for more than 1 + 1% of the total elements, and
[294] Fix | Delete
# when the sequence is reasonably large (>= 200 elements); this can
[295] Fix | Delete
# be viewed as an adaptive notion of semi-junk, and yields an enormous
[296] Fix | Delete
# speedup when, e.g., comparing program files with hundreds of
[297] Fix | Delete
# instances of "return NULL;" ...
[298] Fix | Delete
# note that this is only called when b changes; so for cross-product
[299] Fix | Delete
# kinds of matches, it's best to call set_seq2 once, then set_seq1
[300] Fix | Delete
# repeatedly
[301] Fix | Delete
[302] Fix | Delete
def __chain_b(self):
[303] Fix | Delete
# Because isjunk is a user-defined (not C) function, and we test
[304] Fix | Delete
# for junk a LOT, it's important to minimize the number of calls.
[305] Fix | Delete
# Before the tricks described here, __chain_b was by far the most
[306] Fix | Delete
# time-consuming routine in the whole module! If anyone sees
[307] Fix | Delete
# Jim Roskind, thank him again for profile.py -- I never would
[308] Fix | Delete
# have guessed that.
[309] Fix | Delete
# The first trick is to build b2j ignoring the possibility
[310] Fix | Delete
# of junk. I.e., we don't call isjunk at all yet. Throwing
[311] Fix | Delete
# out the junk later is much cheaper than building b2j "right"
[312] Fix | Delete
# from the start.
[313] Fix | Delete
b = self.b
[314] Fix | Delete
self.b2j = b2j = {}
[315] Fix | Delete
[316] Fix | Delete
for i, elt in enumerate(b):
[317] Fix | Delete
indices = b2j.setdefault(elt, [])
[318] Fix | Delete
indices.append(i)
[319] Fix | Delete
[320] Fix | Delete
# Purge junk elements
[321] Fix | Delete
junk = set()
[322] Fix | Delete
isjunk = self.isjunk
[323] Fix | Delete
if isjunk:
[324] Fix | Delete
for elt in list(b2j.keys()): # using list() since b2j is modified
[325] Fix | Delete
if isjunk(elt):
[326] Fix | Delete
junk.add(elt)
[327] Fix | Delete
del b2j[elt]
[328] Fix | Delete
[329] Fix | Delete
# Purge popular elements that are not junk
[330] Fix | Delete
popular = set()
[331] Fix | Delete
n = len(b)
[332] Fix | Delete
if self.autojunk and n >= 200:
[333] Fix | Delete
ntest = n // 100 + 1
[334] Fix | Delete
for elt, idxs in list(b2j.items()):
[335] Fix | Delete
if len(idxs) > ntest:
[336] Fix | Delete
popular.add(elt)
[337] Fix | Delete
del b2j[elt]
[338] Fix | Delete
[339] Fix | Delete
# Now for x in b, isjunk(x) == x in junk, but the latter is much faster.
[340] Fix | Delete
# Sicne the number of *unique* junk elements is probably small, the
[341] Fix | Delete
# memory burden of keeping this set alive is likely trivial compared to
[342] Fix | Delete
# the size of b2j.
[343] Fix | Delete
self.isbjunk = junk.__contains__
[344] Fix | Delete
self.isbpopular = popular.__contains__
[345] Fix | Delete
[346] Fix | Delete
def find_longest_match(self, alo, ahi, blo, bhi):
[347] Fix | Delete
"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
[348] Fix | Delete
[349] Fix | Delete
If isjunk is not defined:
[350] Fix | Delete
[351] Fix | Delete
Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
[352] Fix | Delete
alo <= i <= i+k <= ahi
[353] Fix | Delete
blo <= j <= j+k <= bhi
[354] Fix | Delete
and for all (i',j',k') meeting those conditions,
[355] Fix | Delete
k >= k'
[356] Fix | Delete
i <= i'
[357] Fix | Delete
and if i == i', j <= j'
[358] Fix | Delete
[359] Fix | Delete
In other words, of all maximal matching blocks, return one that
[360] Fix | Delete
starts earliest in a, and of all those maximal matching blocks that
[361] Fix | Delete
start earliest in a, return the one that starts earliest in b.
[362] Fix | Delete
[363] Fix | Delete
>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
[364] Fix | Delete
>>> s.find_longest_match(0, 5, 0, 9)
[365] Fix | Delete
Match(a=0, b=4, size=5)
[366] Fix | Delete
[367] Fix | Delete
If isjunk is defined, first the longest matching block is
[368] Fix | Delete
determined as above, but with the additional restriction that no
[369] Fix | Delete
junk element appears in the block. Then that block is extended as
[370] Fix | Delete
far as possible by matching (only) junk elements on both sides. So
[371] Fix | Delete
the resulting block never matches on junk except as identical junk
[372] Fix | Delete
happens to be adjacent to an "interesting" match.
[373] Fix | Delete
[374] Fix | Delete
Here's the same example as before, but considering blanks to be
[375] Fix | Delete
junk. That prevents " abcd" from matching the " abcd" at the tail
[376] Fix | Delete
end of the second sequence directly. Instead only the "abcd" can
[377] Fix | Delete
match, and matches the leftmost "abcd" in the second sequence:
[378] Fix | Delete
[379] Fix | Delete
>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
[380] Fix | Delete
>>> s.find_longest_match(0, 5, 0, 9)
[381] Fix | Delete
Match(a=1, b=0, size=4)
[382] Fix | Delete
[383] Fix | Delete
If no blocks match, return (alo, blo, 0).
[384] Fix | Delete
[385] Fix | Delete
>>> s = SequenceMatcher(None, "ab", "c")
[386] Fix | Delete
>>> s.find_longest_match(0, 2, 0, 1)
[387] Fix | Delete
Match(a=0, b=0, size=0)
[388] Fix | Delete
"""
[389] Fix | Delete
[390] Fix | Delete
# CAUTION: stripping common prefix or suffix would be incorrect.
[391] Fix | Delete
# E.g.,
[392] Fix | Delete
# ab
[393] Fix | Delete
# acab
[394] Fix | Delete
# Longest matching block is "ab", but if common prefix is
[395] Fix | Delete
# stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
[396] Fix | Delete
# strip, so ends up claiming that ab is changed to acab by
[397] Fix | Delete
# inserting "ca" in the middle. That's minimal but unintuitive:
[398] Fix | Delete
# "it's obvious" that someone inserted "ac" at the front.
[399] Fix | Delete
# Windiff ends up at the same place as diff, but by pairing up
[400] Fix | Delete
# the unique 'b's and then matching the first two 'a's.
[401] Fix | Delete
[402] Fix | Delete
a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
[403] Fix | Delete
besti, bestj, bestsize = alo, blo, 0
[404] Fix | Delete
# find longest junk-free match
[405] Fix | Delete
# during an iteration of the loop, j2len[j] = length of longest
[406] Fix | Delete
# junk-free match ending with a[i-1] and b[j]
[407] Fix | Delete
j2len = {}
[408] Fix | Delete
nothing = []
[409] Fix | Delete
for i in xrange(alo, ahi):
[410] Fix | Delete
# look at all instances of a[i] in b; note that because
[411] Fix | Delete
# b2j has no junk keys, the loop is skipped if a[i] is junk
[412] Fix | Delete
j2lenget = j2len.get
[413] Fix | Delete
newj2len = {}
[414] Fix | Delete
for j in b2j.get(a[i], nothing):
[415] Fix | Delete
# a[i] matches b[j]
[416] Fix | Delete
if j < blo:
[417] Fix | Delete
continue
[418] Fix | Delete
if j >= bhi:
[419] Fix | Delete
break
[420] Fix | Delete
k = newj2len[j] = j2lenget(j-1, 0) + 1
[421] Fix | Delete
if k > bestsize:
[422] Fix | Delete
besti, bestj, bestsize = i-k+1, j-k+1, k
[423] Fix | Delete
j2len = newj2len
[424] Fix | Delete
[425] Fix | Delete
# Extend the best by non-junk elements on each end. In particular,
[426] Fix | Delete
# "popular" non-junk elements aren't in b2j, which greatly speeds
[427] Fix | Delete
# the inner loop above, but also means "the best" match so far
[428] Fix | Delete
# doesn't contain any junk *or* popular non-junk elements.
[429] Fix | Delete
while besti > alo and bestj > blo and \
[430] Fix | Delete
not isbjunk(b[bestj-1]) and \
[431] Fix | Delete
a[besti-1] == b[bestj-1]:
[432] Fix | Delete
besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
[433] Fix | Delete
while besti+bestsize < ahi and bestj+bestsize < bhi and \
[434] Fix | Delete
not isbjunk(b[bestj+bestsize]) and \
[435] Fix | Delete
a[besti+bestsize] == b[bestj+bestsize]:
[436] Fix | Delete
bestsize += 1
[437] Fix | Delete
[438] Fix | Delete
# Now that we have a wholly interesting match (albeit possibly
[439] Fix | Delete
# empty!), we may as well suck up the matching junk on each
[440] Fix | Delete
# side of it too. Can't think of a good reason not to, and it
[441] Fix | Delete
# saves post-processing the (possibly considerable) expense of
[442] Fix | Delete
# figuring out what to do with it. In the case of an empty
[443] Fix | Delete
# interesting match, this is clearly the right thing to do,
[444] Fix | Delete
# because no other kind of match is possible in the regions.
[445] Fix | Delete
while besti > alo and bestj > blo and \
[446] Fix | Delete
isbjunk(b[bestj-1]) and \
[447] Fix | Delete
a[besti-1] == b[bestj-1]:
[448] Fix | Delete
besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
[449] Fix | Delete
while besti+bestsize < ahi and bestj+bestsize < bhi and \
[450] Fix | Delete
isbjunk(b[bestj+bestsize]) and \
[451] Fix | Delete
a[besti+bestsize] == b[bestj+bestsize]:
[452] Fix | Delete
bestsize = bestsize + 1
[453] Fix | Delete
[454] Fix | Delete
return Match(besti, bestj, bestsize)
[455] Fix | Delete
[456] Fix | Delete
def get_matching_blocks(self):
[457] Fix | Delete
"""Return list of triples describing matching subsequences.
[458] Fix | Delete
[459] Fix | Delete
Each triple is of the form (i, j, n), and means that
[460] Fix | Delete
a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
[461] Fix | Delete
i and in j. New in Python 2.5, it's also guaranteed that if
[462] Fix | Delete
(i, j, n) and (i', j', n') are adjacent triples in the list, and
[463] Fix | Delete
the second is not the last triple in the list, then i+n != i' or
[464] Fix | Delete
j+n != j'. IOW, adjacent triples never describe adjacent equal
[465] Fix | Delete
blocks.
[466] Fix | Delete
[467] Fix | Delete
The last triple is a dummy, (len(a), len(b), 0), and is the only
[468] Fix | Delete
triple with n==0.
[469] Fix | Delete
[470] Fix | Delete
>>> s = SequenceMatcher(None, "abxcd", "abcd")
[471] Fix | Delete
>>> s.get_matching_blocks()
[472] Fix | Delete
[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
[473] Fix | Delete
"""
[474] Fix | Delete
[475] Fix | Delete
if self.matching_blocks is not None:
[476] Fix | Delete
return self.matching_blocks
[477] Fix | Delete
la, lb = len(self.a), len(self.b)
[478] Fix | Delete
[479] Fix | Delete
# This is most naturally expressed as a recursive algorithm, but
[480] Fix | Delete
# at least one user bumped into extreme use cases that exceeded
[481] Fix | Delete
# the recursion limit on their box. So, now we maintain a list
[482] Fix | Delete
# ('queue`) of blocks we still need to look at, and append partial
[483] Fix | Delete
# results to `matching_blocks` in a loop; the matches are sorted
[484] Fix | Delete
# at the end.
[485] Fix | Delete
queue = [(0, la, 0, lb)]
[486] Fix | Delete
matching_blocks = []
[487] Fix | Delete
while queue:
[488] Fix | Delete
alo, ahi, blo, bhi = queue.pop()
[489] Fix | Delete
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
[490] Fix | Delete
# a[alo:i] vs b[blo:j] unknown
[491] Fix | Delete
# a[i:i+k] same as b[j:j+k]
[492] Fix | Delete
# a[i+k:ahi] vs b[j+k:bhi] unknown
[493] Fix | Delete
if k: # if k is 0, there was no matching block
[494] Fix | Delete
matching_blocks.append(x)
[495] Fix | Delete
if alo < i and blo < j:
[496] Fix | Delete
queue.append((alo, i, blo, j))
[497] Fix | Delete
if i+k < ahi and j+k < bhi:
[498] Fix | Delete
queue.append((i+k, ahi, j+k, bhi))
[499] Fix | Delete
It is recommended that you Edit text format, this type of Fix handles quite a lot in one request
Function