Edit File by line

"""

[0] Fix | Delete

Module difflib -- helpers for computing deltas between objects.

[1] Fix | Delete

[2] Fix | Delete

Function get_close_matches(word, possibilities, n=3, cutoff=0.6):

[3] Fix | Delete

Use SequenceMatcher to return list of the best "good enough" matches.

[4] Fix | Delete

[5] Fix | Delete

Function context_diff(a, b):

[6] Fix | Delete

For two lists of strings, return a delta in context diff format.

[7] Fix | Delete

[8] Fix | Delete

Function ndiff(a, b):

[9] Fix | Delete

Return a delta: the difference between `a` and `b` (lists of strings).

[10] Fix | Delete

[11] Fix | Delete

Function restore(delta, which):

[12] Fix | Delete

Return one of the two sequences that generated an ndiff delta.

[13] Fix | Delete

[14] Fix | Delete

Function unified_diff(a, b):

[15] Fix | Delete

For two lists of strings, return a delta in unified diff format.

[16] Fix | Delete

[17] Fix | Delete

Class SequenceMatcher:

[18] Fix | Delete

A flexible class for comparing pairs of sequences of any type.

[19] Fix | Delete

[20] Fix | Delete

Class Differ:

[21] Fix | Delete

For producing human-readable deltas from sequences of lines of text.

[22] Fix | Delete

[23] Fix | Delete

Class HtmlDiff:

[24] Fix | Delete

For producing HTML side by side comparison with change highlights.

[25] Fix | Delete

"""

[26] Fix | Delete

[27] Fix | Delete

__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',

[28] Fix | Delete

'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',

[29] Fix | Delete

'unified_diff', 'HtmlDiff', 'Match']

[30] Fix | Delete

[31] Fix | Delete

import heapq

[32] Fix | Delete

from collections import namedtuple as _namedtuple

[33] Fix | Delete

from functools import reduce

[34] Fix | Delete

[35] Fix | Delete

Match = _namedtuple('Match', 'a b size')

[36] Fix | Delete

[37] Fix | Delete

def _calculate_ratio(matches, length):

[38] Fix | Delete

if length:

[39] Fix | Delete

return 2.0 * matches / length

[40] Fix | Delete

return 1.0

[41] Fix | Delete

[42] Fix | Delete

class SequenceMatcher:

[43] Fix | Delete

[44] Fix | Delete

"""

[45] Fix | Delete

SequenceMatcher is a flexible class for comparing pairs of sequences of

[46] Fix | Delete

any type, so long as the sequence elements are hashable. The basic

[47] Fix | Delete

algorithm predates, and is a little fancier than, an algorithm

[48] Fix | Delete

published in the late 1980's by Ratcliff and Obershelp under the

[49] Fix | Delete

hyperbolic name "gestalt pattern matching". The basic idea is to find

[50] Fix | Delete

the longest contiguous matching subsequence that contains no "junk"

[51] Fix | Delete

elements (R-O doesn't address junk). The same idea is then applied

[52] Fix | Delete

recursively to the pieces of the sequences to the left and to the right

[53] Fix | Delete

of the matching subsequence. This does not yield minimal edit

[54] Fix | Delete

sequences, but does tend to yield matches that "look right" to people.

[55] Fix | Delete

[56] Fix | Delete

SequenceMatcher tries to compute a "human-friendly diff" between two

[57] Fix | Delete

sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the

[58] Fix | Delete

longest *contiguous* & junk-free matching subsequence. That's what

[59] Fix | Delete

catches peoples' eyes. The Windows(tm) windiff has another interesting

[60] Fix | Delete

notion, pairing up elements that appear uniquely in each sequence.

[61] Fix | Delete

That, and the method here, appear to yield more intuitive difference

[62] Fix | Delete

reports than does diff. This method appears to be the least vulnerable

[63] Fix | Delete

to synching up on blocks of "junk lines", though (like blank lines in

[64] Fix | Delete

ordinary text files, or maybe "<P>" lines in HTML files). That may be

[65] Fix | Delete

because this is the only method of the 3 that has a *concept* of

[66] Fix | Delete

"junk" <wink>.

[67] Fix | Delete

[68] Fix | Delete

Example, comparing two strings, and considering blanks to be "junk":

[69] Fix | Delete

[70] Fix | Delete

>>> s = SequenceMatcher(lambda x: x == " ",

[71] Fix | Delete

... "private Thread currentThread;",

[72] Fix | Delete

... "private volatile Thread currentThread;")

[73] Fix | Delete

>>>

[74] Fix | Delete

[75] Fix | Delete

.ratio() returns a float in [0, 1], measuring the "similarity" of the

[76] Fix | Delete

sequences. As a rule of thumb, a .ratio() value over 0.6 means the

[77] Fix | Delete

sequences are close matches:

[78] Fix | Delete

[79] Fix | Delete

>>> print round(s.ratio(), 3)

[80] Fix | Delete

0.866

[81] Fix | Delete

>>>

[82] Fix | Delete

[83] Fix | Delete

If you're only interested in where the sequences match,

[84] Fix | Delete

.get_matching_blocks() is handy:

[85] Fix | Delete

[86] Fix | Delete

>>> for block in s.get_matching_blocks():

[87] Fix | Delete

... print "a[%d] and b[%d] match for %d elements" % block

[88] Fix | Delete

a[0] and b[0] match for 8 elements

[89] Fix | Delete

a[8] and b[17] match for 21 elements

[90] Fix | Delete

a[29] and b[38] match for 0 elements

[91] Fix | Delete

[92] Fix | Delete

Note that the last tuple returned by .get_matching_blocks() is always a

[93] Fix | Delete

dummy, (len(a), len(b), 0), and this is the only case in which the last

[94] Fix | Delete

tuple element (number of elements matched) is 0.

[95] Fix | Delete

[96] Fix | Delete

If you want to know how to change the first sequence into the second,

[97] Fix | Delete

use .get_opcodes():

[98] Fix | Delete

[99] Fix | Delete

>>> for opcode in s.get_opcodes():

[100] Fix | Delete

... print "%6s a[%d:%d] b[%d:%d]" % opcode

[101] Fix | Delete

equal a[0:8] b[0:8]

[102] Fix | Delete

insert a[8:8] b[8:17]

[103] Fix | Delete

equal a[8:29] b[17:38]

[104] Fix | Delete

[105] Fix | Delete

See the Differ class for a fancy human-friendly file differencer, which

[106] Fix | Delete

uses SequenceMatcher both to compare sequences of lines, and to compare

[107] Fix | Delete

sequences of characters within similar (near-matching) lines.

[108] Fix | Delete

[109] Fix | Delete

See also function get_close_matches() in this module, which shows how

[110] Fix | Delete

simple code building on SequenceMatcher can be used to do useful work.

[111] Fix | Delete

[112] Fix | Delete

Timing: Basic R-O is cubic time worst case and quadratic time expected

[113] Fix | Delete

case. SequenceMatcher is quadratic time for the worst case and has

[114] Fix | Delete

expected-case behavior dependent in a complicated way on how many

[115] Fix | Delete

elements the sequences have in common; best case time is linear.

[116] Fix | Delete

[117] Fix | Delete

Methods:

[118] Fix | Delete

[119] Fix | Delete

__init__(isjunk=None, a='', b='')

[120] Fix | Delete

Construct a SequenceMatcher.

[121] Fix | Delete

[122] Fix | Delete

set_seqs(a, b)

[123] Fix | Delete

Set the two sequences to be compared.

[124] Fix | Delete

[125] Fix | Delete

set_seq1(a)

[126] Fix | Delete

Set the first sequence to be compared.

[127] Fix | Delete

[128] Fix | Delete

set_seq2(b)

[129] Fix | Delete

Set the second sequence to be compared.

[130] Fix | Delete

[131] Fix | Delete

find_longest_match(alo, ahi, blo, bhi)

[132] Fix | Delete

Find longest matching block in a[alo:ahi] and b[blo:bhi].

[133] Fix | Delete

[134] Fix | Delete

get_matching_blocks()

[135] Fix | Delete

Return list of triples describing matching subsequences.

[136] Fix | Delete

[137] Fix | Delete

get_opcodes()

[138] Fix | Delete

Return list of 5-tuples describing how to turn a into b.

[139] Fix | Delete

[140] Fix | Delete

ratio()

[141] Fix | Delete

Return a measure of the sequences' similarity (float in [0,1]).

[142] Fix | Delete

[143] Fix | Delete

quick_ratio()

[144] Fix | Delete

Return an upper bound on .ratio() relatively quickly.

[145] Fix | Delete

[146] Fix | Delete

real_quick_ratio()

[147] Fix | Delete

Return an upper bound on ratio() very quickly.

[148] Fix | Delete

"""

[149] Fix | Delete

[150] Fix | Delete

def __init__(self, isjunk=None, a='', b='', autojunk=True):

[151] Fix | Delete

"""Construct a SequenceMatcher.

[152] Fix | Delete

[153] Fix | Delete

Optional arg isjunk is None (the default), or a one-argument

[154] Fix | Delete

function that takes a sequence element and returns true iff the

[155] Fix | Delete

element is junk. None is equivalent to passing "lambda x: 0", i.e.

[156] Fix | Delete

no elements are considered to be junk. For example, pass

[157] Fix | Delete

lambda x: x in " \\t"

[158] Fix | Delete

if you're comparing lines as sequences of characters, and don't

[159] Fix | Delete

want to synch up on blanks or hard tabs.

[160] Fix | Delete

[161] Fix | Delete

Optional arg a is the first of two sequences to be compared. By

[162] Fix | Delete

default, an empty string. The elements of a must be hashable. See

[163] Fix | Delete

also .set_seqs() and .set_seq1().

[164] Fix | Delete

[165] Fix | Delete

Optional arg b is the second of two sequences to be compared. By

[166] Fix | Delete

default, an empty string. The elements of b must be hashable. See

[167] Fix | Delete

also .set_seqs() and .set_seq2().

[168] Fix | Delete

[169] Fix | Delete

Optional arg autojunk should be set to False to disable the

[170] Fix | Delete

"automatic junk heuristic" that treats popular elements as junk

[171] Fix | Delete

(see module documentation for more information).

[172] Fix | Delete

"""

[173] Fix | Delete

[174] Fix | Delete

# Members:

[175] Fix | Delete

# a

[176] Fix | Delete

# first sequence

[177] Fix | Delete

# b

[178] Fix | Delete

# second sequence; differences are computed as "what do

[179] Fix | Delete

# we need to do to 'a' to change it into 'b'?"

[180] Fix | Delete

# b2j

[181] Fix | Delete

# for x in b, b2j[x] is a list of the indices (into b)

[182] Fix | Delete

# at which x appears; junk elements do not appear

[183] Fix | Delete

# fullbcount

[184] Fix | Delete

# for x in b, fullbcount[x] == the number of times x

[185] Fix | Delete

# appears in b; only materialized if really needed (used

[186] Fix | Delete

# only for computing quick_ratio())

[187] Fix | Delete

# matching_blocks

[188] Fix | Delete

# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];

[189] Fix | Delete

# ascending & non-overlapping in i and in j; terminated by

[190] Fix | Delete

# a dummy (len(a), len(b), 0) sentinel

[191] Fix | Delete

# opcodes

[192] Fix | Delete

# a list of (tag, i1, i2, j1, j2) tuples, where tag is

[193] Fix | Delete

# one of

[194] Fix | Delete

# 'replace' a[i1:i2] should be replaced by b[j1:j2]

[195] Fix | Delete

# 'delete' a[i1:i2] should be deleted

[196] Fix | Delete

# 'insert' b[j1:j2] should be inserted

[197] Fix | Delete

# 'equal' a[i1:i2] == b[j1:j2]

[198] Fix | Delete

# isjunk

[199] Fix | Delete

# a user-supplied function taking a sequence element and

[200] Fix | Delete

# returning true iff the element is "junk" -- this has

[201] Fix | Delete

# subtle but helpful effects on the algorithm, which I'll

[202] Fix | Delete

# get around to writing up someday <0.9 wink>.

[203] Fix | Delete

# DON'T USE! Only __chain_b uses this. Use isbjunk.

[204] Fix | Delete

# isbjunk

[205] Fix | Delete

# for x in b, isbjunk(x) == isjunk(x) but much faster;

[206] Fix | Delete

# it's really the __contains__ method of a hidden dict.

[207] Fix | Delete

# DOES NOT WORK for x in a!

[208] Fix | Delete

# isbpopular

[209] Fix | Delete

# for x in b, isbpopular(x) is true iff b is reasonably long

[210] Fix | Delete

# (at least 200 elements) and x accounts for more than 1 + 1% of

[211] Fix | Delete

# its elements (when autojunk is enabled).

[212] Fix | Delete

# DOES NOT WORK for x in a!

[213] Fix | Delete

[214] Fix | Delete

self.isjunk = isjunk

[215] Fix | Delete

self.a = self.b = None

[216] Fix | Delete

self.autojunk = autojunk

[217] Fix | Delete

self.set_seqs(a, b)

[218] Fix | Delete

[219] Fix | Delete

def set_seqs(self, a, b):

[220] Fix | Delete

"""Set the two sequences to be compared.

[221] Fix | Delete

[222] Fix | Delete

>>> s = SequenceMatcher()

[223] Fix | Delete

>>> s.set_seqs("abcd", "bcde")

[224] Fix | Delete

>>> s.ratio()

[225] Fix | Delete

0.75

[226] Fix | Delete

"""

[227] Fix | Delete

[228] Fix | Delete

self.set_seq1(a)

[229] Fix | Delete

self.set_seq2(b)

[230] Fix | Delete

[231] Fix | Delete

def set_seq1(self, a):

[232] Fix | Delete

"""Set the first sequence to be compared.

[233] Fix | Delete

[234] Fix | Delete

The second sequence to be compared is not changed.

[235] Fix | Delete

[236] Fix | Delete

>>> s = SequenceMatcher(None, "abcd", "bcde")

[237] Fix | Delete

>>> s.ratio()

[238] Fix | Delete

0.75

[239] Fix | Delete

>>> s.set_seq1("bcde")

[240] Fix | Delete

>>> s.ratio()

[241] Fix | Delete

1.0

[242] Fix | Delete

>>>

[243] Fix | Delete

[244] Fix | Delete

SequenceMatcher computes and caches detailed information about the

[245] Fix | Delete

second sequence, so if you want to compare one sequence S against

[246] Fix | Delete

many sequences, use .set_seq2(S) once and call .set_seq1(x)

[247] Fix | Delete

repeatedly for each of the other sequences.

[248] Fix | Delete

[249] Fix | Delete