Edit File by line

"""Heap queue algorithm (a.k.a. priority queue).

[0] Fix | Delete

[1] Fix | Delete

Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for

[2] Fix | Delete

all k, counting elements from 0. For the sake of comparison,

[3] Fix | Delete

non-existing elements are considered to be infinite. The interesting

[4] Fix | Delete

property of a heap is that a[0] is always its smallest element.

[5] Fix | Delete

[6] Fix | Delete

Usage:

[7] Fix | Delete

[8] Fix | Delete

heap = [] # creates an empty heap

[9] Fix | Delete

heappush(heap, item) # pushes a new item on the heap

[10] Fix | Delete

item = heappop(heap) # pops the smallest item from the heap

[11] Fix | Delete

item = heap[0] # smallest item on the heap without popping it

[12] Fix | Delete

heapify(x) # transforms list into a heap, in-place, in linear time

[13] Fix | Delete

item = heapreplace(heap, item) # pops and returns smallest item, and adds

[14] Fix | Delete

# new item; the heap size is unchanged

[15] Fix | Delete

[16] Fix | Delete

Our API differs from textbook heap algorithms as follows:

[17] Fix | Delete

[18] Fix | Delete

- We use 0-based indexing. This makes the relationship between the

[19] Fix | Delete

index for a node and the indexes for its children slightly less

[20] Fix | Delete

obvious, but is more suitable since Python uses 0-based indexing.

[21] Fix | Delete

[22] Fix | Delete

- Our heappop() method returns the smallest item, not the largest.

[23] Fix | Delete

[24] Fix | Delete

These two make it possible to view the heap as a regular Python list

[25] Fix | Delete

without surprises: heap[0] is the smallest item, and heap.sort()

[26] Fix | Delete

maintains the heap invariant!

[27] Fix | Delete

"""

[28] Fix | Delete

[29] Fix | Delete

# Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger

[30] Fix | Delete

[31] Fix | Delete

__about__ = """Heap queues

[32] Fix | Delete

[33] Fix | Delete

[explanation by François Pinard]

[34] Fix | Delete

[35] Fix | Delete

Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for

[36] Fix | Delete

all k, counting elements from 0. For the sake of comparison,

[37] Fix | Delete

non-existing elements are considered to be infinite. The interesting

[38] Fix | Delete

property of a heap is that a[0] is always its smallest element.

[39] Fix | Delete

[40] Fix | Delete

The strange invariant above is meant to be an efficient memory

[41] Fix | Delete

representation for a tournament. The numbers below are `k', not a[k]:

[42] Fix | Delete

[43] Fix | Delete

[44] Fix | Delete

[45] Fix | Delete

1 2

[46] Fix | Delete

[47] Fix | Delete

3 4 5 6

[48] Fix | Delete

[49] Fix | Delete

7 8 9 10 11 12 13 14

[50] Fix | Delete

[51] Fix | Delete

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

[52] Fix | Delete

[53] Fix | Delete

[54] Fix | Delete

In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In

[55] Fix | Delete

a usual binary tournament we see in sports, each cell is the winner

[56] Fix | Delete

over the two cells it tops, and we can trace the winner down the tree

[57] Fix | Delete

to see all opponents s/he had. However, in many computer applications

[58] Fix | Delete

of such tournaments, we do not need to trace the history of a winner.

[59] Fix | Delete

To be more memory efficient, when a winner is promoted, we try to

[60] Fix | Delete

replace it by something else at a lower level, and the rule becomes

[61] Fix | Delete

that a cell and the two cells it tops contain three different items,

[62] Fix | Delete

but the top cell "wins" over the two topped cells.

[63] Fix | Delete

[64] Fix | Delete

If this heap invariant is protected at all time, index 0 is clearly

[65] Fix | Delete

the overall winner. The simplest algorithmic way to remove it and

[66] Fix | Delete

find the "next" winner is to move some loser (let's say cell 30 in the

[67] Fix | Delete

diagram above) into the 0 position, and then percolate this new 0 down

[68] Fix | Delete

the tree, exchanging values, until the invariant is re-established.

[69] Fix | Delete

This is clearly logarithmic on the total number of items in the tree.

[70] Fix | Delete

By iterating over all items, you get an O(n ln n) sort.

[71] Fix | Delete

[72] Fix | Delete

A nice feature of this sort is that you can efficiently insert new

[73] Fix | Delete

items while the sort is going on, provided that the inserted items are

[74] Fix | Delete

not "better" than the last 0'th element you extracted. This is

[75] Fix | Delete

especially useful in simulation contexts, where the tree holds all

[76] Fix | Delete

incoming events, and the "win" condition means the smallest scheduled

[77] Fix | Delete

time. When an event schedule other events for execution, they are

[78] Fix | Delete

scheduled into the future, so they can easily go into the heap. So, a

[79] Fix | Delete

heap is a good structure for implementing schedulers (this is what I

[80] Fix | Delete

used for my MIDI sequencer :-).

[81] Fix | Delete

[82] Fix | Delete

Various structures for implementing schedulers have been extensively

[83] Fix | Delete

studied, and heaps are good for this, as they are reasonably speedy,

[84] Fix | Delete

the speed is almost constant, and the worst case is not much different

[85] Fix | Delete

than the average case. However, there are other representations which

[86] Fix | Delete

are more efficient overall, yet the worst cases might be terrible.

[87] Fix | Delete

[88] Fix | Delete

Heaps are also very useful in big disk sorts. You most probably all

[89] Fix | Delete

know that a big sort implies producing "runs" (which are pre-sorted

[90] Fix | Delete

sequences, which size is usually related to the amount of CPU memory),

[91] Fix | Delete

followed by a merging passes for these runs, which merging is often

[92] Fix | Delete

very cleverly organised[1]. It is very important that the initial

[93] Fix | Delete

sort produces the longest runs possible. Tournaments are a good way

[94] Fix | Delete

to that. If, using all the memory available to hold a tournament, you

[95] Fix | Delete

replace and percolate items that happen to fit the current run, you'll

[96] Fix | Delete

produce runs which are twice the size of the memory for random input,

[97] Fix | Delete

and much better for input fuzzily ordered.

[98] Fix | Delete

[99] Fix | Delete

Moreover, if you output the 0'th item on disk and get an input which

[100] Fix | Delete

may not fit in the current tournament (because the value "wins" over

[101] Fix | Delete

the last output value), it cannot fit in the heap, so the size of the

[102] Fix | Delete

heap decreases. The freed memory could be cleverly reused immediately

[103] Fix | Delete

for progressively building a second heap, which grows at exactly the

[104] Fix | Delete

same rate the first heap is melting. When the first heap completely

[105] Fix | Delete

vanishes, you switch heaps and start a new run. Clever and quite

[106] Fix | Delete

effective!

[107] Fix | Delete

[108] Fix | Delete

In a word, heaps are useful memory structures to know. I use them in

[109] Fix | Delete

a few applications, and I think it is good to keep a `heap' module

[110] Fix | Delete

around. :-)

[111] Fix | Delete

[112] Fix | Delete

--------------------

[113] Fix | Delete

[1] The disk balancing algorithms which are current, nowadays, are

[114] Fix | Delete

more annoying than clever, and this is a consequence of the seeking

[115] Fix | Delete

capabilities of the disks. On devices which cannot seek, like big

[116] Fix | Delete

tape drives, the story was quite different, and one had to be very

[117] Fix | Delete

clever to ensure (far in advance) that each tape movement will be the

[118] Fix | Delete

most effective possible (that is, will best participate at

[119] Fix | Delete

"progressing" the merge). Some tapes were even able to read

[120] Fix | Delete

backwards, and this was also used to avoid the rewinding time.

[121] Fix | Delete

Believe me, real good tape sorts were quite spectacular to watch!

[122] Fix | Delete

From all times, sorting has always been a Great Art! :-)

[123] Fix | Delete

"""

[124] Fix | Delete

[125] Fix | Delete

__all__ = ['heappush', 'heappop', 'heapify', 'heapreplace', 'merge',

[126] Fix | Delete

'nlargest', 'nsmallest', 'heappushpop']

[127] Fix | Delete

[128] Fix | Delete

def heappush(heap, item):

[129] Fix | Delete

"""Push item onto heap, maintaining the heap invariant."""

[130] Fix | Delete

heap.append(item)

[131] Fix | Delete

_siftdown(heap, 0, len(heap)-1)

[132] Fix | Delete

[133] Fix | Delete

def heappop(heap):

[134] Fix | Delete

"""Pop the smallest item off the heap, maintaining the heap invariant."""

[135] Fix | Delete

lastelt = heap.pop() # raises appropriate IndexError if heap is empty

[136] Fix | Delete

if heap:

[137] Fix | Delete

returnitem = heap[0]

[138] Fix | Delete

heap[0] = lastelt

[139] Fix | Delete

_siftup(heap, 0)

[140] Fix | Delete

return returnitem

[141] Fix | Delete

return lastelt

[142] Fix | Delete

[143] Fix | Delete

def heapreplace(heap, item):

[144] Fix | Delete

"""Pop and return the current smallest value, and add the new item.

[145] Fix | Delete

[146] Fix | Delete

This is more efficient than heappop() followed by heappush(), and can be

[147] Fix | Delete

more appropriate when using a fixed-size heap. Note that the value

[148] Fix | Delete

returned may be larger than item! That constrains reasonable uses of

[149] Fix | Delete

this routine unless written as part of a conditional replacement:

[150] Fix | Delete

[151] Fix | Delete

if item > heap[0]:

[152] Fix | Delete

item = heapreplace(heap, item)

[153] Fix | Delete

"""

[154] Fix | Delete

returnitem = heap[0] # raises appropriate IndexError if heap is empty

[155] Fix | Delete

heap[0] = item

[156] Fix | Delete

_siftup(heap, 0)

[157] Fix | Delete

return returnitem

[158] Fix | Delete

[159] Fix | Delete

def heappushpop(heap, item):

[160] Fix | Delete

"""Fast version of a heappush followed by a heappop."""

[161] Fix | Delete

if heap and heap[0] < item:

[162] Fix | Delete

item, heap[0] = heap[0], item

[163] Fix | Delete

_siftup(heap, 0)

[164] Fix | Delete

return item

[165] Fix | Delete

[166] Fix | Delete

def heapify(x):

[167] Fix | Delete

"""Transform list into a heap, in-place, in O(len(x)) time."""

[168] Fix | Delete

n = len(x)

[169] Fix | Delete

# Transform bottom-up. The largest index there's any point to looking at

[170] Fix | Delete

# is the largest with a child index in-range, so must have 2*i + 1 < n,

[171] Fix | Delete

# or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so

[172] Fix | Delete

# j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is

[173] Fix | Delete

# (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.

[174] Fix | Delete

for i in reversed(range(n//2)):

[175] Fix | Delete

_siftup(x, i)

[176] Fix | Delete

[177] Fix | Delete

def _heappop_max(heap):

[178] Fix | Delete

"""Maxheap version of a heappop."""

[179] Fix | Delete

lastelt = heap.pop() # raises appropriate IndexError if heap is empty

[180] Fix | Delete

if heap:

[181] Fix | Delete

returnitem = heap[0]

[182] Fix | Delete

heap[0] = lastelt

[183] Fix | Delete

_siftup_max(heap, 0)

[184] Fix | Delete

return returnitem

[185] Fix | Delete

return lastelt

[186] Fix | Delete

[187] Fix | Delete

def _heapreplace_max(heap, item):

[188] Fix | Delete

"""Maxheap version of a heappop followed by a heappush."""

[189] Fix | Delete

returnitem = heap[0] # raises appropriate IndexError if heap is empty

[190] Fix | Delete

heap[0] = item

[191] Fix | Delete

_siftup_max(heap, 0)

[192] Fix | Delete

return returnitem

[193] Fix | Delete

[194] Fix | Delete

def _heapify_max(x):

[195] Fix | Delete

"""Transform list into a maxheap, in-place, in O(len(x)) time."""

[196] Fix | Delete

n = len(x)

[197] Fix | Delete

for i in reversed(range(n//2)):

[198] Fix | Delete

_siftup_max(x, i)

[199] Fix | Delete

[200] Fix | Delete

# 'heap' is a heap at all indices >= startpos, except possibly for pos. pos

[201] Fix | Delete

# is the index of a leaf with a possibly out-of-order value. Restore the

[202] Fix | Delete

# heap invariant.

[203] Fix | Delete

def _siftdown(heap, startpos, pos):

[204] Fix | Delete

newitem = heap[pos]

[205] Fix | Delete

# Follow the path to the root, moving parents down until finding a place

[206] Fix | Delete

# newitem fits.

[207] Fix | Delete

while pos > startpos:

[208] Fix | Delete

parentpos = (pos - 1) >> 1

[209] Fix | Delete

parent = heap[parentpos]

[210] Fix | Delete

if newitem < parent:

[211] Fix | Delete

heap[pos] = parent

[212] Fix | Delete

pos = parentpos

[213] Fix | Delete

continue

[214] Fix | Delete

break

[215] Fix | Delete

heap[pos] = newitem

[216] Fix | Delete

[217] Fix | Delete

# The child indices of heap index pos are already heaps, and we want to make

[218] Fix | Delete

# a heap at index pos too. We do this by bubbling the smaller child of

[219] Fix | Delete

# pos up (and so on with that child's children, etc) until hitting a leaf,

[220] Fix | Delete

# then using _siftdown to move the oddball originally at index pos into place.

[221] Fix | Delete

[222] Fix | Delete

# We *could* break out of the loop as soon as we find a pos where newitem <=

[223] Fix | Delete

# both its children, but turns out that's not a good idea, and despite that

[224] Fix | Delete

# many books write the algorithm that way. During a heap pop, the last array

[225] Fix | Delete

# element is sifted in, and that tends to be large, so that comparing it

[226] Fix | Delete

# against values starting from the root usually doesn't pay (= usually doesn't

[227] Fix | Delete

# get us out of the loop early). See Knuth, Volume 3, where this is

[228] Fix | Delete

# explained and quantified in an exercise.

[229] Fix | Delete

[230] Fix | Delete

# Cutting the # of comparisons is important, since these routines have no

[231] Fix | Delete

# way to extract "the priority" from an array element, so that intelligence

[232] Fix | Delete

# is likely to be hiding in custom comparison methods, or in array elements

[233] Fix | Delete

# storing (priority, record) tuples. Comparisons are thus potentially

[234] Fix | Delete

# expensive.

[235] Fix | Delete

[236] Fix | Delete

# On random arrays of length 1000, making this change cut the number of

[237] Fix | Delete

# comparisons made by heapify() a little, and those made by exhaustive

[238] Fix | Delete

# heappop() a lot, in accord with theory. Here are typical results from 3

[239] Fix | Delete

# runs (3 just to demonstrate how small the variance is):

[240] Fix | Delete

[241] Fix | Delete

# Compares needed by heapify Compares needed by 1000 heappops

[242] Fix | Delete

# -------------------------- --------------------------------

[243] Fix | Delete

# 1837 cut to 1663 14996 cut to 8680

[244] Fix | Delete

# 1855 cut to 1659 14966 cut to 8678

[245] Fix | Delete

# 1847 cut to 1660 15024 cut to 8703

[246] Fix | Delete

[247] Fix | Delete

# Building the heap by using heappush() 1000 times instead required

[248] Fix | Delete

# 2198, 2148, and 2219 compares: heapify() is more efficient, when

[249] Fix | Delete

# you can use it.

[250] Fix | Delete

[251] Fix | Delete

# The total compares needed by list.sort() on the same lists were 8627,

[252] Fix | Delete

# 8627, and 8632 (this should be compared to the sum of heapify() and

[253] Fix | Delete

# heappop() compares): list.sort() is (unsurprisingly!) more efficient

[254] Fix | Delete

# for sorting.

[255] Fix | Delete

[256] Fix | Delete

def _siftup(heap, pos):

[257] Fix | Delete

endpos = len(heap)

[258] Fix | Delete

startpos = pos

[259] Fix | Delete

newitem = heap[pos]

[260] Fix | Delete

# Bubble up the smaller child until hitting a leaf.

[261] Fix | Delete

childpos = 2*pos + 1 # leftmost child position

[262] Fix | Delete

while childpos < endpos:

[263] Fix | Delete

# Set childpos to index of smaller child.

[264] Fix | Delete

rightpos = childpos + 1

[265] Fix | Delete

if rightpos < endpos and not heap[childpos] < heap[rightpos]:

[266] Fix | Delete

childpos = rightpos

[267] Fix | Delete

# Move the smaller child up.

[268] Fix | Delete

heap[pos] = heap[childpos]

[269] Fix | Delete

pos = childpos

[270] Fix | Delete

childpos = 2*pos + 1

[271] Fix | Delete

# The leaf at pos is empty now. Put newitem there, and bubble it up

[272] Fix | Delete

# to its final resting place (by sifting its parents down).

[273] Fix | Delete

heap[pos] = newitem

[274] Fix | Delete

_siftdown(heap, startpos, pos)

[275] Fix | Delete

[276] Fix | Delete

def _siftdown_max(heap, startpos, pos):

[277] Fix | Delete

'Maxheap variant of _siftdown'

[278] Fix | Delete

newitem = heap[pos]

[279] Fix | Delete

# Follow the path to the root, moving parents down until finding a place

[280] Fix | Delete

# newitem fits.

[281] Fix | Delete

while pos > startpos:

[282] Fix | Delete

parentpos = (pos - 1) >> 1

[283] Fix | Delete

parent = heap[parentpos]

[284] Fix | Delete

if parent < newitem:

[285] Fix | Delete

heap[pos] = parent

[286] Fix | Delete

pos = parentpos

[287] Fix | Delete

continue

[288] Fix | Delete

break

[289] Fix | Delete

heap[pos] = newitem

[290] Fix | Delete

[291] Fix | Delete

def _siftup_max(heap, pos):

[292] Fix | Delete

'Maxheap variant of _siftup'

[293] Fix | Delete

endpos = len(heap)

[294] Fix | Delete

startpos = pos

[295] Fix | Delete

newitem = heap[pos]

[296] Fix | Delete

# Bubble up the larger child until hitting a leaf.

[297] Fix | Delete

childpos = 2*pos + 1 # leftmost child position

[298] Fix | Delete

while childpos < endpos:

[299] Fix | Delete

# Set childpos to index of larger child.

[300] Fix | Delete

rightpos = childpos + 1

[301] Fix | Delete

if rightpos < endpos and not heap[rightpos] < heap[childpos]:

[302] Fix | Delete

childpos = rightpos

[303] Fix | Delete

# Move the larger child up.

[304] Fix | Delete

heap[pos] = heap[childpos]

[305] Fix | Delete

pos = childpos

[306] Fix | Delete

childpos = 2*pos + 1

[307] Fix | Delete

# The leaf at pos is empty now. Put newitem there, and bubble it up

[308] Fix | Delete

# to its final resting place (by sifting its parents down).

[309] Fix | Delete

heap[pos] = newitem

[310] Fix | Delete

_siftdown_max(heap, startpos, pos)

[311] Fix | Delete

[312] Fix | Delete

def merge(*iterables, key=None, reverse=False):

[313] Fix | Delete

'''Merge multiple sorted inputs into a single sorted output.

[314] Fix | Delete

[315] Fix | Delete

Similar to sorted(itertools.chain(*iterables)) but returns a generator,

[316] Fix | Delete

does not pull the data into memory all at once, and assumes that each of

[317] Fix | Delete

the input streams is already sorted (smallest to largest).

[318] Fix | Delete

[319] Fix | Delete

>>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))

[320] Fix | Delete

[0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]

[321] Fix | Delete

[322] Fix | Delete

If *key* is not None, applies a key function to each element to determine

[323] Fix | Delete

its sort order.

[324] Fix | Delete

[325] Fix | Delete

>>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))

[326] Fix | Delete

['dog', 'cat', 'fish', 'horse', 'kangaroo']

[327] Fix | Delete

[328] Fix | Delete

'''

[329] Fix | Delete

[330] Fix | Delete

h = []

[331] Fix | Delete

h_append = h.append

[332] Fix | Delete

[333] Fix | Delete

if reverse:

[334] Fix | Delete

_heapify = _heapify_max

[335] Fix | Delete

_heappop = _heappop_max

[336] Fix | Delete

_heapreplace = _heapreplace_max

[337] Fix | Delete

direction = -1

[338] Fix | Delete

else:

[339] Fix | Delete

_heapify = heapify

[340] Fix | Delete

_heappop = heappop

[341] Fix | Delete

_heapreplace = heapreplace

[342] Fix | Delete

direction = 1

[343] Fix | Delete

[344] Fix | Delete

if key is None:

[345] Fix | Delete

for order, it in enumerate(map(iter, iterables)):

[346] Fix | Delete

try:

[347] Fix | Delete

next = it.__next__

[348] Fix | Delete

h_append([next(), order * direction, next])

[349] Fix | Delete

except StopIteration:

[350] Fix | Delete

pass

[351] Fix | Delete

_heapify(h)

[352] Fix | Delete

while len(h) > 1:

[353] Fix | Delete

try:

[354] Fix | Delete

while True:

[355] Fix | Delete

value, order, next = s = h[0]

[356] Fix | Delete

yield value

[357] Fix | Delete

s[0] = next() # raises StopIteration when exhausted

[358] Fix | Delete

_heapreplace(h, s) # restore heap condition

[359] Fix | Delete

except StopIteration:

[360] Fix | Delete

_heappop(h) # remove empty iterator

[361] Fix | Delete

if h:

[362] Fix | Delete

# fast case when only a single iterator remains

[363] Fix | Delete

value, order, next = h[0]

[364] Fix | Delete

yield value

[365] Fix | Delete

yield from next.__self__

[366] Fix | Delete

return

[367] Fix | Delete

[368] Fix | Delete

for order, it in enumerate(map(iter, iterables)):

[369] Fix | Delete

try:

[370] Fix | Delete

next = it.__next__

[371] Fix | Delete

value = next()

[372] Fix | Delete

h_append([key(value), order * direction, value, next])

[373] Fix | Delete

except StopIteration:

[374] Fix | Delete

pass

[375] Fix | Delete

_heapify(h)

[376] Fix | Delete

while len(h) > 1:

[377] Fix | Delete

try:

[378] Fix | Delete

while True:

[379] Fix | Delete

key_value, order, value, next = s = h[0]

[380] Fix | Delete

yield value

[381] Fix | Delete

value = next()

[382] Fix | Delete

s[0] = key(value)

[383] Fix | Delete

s[2] = value

[384] Fix | Delete

_heapreplace(h, s)

[385] Fix | Delete

except StopIteration:

[386] Fix | Delete

_heappop(h)

[387] Fix | Delete

if h:

[388] Fix | Delete

key_value, order, value, next = h[0]

[389] Fix | Delete

yield value

[390] Fix | Delete

yield from next.__self__

[391] Fix | Delete

[392] Fix | Delete

[393] Fix | Delete

# Algorithm notes for nlargest() and nsmallest()

[394] Fix | Delete

# ==============================================

[395] Fix | Delete

[396] Fix | Delete

# Make a single pass over the data while keeping the k most extreme values

[397] Fix | Delete

# in a heap. Memory consumption is limited to keeping k values in a list.

[398] Fix | Delete

[399] Fix | Delete

# Measured performance for random inputs:

[400] Fix | Delete

[401] Fix | Delete

# number of comparisons

[402] Fix | Delete

# n inputs k-extreme values (average of 5 trials) % more than min()

[403] Fix | Delete

# ------------- ---------------- --------------------- -----------------

[404] Fix | Delete

# 1,000 100 3,317 231.7%

[405] Fix | Delete

# 10,000 100 14,046 40.5%

[406] Fix | Delete

# 100,000 100 105,749 5.7%

[407] Fix | Delete

# 1,000,000 100 1,007,751 0.8%

[408] Fix | Delete

# 10,000,000 100 10,009,401 0.1%

[409] Fix | Delete

[410] Fix | Delete

# Theoretical number of comparisons for k smallest of n random inputs:

[411] Fix | Delete

[412] Fix | Delete

# Step Comparisons Action

[413] Fix | Delete

# ---- -------------------------- ---------------------------

[414] Fix | Delete

# 1 1.66 * k heapify the first k-inputs

[415] Fix | Delete

# 2 n - k compare remaining elements to top of heap

[416] Fix | Delete

# 3 k * (1 + lg2(k)) * ln(n/k) replace the topmost value on the heap

[417] Fix | Delete

# 4 k * lg2(k) - (k/2) final sort of the k most extreme values

[418] Fix | Delete

[419] Fix | Delete

# Combining and simplifying for a rough estimate gives:

[420] Fix | Delete

[421] Fix | Delete

# comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))

[422] Fix | Delete

[423] Fix | Delete

# Computing the number of comparisons for step 3:

[424] Fix | Delete

# -----------------------------------------------

[425] Fix | Delete

# * For the i-th new value from the iterable, the probability of being in the

[426] Fix | Delete

# k most extreme values is k/i. For example, the probability of the 101st

[427] Fix | Delete

# value seen being in the 100 most extreme values is 100/101.

[428] Fix | Delete

# * If the value is a new extreme value, the cost of inserting it into the

[429] Fix | Delete

# heap is 1 + log(k, 2).

[430] Fix | Delete

# * The probability times the cost gives:

[431] Fix | Delete

# (k/i) * (1 + log(k, 2))

[432] Fix | Delete

# * Summing across the remaining n-k elements gives:

[433] Fix | Delete

# sum((k/i) * (1 + log(k, 2)) for i in range(k+1, n+1))

[434] Fix | Delete

# * This reduces to:

[435] Fix | Delete

# (H(n) - H(k)) * k * (1 + log(k, 2))

[436] Fix | Delete

# * Where H(n) is the n-th harmonic number estimated by:

[437] Fix | Delete

# gamma = 0.5772156649

[438] Fix | Delete

# H(n) = log(n, e) + gamma + 1 / (2 * n)

[439] Fix | Delete

# http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Rate_of_divergence

[440] Fix | Delete

# * Substituting the H(n) formula:

[441] Fix | Delete

# comparisons = k * (1 + log(k, 2)) * (log(n/k, e) + (1/n - 1/k) / 2)

[442] Fix | Delete

[443] Fix | Delete

# Worst-case for step 3:

[444] Fix | Delete

# ----------------------

[445] Fix | Delete

# In the worst case, the input data is reversed sorted so that every new element

[446] Fix | Delete

# must be inserted in the heap:

[447] Fix | Delete

[448] Fix | Delete

# comparisons = 1.66 * k + log(k, 2) * (n - k)

[449] Fix | Delete

[450] Fix | Delete

# Alternative Algorithms

[451] Fix | Delete

# ----------------------

[452] Fix | Delete

# Other algorithms were not used because they:

[453] Fix | Delete

# 1) Took much more auxiliary memory,

[454] Fix | Delete

# 2) Made multiple passes over the data.

[455] Fix | Delete

# 3) Made more comparisons in common cases (small k, large n, semi-random input).

[456] Fix | Delete

# See the more detailed comparison of approach at:

[457] Fix | Delete

# http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest

[458] Fix | Delete

[459] Fix | Delete

def nsmallest(n, iterable, key=None):

[460] Fix | Delete

"""Find the n smallest elements in a dataset.

[461] Fix | Delete

[462] Fix | Delete

Equivalent to: sorted(iterable, key=key)[:n]

[463] Fix | Delete

"""

[464] Fix | Delete

[465] Fix | Delete

# Short-cut for n==1 is to use min()

[466] Fix | Delete

if n == 1:

[467] Fix | Delete

it = iter(iterable)

[468] Fix | Delete

sentinel = object()

[469] Fix | Delete

result = min(it, default=sentinel, key=key)

[470] Fix | Delete

return [] if result is sentinel else [result]

[471] Fix | Delete

[472] Fix | Delete

# When n>=size, it's faster to use sorted()

[473] Fix | Delete

try:

[474] Fix | Delete

size = len(iterable)

[475] Fix | Delete

except (TypeError, AttributeError):

[476] Fix | Delete

pass

[477] Fix | Delete

else:

[478] Fix | Delete

if n >= size:

[479] Fix | Delete

return sorted(iterable, key=key)[:n]

[480] Fix | Delete

[481] Fix | Delete

# When key is none, use simpler decoration

[482] Fix | Delete

if key is None:

[483] Fix | Delete

it = iter(iterable)

[484] Fix | Delete

# put the range(n) first so that zip() doesn't

[485] Fix | Delete

# consume one too many elements from the iterator

[486] Fix | Delete

result = [(elem, i) for i, elem in zip(range(n), it)]

[487] Fix | Delete

if not result:

[488] Fix | Delete

return result

[489] Fix | Delete

_heapify_max(result)

[490] Fix | Delete

top = result[0][0]

[491] Fix | Delete

order = n

[492] Fix | Delete

_heapreplace = _heapreplace_max

[493] Fix | Delete

for elem in it:

[494] Fix | Delete

if elem < top:

[495] Fix | Delete

_heapreplace(result, (elem, order))

[496] Fix | Delete

top, _order = result[0]

[497] Fix | Delete

order += 1

[498] Fix | Delete

result.sort()

[499] Fix | Delete