/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
41
from bzrlib import lazy_import
42
lazy_import.lazy_import(globals(), """
43
from bzrlib import versionedfile
44
""")
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
45
46
47
class CHKMap(object):
48
    """A persistent map from string to string backed by a CHK store."""
49
50
    def __init__(self, store, root_key):
51
        """Create a CHKMap object.
52
53
        :param store: The store the CHKMap is stored in.
54
        :param root_key: The root key of the map. None to create an empty
55
            CHKMap.
56
        """
57
        self._store = store
58
        if root_key is None:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
59
            self._root_node = LeafNode()
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
60
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
61
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
62
63
    def apply_delta(self, delta):
64
        """Apply a delta to the map.
65
66
        :param delta: An iterable of old_key, new_key, new_value tuples.
67
            If new_key is not None, then new_key->new_value is inserted
68
            into the map; if old_key is not None, then the old mapping
69
            of old_key is removed.
70
        """
71
        for old, new, value in delta:
72
            if old is not None and old != new:
73
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
74
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
75
        for old, new, value in delta:
76
            if new is not None:
77
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
78
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
79
        return self._save()
80
81
    def _ensure_root(self):
82
        """Ensure that the root node is an object not a key."""
83
        if type(self._root_node) == tuple:
84
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
85
            self._root_node = self._get_node(self._root_node)
86
87
    def _get_node(self, node):
88
        """Get a node.
89
90
        Node that this does not update the _items dict in objects containing a
91
        reference to this node. As such it does not prevent subsequent IO being
92
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
93
3735.2.31 by Robert Collins
CHKMap.iter_changes
94
        :param node: A tuple key or node object.
95
        :return: A node object.
96
        """
97
        if type(node) == tuple:
98
            bytes = self._read_bytes(node)
99
            return _deserialise(bytes, node)
100
        else:
101
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
102
103
    def _read_bytes(self, key):
104
        stream = self._store.get_record_stream([key], 'unordered', True)
105
        return stream.next().get_bytes_as('fulltext')
106
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
107
    def _dump_tree(self):
108
        """Return the tree in a string representation."""
109
        self._ensure_root()
110
        res = self._dump_tree_node(self._root_node, prefix='', indent='')
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
111
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
112
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
113
114
    def _dump_tree_node(self, node, prefix, indent):
115
        """For this node and all children, generate a string representation."""
116
        result = []
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
117
        node_key = node.key()
118
        if node_key is not None:
119
            node_key = node_key[0]
120
        result.append('%s%r %s %s' % (indent, prefix, node.__class__.__name__,
121
                                        node_key))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
122
        if isinstance(node, InternalNode):
123
            # Trigger all child nodes to get loaded
124
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
125
            for prefix, sub in sorted(node._items.iteritems()):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
126
                result.extend(self._dump_tree_node(sub, prefix, indent + '  '))
127
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
128
            for key, value in sorted(node._items.iteritems()):
129
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
130
        return result
131
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
132
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
133
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
134
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
135
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
136
        :param store: The store to record initial_value in, a VersionedFiles
137
            object with 1-tuple keys supporting CHK key generation.
138
        :param initial_value: A dict to store in store. Its keys and values
139
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
140
        :param maximum_size: The maximum_size rule to apply to nodes. This
141
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
142
        :param key_width: The number of elements in each key_tuple being stored
143
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
144
        :return: The root chk of te resulting CHKMap.
145
        """
146
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
147
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
148
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
149
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
150
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
151
            delta.append((None, key, value))
152
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
153
        return result._save()
154
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
155
    def iter_changes(self, basis):
156
        """Iterate over the changes between basis and self.
157
158
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
159
            is None for keys only in self; new_value is None for keys only in
160
            basis.
161
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
162
        # Overview:
163
        # Read both trees in lexographic, highest-first order.
164
        # Any identical nodes we skip
165
        # Any unique prefixes we output immediately.
166
        # values in a leaf node are treated as single-value nodes in the tree
167
        # which allows them to be not-special-cased. We know to output them
168
        # because their value is a string, not a key(tuple) or node.
169
        #
170
        # corner cases to beware of when considering this function:
171
        # *) common references are at different heights.
172
        #    consider two trees:
173
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
174
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
175
        #     'b': LeafNode={'b'}}
176
        #    the node with aaa/aab will only be encountered in the second tree
177
        #    after reading the 'a' subtree, but it is encountered in the first
178
        #    tree immediately. Variations on this may have read internal nodes like this.
179
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
180
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
181
        #    clean as we process each item.
182
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
183
            return
184
        self._ensure_root()
185
        basis._ensure_root()
186
        excluded_keys = set()
187
        self_node = self._root_node
188
        basis_node = basis._root_node
189
        # A heap, each element is prefix, node(tuple/NodeObject/string),
190
        # key_path (a list of tuples, tail-sharing down the tree.)
191
        self_pending = []
192
        basis_pending = []
193
        def process_node(prefix, node, path, a_map, pending):
194
            # take a node and expand it
195
            node = a_map._get_node(node)
196
            if type(node) == LeafNode:
197
                path = (node._key, path)
198
                for key, value in node._items.items():
199
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
200
            else:
201
                # type(node) == InternalNode
202
                path = (node._key, path)
203
                for prefix, child in node._items.items():
204
                    heapq.heappush(pending, (prefix, child, path))
205
        process_node(None, self_node, None, self, self_pending)
206
        process_node(None, basis_node, None, basis, basis_pending)
207
        self_seen = set()
208
        basis_seen = set()
209
        excluded_keys = set()
210
        def check_excluded(key_path):
211
            # Note that this is N^2, it depends on us trimming trees
212
            # aggressively to not become slow.
213
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
214
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
215
            # a common node is detected, the proceed to remove the already
216
            # pending children. bzrlib.graph has a searcher module with a
217
            # similar problem.
218
            while key_path is not None:
219
                key, key_path = key_path
220
                if key in excluded_keys:
221
                    return True
222
            return False
223
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
224
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
225
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
226
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
227
            if not self_pending:
228
                # self is exhausted: output remainder of basis
229
                for prefix, node, path in basis_pending:
230
                    if check_excluded(path):
231
                        continue
232
                    node = basis._get_node(node)
233
                    if type(node) == str:
234
                        # a value
235
                        yield (tuple(prefix.split('\x00')), node, None)
236
                    else:
237
                        # subtree - fastpath the entire thing.
238
                        for key, value in node.iteritems(basis._store):
239
                            yield (key, value, None)
240
                return
241
            elif not basis_pending:
242
                # basis is exhausted: output remainder of self.
243
                for prefix, node, path in self_pending:
244
                    if check_excluded(path):
245
                        continue
246
                    node = self._get_node(node)
247
                    if type(node) == str:
248
                        # a value
249
                        yield (tuple(prefix.split('\x00')), None, node)
250
                    else:
251
                        # subtree - fastpath the entire thing.
252
                        for key, value in node.iteritems(self._store):
253
                            yield (key, None, value)
254
                return
255
            else:
256
                # XXX: future optimisation - yield the smaller items
257
                # immediately rather than pushing everything on/off the
258
                # heaps. Applies to both internal nodes and leafnodes.
259
                if self_pending[0][0] < basis_pending[0][0]:
260
                    # expand self
261
                    prefix, node, path = heapq.heappop(self_pending)
262
                    if check_excluded(path):
263
                        continue
264
                    if type(node) == str:
265
                        # a value
266
                        yield (tuple(prefix.split('\x00')), None, node)
267
                    else:
268
                        process_node(prefix, node, path, self, self_pending)
269
                        continue
270
                elif self_pending[0][0] > basis_pending[0][0]:
271
                    # expand basis
272
                    prefix, node, path = heapq.heappop(basis_pending)
273
                    if check_excluded(path):
274
                        continue
275
                    if type(node) == str:
276
                        # a value
277
                        yield (tuple(prefix.split('\x00')), node, None)
278
                    else:
279
                        process_node(prefix, node, path, basis, basis_pending)
280
                        continue
281
                else:
282
                    # common prefix: possibly expand both
283
                    if type(self_pending[0][1]) != str:
284
                        # process next self
285
                        read_self = True
286
                    else:
287
                        read_self = False
288
                    if type(basis_pending[0][1]) != str:
289
                        # process next basis
290
                        read_basis = True
291
                    else:
292
                        read_basis = False
293
                    if not read_self and not read_basis:
294
                        # compare a common value
295
                        self_details = heapq.heappop(self_pending)
296
                        basis_details = heapq.heappop(basis_pending)
297
                        if self_details[1] != basis_details[1]:
298
                            yield (tuple(self_details[0].split('\x00')),
299
                                basis_details[1], self_details[1])
300
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
301
                    # At least one side wasn't a string.
302
                    if (self._node_key(self_pending[0][1]) ==
303
                        self._node_key(basis_pending[0][1])):
304
                        # Identical pointers, skip (and don't bother adding to
305
                        # excluded, it won't turn up again.
306
                        heapq.heappop(self_pending)
307
                        heapq.heappop(basis_pending)
308
                        continue
309
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
310
                    if read_self:
311
                        prefix, node, path = heapq.heappop(self_pending)
312
                        if check_excluded(path):
313
                            continue
314
                        process_node(prefix, node, path, self, self_pending)
315
                    if read_basis:
316
                        prefix, node, path = heapq.heappop(basis_pending)
317
                        if check_excluded(path):
318
                            continue
319
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
320
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
321
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
322
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
323
        """Iterate over the entire CHKMap's contents."""
324
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
325
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
326
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
327
    def key(self):
328
        """Return the key for this map."""
329
        if isinstance(self._root_node, tuple):
330
            return self._root_node
331
        else:
332
            return self._root_node._key
333
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
334
    def __len__(self):
335
        self._ensure_root()
336
        return len(self._root_node)
337
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
338
    def map(self, key, value):
339
        """Map a key tuple to value."""
340
        # Need a root object.
341
        self._ensure_root()
342
        prefix, node_details = self._root_node.map(self._store, key, value)
343
        if len(node_details) == 1:
344
            self._root_node = node_details[0][1]
345
        else:
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
346
            self._root_node = InternalNode(prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
347
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
348
            self._root_node._key_width = node_details[0][1]._key_width
349
            for split, node in node_details:
350
                self._root_node.add_node(split, node)
351
3735.2.31 by Robert Collins
CHKMap.iter_changes
352
    def _node_key(self, node):
353
        """Get the key for a node whether its a tuple o r node."""
354
        if type(node) == tuple:
355
            return node
356
        else:
357
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
358
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
359
    def unmap(self, key):
360
        """remove key from the map."""
361
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
362
        unmapped = self._root_node.unmap(self._store, key)
363
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
364
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
365
    def _save(self):
366
        """Save the map completely.
367
368
        :return: The key of the root node.
369
        """
370
        if type(self._root_node) == tuple:
371
            # Already saved.
372
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
373
        keys = list(self._root_node.serialise(self._store))
374
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
375
376
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
377
class Node(object):
378
    """Base class defining the protocol for CHK Map nodes."""
379
380
    def __init__(self, key_width=1):
381
        """Create a node.
382
383
        :param key_width: The width of keys for this node.
384
        """
385
        self._key = None
386
        # Current number of elements
387
        self._len = 0
388
        self._maximum_size = 0
389
        self._key_width = 1
390
        # current size in bytes
391
        self._size = 0
392
        # The pointers/values this node has - meaning defined by child classes.
393
        self._items = {}
394
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
395
    def __repr__(self):
396
        items_str = sorted(self._items)
397
        if len(items_str) > 20:
398
            items_str = items_str[16] + '...]'
399
        return '%s(key:%s len:%s size:%s max:%s items:%s)' % (
400
            self.__class__.__name__, self._key, self._len, self._size,
401
            self._maximum_size, items_str)
402
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
403
    def key(self):
404
        return self._key
405
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
406
    def __len__(self):
407
        return self._len
408
409
    @property
410
    def maximum_size(self):
411
        """What is the upper limit for adding references to a node."""
412
        return self._maximum_size
413
414
    def set_maximum_size(self, new_size):
415
        """Set the size threshold for nodes.
416
417
        :param new_size: The size at which no data is added to a node. 0 for
418
            unlimited.
419
        """
420
        self._maximum_size = new_size
421
422
423
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
424
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
425
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
426
    :ivar _items: A dict of key->value items. The key is in tuple form.
427
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
428
429
    def __init__(self):
430
        Node.__init__(self)
431
        # The size of a leaf node with default values and no children.
432
        self._size = 12
433
434
    def _current_size(self):
435
        """Answer the current serialised size of this node."""
436
        return (self._size + len(str(self._len)) + len(str(self._key_width)) +
437
            len(str(self._maximum_size)))
438
439
    @classmethod
440
    def deserialise(klass, bytes, key):
441
        """Deserialise bytes, with key key, into a LeafNode.
442
443
        :param bytes: The bytes of the node.
444
        :param key: The key that the serialised node has.
445
        """
446
        result = LeafNode()
447
        lines = bytes.splitlines()
448
        items = {}
449
        if lines[0] != 'chkleaf:':
450
            raise ValueError("not a serialised leaf node: %r" % bytes)
451
        maximum_size = int(lines[1])
452
        width = int(lines[2])
453
        length = int(lines[3])
454
        for line in lines[4:]:
3735.2.25 by Robert Collins
CHKInventory core tests passing.
455
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
456
            items[tuple(elements[:-1])] = elements[-1]
457
        if len(items) != length:
458
            raise AssertionError("item count mismatch")
459
        result._items = items
460
        result._len = length
461
        result._maximum_size = maximum_size
462
        result._key = key
463
        result._key_width = width
464
        result._size = len(bytes)
465
        return result
466
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
467
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
468
        """Iterate over items in the node.
469
470
        :param key_filter: A filter to apply to the node. It should be a
471
            list/set/dict or similar repeatedly iterable container.
472
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
473
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
474
            # Adjust the filter - short elements go to a prefix filter. Would this
475
            # be cleaner explicitly? That would be no harder for InternalNode..
476
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
477
            filters = {}
478
            for key in key_filter:
479
                length_filter = filters.setdefault(len(key), set())
480
                length_filter.add(key)
481
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
482
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
483
                for length, length_filter in filters:
484
                    if item[0][:length] in length_filter:
485
                        yield item
486
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
487
        else:
488
            for item in self._items.iteritems():
489
                yield item
490
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
491
    def _key_value_len(self, key, value):
492
        # TODO: Should probably be done without actually joining the key, but
493
        #       then that can be done via the C extension
494
        return 2 + len('\x00'.join(key)) + len(value)
495
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
496
    def _map_no_split(self, key, value):
497
        """Map a key to a value.
498
499
        This assumes either the key does not already exist, or you have already
500
        removed its size and length from self.
501
502
        :return: True if adding this node should cause us to split.
503
        """
504
        self._items[key] = value
505
        self._size += self._key_value_len(key, value)
506
        self._len += 1
507
        if (self._len > 1
508
            and self._maximum_size
509
            and self._current_size() > self._maximum_size):
510
            return True
511
        return False
512
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
513
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
514
        """Map key to value."""
515
        if key in self._items:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
516
            self._size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
517
            self._len -= 1
518
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
519
        if self._map_no_split(key, value):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
520
            common_prefix = self.unique_serialised_prefix()
521
            split_at = len(common_prefix) + 1
522
            result = {}
523
            for key, value in self._items.iteritems():
524
                serialised_key = self._serialised_key(key)
525
                prefix = serialised_key[:split_at]
3735.11.7 by John Arbash Meinel
Turns out that the LeafNode split code needs to be aware of having a key
526
                # TODO: Generally only 1 key can be exactly the right length,
527
                #       which means we can only have 1 key in the node pointed
528
                #       at by the 'prefix\0' key. We might want to consider
529
                #       folding it into the containing InternalNode rather than
530
                #       having a fixed length-1 node.
3735.11.8 by John Arbash Meinel
more comment
531
                #       Note this is probably not true for hash keys, as they
532
                #       may get a '\00' node anywhere, but won't have keys of
533
                #       different lengths.
3735.11.7 by John Arbash Meinel
Turns out that the LeafNode split code needs to be aware of having a key
534
                if len(prefix) < split_at:
535
                    prefix += '\x00'*(split_at - len(prefix))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
536
                if prefix not in result:
537
                    node = LeafNode()
538
                    node.set_maximum_size(self._maximum_size)
539
                    node._key_width = self._key_width
540
                    result[prefix] = node
541
                else:
542
                    node = result[prefix]
543
                node.map(store, key, value)
544
            return common_prefix, result.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
545
        else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
546
            return self.unique_serialised_prefix(), [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
547
548
    def serialise(self, store):
549
        """Serialise the tree to store.
550
551
        :param store: A VersionedFiles honouring the CHK extensions.
552
        :return: An iterable of the keys inserted by this operation.
553
        """
554
        lines = ["chkleaf:\n"]
555
        lines.append("%d\n" % self._maximum_size)
556
        lines.append("%d\n" % self._key_width)
557
        lines.append("%d\n" % self._len)
558
        for key, value in sorted(self._items.items()):
559
            lines.append("%s\x00%s\n" % ('\x00'.join(key), value))
560
        sha1, _, _ = store.add_lines((None,), (), lines)
561
        self._key = ("sha1:" + sha1,)
562
        return [self._key]
563
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
564
    def _serialised_key(self, key):
565
        """Return the serialised key for key in this node."""
566
        return '\x00'.join(key)
567
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
568
    def refs(self):
569
        """Return the references to other CHK's held by this node."""
570
        return []
571
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
572
    def unique_serialised_prefix(self):
573
        """Return the unique key prefix for this node.
574
575
        :return: A bytestring of the longest serialised key prefix that is
576
            unique within this node.
577
        """
578
        # may want to cache this eventually :- but wait for enough
579
        # functionality to profile.
580
        keys = list(self._items.keys())
581
        if not keys:
582
            return ""
583
        current_prefix = self._serialised_key(keys.pop(-1))
584
        while current_prefix and keys:
585
            next_key = self._serialised_key(keys.pop(-1))
586
            for pos, (left, right) in enumerate(zip(current_prefix, next_key)):
587
                if left != right:
588
                    pos -= 1
589
                    break
590
            current_prefix = current_prefix[:pos + 1]
591
        return current_prefix
592
593
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
594
        """Unmap key from the node."""
595
        self._size -= 2 + len('\x00'.join(key)) + len(self._items[key])
596
        self._len -= 1
597
        del self._items[key]
598
        self._key = None
599
        return self
600
601
602
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
603
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
604
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
605
    An InternalNode is responsible for mapping serialised key prefixes to child
606
    nodes. It is greedy - it will defer splitting itself as long as possible.
607
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
608
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
609
    def __init__(self, prefix=''):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
610
        Node.__init__(self)
611
        # The size of an internalnode with default values and no children.
612
        # self._size = 12
613
        # How many octets key prefixes within this node are.
614
        self._node_width = 0
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
615
        self._prefix = prefix
616
617
    def __repr__(self):
618
        items_str = sorted(self._items)
619
        if len(items_str) > 20:
620
            items_str = items_str[16] + '...]'
621
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
622
            self.__class__.__name__, self._key, self._len, self._size,
623
            self._maximum_size, self._prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
624
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
625
    def add_node(self, prefix, node):
626
        """Add a child node with prefix prefix, and node node.
627
628
        :param prefix: The serialised key prefix for node.
629
        :param node: The node being added.
630
        """
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
631
        assert self._prefix is not None
632
        assert prefix.startswith(self._prefix)
633
        assert len(prefix) == len(self._prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
634
        self._len += len(node)
635
        if not len(self._items):
636
            self._node_width = len(prefix)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
637
        assert self._node_width == len(self._prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
638
        self._items[prefix] = node
639
        self._key = None
640
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
641
    def _current_size(self):
642
        """Answer the current serialised size of this node."""
643
        return (self._size + len(str(self._len)) + len(str(self._key_width)) +
644
            len(str(self._maximum_size)))
645
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
646
    @classmethod
647
    def deserialise(klass, bytes, key):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
648
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
649
650
        :param bytes: The bytes of the node.
651
        :param key: The key that the serialised node has.
652
        :return: An InternalNode instance.
653
        """
654
        result = InternalNode()
655
        lines = bytes.splitlines()
656
        items = {}
657
        if lines[0] != 'chknode:':
658
            raise ValueError("not a serialised internal node: %r" % bytes)
659
        maximum_size = int(lines[1])
660
        width = int(lines[2])
661
        length = int(lines[3])
662
        for line in lines[4:]:
663
            prefix, flat_key = line.rsplit('\x00', 1)
664
            items[prefix] = (flat_key,)
665
        result._items = items
666
        result._len = length
667
        result._maximum_size = maximum_size
668
        result._key = key
669
        result._key_width = width
670
        result._size = len(bytes)
671
        result._node_width = len(prefix)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
672
        result._prefix = result.unique_serialised_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
673
        return result
674
675
    def iteritems(self, store, key_filter=None):
676
        for node in self._iter_nodes(store, key_filter=key_filter):
677
            for item in node.iteritems(store, key_filter=key_filter):
678
                yield item
679
680
    def _iter_nodes(self, store, key_filter=None):
681
        """Iterate over node objects which match key_filter.
682
683
        :param store: A store to use for accessing content.
684
        :param key_filter: A key filter to filter nodes. Only nodes that might
685
            contain a key in key_filter will be returned.
686
        :return: An iterable of nodes.
687
        """
688
        nodes = []
3735.2.31 by Robert Collins
CHKMap.iter_changes
689
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
690
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
691
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
692
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
693
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
694
                else:
695
                    nodes.append(node)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
696
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
697
            # XXX defaultdict ?
698
            length_filters = {}
699
            for key in key_filter:
700
                serialised_key = self._serialised_prefix_filter(key)
701
                length_filter = length_filters.setdefault(len(serialised_key),
702
                    set())
703
                length_filter.add(serialised_key)
704
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
705
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
706
                for length, length_filter in length_filters:
707
                    if prefix[:length] in length_filter:
708
                        if type(node) == tuple:
709
                            keys[node] = prefix
710
                        else:
711
                            nodes.append(node)
712
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
713
        if keys:
714
            # demand load some pages.
715
            stream = store.get_record_stream(keys, 'unordered', True)
716
            for record in stream:
717
                node = _deserialise(record.get_bytes_as('fulltext'), record.key)
718
                nodes.append(node)
3735.2.31 by Robert Collins
CHKMap.iter_changes
719
                self._items[keys[record.key]] = node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
720
        return nodes
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
721
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
722
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
723
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
724
        if not len(self._items):
725
            raise AssertionError("cant map in an empty InternalNode.")
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
726
        serialised_key = self._serialised_key(key)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
727
        assert self._node_width == len(self._prefix) + 1
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
728
        if not serialised_key.startswith(self._prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
729
            # This key doesn't fit in this index, so we need to split at the
730
            # point where it would fit.
731
            # XXX: Do we need the serialised_key in its maximum length?
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
732
            new_prefix = self.unique_serialised_prefix(serialised_key)
733
            new_parent = InternalNode(new_prefix)
734
            new_parent.set_maximum_size(self._maximum_size)
735
            new_parent._key_width = self._key_width
736
            new_parent.add_node(self._prefix[:len(new_prefix)+1], self)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
737
            assert new_parent._node_width == len(new_parent._prefix) + 1
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
738
            return new_parent.map(store, key, value)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
739
        children = self._iter_nodes(store, key_filter=[key])
740
        if children:
741
            child = children[0]
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
742
            # if isinstance(child, InternalNode):
743
            #     child_prefix = child._prefix
744
            #     child_ser_key = child._serialised_key(key)
745
            #     if not child_ser_key.startswith(child_prefix):
746
            #         import pdb; pdb.set_trace()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
747
        else:
748
            # new child needed:
749
            child = self._new_child(serialised_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
750
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
751
        if isinstance(child, LeafNode):
752
            old_size = child._current_size()
753
        else:
754
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
755
        prefix, node_details = child.map(store, key, value)
756
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
757
            # child may have shrunk, or might be a new node
758
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
759
            self._len = self._len - old_len + len(child)
760
            self._items[serialised_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
761
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
762
            new_node = self
763
            if (isinstance(child, LeafNode)
764
                and (old_size is None or child._current_size() < old_size)):
765
                # The old node was an InternalNode which means it has now
766
                # collapsed, so we need to check if it will chain to a collapse
767
                # at this level. Or the LeafNode has shrunk in size, so we need
768
                # to check that as well.
769
                new_node = self._check_remap(store)
770
            return new_node.unique_serialised_prefix(), [("", new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
771
        # child has overflown - create a new intermediate node.
772
        # XXX: This is where we might want to try and expand our depth
773
        # to refer to more bytes of every child (which would give us
774
        # multiple pointers to child nodes, but less intermediate nodes)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
775
        # TODO: Is this mapped as serialised_key or as prefix?
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
776
        child = self._new_child(serialised_key, InternalNode)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
777
        child._prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
778
        for split, node in node_details:
779
            child.add_node(split, node)
780
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
781
        self._key = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
782
        return self.unique_serialised_prefix(), [("", self)]
783
784
    def _new_child(self, serialised_key, klass):
785
        """Create a new child node of type klass."""
786
        child = klass()
787
        child.set_maximum_size(self._maximum_size)
788
        child._key_width = self._key_width
789
        self._items[serialised_key] = child
790
        return child
791
792
    def serialise(self, store):
793
        """Serialise the node to store.
794
795
        :param store: A VersionedFiles honouring the CHK extensions.
796
        :return: An iterable of the keys inserted by this operation.
797
        """
798
        for node in self._items.itervalues():
799
            if type(node) == tuple:
800
                # Never deserialised.
801
                continue
802
            if node._key is not None:
803
                # Never altered
804
                continue
805
            for key in node.serialise(store):
806
                yield key
807
        lines = ["chknode:\n"]
808
        lines.append("%d\n" % self._maximum_size)
809
        lines.append("%d\n" % self._key_width)
810
        lines.append("%d\n" % self._len)
811
        for prefix, node in sorted(self._items.items()):
812
            if type(node) == tuple:
813
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
814
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
815
                key = node._key[0]
816
            lines.append("%s\x00%s\n" % (prefix, key))
817
        sha1, _, _ = store.add_lines((None,), (), lines)
818
        self._key = ("sha1:" + sha1,)
819
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
820
821
    def _serialised_key(self, key):
822
        """Return the serialised key for key in this node."""
823
        return ('\x00'.join(key) + '\x00'*self._node_width)[:self._node_width]
824
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
825
    def _serialised_prefix_filter(self, key):
826
        """Serialise key for use as a prefix filter in iteritems."""
827
        if len(key) == self._key_width:
828
            return self._serialised_key(key)
829
        return '\x00'.join(key)[:self._node_width]
830
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
831
    def _split(self, offset):
832
        """Split this node into smaller nodes starting at offset.
833
834
        :param offset: The offset to start the new child nodes at.
835
        :return: An iterable of (prefix, node) tuples. prefix is a byte
836
            prefix for reaching node.
837
        """
838
        if offset >= self._node_width:
839
            for node in self._items.values():
840
                for result in node._split(offset):
841
                    yield result
842
            return
843
        for key, node in self._items.items():
844
            pass
845
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
846
    def refs(self):
847
        """Return the references to other CHK's held by this node."""
848
        if self._key is None:
849
            raise AssertionError("unserialised nodes have no refs.")
850
        refs = []
851
        for value in self._items.itervalues():
852
            if type(value) == tuple:
853
                refs.append(value)
854
            else:
855
                refs.append(value.key())
856
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
857
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
858
    def unique_serialised_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
859
        """Return the unique key prefix for this node.
860
861
        :return: A bytestring of the longest serialised key prefix that is
862
            unique within this node.
863
        """
864
        # may want to cache this eventually :- but wait for enough
865
        # functionality to profile.
866
        keys = list(self._items.keys())
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
867
        if extra_key is not None:
868
            keys.append(extra_key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
869
        if not keys:
870
            return ""
871
        current_prefix = keys.pop(-1)
872
        while current_prefix and keys:
873
            next_key = keys.pop(-1)
874
            for pos, (left, right) in enumerate(zip(current_prefix, next_key)):
875
                if left != right:
876
                    pos -= 1
877
                    break
878
            current_prefix = current_prefix[:pos + 1]
879
        return current_prefix
880
881
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
882
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
883
        if not len(self._items):
884
            raise AssertionError("cant unmap in an empty InternalNode.")
885
        children = self._iter_nodes(store, key_filter=[key])
886
        if children:
887
            child = children[0]
888
        else:
889
            raise KeyError(key)
890
        self._len -= 1
891
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
892
        self._key = None
893
        serialised_key = self._serialised_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
894
        if len(unmapped) == 0:
895
            # All child nodes are gone, remove the child:
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
896
            del self._items[serialised_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
897
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
898
        else:
899
            # Stash the returned node
900
            self._items[serialised_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
901
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
902
            # this node is no longer needed:
903
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
904
        if isinstance(unmapped, InternalNode):
905
            return self
906
        return self._check_remap(store)
907
908
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
909
        """Check if all keys contained by children fit in a single LeafNode.
910
911
        :param store: A store to use for reading more nodes
912
        :return: Either self, or a new LeafNode which should replace self.
913
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
914
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
915
        # 1) Implicitly unmap() is removing a key which means that the child
916
        #    nodes are going to be shrinking by some extent.
917
        # 2) If all children are LeafNodes, it is possible that they could be
918
        #    combined into a single LeafNode, which can then completely replace
919
        #    this internal node with a single LeafNode
920
        # 3) If *one* child is an InternalNode, we assume it has already done
921
        #    all the work to determine that its children cannot collapse, and
922
        #    we can then assume that those nodes *plus* the current nodes don't
923
        #    have a chance of collapsing either.
924
        #    So a very cheap check is to just say if 'unmapped' is an
925
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
926
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
927
        # TODO: Another alternative is to check the total size of all known
928
        #       LeafNodes. If there is some formula we can use to determine the
929
        #       final size without actually having to read in any more
930
        #       children, it would be nice to have. However, we have to be
931
        #       careful with stuff like nodes that pull out the common prefix
932
        #       of each key, as adding a new key can change the common prefix
933
        #       and cause size changes greater than the length of one key.
934
        #       So for now, we just add everything to a new Leaf until it
935
        #       splits, as we know that will give the right answer
936
        new_leaf = LeafNode()
937
        new_leaf.set_maximum_size(self._maximum_size)
938
        new_leaf._key_width = self._key_width
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
939
        keys = {}
940
        # There is some overlap with _iter_nodes here, but not a lot, and it
941
        # allows us to do quick evaluation without paging everything in
942
        for prefix, node in self._items.iteritems():
943
            if type(node) == tuple:
944
                keys[node] = prefix
945
            else:
946
                if isinstance(node, InternalNode):
947
                    # Without looking at any leaf nodes, we are sure
948
                    return self
949
                for key, value in node._items.iteritems():
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
950
                    if new_leaf._map_no_split(key, value):
951
                        # Adding this key would cause a split, so we know we
952
                        # don't need to collapse
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
953
                        return self
954
        # So far, everything fits. Page in the rest of the nodes, and see if it
955
        # holds true.
956
        if keys:
3735.11.12 by John Arbash Meinel
Add a TODO describing a possible optimization.
957
            # TODO: Consider looping over a limited set of keys (like 25 or so
958
            #       at a time). If we have to read more than 25 we almost
959
            #       certainly won't fit them all into a single new LeafNode, so
960
            #       reading the extra nodes is a waste.
961
            #       This will probably matter more with hash serialised keys,
962
            #       as we will get InternalNodes with more references.
963
            #       The other argument is that unmap() is uncommon, so we don't
964
            #       need to optimize it. But map() with a slightly shorter
965
            #       value may happen a lot.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
966
            stream = store.get_record_stream(keys, 'unordered', True)
967
            nodes = []
968
            # Fully consume the stream, even if we could determine that we
969
            # don't need to continue. We requested the bytes, we may as well
970
            # use them
971
            for record in stream:
972
                node = _deserialise(record.get_bytes_as('fulltext'), record.key)
973
                self._items[keys[record.key]] = node
974
                nodes.append(node)
975
            for node in nodes:
976
                if isinstance(node, InternalNode):
977
                    # We know we won't fit
978
                    return self
979
                for key, value in node._items.iteritems():
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
980
                    if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
981
                        return self
982
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
983
        # We have gone to every child, and everything fits in a single leaf
984
        # node, we no longer need this internal node
985
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
986
987
3735.2.16 by Robert Collins
Untested extensions to support repodetails
988
def _deserialise(bytes, key):
989
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
990
    if bytes.startswith("chkleaf:\n"):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
991
        return LeafNode.deserialise(bytes, key)
992
    elif bytes.startswith("chknode:\n"):
993
        return InternalNode.deserialise(bytes, key)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
994
    else:
995
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
996
997
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
998
def _find_children_info(store, interesting_keys, uninteresting_keys,
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
999
                        adapter, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1000
    """Read the associated records, and determine what is interesting."""
1001
    uninteresting_keys = set(uninteresting_keys)
1002
    chks_to_read = uninteresting_keys.union(interesting_keys)
1003
    next_uninteresting = set()
1004
    next_interesting = set()
1005
    uninteresting_items = set()
1006
    interesting_items = set()
1007
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1008
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1009
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1010
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1011
        if pb is not None:
1012
            pb.tick()
1013
        if record.storage_kind != 'fulltext':
1014
            bytes = adapter.get_bytes(record,
1015
                        record.get_bytes_as(record.storage_kind))
1016
        else:
1017
            bytes = record.get_bytes_as('fulltext')
1018
        node = _deserialise(bytes, record.key)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1019
        if record.key in uninteresting_keys:
1020
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1021
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1022
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1023
                # We know we are at a LeafNode, so we can pass None for the
1024
                # store
1025
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1026
        else:
1027
            interesting_records.append(record)
1028
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1029
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1030
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1031
                interesting_items.update(node.iteritems(None))
1032
    # TODO: Filter out records that have already been read, as node splitting
1033
    #       can cause us to reference the same nodes via shorter and longer
1034
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1035
    return (next_uninteresting, uninteresting_items,
1036
            next_interesting, interesting_records, interesting_items)
1037
1038
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1039
def _find_all_uninteresting(store, interesting_root_keys,
1040
                            uninteresting_root_keys, adapter, pb):
1041
    """Determine the full set of uninteresting keys."""
1042
    # What about duplicates between interesting_root_keys and
1043
    # uninteresting_root_keys?
1044
    if not uninteresting_root_keys:
1045
        # Shortcut case. We know there is nothing uninteresting to filter out
1046
        # So we just let the rest of the algorithm do the work
1047
        # We know there is nothing uninteresting, and we didn't have to read
1048
        # any interesting records yet.
1049
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1050
    all_uninteresting_chks = set(uninteresting_root_keys)
1051
    all_uninteresting_items = set()
1052
1053
    # First step, find the direct children of both the interesting and
1054
    # uninteresting set
1055
    (uninteresting_keys, uninteresting_items,
1056
     interesting_keys, interesting_records,
1057
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1058
                                              uninteresting_root_keys,
1059
                                              adapter=adapter, pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1060
    all_uninteresting_chks.update(uninteresting_keys)
1061
    all_uninteresting_items.update(uninteresting_items)
1062
    del uninteresting_items
1063
    # Note: Exact matches between interesting and uninteresting do not need
1064
    #       to be search further. Non-exact matches need to be searched in case
1065
    #       there is a future exact-match
1066
    uninteresting_keys.difference_update(interesting_keys)
1067
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1068
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1069
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1070
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1071
    while chks_to_read:
1072
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1073
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1074
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1075
            if pb is not None:
1076
                pb.tick()
1077
            if record.storage_kind != 'fulltext':
1078
                bytes = adapter.get_bytes(record,
1079
                            record.get_bytes_as(record.storage_kind))
1080
            else:
1081
                bytes = record.get_bytes_as('fulltext')
1082
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1083
            if isinstance(node, InternalNode):
1084
                # uninteresting_prefix_chks.update(node._items.iteritems())
1085
                chks = node._items.values()
1086
                # TODO: We remove the entries that are already in
1087
                #       uninteresting_chks ?
1088
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1089
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1090
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1091
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1092
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1093
    return (all_uninteresting_chks, all_uninteresting_items,
1094
            interesting_keys, interesting_records, interesting_items)
1095
1096
1097
def iter_interesting_nodes(store, interesting_root_keys,
1098
                           uninteresting_root_keys, pb=None):
1099
    """Given root keys, find interesting nodes.
1100
1101
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1102
    referenced from uninteresting_root_keys are not considered interesting.
1103
1104
    :param interesting_root_keys: keys which should be part of the
1105
        "interesting" nodes (which will be yielded)
1106
    :param uninteresting_root_keys: keys which should be filtered out of the
1107
        result set.
1108
    :return: Yield
1109
        (interesting records, interesting chk's, interesting key:values)
1110
    """
1111
    # TODO: consider that it may be more memory efficient to use the 20-byte
1112
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
1113
1114
    # A way to adapt from the compressed texts back into fulltexts
1115
    # In a way, this seems like a layering inversion to have CHKMap know the
1116
    # details of versionedfile
1117
    adapter_class = versionedfile.adapter_registry.get(
1118
        ('knit-ft-gz', 'fulltext'))
1119
    adapter = adapter_class(store)
1120
1121
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1122
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1123
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1124
1125
    # Now that we know everything uninteresting, we can yield information from
1126
    # our first request
1127
    interesting_items.difference_update(all_uninteresting_items)
1128
    records = dict((record.key, record) for record in interesting_records
1129
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1130
    if records or interesting_items:
1131
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1132
    interesting_keys.difference_update(all_uninteresting_chks)
1133
1134
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1135
    while chks_to_read:
1136
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1137
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1138
            if pb is not None:
1139
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1140
            # TODO: Handle 'absent'?
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1141
            if record.storage_kind != 'fulltext':
1142
                bytes = adapter.get_bytes(record,
1143
                            record.get_bytes_as(record.storage_kind))
1144
            else:
1145
                bytes = record.get_bytes_as('fulltext')
1146
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1147
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1148
                chks = set(node.refs())
1149
                chks.difference_update(all_uninteresting_chks)
1150
                # Is set() and .difference_update better than:
1151
                # chks = [chk for chk in node.refs()
1152
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1153
                next_chks.update(chks)
1154
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1155
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1156
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1157
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1158
                interesting_items = [item for item in node._items.iteritems()
1159
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1160
                # TODO: Do we need to filter out items that we have already
1161
                #       seen on other pages? We don't really want to buffer the
1162
                #       whole thing, but it does mean that callers need to
1163
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1164
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1165
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1166
        chks_to_read = next_chks