/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
44
from bzrlib import versionedfile
45
""")
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
46
from bzrlib import lru_cache
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
47
48
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
49
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
50
# out, it takes 3.1MB to cache the layer.
51
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
52
# We are caching bytes so len(value) is perfectly accurate
53
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
54
55
56
class CHKMap(object):
57
    """A persistent map from string to string backed by a CHK store."""
58
59
    def __init__(self, store, root_key):
60
        """Create a CHKMap object.
61
62
        :param store: The store the CHKMap is stored in.
63
        :param root_key: The root key of the map. None to create an empty
64
            CHKMap.
65
        """
66
        self._store = store
67
        if root_key is None:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
68
            self._root_node = LeafNode()
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
69
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
70
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
71
72
    def apply_delta(self, delta):
73
        """Apply a delta to the map.
74
75
        :param delta: An iterable of old_key, new_key, new_value tuples.
76
            If new_key is not None, then new_key->new_value is inserted
77
            into the map; if old_key is not None, then the old mapping
78
            of old_key is removed.
79
        """
80
        for old, new, value in delta:
81
            if old is not None and old != new:
82
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
83
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
84
        for old, new, value in delta:
85
            if new is not None:
86
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
87
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
88
        return self._save()
89
90
    def _ensure_root(self):
91
        """Ensure that the root node is an object not a key."""
92
        if type(self._root_node) == tuple:
93
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
94
            self._root_node = self._get_node(self._root_node)
95
96
    def _get_node(self, node):
97
        """Get a node.
98
99
        Node that this does not update the _items dict in objects containing a
100
        reference to this node. As such it does not prevent subsequent IO being
101
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
102
3735.2.31 by Robert Collins
CHKMap.iter_changes
103
        :param node: A tuple key or node object.
104
        :return: A node object.
105
        """
106
        if type(node) == tuple:
107
            bytes = self._read_bytes(node)
108
            return _deserialise(bytes, node)
109
        else:
110
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
111
112
    def _read_bytes(self, key):
113
        stream = self._store.get_record_stream([key], 'unordered', True)
114
        return stream.next().get_bytes_as('fulltext')
115
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
116
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
117
        """Return the tree in a string representation."""
118
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
119
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
120
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
121
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
122
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
123
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
124
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
125
        """For this node and all children, generate a string representation."""
126
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
127
        if not include_keys:
128
            key_str = ''
129
        else:
130
            node_key = node.key()
131
            if node_key is not None:
132
                key_str = ' %s' % (node_key[0],)
133
            else:
134
                key_str = ' None'
135
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
136
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
137
        if isinstance(node, InternalNode):
138
            # Trigger all child nodes to get loaded
139
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
140
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
141
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
142
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
143
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
144
            for key, value in sorted(node._items.iteritems()):
145
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
146
        return result
147
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
148
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
149
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
150
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
151
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
152
        :param store: The store to record initial_value in, a VersionedFiles
153
            object with 1-tuple keys supporting CHK key generation.
154
        :param initial_value: A dict to store in store. Its keys and values
155
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
156
        :param maximum_size: The maximum_size rule to apply to nodes. This
157
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
158
        :param key_width: The number of elements in each key_tuple being stored
159
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
160
        :return: The root chk of te resulting CHKMap.
161
        """
162
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
163
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
164
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
165
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
166
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
167
            delta.append((None, key, value))
168
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
169
        return result._save()
170
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
171
    def iter_changes(self, basis):
172
        """Iterate over the changes between basis and self.
173
174
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
175
            is None for keys only in self; new_value is None for keys only in
176
            basis.
177
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
178
        # Overview:
179
        # Read both trees in lexographic, highest-first order.
180
        # Any identical nodes we skip
181
        # Any unique prefixes we output immediately.
182
        # values in a leaf node are treated as single-value nodes in the tree
183
        # which allows them to be not-special-cased. We know to output them
184
        # because their value is a string, not a key(tuple) or node.
185
        #
186
        # corner cases to beware of when considering this function:
187
        # *) common references are at different heights.
188
        #    consider two trees:
189
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
190
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
191
        #     'b': LeafNode={'b'}}
192
        #    the node with aaa/aab will only be encountered in the second tree
193
        #    after reading the 'a' subtree, but it is encountered in the first
194
        #    tree immediately. Variations on this may have read internal nodes like this.
195
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
196
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
197
        #    clean as we process each item.
198
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
199
            return
200
        self._ensure_root()
201
        basis._ensure_root()
202
        excluded_keys = set()
203
        self_node = self._root_node
204
        basis_node = basis._root_node
205
        # A heap, each element is prefix, node(tuple/NodeObject/string),
206
        # key_path (a list of tuples, tail-sharing down the tree.)
207
        self_pending = []
208
        basis_pending = []
209
        def process_node(prefix, node, path, a_map, pending):
210
            # take a node and expand it
211
            node = a_map._get_node(node)
212
            if type(node) == LeafNode:
213
                path = (node._key, path)
214
                for key, value in node._items.items():
215
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
216
            else:
217
                # type(node) == InternalNode
218
                path = (node._key, path)
219
                for prefix, child in node._items.items():
220
                    heapq.heappush(pending, (prefix, child, path))
221
        process_node(None, self_node, None, self, self_pending)
222
        process_node(None, basis_node, None, basis, basis_pending)
223
        self_seen = set()
224
        basis_seen = set()
225
        excluded_keys = set()
226
        def check_excluded(key_path):
227
            # Note that this is N^2, it depends on us trimming trees
228
            # aggressively to not become slow.
229
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
230
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
231
            # a common node is detected, the proceed to remove the already
232
            # pending children. bzrlib.graph has a searcher module with a
233
            # similar problem.
234
            while key_path is not None:
235
                key, key_path = key_path
236
                if key in excluded_keys:
237
                    return True
238
            return False
239
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
240
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
241
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
242
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
243
            if not self_pending:
244
                # self is exhausted: output remainder of basis
245
                for prefix, node, path in basis_pending:
246
                    if check_excluded(path):
247
                        continue
248
                    node = basis._get_node(node)
249
                    if type(node) == str:
250
                        # a value
251
                        yield (tuple(prefix.split('\x00')), node, None)
252
                    else:
253
                        # subtree - fastpath the entire thing.
254
                        for key, value in node.iteritems(basis._store):
255
                            yield (key, value, None)
256
                return
257
            elif not basis_pending:
258
                # basis is exhausted: output remainder of self.
259
                for prefix, node, path in self_pending:
260
                    if check_excluded(path):
261
                        continue
262
                    node = self._get_node(node)
263
                    if type(node) == str:
264
                        # a value
265
                        yield (tuple(prefix.split('\x00')), None, node)
266
                    else:
267
                        # subtree - fastpath the entire thing.
268
                        for key, value in node.iteritems(self._store):
269
                            yield (key, None, value)
270
                return
271
            else:
272
                # XXX: future optimisation - yield the smaller items
273
                # immediately rather than pushing everything on/off the
274
                # heaps. Applies to both internal nodes and leafnodes.
275
                if self_pending[0][0] < basis_pending[0][0]:
276
                    # expand self
277
                    prefix, node, path = heapq.heappop(self_pending)
278
                    if check_excluded(path):
279
                        continue
280
                    if type(node) == str:
281
                        # a value
282
                        yield (tuple(prefix.split('\x00')), None, node)
283
                    else:
284
                        process_node(prefix, node, path, self, self_pending)
285
                        continue
286
                elif self_pending[0][0] > basis_pending[0][0]:
287
                    # expand basis
288
                    prefix, node, path = heapq.heappop(basis_pending)
289
                    if check_excluded(path):
290
                        continue
291
                    if type(node) == str:
292
                        # a value
293
                        yield (tuple(prefix.split('\x00')), node, None)
294
                    else:
295
                        process_node(prefix, node, path, basis, basis_pending)
296
                        continue
297
                else:
298
                    # common prefix: possibly expand both
299
                    if type(self_pending[0][1]) != str:
300
                        # process next self
301
                        read_self = True
302
                    else:
303
                        read_self = False
304
                    if type(basis_pending[0][1]) != str:
305
                        # process next basis
306
                        read_basis = True
307
                    else:
308
                        read_basis = False
309
                    if not read_self and not read_basis:
310
                        # compare a common value
311
                        self_details = heapq.heappop(self_pending)
312
                        basis_details = heapq.heappop(basis_pending)
313
                        if self_details[1] != basis_details[1]:
314
                            yield (tuple(self_details[0].split('\x00')),
315
                                basis_details[1], self_details[1])
316
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
317
                    # At least one side wasn't a string.
318
                    if (self._node_key(self_pending[0][1]) ==
319
                        self._node_key(basis_pending[0][1])):
320
                        # Identical pointers, skip (and don't bother adding to
321
                        # excluded, it won't turn up again.
322
                        heapq.heappop(self_pending)
323
                        heapq.heappop(basis_pending)
324
                        continue
325
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
326
                    if read_self:
327
                        prefix, node, path = heapq.heappop(self_pending)
328
                        if check_excluded(path):
329
                            continue
330
                        process_node(prefix, node, path, self, self_pending)
331
                    if read_basis:
332
                        prefix, node, path = heapq.heappop(basis_pending)
333
                        if check_excluded(path):
334
                            continue
335
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
336
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
337
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
338
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
339
        """Iterate over the entire CHKMap's contents."""
340
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
341
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
342
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
343
    def key(self):
344
        """Return the key for this map."""
345
        if isinstance(self._root_node, tuple):
346
            return self._root_node
347
        else:
348
            return self._root_node._key
349
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
350
    def __len__(self):
351
        self._ensure_root()
352
        return len(self._root_node)
353
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
354
    def map(self, key, value):
355
        """Map a key tuple to value."""
356
        # Need a root object.
357
        self._ensure_root()
358
        prefix, node_details = self._root_node.map(self._store, key, value)
359
        if len(node_details) == 1:
360
            self._root_node = node_details[0][1]
361
        else:
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
362
            self._root_node = InternalNode(prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
363
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
364
            self._root_node._key_width = node_details[0][1]._key_width
365
            for split, node in node_details:
366
                self._root_node.add_node(split, node)
367
3735.2.31 by Robert Collins
CHKMap.iter_changes
368
    def _node_key(self, node):
369
        """Get the key for a node whether its a tuple o r node."""
370
        if type(node) == tuple:
371
            return node
372
        else:
373
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
374
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
375
    def unmap(self, key):
376
        """remove key from the map."""
377
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
378
        unmapped = self._root_node.unmap(self._store, key)
379
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
380
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
381
    def _save(self):
382
        """Save the map completely.
383
384
        :return: The key of the root node.
385
        """
386
        if type(self._root_node) == tuple:
387
            # Already saved.
388
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
389
        keys = list(self._root_node.serialise(self._store))
390
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
391
392
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
393
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
394
    """Base class defining the protocol for CHK Map nodes.
395
396
    :ivar _raw_size: The total size of the serialized key:value data, before
397
        adding the header bytes, and without prefix compression.
398
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
399
400
    def __init__(self, key_width=1):
401
        """Create a node.
402
403
        :param key_width: The width of keys for this node.
404
        """
405
        self._key = None
406
        # Current number of elements
407
        self._len = 0
408
        self._maximum_size = 0
409
        self._key_width = 1
410
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
411
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
412
        # The pointers/values this node has - meaning defined by child classes.
413
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
414
        # The common search prefix
415
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
416
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
417
    def __repr__(self):
418
        items_str = sorted(self._items)
419
        if len(items_str) > 20:
420
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
421
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
422
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
423
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
424
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
425
    def key(self):
426
        return self._key
427
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
428
    def __len__(self):
429
        return self._len
430
431
    @property
432
    def maximum_size(self):
433
        """What is the upper limit for adding references to a node."""
434
        return self._maximum_size
435
436
    def set_maximum_size(self, new_size):
437
        """Set the size threshold for nodes.
438
439
        :param new_size: The size at which no data is added to a node. 0 for
440
            unlimited.
441
        """
442
        self._maximum_size = new_size
443
3735.15.5 by John Arbash Meinel
Change the nomenclature.
444
    @classmethod
445
    def common_prefix(cls, prefix, key):
446
        """Given 2 strings, return the longest prefix common to both.
447
448
        :param prefix: This has been the common prefix for other keys, so it is
449
            more likely to be the common prefix in this case as well.
450
        :param key: Another string to compare to
451
        """
452
        if key.startswith(prefix):
453
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
454
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
455
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
456
            if left != right:
457
                pos -= 1
458
                break
3735.15.5 by John Arbash Meinel
Change the nomenclature.
459
        assert pos <= len(prefix)
460
        assert pos <= len(key)
461
        common = prefix[:pos+1]
462
        assert key.startswith(common)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
463
        return common
464
3735.15.5 by John Arbash Meinel
Change the nomenclature.
465
    @classmethod
466
    def common_prefix_for_keys(cls, keys):
467
        """Given a list of keys, find their common prefix.
468
469
        :param keys: An iterable of strings.
470
        :return: The longest common prefix of all keys.
471
        """
472
        common_prefix = None
473
        for key in keys:
474
            if common_prefix is None:
475
                common_prefix = key
476
                continue
477
            common_prefix = cls.common_prefix(common_prefix, key)
478
            if not common_prefix:
479
                # if common_prefix is the empty string, then we know it won't
480
                # change further
481
                return ''
482
        return common_prefix
483
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
484
485
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
486
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
487
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
488
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
489
    :ivar _size: The number of bytes that would be used by serializing all of
490
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
491
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
492
493
    def __init__(self):
494
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
495
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
496
        self._common_serialised_prefix = None
497
        self._serialise_key = '\x00'.join
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
498
499
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
500
        """Answer the current serialised size of this node.
501
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
502
        This differs from self._raw_size in that it includes the bytes used for
503
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
504
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
505
        if self._common_serialised_prefix is None:
506
            bytes_for_items = 0
507
        else:
508
            # We will store a single string with the common prefix
509
            # And then that common prefix will not be stored in any of the
510
            # entry lines
511
            prefix_len = len(self._common_serialised_prefix)
512
            bytes_for_items = (self._raw_size - (prefix_len * (self._len - 1)))
513
        return (13 # bytes overhead for the header and separators
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
514
            + len(str(self._maximum_size))
515
            + len(str(self._key_width))
3735.15.4 by John Arbash Meinel
Clean up some little bits.
516
            + len(str(self._len))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
517
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
518
519
    @classmethod
520
    def deserialise(klass, bytes, key):
521
        """Deserialise bytes, with key key, into a LeafNode.
522
523
        :param bytes: The bytes of the node.
524
        :param key: The key that the serialised node has.
525
        """
526
        result = LeafNode()
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
527
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
528
        # extra '' if the bytes ends in a final newline.
529
        lines = bytes.split('\n')
530
        assert lines[-1] == ''
531
        lines.pop(-1)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
532
        items = {}
533
        if lines[0] != 'chkleaf:':
534
            raise ValueError("not a serialised leaf node: %r" % bytes)
535
        maximum_size = int(lines[1])
536
        width = int(lines[2])
537
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
538
        prefix = lines[4]
539
        for line in lines[5:]:
540
            line = prefix + line
3735.2.25 by Robert Collins
CHKInventory core tests passing.
541
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
542
            items[tuple(elements[:-1])] = elements[-1]
543
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
544
            raise AssertionError("item count (%d) mismatch for key %s,"
545
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
546
        result._items = items
547
        result._len = length
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
548
        assert length == len(lines) - 5
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
549
        result._maximum_size = maximum_size
550
        result._key = key
551
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
552
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
553
            + (length)*(len(prefix)+1)) # prefix + '\n'
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
554
        result._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
555
        result._compute_serialised_prefix()
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
556
        if len(bytes) != result._current_size():
557
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
558
        assert len(bytes) == result._current_size()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
559
        return result
560
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
561
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
562
        """Iterate over items in the node.
563
564
        :param key_filter: A filter to apply to the node. It should be a
565
            list/set/dict or similar repeatedly iterable container.
566
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
567
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
568
            # Adjust the filter - short elements go to a prefix filter. Would this
569
            # be cleaner explicitly? That would be no harder for InternalNode..
570
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
571
            filters = {}
572
            for key in key_filter:
573
                length_filter = filters.setdefault(len(key), set())
574
                length_filter.add(key)
575
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
576
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
577
                for length, length_filter in filters:
578
                    if item[0][:length] in length_filter:
579
                        yield item
580
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
581
        else:
582
            for item in self._items.iteritems():
583
                yield item
584
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
585
    def _key_value_len(self, key, value):
586
        # TODO: Should probably be done without actually joining the key, but
587
        #       then that can be done via the C extension
3735.15.5 by John Arbash Meinel
Change the nomenclature.
588
        return 2 + len(self._serialise_key(key)) + len(value)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
589
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
590
    def _map_no_split(self, key, value):
591
        """Map a key to a value.
592
593
        This assumes either the key does not already exist, or you have already
594
        removed its size and length from self.
595
596
        :return: True if adding this node should cause us to split.
597
        """
598
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
599
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
600
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
601
        serialised_key = self._serialise_key(key)
602
        if self._common_serialised_prefix is None:
603
            self._common_serialised_prefix = serialised_key
604
        else:
605
            self._common_serialised_prefix = self.common_prefix(
606
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
607
        search_key = self._search_key(key)
608
        if self._search_prefix is None:
609
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
610
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
611
            self._search_prefix = self.common_prefix(
612
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
613
        if (self._len > 1
614
            and self._maximum_size
615
            and self._current_size() > self._maximum_size):
616
            return True
617
        return False
618
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
619
    def _split(self, store):
620
        """We have overflowed.
621
622
        Split this node into multiple LeafNodes, return it up the stack so that
623
        the next layer creates a new InternalNode and references the new nodes.
624
3735.15.5 by John Arbash Meinel
Change the nomenclature.
625
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
626
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
627
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
628
        split_at = len(common_prefix) + 1
629
        result = {}
630
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
631
            search_key = self._search_key(key)
632
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
633
            # TODO: Generally only 1 key can be exactly the right length,
634
            #       which means we can only have 1 key in the node pointed
635
            #       at by the 'prefix\0' key. We might want to consider
636
            #       folding it into the containing InternalNode rather than
637
            #       having a fixed length-1 node.
638
            #       Note this is probably not true for hash keys, as they
639
            #       may get a '\00' node anywhere, but won't have keys of
640
            #       different lengths.
641
            if len(prefix) < split_at:
642
                prefix += '\x00'*(split_at - len(prefix))
643
            if prefix not in result:
644
                node = LeafNode()
645
                node.set_maximum_size(self._maximum_size)
646
                node._key_width = self._key_width
647
                result[prefix] = node
648
            else:
649
                node = result[prefix]
650
            node.map(store, key, value)
651
        return common_prefix, result.items()
652
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
653
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
654
        """Map key to value."""
655
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
656
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
657
            self._len -= 1
658
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
659
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
660
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
661
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
662
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
663
664
    def serialise(self, store):
665
        """Serialise the tree to store.
666
667
        :param store: A VersionedFiles honouring the CHK extensions.
668
        :return: An iterable of the keys inserted by this operation.
669
        """
670
        lines = ["chkleaf:\n"]
671
        lines.append("%d\n" % self._maximum_size)
672
        lines.append("%d\n" % self._key_width)
673
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
674
        if self._common_serialised_prefix is None:
675
            lines.append('\n')
676
        else:
677
            lines.append('%s\n' % (self._common_serialised_prefix,))
678
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
679
        for key, value in sorted(self._items.items()):
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
680
            serialized = "%s\x00%s\n" % (self._serialise_key(key), value)
681
            assert serialized.startswith(self._common_serialised_prefix)
682
            lines.append(serialized[prefix_len:])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
683
        sha1, _, _ = store.add_lines((None,), (), lines)
684
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
685
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
686
        if len(bytes) != self._current_size():
687
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
688
        assert len(bytes) == self._current_size()
689
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
690
        return [self._key]
691
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
692
    def _search_key(self, key):
693
        """Return the search key for a key in this node."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
694
        return '\x00'.join(key)
695
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
696
    def refs(self):
697
        """Return the references to other CHK's held by this node."""
698
        return []
699
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
700
    def _compute_search_prefix(self):
701
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
702
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
703
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
704
            unique within this node.
705
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
706
        search_keys = [self._search_key(key) for key in self._items]
707
        self._search_prefix = self.common_prefix_for_keys(search_keys)
708
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
709
710
    def _compute_serialised_prefix(self):
711
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
712
713
        :return: A bytestring of the longest serialised key prefix that is
714
            unique within this node.
715
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
716
        serialised_keys = [self._serialise_key(key) for key in self._items]
717
        self._common_serialised_prefix = self.common_prefix_for_keys(
718
            serialised_keys)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
719
720
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
721
        """Unmap key from the node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
722
        self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
723
        self._len -= 1
724
        del self._items[key]
725
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
726
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
727
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
728
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
729
        return self
730
731
732
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
733
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
734
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
735
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
736
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
737
3735.15.5 by John Arbash Meinel
Change the nomenclature.
738
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
739
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
740
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
741
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
742
    def __init__(self, prefix=''):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
743
        Node.__init__(self)
744
        # The size of an internalnode with default values and no children.
745
        # How many octets key prefixes within this node are.
746
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
747
        self._search_prefix = prefix
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
748
749
    def __repr__(self):
750
        items_str = sorted(self._items)
751
        if len(items_str) > 20:
752
            items_str = items_str[16] + '...]'
753
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
754
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
755
            self._maximum_size, self._search_prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
756
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
757
    def add_node(self, prefix, node):
758
        """Add a child node with prefix prefix, and node node.
759
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
760
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
761
        :param node: The node being added.
762
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
763
        assert self._search_prefix is not None
764
        assert prefix.startswith(self._search_prefix)
765
        assert len(prefix) == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
766
        self._len += len(node)
767
        if not len(self._items):
768
            self._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
769
        assert self._node_width == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
770
        self._items[prefix] = node
771
        self._key = None
772
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
773
    def _current_size(self):
774
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
775
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
776
            len(str(self._maximum_size)))
777
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
778
    @classmethod
779
    def deserialise(klass, bytes, key):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
780
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
781
782
        :param bytes: The bytes of the node.
783
        :param key: The key that the serialised node has.
784
        :return: An InternalNode instance.
785
        """
786
        result = InternalNode()
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
787
        # Splitlines can split on '\r' so don't use it, remove the extra ''
788
        # from the result of split('\n') because we should have a trailing
789
        # newline
790
        lines = bytes.split('\n')
791
        assert lines[-1] == ''
792
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
793
        items = {}
794
        if lines[0] != 'chknode:':
795
            raise ValueError("not a serialised internal node: %r" % bytes)
796
        maximum_size = int(lines[1])
797
        width = int(lines[2])
798
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
799
        common_prefix = lines[4]
800
        for line in lines[5:]:
801
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
802
            prefix, flat_key = line.rsplit('\x00', 1)
803
            items[prefix] = (flat_key,)
804
        result._items = items
805
        result._len = length
806
        result._maximum_size = maximum_size
807
        result._key = key
808
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
809
        # XXX: InternalNodes don't really care about their size, and this will
810
        #      change if we add prefix compression
811
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
812
        result._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
813
        result._compute_search_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
814
        return result
815
816
    def iteritems(self, store, key_filter=None):
817
        for node in self._iter_nodes(store, key_filter=key_filter):
818
            for item in node.iteritems(store, key_filter=key_filter):
819
                yield item
820
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
821
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
822
        """Iterate over node objects which match key_filter.
823
824
        :param store: A store to use for accessing content.
825
        :param key_filter: A key filter to filter nodes. Only nodes that might
826
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
827
        :param batch_size: If not None, then we will return the nodes that had
828
            to be read using get_record_stream in batches, rather than reading
829
            them all at once.
830
        :return: An iterable of nodes. This function does not have to be fully
831
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
832
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
833
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
834
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
835
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
836
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
837
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
838
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
839
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
840
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
841
            # XXX defaultdict ?
842
            length_filters = {}
843
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
844
                search_key = self._search_prefix_filter(key)
845
                length_filter = length_filters.setdefault(
846
                                    len(search_key), set())
847
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
848
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
849
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
850
                for length, length_filter in length_filters:
851
                    if prefix[:length] in length_filter:
852
                        if type(node) == tuple:
853
                            keys[node] = prefix
854
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
855
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
856
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
857
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
858
            # Look in the page cache for some more bytes
859
            found_keys = set()
860
            for key in keys:
861
                try:
862
                    bytes = _page_cache[key]
863
                except KeyError:
864
                    continue
865
                else:
866
                    node = _deserialise(bytes, key)
867
                    self._items[keys[key]] = node
868
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
869
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
870
            for key in found_keys:
871
                del keys[key]
872
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
873
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
874
            if batch_size is None:
875
                # Read all the keys in
876
                batch_size = len(keys)
877
            key_order = list(keys)
878
            for batch_start in range(0, len(key_order), batch_size):
879
                batch = key_order[batch_start:batch_start + batch_size]
880
                # We have to fully consume the stream so there is no pending
881
                # I/O, so we buffer the nodes for now.
882
                stream = store.get_record_stream(batch, 'unordered', True)
883
                nodes = []
884
                for record in stream:
885
                    bytes = record.get_bytes_as('fulltext')
886
                    node = _deserialise(bytes, record.key)
887
                    nodes.append(node)
888
                    self._items[keys[record.key]] = node
889
                    _page_cache.add(record.key, bytes)
890
                for node in nodes:
891
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
892
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
893
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
894
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
895
        if not len(self._items):
896
            raise AssertionError("cant map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
897
        search_key = self._search_key(key)
898
        assert self._node_width == len(self._search_prefix) + 1
899
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
900
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
901
            # point where it would fit, insert self into that internal node,
902
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
903
            new_prefix = self.common_prefix(self._search_prefix,
904
                                            search_key)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
905
            new_parent = InternalNode(new_prefix)
906
            new_parent.set_maximum_size(self._maximum_size)
907
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
908
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
909
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
910
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
911
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
912
        if children:
913
            child = children[0]
914
        else:
915
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
916
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
917
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
918
        if isinstance(child, LeafNode):
919
            old_size = child._current_size()
920
        else:
921
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
922
        prefix, node_details = child.map(store, key, value)
923
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
924
            # child may have shrunk, or might be a new node
925
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
926
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
927
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
928
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
929
            new_node = self
930
            if (isinstance(child, LeafNode)
931
                and (old_size is None or child._current_size() < old_size)):
932
                # The old node was an InternalNode which means it has now
933
                # collapsed, so we need to check if it will chain to a collapse
934
                # at this level. Or the LeafNode has shrunk in size, so we need
935
                # to check that as well.
936
                new_node = self._check_remap(store)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
937
            assert new_node._search_prefix is not None
938
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
939
        # child has overflown - create a new intermediate node.
940
        # XXX: This is where we might want to try and expand our depth
941
        # to refer to more bytes of every child (which would give us
942
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
943
        child = self._new_child(search_key, InternalNode)
944
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
945
        for split, node in node_details:
946
            child.add_node(split, node)
947
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
948
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
949
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
950
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
951
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
952
        """Create a new child node of type klass."""
953
        child = klass()
954
        child.set_maximum_size(self._maximum_size)
955
        child._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
956
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
957
        return child
958
959
    def serialise(self, store):
960
        """Serialise the node to store.
961
962
        :param store: A VersionedFiles honouring the CHK extensions.
963
        :return: An iterable of the keys inserted by this operation.
964
        """
965
        for node in self._items.itervalues():
966
            if type(node) == tuple:
967
                # Never deserialised.
968
                continue
969
            if node._key is not None:
970
                # Never altered
971
                continue
972
            for key in node.serialise(store):
973
                yield key
974
        lines = ["chknode:\n"]
975
        lines.append("%d\n" % self._maximum_size)
976
        lines.append("%d\n" % self._key_width)
977
        lines.append("%d\n" % self._len)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
978
        assert self._search_prefix is not None
979
        lines.append('%s\n' % (self._search_prefix,))
980
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
981
        for prefix, node in sorted(self._items.items()):
982
            if type(node) == tuple:
983
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
984
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
985
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
986
            serialised = "%s\x00%s\n" % (prefix, key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
987
            assert serialised.startswith(self._search_prefix)
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
988
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
989
        sha1, _, _ = store.add_lines((None,), (), lines)
990
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
991
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
992
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
993
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
994
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
995
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
996
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
997
        # pad as necessary.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
998
        return ('\x00'.join(key) + '\x00'*self._node_width)[:self._node_width]
999
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1000
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1001
        """Serialise key for use as a prefix filter in iteritems."""
1002
        if len(key) == self._key_width:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1003
            return self._search_key(key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1004
        return '\x00'.join(key)[:self._node_width]
1005
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1006
    def _split(self, offset):
1007
        """Split this node into smaller nodes starting at offset.
1008
1009
        :param offset: The offset to start the new child nodes at.
1010
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1011
            prefix for reaching node.
1012
        """
1013
        if offset >= self._node_width:
1014
            for node in self._items.values():
1015
                for result in node._split(offset):
1016
                    yield result
1017
            return
1018
        for key, node in self._items.items():
1019
            pass
1020
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1021
    def refs(self):
1022
        """Return the references to other CHK's held by this node."""
1023
        if self._key is None:
1024
            raise AssertionError("unserialised nodes have no refs.")
1025
        refs = []
1026
        for value in self._items.itervalues():
1027
            if type(value) == tuple:
1028
                refs.append(value)
1029
            else:
1030
                refs.append(value.key())
1031
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1032
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1033
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1034
        """Return the unique key prefix for this node.
1035
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1036
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1037
            unique within this node.
1038
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1039
        self._search_prefix = self.common_prefix_for_keys(self._items)
1040
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1041
1042
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1043
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1044
        if not len(self._items):
1045
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1046
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1047
        if children:
1048
            child = children[0]
1049
        else:
1050
            raise KeyError(key)
1051
        self._len -= 1
1052
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1053
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1054
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1055
        if len(unmapped) == 0:
1056
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1057
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1058
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1059
        else:
1060
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1061
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1062
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1063
            # this node is no longer needed:
1064
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1065
        if isinstance(unmapped, InternalNode):
1066
            return self
1067
        return self._check_remap(store)
1068
1069
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1070
        """Check if all keys contained by children fit in a single LeafNode.
1071
1072
        :param store: A store to use for reading more nodes
1073
        :return: Either self, or a new LeafNode which should replace self.
1074
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1075
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1076
        # 1) Implicitly unmap() is removing a key which means that the child
1077
        #    nodes are going to be shrinking by some extent.
1078
        # 2) If all children are LeafNodes, it is possible that they could be
1079
        #    combined into a single LeafNode, which can then completely replace
1080
        #    this internal node with a single LeafNode
1081
        # 3) If *one* child is an InternalNode, we assume it has already done
1082
        #    all the work to determine that its children cannot collapse, and
1083
        #    we can then assume that those nodes *plus* the current nodes don't
1084
        #    have a chance of collapsing either.
1085
        #    So a very cheap check is to just say if 'unmapped' is an
1086
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1087
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1088
        # TODO: Another alternative is to check the total size of all known
1089
        #       LeafNodes. If there is some formula we can use to determine the
1090
        #       final size without actually having to read in any more
1091
        #       children, it would be nice to have. However, we have to be
1092
        #       careful with stuff like nodes that pull out the common prefix
1093
        #       of each key, as adding a new key can change the common prefix
1094
        #       and cause size changes greater than the length of one key.
1095
        #       So for now, we just add everything to a new Leaf until it
1096
        #       splits, as we know that will give the right answer
1097
        new_leaf = LeafNode()
1098
        new_leaf.set_maximum_size(self._maximum_size)
1099
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1100
        # A batch_size of 16 was chosen because:
1101
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1102
        #      leaf nodes we are unlikely to hold them in a single new leaf
1103
        #      node. This still allows for 1 round trip
1104
        #   b) With 16-way fan out, we can still do a single round trip
1105
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1106
        #      the page cache, just to determine that we really don't need it.
1107
        for node in self._iter_nodes(store, batch_size=16):
1108
            if isinstance(node, InternalNode):
1109
                # Without looking at any leaf nodes, we are sure
1110
                return self
1111
            for key, value in node._items.iteritems():
1112
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1113
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1114
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1115
1116
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1117
def _deserialise(bytes, key):
1118
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1119
    if bytes.startswith("chkleaf:\n"):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1120
        return LeafNode.deserialise(bytes, key)
1121
    elif bytes.startswith("chknode:\n"):
1122
        return InternalNode.deserialise(bytes, key)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1123
    else:
1124
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1125
1126
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1127
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1128
    """Read the associated records, and determine what is interesting."""
1129
    uninteresting_keys = set(uninteresting_keys)
1130
    chks_to_read = uninteresting_keys.union(interesting_keys)
1131
    next_uninteresting = set()
1132
    next_interesting = set()
1133
    uninteresting_items = set()
1134
    interesting_items = set()
1135
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1136
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1137
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1138
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1139
        if pb is not None:
1140
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1141
        bytes = record.get_bytes_as('fulltext')
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1142
        node = _deserialise(bytes, record.key)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1143
        if record.key in uninteresting_keys:
1144
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1145
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1146
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1147
                # We know we are at a LeafNode, so we can pass None for the
1148
                # store
1149
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1150
        else:
1151
            interesting_records.append(record)
1152
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1153
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1154
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1155
                interesting_items.update(node.iteritems(None))
1156
    # TODO: Filter out records that have already been read, as node splitting
1157
    #       can cause us to reference the same nodes via shorter and longer
1158
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1159
    return (next_uninteresting, uninteresting_items,
1160
            next_interesting, interesting_records, interesting_items)
1161
1162
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1163
def _find_all_uninteresting(store, interesting_root_keys,
1164
                            uninteresting_root_keys, adapter, pb):
1165
    """Determine the full set of uninteresting keys."""
1166
    # What about duplicates between interesting_root_keys and
1167
    # uninteresting_root_keys?
1168
    if not uninteresting_root_keys:
1169
        # Shortcut case. We know there is nothing uninteresting to filter out
1170
        # So we just let the rest of the algorithm do the work
1171
        # We know there is nothing uninteresting, and we didn't have to read
1172
        # any interesting records yet.
1173
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1174
    all_uninteresting_chks = set(uninteresting_root_keys)
1175
    all_uninteresting_items = set()
1176
1177
    # First step, find the direct children of both the interesting and
1178
    # uninteresting set
1179
    (uninteresting_keys, uninteresting_items,
1180
     interesting_keys, interesting_records,
1181
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1182
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1183
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1184
    all_uninteresting_chks.update(uninteresting_keys)
1185
    all_uninteresting_items.update(uninteresting_items)
1186
    del uninteresting_items
1187
    # Note: Exact matches between interesting and uninteresting do not need
1188
    #       to be search further. Non-exact matches need to be searched in case
1189
    #       there is a future exact-match
1190
    uninteresting_keys.difference_update(interesting_keys)
1191
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1192
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1193
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1194
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1195
    while chks_to_read:
1196
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1197
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1198
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1199
            if pb is not None:
1200
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1201
            if record.storage_kind in ('fulltext', 'chunked'):
1202
                bytes = record.get_bytes_as('fulltext')
1203
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1204
                bytes = adapter.get_bytes(record,
1205
                            record.get_bytes_as(record.storage_kind))
1206
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1207
            if isinstance(node, InternalNode):
1208
                # uninteresting_prefix_chks.update(node._items.iteritems())
1209
                chks = node._items.values()
1210
                # TODO: We remove the entries that are already in
1211
                #       uninteresting_chks ?
1212
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1213
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1214
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1215
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1216
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1217
    return (all_uninteresting_chks, all_uninteresting_items,
1218
            interesting_keys, interesting_records, interesting_items)
1219
1220
1221
def iter_interesting_nodes(store, interesting_root_keys,
1222
                           uninteresting_root_keys, pb=None):
1223
    """Given root keys, find interesting nodes.
1224
1225
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1226
    referenced from uninteresting_root_keys are not considered interesting.
1227
1228
    :param interesting_root_keys: keys which should be part of the
1229
        "interesting" nodes (which will be yielded)
1230
    :param uninteresting_root_keys: keys which should be filtered out of the
1231
        result set.
1232
    :return: Yield
1233
        (interesting records, interesting chk's, interesting key:values)
1234
    """
1235
    # TODO: consider that it may be more memory efficient to use the 20-byte
1236
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1237
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1238
    #       helper function similar to _read_bytes. This function should be
1239
    #       able to use nodes from the _page_cache as well as actually
1240
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1241
1242
    # A way to adapt from the compressed texts back into fulltexts
1243
    # In a way, this seems like a layering inversion to have CHKMap know the
1244
    # details of versionedfile
1245
    adapter_class = versionedfile.adapter_registry.get(
1246
        ('knit-ft-gz', 'fulltext'))
1247
    adapter = adapter_class(store)
1248
1249
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1250
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1251
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1252
1253
    # Now that we know everything uninteresting, we can yield information from
1254
    # our first request
1255
    interesting_items.difference_update(all_uninteresting_items)
1256
    records = dict((record.key, record) for record in interesting_records
1257
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1258
    if records or interesting_items:
1259
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1260
    interesting_keys.difference_update(all_uninteresting_chks)
1261
1262
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1263
    while chks_to_read:
1264
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1265
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1266
            if pb is not None:
1267
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1268
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1269
            if record.storage_kind in ('fulltext', 'chunked'):
1270
                bytes = record.get_bytes_as('fulltext')
1271
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1272
                bytes = adapter.get_bytes(record,
1273
                            record.get_bytes_as(record.storage_kind))
1274
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1275
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1276
                chks = set(node.refs())
1277
                chks.difference_update(all_uninteresting_chks)
1278
                # Is set() and .difference_update better than:
1279
                # chks = [chk for chk in node.refs()
1280
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1281
                next_chks.update(chks)
1282
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1283
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1284
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1285
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1286
                interesting_items = [item for item in node._items.iteritems()
1287
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1288
                # TODO: Do we need to filter out items that we have already
1289
                #       seen on other pages? We don't really want to buffer the
1290
                #       whole thing, but it does mean that callers need to
1291
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1292
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1293
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1294
        chks_to_read = next_chks