/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
44
from bzrlib import versionedfile
45
""")
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
46
from bzrlib import lru_cache
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
47
48
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
49
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
50
# out, it takes 3.1MB to cache the layer.
51
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
52
# We are caching bytes so len(value) is perfectly accurate
53
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
54
55
56
class CHKMap(object):
57
    """A persistent map from string to string backed by a CHK store."""
58
59
    def __init__(self, store, root_key):
60
        """Create a CHKMap object.
61
62
        :param store: The store the CHKMap is stored in.
63
        :param root_key: The root key of the map. None to create an empty
64
            CHKMap.
65
        """
66
        self._store = store
67
        if root_key is None:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
68
            self._root_node = LeafNode()
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
69
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
70
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
71
72
    def apply_delta(self, delta):
73
        """Apply a delta to the map.
74
75
        :param delta: An iterable of old_key, new_key, new_value tuples.
76
            If new_key is not None, then new_key->new_value is inserted
77
            into the map; if old_key is not None, then the old mapping
78
            of old_key is removed.
79
        """
80
        for old, new, value in delta:
81
            if old is not None and old != new:
82
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
83
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
84
        for old, new, value in delta:
85
            if new is not None:
86
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
87
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
88
        return self._save()
89
90
    def _ensure_root(self):
91
        """Ensure that the root node is an object not a key."""
92
        if type(self._root_node) == tuple:
93
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
94
            self._root_node = self._get_node(self._root_node)
95
96
    def _get_node(self, node):
97
        """Get a node.
98
99
        Node that this does not update the _items dict in objects containing a
100
        reference to this node. As such it does not prevent subsequent IO being
101
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
102
3735.2.31 by Robert Collins
CHKMap.iter_changes
103
        :param node: A tuple key or node object.
104
        :return: A node object.
105
        """
106
        if type(node) == tuple:
107
            bytes = self._read_bytes(node)
108
            return _deserialise(bytes, node)
109
        else:
110
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
111
112
    def _read_bytes(self, key):
113
        stream = self._store.get_record_stream([key], 'unordered', True)
114
        return stream.next().get_bytes_as('fulltext')
115
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
116
    def _dump_tree(self):
117
        """Return the tree in a string representation."""
118
        self._ensure_root()
119
        res = self._dump_tree_node(self._root_node, prefix='', indent='')
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
120
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
121
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
122
123
    def _dump_tree_node(self, node, prefix, indent):
124
        """For this node and all children, generate a string representation."""
125
        result = []
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
126
        node_key = node.key()
127
        if node_key is not None:
128
            node_key = node_key[0]
129
        result.append('%s%r %s %s' % (indent, prefix, node.__class__.__name__,
130
                                        node_key))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
131
        if isinstance(node, InternalNode):
132
            # Trigger all child nodes to get loaded
133
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
134
            for prefix, sub in sorted(node._items.iteritems()):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
135
                result.extend(self._dump_tree_node(sub, prefix, indent + '  '))
136
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
137
            for key, value in sorted(node._items.iteritems()):
138
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
139
        return result
140
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
141
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
142
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
143
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
144
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
145
        :param store: The store to record initial_value in, a VersionedFiles
146
            object with 1-tuple keys supporting CHK key generation.
147
        :param initial_value: A dict to store in store. Its keys and values
148
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
149
        :param maximum_size: The maximum_size rule to apply to nodes. This
150
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
151
        :param key_width: The number of elements in each key_tuple being stored
152
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
153
        :return: The root chk of te resulting CHKMap.
154
        """
155
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
156
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
157
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
158
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
159
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
160
            delta.append((None, key, value))
161
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
162
        return result._save()
163
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
164
    def iter_changes(self, basis):
165
        """Iterate over the changes between basis and self.
166
167
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
168
            is None for keys only in self; new_value is None for keys only in
169
            basis.
170
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
171
        # Overview:
172
        # Read both trees in lexographic, highest-first order.
173
        # Any identical nodes we skip
174
        # Any unique prefixes we output immediately.
175
        # values in a leaf node are treated as single-value nodes in the tree
176
        # which allows them to be not-special-cased. We know to output them
177
        # because their value is a string, not a key(tuple) or node.
178
        #
179
        # corner cases to beware of when considering this function:
180
        # *) common references are at different heights.
181
        #    consider two trees:
182
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
183
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
184
        #     'b': LeafNode={'b'}}
185
        #    the node with aaa/aab will only be encountered in the second tree
186
        #    after reading the 'a' subtree, but it is encountered in the first
187
        #    tree immediately. Variations on this may have read internal nodes like this.
188
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
189
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
190
        #    clean as we process each item.
191
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
192
            return
193
        self._ensure_root()
194
        basis._ensure_root()
195
        excluded_keys = set()
196
        self_node = self._root_node
197
        basis_node = basis._root_node
198
        # A heap, each element is prefix, node(tuple/NodeObject/string),
199
        # key_path (a list of tuples, tail-sharing down the tree.)
200
        self_pending = []
201
        basis_pending = []
202
        def process_node(prefix, node, path, a_map, pending):
203
            # take a node and expand it
204
            node = a_map._get_node(node)
205
            if type(node) == LeafNode:
206
                path = (node._key, path)
207
                for key, value in node._items.items():
208
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
209
            else:
210
                # type(node) == InternalNode
211
                path = (node._key, path)
212
                for prefix, child in node._items.items():
213
                    heapq.heappush(pending, (prefix, child, path))
214
        process_node(None, self_node, None, self, self_pending)
215
        process_node(None, basis_node, None, basis, basis_pending)
216
        self_seen = set()
217
        basis_seen = set()
218
        excluded_keys = set()
219
        def check_excluded(key_path):
220
            # Note that this is N^2, it depends on us trimming trees
221
            # aggressively to not become slow.
222
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
223
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
224
            # a common node is detected, the proceed to remove the already
225
            # pending children. bzrlib.graph has a searcher module with a
226
            # similar problem.
227
            while key_path is not None:
228
                key, key_path = key_path
229
                if key in excluded_keys:
230
                    return True
231
            return False
232
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
233
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
234
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
235
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
236
            if not self_pending:
237
                # self is exhausted: output remainder of basis
238
                for prefix, node, path in basis_pending:
239
                    if check_excluded(path):
240
                        continue
241
                    node = basis._get_node(node)
242
                    if type(node) == str:
243
                        # a value
244
                        yield (tuple(prefix.split('\x00')), node, None)
245
                    else:
246
                        # subtree - fastpath the entire thing.
247
                        for key, value in node.iteritems(basis._store):
248
                            yield (key, value, None)
249
                return
250
            elif not basis_pending:
251
                # basis is exhausted: output remainder of self.
252
                for prefix, node, path in self_pending:
253
                    if check_excluded(path):
254
                        continue
255
                    node = self._get_node(node)
256
                    if type(node) == str:
257
                        # a value
258
                        yield (tuple(prefix.split('\x00')), None, node)
259
                    else:
260
                        # subtree - fastpath the entire thing.
261
                        for key, value in node.iteritems(self._store):
262
                            yield (key, None, value)
263
                return
264
            else:
265
                # XXX: future optimisation - yield the smaller items
266
                # immediately rather than pushing everything on/off the
267
                # heaps. Applies to both internal nodes and leafnodes.
268
                if self_pending[0][0] < basis_pending[0][0]:
269
                    # expand self
270
                    prefix, node, path = heapq.heappop(self_pending)
271
                    if check_excluded(path):
272
                        continue
273
                    if type(node) == str:
274
                        # a value
275
                        yield (tuple(prefix.split('\x00')), None, node)
276
                    else:
277
                        process_node(prefix, node, path, self, self_pending)
278
                        continue
279
                elif self_pending[0][0] > basis_pending[0][0]:
280
                    # expand basis
281
                    prefix, node, path = heapq.heappop(basis_pending)
282
                    if check_excluded(path):
283
                        continue
284
                    if type(node) == str:
285
                        # a value
286
                        yield (tuple(prefix.split('\x00')), node, None)
287
                    else:
288
                        process_node(prefix, node, path, basis, basis_pending)
289
                        continue
290
                else:
291
                    # common prefix: possibly expand both
292
                    if type(self_pending[0][1]) != str:
293
                        # process next self
294
                        read_self = True
295
                    else:
296
                        read_self = False
297
                    if type(basis_pending[0][1]) != str:
298
                        # process next basis
299
                        read_basis = True
300
                    else:
301
                        read_basis = False
302
                    if not read_self and not read_basis:
303
                        # compare a common value
304
                        self_details = heapq.heappop(self_pending)
305
                        basis_details = heapq.heappop(basis_pending)
306
                        if self_details[1] != basis_details[1]:
307
                            yield (tuple(self_details[0].split('\x00')),
308
                                basis_details[1], self_details[1])
309
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
310
                    # At least one side wasn't a string.
311
                    if (self._node_key(self_pending[0][1]) ==
312
                        self._node_key(basis_pending[0][1])):
313
                        # Identical pointers, skip (and don't bother adding to
314
                        # excluded, it won't turn up again.
315
                        heapq.heappop(self_pending)
316
                        heapq.heappop(basis_pending)
317
                        continue
318
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
319
                    if read_self:
320
                        prefix, node, path = heapq.heappop(self_pending)
321
                        if check_excluded(path):
322
                            continue
323
                        process_node(prefix, node, path, self, self_pending)
324
                    if read_basis:
325
                        prefix, node, path = heapq.heappop(basis_pending)
326
                        if check_excluded(path):
327
                            continue
328
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
329
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
330
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
331
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
332
        """Iterate over the entire CHKMap's contents."""
333
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
334
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
335
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
336
    def key(self):
337
        """Return the key for this map."""
338
        if isinstance(self._root_node, tuple):
339
            return self._root_node
340
        else:
341
            return self._root_node._key
342
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
343
    def __len__(self):
344
        self._ensure_root()
345
        return len(self._root_node)
346
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
347
    def map(self, key, value):
348
        """Map a key tuple to value."""
349
        # Need a root object.
350
        self._ensure_root()
351
        prefix, node_details = self._root_node.map(self._store, key, value)
352
        if len(node_details) == 1:
353
            self._root_node = node_details[0][1]
354
        else:
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
355
            self._root_node = InternalNode(prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
356
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
357
            self._root_node._key_width = node_details[0][1]._key_width
358
            for split, node in node_details:
359
                self._root_node.add_node(split, node)
360
3735.2.31 by Robert Collins
CHKMap.iter_changes
361
    def _node_key(self, node):
362
        """Get the key for a node whether its a tuple o r node."""
363
        if type(node) == tuple:
364
            return node
365
        else:
366
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
367
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
368
    def unmap(self, key):
369
        """remove key from the map."""
370
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
371
        unmapped = self._root_node.unmap(self._store, key)
372
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
373
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
374
    def _save(self):
375
        """Save the map completely.
376
377
        :return: The key of the root node.
378
        """
379
        if type(self._root_node) == tuple:
380
            # Already saved.
381
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
382
        keys = list(self._root_node.serialise(self._store))
383
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
384
385
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
386
class Node(object):
387
    """Base class defining the protocol for CHK Map nodes."""
388
389
    def __init__(self, key_width=1):
390
        """Create a node.
391
392
        :param key_width: The width of keys for this node.
393
        """
394
        self._key = None
395
        # Current number of elements
396
        self._len = 0
397
        self._maximum_size = 0
398
        self._key_width = 1
399
        # current size in bytes
400
        self._size = 0
401
        # The pointers/values this node has - meaning defined by child classes.
402
        self._items = {}
403
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
404
    def __repr__(self):
405
        items_str = sorted(self._items)
406
        if len(items_str) > 20:
407
            items_str = items_str[16] + '...]'
408
        return '%s(key:%s len:%s size:%s max:%s items:%s)' % (
409
            self.__class__.__name__, self._key, self._len, self._size,
410
            self._maximum_size, items_str)
411
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
412
    def key(self):
413
        return self._key
414
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
415
    def __len__(self):
416
        return self._len
417
418
    @property
419
    def maximum_size(self):
420
        """What is the upper limit for adding references to a node."""
421
        return self._maximum_size
422
423
    def set_maximum_size(self, new_size):
424
        """Set the size threshold for nodes.
425
426
        :param new_size: The size at which no data is added to a node. 0 for
427
            unlimited.
428
        """
429
        self._maximum_size = new_size
430
431
432
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
433
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
434
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
435
    :ivar _items: A dict of key->value items. The key is in tuple form.
436
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
437
438
    def __init__(self):
439
        Node.__init__(self)
440
        # The size of a leaf node with default values and no children.
441
        self._size = 12
442
443
    def _current_size(self):
444
        """Answer the current serialised size of this node."""
445
        return (self._size + len(str(self._len)) + len(str(self._key_width)) +
446
            len(str(self._maximum_size)))
447
448
    @classmethod
449
    def deserialise(klass, bytes, key):
450
        """Deserialise bytes, with key key, into a LeafNode.
451
452
        :param bytes: The bytes of the node.
453
        :param key: The key that the serialised node has.
454
        """
455
        result = LeafNode()
456
        lines = bytes.splitlines()
457
        items = {}
458
        if lines[0] != 'chkleaf:':
459
            raise ValueError("not a serialised leaf node: %r" % bytes)
460
        maximum_size = int(lines[1])
461
        width = int(lines[2])
462
        length = int(lines[3])
463
        for line in lines[4:]:
3735.2.25 by Robert Collins
CHKInventory core tests passing.
464
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
465
            items[tuple(elements[:-1])] = elements[-1]
466
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
467
            raise AssertionError("item count (%d) mismatch for key %s,"
468
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
469
        result._items = items
470
        result._len = length
471
        result._maximum_size = maximum_size
472
        result._key = key
473
        result._key_width = width
474
        result._size = len(bytes)
475
        return result
476
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
477
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
478
        """Iterate over items in the node.
479
480
        :param key_filter: A filter to apply to the node. It should be a
481
            list/set/dict or similar repeatedly iterable container.
482
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
483
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
484
            # Adjust the filter - short elements go to a prefix filter. Would this
485
            # be cleaner explicitly? That would be no harder for InternalNode..
486
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
487
            filters = {}
488
            for key in key_filter:
489
                length_filter = filters.setdefault(len(key), set())
490
                length_filter.add(key)
491
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
492
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
493
                for length, length_filter in filters:
494
                    if item[0][:length] in length_filter:
495
                        yield item
496
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
497
        else:
498
            for item in self._items.iteritems():
499
                yield item
500
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
501
    def _key_value_len(self, key, value):
502
        # TODO: Should probably be done without actually joining the key, but
503
        #       then that can be done via the C extension
504
        return 2 + len('\x00'.join(key)) + len(value)
505
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
506
    def _map_no_split(self, key, value):
507
        """Map a key to a value.
508
509
        This assumes either the key does not already exist, or you have already
510
        removed its size and length from self.
511
512
        :return: True if adding this node should cause us to split.
513
        """
514
        self._items[key] = value
515
        self._size += self._key_value_len(key, value)
516
        self._len += 1
517
        if (self._len > 1
518
            and self._maximum_size
519
            and self._current_size() > self._maximum_size):
520
            return True
521
        return False
522
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
523
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
524
        """Map key to value."""
525
        if key in self._items:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
526
            self._size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
527
            self._len -= 1
528
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
529
        if self._map_no_split(key, value):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
530
            common_prefix = self.unique_serialised_prefix()
531
            split_at = len(common_prefix) + 1
532
            result = {}
533
            for key, value in self._items.iteritems():
534
                serialised_key = self._serialised_key(key)
535
                prefix = serialised_key[:split_at]
3735.11.7 by John Arbash Meinel
Turns out that the LeafNode split code needs to be aware of having a key
536
                # TODO: Generally only 1 key can be exactly the right length,
537
                #       which means we can only have 1 key in the node pointed
538
                #       at by the 'prefix\0' key. We might want to consider
539
                #       folding it into the containing InternalNode rather than
540
                #       having a fixed length-1 node.
3735.11.8 by John Arbash Meinel
more comment
541
                #       Note this is probably not true for hash keys, as they
542
                #       may get a '\00' node anywhere, but won't have keys of
543
                #       different lengths.
3735.11.7 by John Arbash Meinel
Turns out that the LeafNode split code needs to be aware of having a key
544
                if len(prefix) < split_at:
545
                    prefix += '\x00'*(split_at - len(prefix))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
546
                if prefix not in result:
547
                    node = LeafNode()
548
                    node.set_maximum_size(self._maximum_size)
549
                    node._key_width = self._key_width
550
                    result[prefix] = node
551
                else:
552
                    node = result[prefix]
553
                node.map(store, key, value)
554
            return common_prefix, result.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
555
        else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
556
            return self.unique_serialised_prefix(), [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
557
558
    def serialise(self, store):
559
        """Serialise the tree to store.
560
561
        :param store: A VersionedFiles honouring the CHK extensions.
562
        :return: An iterable of the keys inserted by this operation.
563
        """
564
        lines = ["chkleaf:\n"]
565
        lines.append("%d\n" % self._maximum_size)
566
        lines.append("%d\n" % self._key_width)
567
        lines.append("%d\n" % self._len)
568
        for key, value in sorted(self._items.items()):
569
            lines.append("%s\x00%s\n" % ('\x00'.join(key), value))
570
        sha1, _, _ = store.add_lines((None,), (), lines)
571
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
572
        _page_cache.add(self._key, ''.join(lines))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
573
        return [self._key]
574
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
575
    def _serialised_key(self, key):
576
        """Return the serialised key for key in this node."""
577
        return '\x00'.join(key)
578
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
579
    def refs(self):
580
        """Return the references to other CHK's held by this node."""
581
        return []
582
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
583
    def unique_serialised_prefix(self):
584
        """Return the unique key prefix for this node.
585
586
        :return: A bytestring of the longest serialised key prefix that is
587
            unique within this node.
588
        """
589
        # may want to cache this eventually :- but wait for enough
590
        # functionality to profile.
591
        keys = list(self._items.keys())
592
        if not keys:
593
            return ""
594
        current_prefix = self._serialised_key(keys.pop(-1))
595
        while current_prefix and keys:
596
            next_key = self._serialised_key(keys.pop(-1))
597
            for pos, (left, right) in enumerate(zip(current_prefix, next_key)):
598
                if left != right:
599
                    pos -= 1
600
                    break
601
            current_prefix = current_prefix[:pos + 1]
602
        return current_prefix
603
604
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
605
        """Unmap key from the node."""
606
        self._size -= 2 + len('\x00'.join(key)) + len(self._items[key])
607
        self._len -= 1
608
        del self._items[key]
609
        self._key = None
610
        return self
611
612
613
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
614
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
615
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
616
    An InternalNode is responsible for mapping serialised key prefixes to child
617
    nodes. It is greedy - it will defer splitting itself as long as possible.
618
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
619
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
620
    def __init__(self, prefix=''):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
621
        Node.__init__(self)
622
        # The size of an internalnode with default values and no children.
623
        # self._size = 12
624
        # How many octets key prefixes within this node are.
625
        self._node_width = 0
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
626
        self._prefix = prefix
627
628
    def __repr__(self):
629
        items_str = sorted(self._items)
630
        if len(items_str) > 20:
631
            items_str = items_str[16] + '...]'
632
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
633
            self.__class__.__name__, self._key, self._len, self._size,
634
            self._maximum_size, self._prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
635
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
636
    def add_node(self, prefix, node):
637
        """Add a child node with prefix prefix, and node node.
638
639
        :param prefix: The serialised key prefix for node.
640
        :param node: The node being added.
641
        """
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
642
        assert self._prefix is not None
643
        assert prefix.startswith(self._prefix)
644
        assert len(prefix) == len(self._prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
645
        self._len += len(node)
646
        if not len(self._items):
647
            self._node_width = len(prefix)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
648
        assert self._node_width == len(self._prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
649
        self._items[prefix] = node
650
        self._key = None
651
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
652
    def _current_size(self):
653
        """Answer the current serialised size of this node."""
654
        return (self._size + len(str(self._len)) + len(str(self._key_width)) +
655
            len(str(self._maximum_size)))
656
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
657
    @classmethod
658
    def deserialise(klass, bytes, key):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
659
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
660
661
        :param bytes: The bytes of the node.
662
        :param key: The key that the serialised node has.
663
        :return: An InternalNode instance.
664
        """
665
        result = InternalNode()
666
        lines = bytes.splitlines()
667
        items = {}
668
        if lines[0] != 'chknode:':
669
            raise ValueError("not a serialised internal node: %r" % bytes)
670
        maximum_size = int(lines[1])
671
        width = int(lines[2])
672
        length = int(lines[3])
673
        for line in lines[4:]:
674
            prefix, flat_key = line.rsplit('\x00', 1)
675
            items[prefix] = (flat_key,)
676
        result._items = items
677
        result._len = length
678
        result._maximum_size = maximum_size
679
        result._key = key
680
        result._key_width = width
681
        result._size = len(bytes)
682
        result._node_width = len(prefix)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
683
        result._prefix = result.unique_serialised_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
684
        return result
685
686
    def iteritems(self, store, key_filter=None):
687
        for node in self._iter_nodes(store, key_filter=key_filter):
688
            for item in node.iteritems(store, key_filter=key_filter):
689
                yield item
690
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
691
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
692
        """Iterate over node objects which match key_filter.
693
694
        :param store: A store to use for accessing content.
695
        :param key_filter: A key filter to filter nodes. Only nodes that might
696
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
697
        :param batch_size: If not None, then we will return the nodes that had
698
            to be read using get_record_stream in batches, rather than reading
699
            them all at once.
700
        :return: An iterable of nodes. This function does not have to be fully
701
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
702
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
703
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
704
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
705
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
706
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
707
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
708
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
709
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
710
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
711
            # XXX defaultdict ?
712
            length_filters = {}
713
            for key in key_filter:
714
                serialised_key = self._serialised_prefix_filter(key)
715
                length_filter = length_filters.setdefault(len(serialised_key),
716
                    set())
717
                length_filter.add(serialised_key)
718
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
719
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
720
                for length, length_filter in length_filters:
721
                    if prefix[:length] in length_filter:
722
                        if type(node) == tuple:
723
                            keys[node] = prefix
724
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
725
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
726
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
727
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
728
            # Look in the page cache for some more bytes
729
            found_keys = set()
730
            for key in keys:
731
                try:
732
                    bytes = _page_cache[key]
733
                except KeyError:
734
                    continue
735
                else:
736
                    node = _deserialise(bytes, key)
737
                    self._items[keys[key]] = node
738
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
739
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
740
            for key in found_keys:
741
                del keys[key]
742
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
743
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
744
            if batch_size is None:
745
                # Read all the keys in
746
                batch_size = len(keys)
747
            key_order = list(keys)
748
            for batch_start in range(0, len(key_order), batch_size):
749
                batch = key_order[batch_start:batch_start + batch_size]
750
                # We have to fully consume the stream so there is no pending
751
                # I/O, so we buffer the nodes for now.
752
                stream = store.get_record_stream(batch, 'unordered', True)
753
                nodes = []
754
                for record in stream:
755
                    bytes = record.get_bytes_as('fulltext')
756
                    node = _deserialise(bytes, record.key)
757
                    nodes.append(node)
758
                    self._items[keys[record.key]] = node
759
                    _page_cache.add(record.key, bytes)
760
                for node in nodes:
761
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
762
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
763
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
764
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
765
        if not len(self._items):
766
            raise AssertionError("cant map in an empty InternalNode.")
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
767
        serialised_key = self._serialised_key(key)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
768
        assert self._node_width == len(self._prefix) + 1
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
769
        if not serialised_key.startswith(self._prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
770
            # This key doesn't fit in this index, so we need to split at the
771
            # point where it would fit.
772
            # XXX: Do we need the serialised_key in its maximum length?
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
773
            new_prefix = self.unique_serialised_prefix(serialised_key)
774
            new_parent = InternalNode(new_prefix)
775
            new_parent.set_maximum_size(self._maximum_size)
776
            new_parent._key_width = self._key_width
777
            new_parent.add_node(self._prefix[:len(new_prefix)+1], self)
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
778
            assert new_parent._node_width == len(new_parent._prefix) + 1
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
779
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
780
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
781
        if children:
782
            child = children[0]
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
783
            # if isinstance(child, InternalNode):
784
            #     child_prefix = child._prefix
785
            #     child_ser_key = child._serialised_key(key)
786
            #     if not child_ser_key.startswith(child_prefix):
787
            #         import pdb; pdb.set_trace()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
788
        else:
789
            # new child needed:
790
            child = self._new_child(serialised_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
791
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
792
        if isinstance(child, LeafNode):
793
            old_size = child._current_size()
794
        else:
795
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
796
        prefix, node_details = child.map(store, key, value)
797
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
798
            # child may have shrunk, or might be a new node
799
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
800
            self._len = self._len - old_len + len(child)
801
            self._items[serialised_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
802
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
803
            new_node = self
804
            if (isinstance(child, LeafNode)
805
                and (old_size is None or child._current_size() < old_size)):
806
                # The old node was an InternalNode which means it has now
807
                # collapsed, so we need to check if it will chain to a collapse
808
                # at this level. Or the LeafNode has shrunk in size, so we need
809
                # to check that as well.
810
                new_node = self._check_remap(store)
811
            return new_node.unique_serialised_prefix(), [("", new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
812
        # child has overflown - create a new intermediate node.
813
        # XXX: This is where we might want to try and expand our depth
814
        # to refer to more bytes of every child (which would give us
815
        # multiple pointers to child nodes, but less intermediate nodes)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
816
        # TODO: Is this mapped as serialised_key or as prefix?
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
817
        child = self._new_child(serialised_key, InternalNode)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
818
        child._prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
819
        for split, node in node_details:
820
            child.add_node(split, node)
821
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
822
        self._key = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
823
        return self.unique_serialised_prefix(), [("", self)]
824
825
    def _new_child(self, serialised_key, klass):
826
        """Create a new child node of type klass."""
827
        child = klass()
828
        child.set_maximum_size(self._maximum_size)
829
        child._key_width = self._key_width
830
        self._items[serialised_key] = child
831
        return child
832
833
    def serialise(self, store):
834
        """Serialise the node to store.
835
836
        :param store: A VersionedFiles honouring the CHK extensions.
837
        :return: An iterable of the keys inserted by this operation.
838
        """
839
        for node in self._items.itervalues():
840
            if type(node) == tuple:
841
                # Never deserialised.
842
                continue
843
            if node._key is not None:
844
                # Never altered
845
                continue
846
            for key in node.serialise(store):
847
                yield key
848
        lines = ["chknode:\n"]
849
        lines.append("%d\n" % self._maximum_size)
850
        lines.append("%d\n" % self._key_width)
851
        lines.append("%d\n" % self._len)
852
        for prefix, node in sorted(self._items.items()):
853
            if type(node) == tuple:
854
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
855
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
856
                key = node._key[0]
857
            lines.append("%s\x00%s\n" % (prefix, key))
858
        sha1, _, _ = store.add_lines((None,), (), lines)
859
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
860
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
861
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
862
863
    def _serialised_key(self, key):
864
        """Return the serialised key for key in this node."""
865
        return ('\x00'.join(key) + '\x00'*self._node_width)[:self._node_width]
866
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
867
    def _serialised_prefix_filter(self, key):
868
        """Serialise key for use as a prefix filter in iteritems."""
869
        if len(key) == self._key_width:
870
            return self._serialised_key(key)
871
        return '\x00'.join(key)[:self._node_width]
872
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
873
    def _split(self, offset):
874
        """Split this node into smaller nodes starting at offset.
875
876
        :param offset: The offset to start the new child nodes at.
877
        :return: An iterable of (prefix, node) tuples. prefix is a byte
878
            prefix for reaching node.
879
        """
880
        if offset >= self._node_width:
881
            for node in self._items.values():
882
                for result in node._split(offset):
883
                    yield result
884
            return
885
        for key, node in self._items.items():
886
            pass
887
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
888
    def refs(self):
889
        """Return the references to other CHK's held by this node."""
890
        if self._key is None:
891
            raise AssertionError("unserialised nodes have no refs.")
892
        refs = []
893
        for value in self._items.itervalues():
894
            if type(value) == tuple:
895
                refs.append(value)
896
            else:
897
                refs.append(value.key())
898
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
899
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
900
    def unique_serialised_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
901
        """Return the unique key prefix for this node.
902
903
        :return: A bytestring of the longest serialised key prefix that is
904
            unique within this node.
905
        """
906
        # may want to cache this eventually :- but wait for enough
907
        # functionality to profile.
908
        keys = list(self._items.keys())
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
909
        if extra_key is not None:
910
            keys.append(extra_key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
911
        if not keys:
912
            return ""
913
        current_prefix = keys.pop(-1)
914
        while current_prefix and keys:
915
            next_key = keys.pop(-1)
916
            for pos, (left, right) in enumerate(zip(current_prefix, next_key)):
917
                if left != right:
918
                    pos -= 1
919
                    break
920
            current_prefix = current_prefix[:pos + 1]
921
        return current_prefix
922
923
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
924
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
925
        if not len(self._items):
926
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
927
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
928
        if children:
929
            child = children[0]
930
        else:
931
            raise KeyError(key)
932
        self._len -= 1
933
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
934
        self._key = None
935
        serialised_key = self._serialised_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
936
        if len(unmapped) == 0:
937
            # All child nodes are gone, remove the child:
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
938
            del self._items[serialised_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
939
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
940
        else:
941
            # Stash the returned node
942
            self._items[serialised_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
943
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
944
            # this node is no longer needed:
945
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
946
        if isinstance(unmapped, InternalNode):
947
            return self
948
        return self._check_remap(store)
949
950
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
951
        """Check if all keys contained by children fit in a single LeafNode.
952
953
        :param store: A store to use for reading more nodes
954
        :return: Either self, or a new LeafNode which should replace self.
955
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
956
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
957
        # 1) Implicitly unmap() is removing a key which means that the child
958
        #    nodes are going to be shrinking by some extent.
959
        # 2) If all children are LeafNodes, it is possible that they could be
960
        #    combined into a single LeafNode, which can then completely replace
961
        #    this internal node with a single LeafNode
962
        # 3) If *one* child is an InternalNode, we assume it has already done
963
        #    all the work to determine that its children cannot collapse, and
964
        #    we can then assume that those nodes *plus* the current nodes don't
965
        #    have a chance of collapsing either.
966
        #    So a very cheap check is to just say if 'unmapped' is an
967
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
968
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
969
        # TODO: Another alternative is to check the total size of all known
970
        #       LeafNodes. If there is some formula we can use to determine the
971
        #       final size without actually having to read in any more
972
        #       children, it would be nice to have. However, we have to be
973
        #       careful with stuff like nodes that pull out the common prefix
974
        #       of each key, as adding a new key can change the common prefix
975
        #       and cause size changes greater than the length of one key.
976
        #       So for now, we just add everything to a new Leaf until it
977
        #       splits, as we know that will give the right answer
978
        new_leaf = LeafNode()
979
        new_leaf.set_maximum_size(self._maximum_size)
980
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
981
        # A batch_size of 16 was chosen because:
982
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
983
        #      leaf nodes we are unlikely to hold them in a single new leaf
984
        #      node. This still allows for 1 round trip
985
        #   b) With 16-way fan out, we can still do a single round trip
986
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
987
        #      the page cache, just to determine that we really don't need it.
988
        for node in self._iter_nodes(store, batch_size=16):
989
            if isinstance(node, InternalNode):
990
                # Without looking at any leaf nodes, we are sure
991
                return self
992
            for key, value in node._items.iteritems():
993
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
994
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
995
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
996
997
3735.2.16 by Robert Collins
Untested extensions to support repodetails
998
def _deserialise(bytes, key):
999
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1000
    if bytes.startswith("chkleaf:\n"):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1001
        return LeafNode.deserialise(bytes, key)
1002
    elif bytes.startswith("chknode:\n"):
1003
        return InternalNode.deserialise(bytes, key)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1004
    else:
1005
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1006
1007
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1008
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1009
    """Read the associated records, and determine what is interesting."""
1010
    uninteresting_keys = set(uninteresting_keys)
1011
    chks_to_read = uninteresting_keys.union(interesting_keys)
1012
    next_uninteresting = set()
1013
    next_interesting = set()
1014
    uninteresting_items = set()
1015
    interesting_items = set()
1016
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1017
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1018
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1019
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1020
        if pb is not None:
1021
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1022
        bytes = record.get_bytes_as('fulltext')
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1023
        node = _deserialise(bytes, record.key)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1024
        if record.key in uninteresting_keys:
1025
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1026
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1027
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1028
                # We know we are at a LeafNode, so we can pass None for the
1029
                # store
1030
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1031
        else:
1032
            interesting_records.append(record)
1033
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1034
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1035
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1036
                interesting_items.update(node.iteritems(None))
1037
    # TODO: Filter out records that have already been read, as node splitting
1038
    #       can cause us to reference the same nodes via shorter and longer
1039
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1040
    return (next_uninteresting, uninteresting_items,
1041
            next_interesting, interesting_records, interesting_items)
1042
1043
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1044
def _find_all_uninteresting(store, interesting_root_keys,
1045
                            uninteresting_root_keys, adapter, pb):
1046
    """Determine the full set of uninteresting keys."""
1047
    # What about duplicates between interesting_root_keys and
1048
    # uninteresting_root_keys?
1049
    if not uninteresting_root_keys:
1050
        # Shortcut case. We know there is nothing uninteresting to filter out
1051
        # So we just let the rest of the algorithm do the work
1052
        # We know there is nothing uninteresting, and we didn't have to read
1053
        # any interesting records yet.
1054
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1055
    all_uninteresting_chks = set(uninteresting_root_keys)
1056
    all_uninteresting_items = set()
1057
1058
    # First step, find the direct children of both the interesting and
1059
    # uninteresting set
1060
    (uninteresting_keys, uninteresting_items,
1061
     interesting_keys, interesting_records,
1062
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1063
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1064
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1065
    all_uninteresting_chks.update(uninteresting_keys)
1066
    all_uninteresting_items.update(uninteresting_items)
1067
    del uninteresting_items
1068
    # Note: Exact matches between interesting and uninteresting do not need
1069
    #       to be search further. Non-exact matches need to be searched in case
1070
    #       there is a future exact-match
1071
    uninteresting_keys.difference_update(interesting_keys)
1072
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1073
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1074
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1075
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1076
    while chks_to_read:
1077
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1078
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1079
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1080
            if pb is not None:
1081
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1082
            if record.storage_kind in ('fulltext', 'chunked'):
1083
                bytes = record.get_bytes_as('fulltext')
1084
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1085
                bytes = adapter.get_bytes(record,
1086
                            record.get_bytes_as(record.storage_kind))
1087
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1088
            if isinstance(node, InternalNode):
1089
                # uninteresting_prefix_chks.update(node._items.iteritems())
1090
                chks = node._items.values()
1091
                # TODO: We remove the entries that are already in
1092
                #       uninteresting_chks ?
1093
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1094
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1095
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1096
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1097
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1098
    return (all_uninteresting_chks, all_uninteresting_items,
1099
            interesting_keys, interesting_records, interesting_items)
1100
1101
1102
def iter_interesting_nodes(store, interesting_root_keys,
1103
                           uninteresting_root_keys, pb=None):
1104
    """Given root keys, find interesting nodes.
1105
1106
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1107
    referenced from uninteresting_root_keys are not considered interesting.
1108
1109
    :param interesting_root_keys: keys which should be part of the
1110
        "interesting" nodes (which will be yielded)
1111
    :param uninteresting_root_keys: keys which should be filtered out of the
1112
        result set.
1113
    :return: Yield
1114
        (interesting records, interesting chk's, interesting key:values)
1115
    """
1116
    # TODO: consider that it may be more memory efficient to use the 20-byte
1117
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1118
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1119
    #       helper function similar to _read_bytes. This function should be
1120
    #       able to use nodes from the _page_cache as well as actually
1121
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1122
1123
    # A way to adapt from the compressed texts back into fulltexts
1124
    # In a way, this seems like a layering inversion to have CHKMap know the
1125
    # details of versionedfile
1126
    adapter_class = versionedfile.adapter_registry.get(
1127
        ('knit-ft-gz', 'fulltext'))
1128
    adapter = adapter_class(store)
1129
1130
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1131
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1132
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1133
1134
    # Now that we know everything uninteresting, we can yield information from
1135
    # our first request
1136
    interesting_items.difference_update(all_uninteresting_items)
1137
    records = dict((record.key, record) for record in interesting_records
1138
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1139
    if records or interesting_items:
1140
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1141
    interesting_keys.difference_update(all_uninteresting_chks)
1142
1143
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1144
    while chks_to_read:
1145
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1146
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1147
            if pb is not None:
1148
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1149
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1150
            if record.storage_kind in ('fulltext', 'chunked'):
1151
                bytes = record.get_bytes_as('fulltext')
1152
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1153
                bytes = adapter.get_bytes(record,
1154
                            record.get_bytes_as(record.storage_kind))
1155
            node = _deserialise(bytes, record.key)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1156
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1157
                chks = set(node.refs())
1158
                chks.difference_update(all_uninteresting_chks)
1159
                # Is set() and .difference_update better than:
1160
                # chks = [chk for chk in node.refs()
1161
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1162
                next_chks.update(chks)
1163
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1164
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1165
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1166
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1167
                interesting_items = [item for item in node._items.iteritems()
1168
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1169
                # TODO: Do we need to filter out items that we have already
1170
                #       seen on other pages? We don't really want to buffer the
1171
                #       whole thing, but it does mean that callers need to
1172
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1173
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1174
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1175
        chks_to_read = next_chks