/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
44
import zlib
45
import struct
46
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
47
from bzrlib import versionedfile
48
""")
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
49
from bzrlib import (
50
    lru_cache,
51
    registry,
52
    )
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
53
54
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
55
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
56
# out, it takes 3.1MB to cache the layer.
57
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
58
# We are caching bytes so len(value) is perfectly accurate
59
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
60
61
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
62
def _search_key_plain(key):
63
    """Map the key tuple into a search string that just uses the key bytes."""
64
    return '\x00'.join(key)
65
66
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
67
def _search_key_16(key):
68
    """Map the key tuple into a search key string which has 16-way fan out."""
69
    return '\x00'.join(['%08X' % abs(zlib.crc32(bit)) for bit in key])
70
71
72
def _search_key_255(key):
73
    """Map the key tuple into a search key string which has 255-way fan out.
74
75
    We use 255-way because '\n' is used as a delimiter, and causes problems
76
    while parsing.
77
    """
78
    bytes = '\x00'.join([struct.pack('>i', zlib.crc32(bit)) for bit in key])
79
    return bytes.replace('\n', '_')
80
81
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
82
search_key_registry = registry.Registry()
83
search_key_registry.register('plain', _search_key_plain)
84
search_key_registry.register('hash-16-way', _search_key_16)
85
search_key_registry.register('hash-255-way', _search_key_255)
86
87
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
88
class CHKMap(object):
89
    """A persistent map from string to string backed by a CHK store."""
90
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
91
    def __init__(self, store, root_key, search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
92
        """Create a CHKMap object.
93
94
        :param store: The store the CHKMap is stored in.
95
        :param root_key: The root key of the map. None to create an empty
96
            CHKMap.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
97
        :param search_key_func: A function mapping a key => bytes. These bytes
98
            are then used by the internal nodes to split up leaf nodes into
99
            multiple pages.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
100
        """
101
        self._store = store
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
102
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
103
            search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
104
        self._search_key_func = search_key_func
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
105
        if root_key is None:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
106
            self._root_node = LeafNode(search_key_func=search_key_func)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
107
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
108
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
109
110
    def apply_delta(self, delta):
111
        """Apply a delta to the map.
112
113
        :param delta: An iterable of old_key, new_key, new_value tuples.
114
            If new_key is not None, then new_key->new_value is inserted
115
            into the map; if old_key is not None, then the old mapping
116
            of old_key is removed.
117
        """
118
        for old, new, value in delta:
119
            if old is not None and old != new:
120
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
121
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
122
        for old, new, value in delta:
123
            if new is not None:
124
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
125
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
126
        return self._save()
127
128
    def _ensure_root(self):
129
        """Ensure that the root node is an object not a key."""
130
        if type(self._root_node) == tuple:
131
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
132
            self._root_node = self._get_node(self._root_node)
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
133
            # XXX: Shouldn't this be put into _deserialize?
134
            self._root_node._search_key_func = self._search_key_func
3735.2.31 by Robert Collins
CHKMap.iter_changes
135
136
    def _get_node(self, node):
137
        """Get a node.
138
139
        Node that this does not update the _items dict in objects containing a
140
        reference to this node. As such it does not prevent subsequent IO being
141
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
142
3735.2.31 by Robert Collins
CHKMap.iter_changes
143
        :param node: A tuple key or node object.
144
        :return: A node object.
145
        """
146
        if type(node) == tuple:
147
            bytes = self._read_bytes(node)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
148
            return _deserialise(bytes, node,
149
                search_key_func=self._search_key_func)
3735.2.31 by Robert Collins
CHKMap.iter_changes
150
        else:
151
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
152
153
    def _read_bytes(self, key):
154
        stream = self._store.get_record_stream([key], 'unordered', True)
155
        return stream.next().get_bytes_as('fulltext')
156
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
157
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
158
        """Return the tree in a string representation."""
159
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
160
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
161
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
162
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
163
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
164
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
165
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
166
        """For this node and all children, generate a string representation."""
167
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
168
        if not include_keys:
169
            key_str = ''
170
        else:
171
            node_key = node.key()
172
            if node_key is not None:
173
                key_str = ' %s' % (node_key[0],)
174
            else:
175
                key_str = ' None'
176
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
177
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
178
        if isinstance(node, InternalNode):
179
            # Trigger all child nodes to get loaded
180
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
181
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
182
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
183
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
184
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
185
            for key, value in sorted(node._items.iteritems()):
186
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
187
        return result
188
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
189
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
190
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
191
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
192
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
193
        :param store: The store to record initial_value in, a VersionedFiles
194
            object with 1-tuple keys supporting CHK key generation.
195
        :param initial_value: A dict to store in store. Its keys and values
196
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
197
        :param maximum_size: The maximum_size rule to apply to nodes. This
198
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
199
        :param key_width: The number of elements in each key_tuple being stored
200
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
201
        :return: The root chk of te resulting CHKMap.
202
        """
203
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
204
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
205
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
206
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
207
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
208
            delta.append((None, key, value))
209
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
210
        return result._save()
211
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
212
    def iter_changes(self, basis):
213
        """Iterate over the changes between basis and self.
214
215
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
216
            is None for keys only in self; new_value is None for keys only in
217
            basis.
218
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
219
        # Overview:
220
        # Read both trees in lexographic, highest-first order.
221
        # Any identical nodes we skip
222
        # Any unique prefixes we output immediately.
223
        # values in a leaf node are treated as single-value nodes in the tree
224
        # which allows them to be not-special-cased. We know to output them
225
        # because their value is a string, not a key(tuple) or node.
226
        #
227
        # corner cases to beware of when considering this function:
228
        # *) common references are at different heights.
229
        #    consider two trees:
230
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
231
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
232
        #     'b': LeafNode={'b'}}
233
        #    the node with aaa/aab will only be encountered in the second tree
234
        #    after reading the 'a' subtree, but it is encountered in the first
235
        #    tree immediately. Variations on this may have read internal nodes like this.
236
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
237
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
238
        #    clean as we process each item.
239
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
240
            return
241
        self._ensure_root()
242
        basis._ensure_root()
243
        excluded_keys = set()
244
        self_node = self._root_node
245
        basis_node = basis._root_node
246
        # A heap, each element is prefix, node(tuple/NodeObject/string),
247
        # key_path (a list of tuples, tail-sharing down the tree.)
248
        self_pending = []
249
        basis_pending = []
250
        def process_node(prefix, node, path, a_map, pending):
251
            # take a node and expand it
252
            node = a_map._get_node(node)
253
            if type(node) == LeafNode:
254
                path = (node._key, path)
255
                for key, value in node._items.items():
256
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
257
            else:
258
                # type(node) == InternalNode
259
                path = (node._key, path)
260
                for prefix, child in node._items.items():
261
                    heapq.heappush(pending, (prefix, child, path))
262
        process_node(None, self_node, None, self, self_pending)
263
        process_node(None, basis_node, None, basis, basis_pending)
264
        self_seen = set()
265
        basis_seen = set()
266
        excluded_keys = set()
267
        def check_excluded(key_path):
268
            # Note that this is N^2, it depends on us trimming trees
269
            # aggressively to not become slow.
270
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
271
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
272
            # a common node is detected, the proceed to remove the already
273
            # pending children. bzrlib.graph has a searcher module with a
274
            # similar problem.
275
            while key_path is not None:
276
                key, key_path = key_path
277
                if key in excluded_keys:
278
                    return True
279
            return False
280
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
281
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
282
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
283
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
284
            if not self_pending:
285
                # self is exhausted: output remainder of basis
286
                for prefix, node, path in basis_pending:
287
                    if check_excluded(path):
288
                        continue
289
                    node = basis._get_node(node)
290
                    if type(node) == str:
291
                        # a value
292
                        yield (tuple(prefix.split('\x00')), node, None)
293
                    else:
294
                        # subtree - fastpath the entire thing.
295
                        for key, value in node.iteritems(basis._store):
296
                            yield (key, value, None)
297
                return
298
            elif not basis_pending:
299
                # basis is exhausted: output remainder of self.
300
                for prefix, node, path in self_pending:
301
                    if check_excluded(path):
302
                        continue
303
                    node = self._get_node(node)
304
                    if type(node) == str:
305
                        # a value
306
                        yield (tuple(prefix.split('\x00')), None, node)
307
                    else:
308
                        # subtree - fastpath the entire thing.
309
                        for key, value in node.iteritems(self._store):
310
                            yield (key, None, value)
311
                return
312
            else:
313
                # XXX: future optimisation - yield the smaller items
314
                # immediately rather than pushing everything on/off the
315
                # heaps. Applies to both internal nodes and leafnodes.
316
                if self_pending[0][0] < basis_pending[0][0]:
317
                    # expand self
318
                    prefix, node, path = heapq.heappop(self_pending)
319
                    if check_excluded(path):
320
                        continue
321
                    if type(node) == str:
322
                        # a value
323
                        yield (tuple(prefix.split('\x00')), None, node)
324
                    else:
325
                        process_node(prefix, node, path, self, self_pending)
326
                        continue
327
                elif self_pending[0][0] > basis_pending[0][0]:
328
                    # expand basis
329
                    prefix, node, path = heapq.heappop(basis_pending)
330
                    if check_excluded(path):
331
                        continue
332
                    if type(node) == str:
333
                        # a value
334
                        yield (tuple(prefix.split('\x00')), node, None)
335
                    else:
336
                        process_node(prefix, node, path, basis, basis_pending)
337
                        continue
338
                else:
339
                    # common prefix: possibly expand both
340
                    if type(self_pending[0][1]) != str:
341
                        # process next self
342
                        read_self = True
343
                    else:
344
                        read_self = False
345
                    if type(basis_pending[0][1]) != str:
346
                        # process next basis
347
                        read_basis = True
348
                    else:
349
                        read_basis = False
350
                    if not read_self and not read_basis:
351
                        # compare a common value
352
                        self_details = heapq.heappop(self_pending)
353
                        basis_details = heapq.heappop(basis_pending)
354
                        if self_details[1] != basis_details[1]:
355
                            yield (tuple(self_details[0].split('\x00')),
356
                                basis_details[1], self_details[1])
357
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
358
                    # At least one side wasn't a string.
359
                    if (self._node_key(self_pending[0][1]) ==
360
                        self._node_key(basis_pending[0][1])):
361
                        # Identical pointers, skip (and don't bother adding to
362
                        # excluded, it won't turn up again.
363
                        heapq.heappop(self_pending)
364
                        heapq.heappop(basis_pending)
365
                        continue
366
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
367
                    if read_self:
368
                        prefix, node, path = heapq.heappop(self_pending)
369
                        if check_excluded(path):
370
                            continue
371
                        process_node(prefix, node, path, self, self_pending)
372
                    if read_basis:
373
                        prefix, node, path = heapq.heappop(basis_pending)
374
                        if check_excluded(path):
375
                            continue
376
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
377
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
378
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
379
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
380
        """Iterate over the entire CHKMap's contents."""
381
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
382
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
383
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
384
    def key(self):
385
        """Return the key for this map."""
386
        if isinstance(self._root_node, tuple):
387
            return self._root_node
388
        else:
389
            return self._root_node._key
390
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
391
    def __len__(self):
392
        self._ensure_root()
393
        return len(self._root_node)
394
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
395
    def map(self, key, value):
396
        """Map a key tuple to value."""
397
        # Need a root object.
398
        self._ensure_root()
399
        prefix, node_details = self._root_node.map(self._store, key, value)
400
        if len(node_details) == 1:
401
            self._root_node = node_details[0][1]
402
        else:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
403
            self._root_node = InternalNode(prefix,
404
                                search_key_func=self._search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
405
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
406
            self._root_node._key_width = node_details[0][1]._key_width
407
            for split, node in node_details:
408
                self._root_node.add_node(split, node)
409
3735.2.31 by Robert Collins
CHKMap.iter_changes
410
    def _node_key(self, node):
411
        """Get the key for a node whether its a tuple o r node."""
412
        if type(node) == tuple:
413
            return node
414
        else:
415
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
416
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
417
    def unmap(self, key):
418
        """remove key from the map."""
419
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
420
        unmapped = self._root_node.unmap(self._store, key)
421
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
422
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
423
    def _save(self):
424
        """Save the map completely.
425
426
        :return: The key of the root node.
427
        """
428
        if type(self._root_node) == tuple:
429
            # Already saved.
430
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
431
        keys = list(self._root_node.serialise(self._store))
432
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
433
434
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
435
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
436
    """Base class defining the protocol for CHK Map nodes.
437
438
    :ivar _raw_size: The total size of the serialized key:value data, before
439
        adding the header bytes, and without prefix compression.
440
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
441
442
    def __init__(self, key_width=1):
443
        """Create a node.
444
445
        :param key_width: The width of keys for this node.
446
        """
447
        self._key = None
448
        # Current number of elements
449
        self._len = 0
450
        self._maximum_size = 0
451
        self._key_width = 1
452
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
453
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
454
        # The pointers/values this node has - meaning defined by child classes.
455
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
456
        # The common search prefix
457
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
458
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
459
    def __repr__(self):
460
        items_str = sorted(self._items)
461
        if len(items_str) > 20:
462
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
463
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
464
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
465
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
466
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
467
    def key(self):
468
        return self._key
469
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
470
    def __len__(self):
471
        return self._len
472
473
    @property
474
    def maximum_size(self):
475
        """What is the upper limit for adding references to a node."""
476
        return self._maximum_size
477
478
    def set_maximum_size(self, new_size):
479
        """Set the size threshold for nodes.
480
481
        :param new_size: The size at which no data is added to a node. 0 for
482
            unlimited.
483
        """
484
        self._maximum_size = new_size
485
3735.15.5 by John Arbash Meinel
Change the nomenclature.
486
    @classmethod
487
    def common_prefix(cls, prefix, key):
488
        """Given 2 strings, return the longest prefix common to both.
489
490
        :param prefix: This has been the common prefix for other keys, so it is
491
            more likely to be the common prefix in this case as well.
492
        :param key: Another string to compare to
493
        """
494
        if key.startswith(prefix):
495
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
496
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
497
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
498
            if left != right:
499
                pos -= 1
500
                break
3735.15.5 by John Arbash Meinel
Change the nomenclature.
501
        assert pos <= len(prefix)
502
        assert pos <= len(key)
503
        common = prefix[:pos+1]
504
        assert key.startswith(common)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
505
        return common
506
3735.15.5 by John Arbash Meinel
Change the nomenclature.
507
    @classmethod
508
    def common_prefix_for_keys(cls, keys):
509
        """Given a list of keys, find their common prefix.
510
511
        :param keys: An iterable of strings.
512
        :return: The longest common prefix of all keys.
513
        """
514
        common_prefix = None
515
        for key in keys:
516
            if common_prefix is None:
517
                common_prefix = key
518
                continue
519
            common_prefix = cls.common_prefix(common_prefix, key)
520
            if not common_prefix:
521
                # if common_prefix is the empty string, then we know it won't
522
                # change further
523
                return ''
524
        return common_prefix
525
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
526
527
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
528
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
529
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
530
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
531
    :ivar _size: The number of bytes that would be used by serializing all of
532
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
533
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
534
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
535
    def __init__(self, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
536
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
537
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
538
        self._common_serialised_prefix = None
539
        self._serialise_key = '\x00'.join
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
540
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
541
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
542
        else:
543
            self._search_key_func = search_key_func
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
544
545
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
546
        """Answer the current serialised size of this node.
547
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
548
        This differs from self._raw_size in that it includes the bytes used for
549
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
550
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
551
        if self._common_serialised_prefix is None:
552
            bytes_for_items = 0
553
        else:
554
            # We will store a single string with the common prefix
555
            # And then that common prefix will not be stored in any of the
556
            # entry lines
557
            prefix_len = len(self._common_serialised_prefix)
558
            bytes_for_items = (self._raw_size - (prefix_len * (self._len - 1)))
559
        return (13 # bytes overhead for the header and separators
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
560
            + len(str(self._maximum_size))
561
            + len(str(self._key_width))
3735.15.4 by John Arbash Meinel
Clean up some little bits.
562
            + len(str(self._len))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
563
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
564
565
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
566
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
567
        """Deserialise bytes, with key key, into a LeafNode.
568
569
        :param bytes: The bytes of the node.
570
        :param key: The key that the serialised node has.
571
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
572
        result = LeafNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
573
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
574
        # extra '' if the bytes ends in a final newline.
575
        lines = bytes.split('\n')
576
        assert lines[-1] == ''
577
        lines.pop(-1)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
578
        items = {}
579
        if lines[0] != 'chkleaf:':
580
            raise ValueError("not a serialised leaf node: %r" % bytes)
581
        maximum_size = int(lines[1])
582
        width = int(lines[2])
583
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
584
        prefix = lines[4]
585
        for line in lines[5:]:
586
            line = prefix + line
3735.2.25 by Robert Collins
CHKInventory core tests passing.
587
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
588
            items[tuple(elements[:-1])] = elements[-1]
589
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
590
            raise AssertionError("item count (%d) mismatch for key %s,"
591
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
592
        result._items = items
593
        result._len = length
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
594
        assert length == len(lines) - 5
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
595
        result._maximum_size = maximum_size
596
        result._key = key
597
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
598
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
599
            + (length)*(len(prefix)+1)) # prefix + '\n'
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
600
        result._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
601
        result._compute_serialised_prefix()
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
602
        if len(bytes) != result._current_size():
603
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
604
        assert len(bytes) == result._current_size()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
605
        return result
606
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
607
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
608
        """Iterate over items in the node.
609
610
        :param key_filter: A filter to apply to the node. It should be a
611
            list/set/dict or similar repeatedly iterable container.
612
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
613
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
614
            # Adjust the filter - short elements go to a prefix filter. Would this
615
            # be cleaner explicitly? That would be no harder for InternalNode..
616
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
617
            filters = {}
618
            for key in key_filter:
619
                length_filter = filters.setdefault(len(key), set())
620
                length_filter.add(key)
621
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
622
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
623
                for length, length_filter in filters:
624
                    if item[0][:length] in length_filter:
625
                        yield item
626
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
627
        else:
628
            for item in self._items.iteritems():
629
                yield item
630
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
631
    def _key_value_len(self, key, value):
632
        # TODO: Should probably be done without actually joining the key, but
633
        #       then that can be done via the C extension
3735.15.5 by John Arbash Meinel
Change the nomenclature.
634
        return 2 + len(self._serialise_key(key)) + len(value)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
635
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
636
    def _search_key(self, key):
637
        return self._search_key_func(key)
638
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
639
    def _map_no_split(self, key, value):
640
        """Map a key to a value.
641
642
        This assumes either the key does not already exist, or you have already
643
        removed its size and length from self.
644
645
        :return: True if adding this node should cause us to split.
646
        """
647
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
648
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
649
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
650
        serialised_key = self._serialise_key(key)
651
        if self._common_serialised_prefix is None:
652
            self._common_serialised_prefix = serialised_key
653
        else:
654
            self._common_serialised_prefix = self.common_prefix(
655
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
656
        search_key = self._search_key(key)
657
        if self._search_prefix is None:
658
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
659
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
660
            self._search_prefix = self.common_prefix(
661
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
662
        if (self._len > 1
663
            and self._maximum_size
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
664
            and self._current_size() > self._maximum_size):
665
            # Check to see if all of the search_keys for this node are
666
            # identical. We allow the node to grow under that circumstance
667
            # (we could track this as common state, but it is infrequent)
668
            if (search_key != self._search_prefix
669
                or not self._are_search_keys_identical()):
670
                return True
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
671
        return False
672
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
673
    def _split(self, store):
674
        """We have overflowed.
675
676
        Split this node into multiple LeafNodes, return it up the stack so that
677
        the next layer creates a new InternalNode and references the new nodes.
678
3735.15.5 by John Arbash Meinel
Change the nomenclature.
679
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
680
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
681
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
682
        split_at = len(common_prefix) + 1
683
        result = {}
684
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
685
            search_key = self._search_key(key)
686
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
687
            # TODO: Generally only 1 key can be exactly the right length,
688
            #       which means we can only have 1 key in the node pointed
689
            #       at by the 'prefix\0' key. We might want to consider
690
            #       folding it into the containing InternalNode rather than
691
            #       having a fixed length-1 node.
692
            #       Note this is probably not true for hash keys, as they
693
            #       may get a '\00' node anywhere, but won't have keys of
694
            #       different lengths.
695
            if len(prefix) < split_at:
696
                prefix += '\x00'*(split_at - len(prefix))
697
            if prefix not in result:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
698
                node = LeafNode(search_key_func=self._search_key_func)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
699
                node.set_maximum_size(self._maximum_size)
700
                node._key_width = self._key_width
701
                result[prefix] = node
702
            else:
703
                node = result[prefix]
704
            node.map(store, key, value)
705
        return common_prefix, result.items()
706
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
707
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
708
        """Map key to value."""
709
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
710
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
711
            self._len -= 1
712
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
713
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
714
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
715
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
716
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
717
718
    def serialise(self, store):
719
        """Serialise the tree to store.
720
721
        :param store: A VersionedFiles honouring the CHK extensions.
722
        :return: An iterable of the keys inserted by this operation.
723
        """
724
        lines = ["chkleaf:\n"]
725
        lines.append("%d\n" % self._maximum_size)
726
        lines.append("%d\n" % self._key_width)
727
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
728
        if self._common_serialised_prefix is None:
729
            lines.append('\n')
730
        else:
731
            lines.append('%s\n' % (self._common_serialised_prefix,))
732
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
733
        for key, value in sorted(self._items.items()):
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
734
            serialized = "%s\x00%s\n" % (self._serialise_key(key), value)
735
            assert serialized.startswith(self._common_serialised_prefix)
736
            lines.append(serialized[prefix_len:])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
737
        sha1, _, _ = store.add_lines((None,), (), lines)
738
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
739
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
740
        if len(bytes) != self._current_size():
741
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
742
        assert len(bytes) == self._current_size()
743
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
744
        return [self._key]
745
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
746
    def refs(self):
747
        """Return the references to other CHK's held by this node."""
748
        return []
749
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
750
    def _compute_search_prefix(self):
751
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
752
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
753
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
754
            unique within this node.
755
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
756
        search_keys = [self._search_key(key) for key in self._items]
757
        self._search_prefix = self.common_prefix_for_keys(search_keys)
758
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
759
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
760
    def _are_search_keys_identical(self):
761
        """Check to see if the search keys for all entries are the same.
762
763
        When using a hash as the search_key it is possible for non-identical
764
        keys to collide. If that happens enough, we may try overflow a
765
        LeafNode, but as all are collisions, we must not split.
766
        """
767
        common_search_key = None
768
        for key in self._items:
769
            search_key = self._search_key(key)
770
            if common_search_key is None:
771
                common_search_key = search_key
772
            elif search_key != common_search_key:
773
                return False
774
        return True
775
3735.15.5 by John Arbash Meinel
Change the nomenclature.
776
    def _compute_serialised_prefix(self):
777
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
778
779
        :return: A bytestring of the longest serialised key prefix that is
780
            unique within this node.
781
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
782
        serialised_keys = [self._serialise_key(key) for key in self._items]
783
        self._common_serialised_prefix = self.common_prefix_for_keys(
784
            serialised_keys)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
785
786
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
787
        """Unmap key from the node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
788
        self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
789
        self._len -= 1
790
        del self._items[key]
791
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
792
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
793
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
794
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
795
        return self
796
797
798
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
799
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
800
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
801
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
802
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
803
3735.15.5 by John Arbash Meinel
Change the nomenclature.
804
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
805
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
806
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
807
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
808
    def __init__(self, prefix='', search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
809
        Node.__init__(self)
810
        # The size of an internalnode with default values and no children.
811
        # How many octets key prefixes within this node are.
812
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
813
        self._search_prefix = prefix
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
814
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
815
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
816
        else:
817
            self._search_key_func = search_key_func
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
818
819
    def __repr__(self):
820
        items_str = sorted(self._items)
821
        if len(items_str) > 20:
822
            items_str = items_str[16] + '...]'
823
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
824
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
825
            self._maximum_size, self._search_prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
826
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
827
    def add_node(self, prefix, node):
828
        """Add a child node with prefix prefix, and node node.
829
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
830
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
831
        :param node: The node being added.
832
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
833
        assert self._search_prefix is not None
834
        assert prefix.startswith(self._search_prefix)
835
        assert len(prefix) == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
836
        self._len += len(node)
837
        if not len(self._items):
838
            self._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
839
        assert self._node_width == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
840
        self._items[prefix] = node
841
        self._key = None
842
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
843
    def _current_size(self):
844
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
845
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
846
            len(str(self._maximum_size)))
847
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
848
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
849
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
850
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
851
852
        :param bytes: The bytes of the node.
853
        :param key: The key that the serialised node has.
854
        :return: An InternalNode instance.
855
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
856
        result = InternalNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
857
        # Splitlines can split on '\r' so don't use it, remove the extra ''
858
        # from the result of split('\n') because we should have a trailing
859
        # newline
860
        lines = bytes.split('\n')
861
        assert lines[-1] == ''
862
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
863
        items = {}
864
        if lines[0] != 'chknode:':
865
            raise ValueError("not a serialised internal node: %r" % bytes)
866
        maximum_size = int(lines[1])
867
        width = int(lines[2])
868
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
869
        common_prefix = lines[4]
870
        for line in lines[5:]:
871
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
872
            prefix, flat_key = line.rsplit('\x00', 1)
873
            items[prefix] = (flat_key,)
874
        result._items = items
875
        result._len = length
876
        result._maximum_size = maximum_size
877
        result._key = key
878
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
879
        # XXX: InternalNodes don't really care about their size, and this will
880
        #      change if we add prefix compression
881
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
882
        result._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
883
        result._compute_search_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
884
        return result
885
886
    def iteritems(self, store, key_filter=None):
887
        for node in self._iter_nodes(store, key_filter=key_filter):
888
            for item in node.iteritems(store, key_filter=key_filter):
889
                yield item
890
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
891
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
892
        """Iterate over node objects which match key_filter.
893
894
        :param store: A store to use for accessing content.
895
        :param key_filter: A key filter to filter nodes. Only nodes that might
896
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
897
        :param batch_size: If not None, then we will return the nodes that had
898
            to be read using get_record_stream in batches, rather than reading
899
            them all at once.
900
        :return: An iterable of nodes. This function does not have to be fully
901
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
902
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
903
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
904
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
905
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
906
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
907
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
908
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
909
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
910
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
911
            # XXX defaultdict ?
912
            length_filters = {}
913
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
914
                search_key = self._search_prefix_filter(key)
915
                length_filter = length_filters.setdefault(
916
                                    len(search_key), set())
917
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
918
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
919
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
920
                for length, length_filter in length_filters:
921
                    if prefix[:length] in length_filter:
922
                        if type(node) == tuple:
923
                            keys[node] = prefix
924
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
925
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
926
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
927
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
928
            # Look in the page cache for some more bytes
929
            found_keys = set()
930
            for key in keys:
931
                try:
932
                    bytes = _page_cache[key]
933
                except KeyError:
934
                    continue
935
                else:
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
936
                    node = _deserialise(bytes, key,
937
                        search_key_func=self._search_key_func)
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
938
                    self._items[keys[key]] = node
939
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
940
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
941
            for key in found_keys:
942
                del keys[key]
943
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
944
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
945
            if batch_size is None:
946
                # Read all the keys in
947
                batch_size = len(keys)
948
            key_order = list(keys)
949
            for batch_start in range(0, len(key_order), batch_size):
950
                batch = key_order[batch_start:batch_start + batch_size]
951
                # We have to fully consume the stream so there is no pending
952
                # I/O, so we buffer the nodes for now.
953
                stream = store.get_record_stream(batch, 'unordered', True)
954
                nodes = []
955
                for record in stream:
956
                    bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
957
                    node = _deserialise(bytes, record.key,
958
                        search_key_func=self._search_key_func)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
959
                    nodes.append(node)
960
                    self._items[keys[record.key]] = node
961
                    _page_cache.add(record.key, bytes)
962
                for node in nodes:
963
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
964
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
965
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
966
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
967
        if not len(self._items):
968
            raise AssertionError("cant map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
969
        search_key = self._search_key(key)
970
        assert self._node_width == len(self._search_prefix) + 1
971
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
972
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
973
            # point where it would fit, insert self into that internal node,
974
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
975
            new_prefix = self.common_prefix(self._search_prefix,
976
                                            search_key)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
977
            new_parent = InternalNode(new_prefix,
978
                search_key_func=self._search_key_func)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
979
            new_parent.set_maximum_size(self._maximum_size)
980
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
981
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
982
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
983
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
984
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
985
        if children:
986
            child = children[0]
987
        else:
988
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
989
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
990
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
991
        if isinstance(child, LeafNode):
992
            old_size = child._current_size()
993
        else:
994
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
995
        prefix, node_details = child.map(store, key, value)
996
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
997
            # child may have shrunk, or might be a new node
998
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
999
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1000
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
1001
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1002
            new_node = self
1003
            if (isinstance(child, LeafNode)
1004
                and (old_size is None or child._current_size() < old_size)):
1005
                # The old node was an InternalNode which means it has now
1006
                # collapsed, so we need to check if it will chain to a collapse
1007
                # at this level. Or the LeafNode has shrunk in size, so we need
1008
                # to check that as well.
1009
                new_node = self._check_remap(store)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1010
            assert new_node._search_prefix is not None
1011
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1012
        # child has overflown - create a new intermediate node.
1013
        # XXX: This is where we might want to try and expand our depth
1014
        # to refer to more bytes of every child (which would give us
1015
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1016
        child = self._new_child(search_key, InternalNode)
1017
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1018
        for split, node in node_details:
1019
            child.add_node(split, node)
1020
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
1021
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1022
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1023
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1024
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1025
        """Create a new child node of type klass."""
1026
        child = klass()
1027
        child.set_maximum_size(self._maximum_size)
1028
        child._key_width = self._key_width
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1029
        child._search_key_func = self._search_key_func
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1030
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1031
        return child
1032
1033
    def serialise(self, store):
1034
        """Serialise the node to store.
1035
1036
        :param store: A VersionedFiles honouring the CHK extensions.
1037
        :return: An iterable of the keys inserted by this operation.
1038
        """
1039
        for node in self._items.itervalues():
1040
            if type(node) == tuple:
1041
                # Never deserialised.
1042
                continue
1043
            if node._key is not None:
1044
                # Never altered
1045
                continue
1046
            for key in node.serialise(store):
1047
                yield key
1048
        lines = ["chknode:\n"]
1049
        lines.append("%d\n" % self._maximum_size)
1050
        lines.append("%d\n" % self._key_width)
1051
        lines.append("%d\n" % self._len)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1052
        assert self._search_prefix is not None
1053
        lines.append('%s\n' % (self._search_prefix,))
1054
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1055
        for prefix, node in sorted(self._items.items()):
1056
            if type(node) == tuple:
1057
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1058
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1059
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1060
            serialised = "%s\x00%s\n" % (prefix, key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1061
            assert serialised.startswith(self._search_prefix)
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1062
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1063
        sha1, _, _ = store.add_lines((None,), (), lines)
1064
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
1065
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1066
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1067
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1068
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1069
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1070
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
1071
        # pad as necessary.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
1072
        return (self._search_key_func(key) + '\x00'*self._node_width)[:self._node_width]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1073
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1074
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1075
        """Serialise key for use as a prefix filter in iteritems."""
1076
        if len(key) == self._key_width:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1077
            return self._search_key(key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1078
        return '\x00'.join(key)[:self._node_width]
1079
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1080
    def _split(self, offset):
1081
        """Split this node into smaller nodes starting at offset.
1082
1083
        :param offset: The offset to start the new child nodes at.
1084
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1085
            prefix for reaching node.
1086
        """
1087
        if offset >= self._node_width:
1088
            for node in self._items.values():
1089
                for result in node._split(offset):
1090
                    yield result
1091
            return
1092
        for key, node in self._items.items():
1093
            pass
1094
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1095
    def refs(self):
1096
        """Return the references to other CHK's held by this node."""
1097
        if self._key is None:
1098
            raise AssertionError("unserialised nodes have no refs.")
1099
        refs = []
1100
        for value in self._items.itervalues():
1101
            if type(value) == tuple:
1102
                refs.append(value)
1103
            else:
1104
                refs.append(value.key())
1105
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1106
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1107
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1108
        """Return the unique key prefix for this node.
1109
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1110
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1111
            unique within this node.
1112
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1113
        self._search_prefix = self.common_prefix_for_keys(self._items)
1114
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1115
1116
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1117
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1118
        if not len(self._items):
1119
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1120
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1121
        if children:
1122
            child = children[0]
1123
        else:
1124
            raise KeyError(key)
1125
        self._len -= 1
1126
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1127
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1128
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1129
        if len(unmapped) == 0:
1130
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1131
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1132
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1133
        else:
1134
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1135
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1136
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1137
            # this node is no longer needed:
1138
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1139
        if isinstance(unmapped, InternalNode):
1140
            return self
1141
        return self._check_remap(store)
1142
1143
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1144
        """Check if all keys contained by children fit in a single LeafNode.
1145
1146
        :param store: A store to use for reading more nodes
1147
        :return: Either self, or a new LeafNode which should replace self.
1148
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1149
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1150
        # 1) Implicitly unmap() is removing a key which means that the child
1151
        #    nodes are going to be shrinking by some extent.
1152
        # 2) If all children are LeafNodes, it is possible that they could be
1153
        #    combined into a single LeafNode, which can then completely replace
1154
        #    this internal node with a single LeafNode
1155
        # 3) If *one* child is an InternalNode, we assume it has already done
1156
        #    all the work to determine that its children cannot collapse, and
1157
        #    we can then assume that those nodes *plus* the current nodes don't
1158
        #    have a chance of collapsing either.
1159
        #    So a very cheap check is to just say if 'unmapped' is an
1160
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1161
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1162
        # TODO: Another alternative is to check the total size of all known
1163
        #       LeafNodes. If there is some formula we can use to determine the
1164
        #       final size without actually having to read in any more
1165
        #       children, it would be nice to have. However, we have to be
1166
        #       careful with stuff like nodes that pull out the common prefix
1167
        #       of each key, as adding a new key can change the common prefix
1168
        #       and cause size changes greater than the length of one key.
1169
        #       So for now, we just add everything to a new Leaf until it
1170
        #       splits, as we know that will give the right answer
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1171
        new_leaf = LeafNode(search_key_func=self._search_key_func)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1172
        new_leaf.set_maximum_size(self._maximum_size)
1173
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1174
        # A batch_size of 16 was chosen because:
1175
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1176
        #      leaf nodes we are unlikely to hold them in a single new leaf
1177
        #      node. This still allows for 1 round trip
1178
        #   b) With 16-way fan out, we can still do a single round trip
1179
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1180
        #      the page cache, just to determine that we really don't need it.
1181
        for node in self._iter_nodes(store, batch_size=16):
1182
            if isinstance(node, InternalNode):
1183
                # Without looking at any leaf nodes, we are sure
1184
                return self
1185
            for key, value in node._items.iteritems():
1186
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1187
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1188
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1189
1190
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1191
def _deserialise(bytes, key, search_key_func):
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1192
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1193
    if bytes.startswith("chkleaf:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1194
        return LeafNode.deserialise(bytes, key, search_key_func=search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1195
    elif bytes.startswith("chknode:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1196
        return InternalNode.deserialise(bytes, key,
1197
            search_key_func=search_key_func)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1198
    else:
1199
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1200
1201
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1202
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1203
    """Read the associated records, and determine what is interesting."""
1204
    uninteresting_keys = set(uninteresting_keys)
1205
    chks_to_read = uninteresting_keys.union(interesting_keys)
1206
    next_uninteresting = set()
1207
    next_interesting = set()
1208
    uninteresting_items = set()
1209
    interesting_items = set()
1210
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1211
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1212
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1213
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1214
        if pb is not None:
1215
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1216
        bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1217
        # We don't care about search_key_func for this code, because we only
1218
        # care about external references.
1219
        node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1220
        if record.key in uninteresting_keys:
1221
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1222
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1223
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1224
                # We know we are at a LeafNode, so we can pass None for the
1225
                # store
1226
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1227
        else:
1228
            interesting_records.append(record)
1229
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1230
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1231
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1232
                interesting_items.update(node.iteritems(None))
1233
    # TODO: Filter out records that have already been read, as node splitting
1234
    #       can cause us to reference the same nodes via shorter and longer
1235
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1236
    return (next_uninteresting, uninteresting_items,
1237
            next_interesting, interesting_records, interesting_items)
1238
1239
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1240
def _find_all_uninteresting(store, interesting_root_keys,
1241
                            uninteresting_root_keys, adapter, pb):
1242
    """Determine the full set of uninteresting keys."""
1243
    # What about duplicates between interesting_root_keys and
1244
    # uninteresting_root_keys?
1245
    if not uninteresting_root_keys:
1246
        # Shortcut case. We know there is nothing uninteresting to filter out
1247
        # So we just let the rest of the algorithm do the work
1248
        # We know there is nothing uninteresting, and we didn't have to read
1249
        # any interesting records yet.
1250
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1251
    all_uninteresting_chks = set(uninteresting_root_keys)
1252
    all_uninteresting_items = set()
1253
1254
    # First step, find the direct children of both the interesting and
1255
    # uninteresting set
1256
    (uninteresting_keys, uninteresting_items,
1257
     interesting_keys, interesting_records,
1258
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1259
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1260
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1261
    all_uninteresting_chks.update(uninteresting_keys)
1262
    all_uninteresting_items.update(uninteresting_items)
1263
    del uninteresting_items
1264
    # Note: Exact matches between interesting and uninteresting do not need
1265
    #       to be search further. Non-exact matches need to be searched in case
1266
    #       there is a future exact-match
1267
    uninteresting_keys.difference_update(interesting_keys)
1268
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1269
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1270
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1271
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1272
    while chks_to_read:
1273
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1274
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1275
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1276
            if pb is not None:
1277
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1278
            if record.storage_kind in ('fulltext', 'chunked'):
1279
                bytes = record.get_bytes_as('fulltext')
1280
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1281
                bytes = adapter.get_bytes(record,
1282
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1283
            # We don't care about search_key_func for this code, because we
1284
            # only care about external references.
1285
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1286
            if isinstance(node, InternalNode):
1287
                # uninteresting_prefix_chks.update(node._items.iteritems())
1288
                chks = node._items.values()
1289
                # TODO: We remove the entries that are already in
1290
                #       uninteresting_chks ?
1291
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1292
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1293
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1294
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1295
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1296
    return (all_uninteresting_chks, all_uninteresting_items,
1297
            interesting_keys, interesting_records, interesting_items)
1298
1299
1300
def iter_interesting_nodes(store, interesting_root_keys,
1301
                           uninteresting_root_keys, pb=None):
1302
    """Given root keys, find interesting nodes.
1303
1304
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1305
    referenced from uninteresting_root_keys are not considered interesting.
1306
1307
    :param interesting_root_keys: keys which should be part of the
1308
        "interesting" nodes (which will be yielded)
1309
    :param uninteresting_root_keys: keys which should be filtered out of the
1310
        result set.
1311
    :return: Yield
1312
        (interesting records, interesting chk's, interesting key:values)
1313
    """
1314
    # TODO: consider that it may be more memory efficient to use the 20-byte
1315
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1316
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1317
    #       helper function similar to _read_bytes. This function should be
1318
    #       able to use nodes from the _page_cache as well as actually
1319
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1320
1321
    # A way to adapt from the compressed texts back into fulltexts
1322
    # In a way, this seems like a layering inversion to have CHKMap know the
1323
    # details of versionedfile
1324
    adapter_class = versionedfile.adapter_registry.get(
1325
        ('knit-ft-gz', 'fulltext'))
1326
    adapter = adapter_class(store)
1327
1328
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1329
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1330
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1331
1332
    # Now that we know everything uninteresting, we can yield information from
1333
    # our first request
1334
    interesting_items.difference_update(all_uninteresting_items)
1335
    records = dict((record.key, record) for record in interesting_records
1336
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1337
    if records or interesting_items:
1338
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1339
    interesting_keys.difference_update(all_uninteresting_chks)
1340
1341
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1342
    while chks_to_read:
1343
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1344
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1345
            if pb is not None:
1346
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1347
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1348
            if record.storage_kind in ('fulltext', 'chunked'):
1349
                bytes = record.get_bytes_as('fulltext')
1350
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1351
                bytes = adapter.get_bytes(record,
1352
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1353
            # We don't care about search_key_func for this code, because we
1354
            # only care about external references.
1355
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1356
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1357
                chks = set(node.refs())
1358
                chks.difference_update(all_uninteresting_chks)
1359
                # Is set() and .difference_update better than:
1360
                # chks = [chk for chk in node.refs()
1361
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1362
                next_chks.update(chks)
1363
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1364
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1365
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1366
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1367
                interesting_items = [item for item in node._items.iteritems()
1368
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1369
                # TODO: Do we need to filter out items that we have already
1370
                #       seen on other pages? We don't really want to buffer the
1371
                #       whole thing, but it does mean that callers need to
1372
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1373
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1374
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1375
        chks_to_read = next_chks