/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
44
import zlib
45
import struct
46
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
47
from bzrlib import versionedfile
48
""")
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
49
from bzrlib import (
50
    lru_cache,
51
    registry,
52
    )
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
53
54
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
55
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
56
# out, it takes 3.1MB to cache the layer.
57
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
58
# We are caching bytes so len(value) is perfectly accurate
59
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
60
61
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
62
def _search_key_plain(key):
63
    """Map the key tuple into a search string that just uses the key bytes."""
64
    return '\x00'.join(key)
65
66
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
67
def _search_key_16(key):
68
    """Map the key tuple into a search key string which has 16-way fan out."""
69
    return '\x00'.join(['%08X' % abs(zlib.crc32(bit)) for bit in key])
70
71
72
def _search_key_255(key):
73
    """Map the key tuple into a search key string which has 255-way fan out.
74
75
    We use 255-way because '\n' is used as a delimiter, and causes problems
76
    while parsing.
77
    """
3735.2.83 by Vincent Ladeuil
Better fix with explanation for zlib.crc32.
78
    # Depending on python version and platform, zlib.crc32 will return either a
79
    # signed (<= 2.5 >= 3.0) or an unsigned (2.5, 2.6).
80
    # http://docs.python.org/library/zlib.html recommends using a mask to force
81
    # an unsigned value to ensure the same numeric value (unsigned) is obtained
82
    # across all python versions and platforms.
83
    bytes = '\x00'.join([struct.pack('>L', zlib.crc32(bit)&0xFFFFFFFF)
84
                         for bit in key])
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
85
    return bytes.replace('\n', '_')
86
87
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
88
search_key_registry = registry.Registry()
89
search_key_registry.register('plain', _search_key_plain)
90
search_key_registry.register('hash-16-way', _search_key_16)
91
search_key_registry.register('hash-255-way', _search_key_255)
92
93
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
94
class CHKMap(object):
95
    """A persistent map from string to string backed by a CHK store."""
96
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
97
    def __init__(self, store, root_key, search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
98
        """Create a CHKMap object.
99
100
        :param store: The store the CHKMap is stored in.
101
        :param root_key: The root key of the map. None to create an empty
102
            CHKMap.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
103
        :param search_key_func: A function mapping a key => bytes. These bytes
104
            are then used by the internal nodes to split up leaf nodes into
105
            multiple pages.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
106
        """
107
        self._store = store
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
108
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
109
            search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
110
        self._search_key_func = search_key_func
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
111
        if root_key is None:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
112
            self._root_node = LeafNode(search_key_func=search_key_func)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
113
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
114
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
115
116
    def apply_delta(self, delta):
117
        """Apply a delta to the map.
118
119
        :param delta: An iterable of old_key, new_key, new_value tuples.
120
            If new_key is not None, then new_key->new_value is inserted
121
            into the map; if old_key is not None, then the old mapping
122
            of old_key is removed.
123
        """
124
        for old, new, value in delta:
125
            if old is not None and old != new:
126
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
127
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
128
        for old, new, value in delta:
129
            if new is not None:
130
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
131
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
132
        return self._save()
133
134
    def _ensure_root(self):
135
        """Ensure that the root node is an object not a key."""
136
        if type(self._root_node) == tuple:
137
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
138
            self._root_node = self._get_node(self._root_node)
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
139
            # XXX: Shouldn't this be put into _deserialize?
140
            self._root_node._search_key_func = self._search_key_func
3735.2.31 by Robert Collins
CHKMap.iter_changes
141
142
    def _get_node(self, node):
143
        """Get a node.
144
145
        Node that this does not update the _items dict in objects containing a
146
        reference to this node. As such it does not prevent subsequent IO being
147
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
148
3735.2.31 by Robert Collins
CHKMap.iter_changes
149
        :param node: A tuple key or node object.
150
        :return: A node object.
151
        """
152
        if type(node) == tuple:
153
            bytes = self._read_bytes(node)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
154
            return _deserialise(bytes, node,
155
                search_key_func=self._search_key_func)
3735.2.31 by Robert Collins
CHKMap.iter_changes
156
        else:
157
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
158
159
    def _read_bytes(self, key):
160
        stream = self._store.get_record_stream([key], 'unordered', True)
161
        return stream.next().get_bytes_as('fulltext')
162
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
163
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
164
        """Return the tree in a string representation."""
165
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
166
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
167
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
168
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
169
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
170
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
171
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
172
        """For this node and all children, generate a string representation."""
173
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
174
        if not include_keys:
175
            key_str = ''
176
        else:
177
            node_key = node.key()
178
            if node_key is not None:
179
                key_str = ' %s' % (node_key[0],)
180
            else:
181
                key_str = ' None'
182
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
183
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
184
        if isinstance(node, InternalNode):
185
            # Trigger all child nodes to get loaded
186
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
187
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
188
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
189
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
190
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
191
            for key, value in sorted(node._items.iteritems()):
192
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
193
        return result
194
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
195
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
196
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
197
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
198
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
199
        :param store: The store to record initial_value in, a VersionedFiles
200
            object with 1-tuple keys supporting CHK key generation.
201
        :param initial_value: A dict to store in store. Its keys and values
202
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
203
        :param maximum_size: The maximum_size rule to apply to nodes. This
204
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
205
        :param key_width: The number of elements in each key_tuple being stored
206
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
207
        :return: The root chk of te resulting CHKMap.
208
        """
209
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
210
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
211
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
212
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
213
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
214
            delta.append((None, key, value))
215
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
216
        return result._save()
217
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
218
    def iter_changes(self, basis):
219
        """Iterate over the changes between basis and self.
220
221
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
222
            is None for keys only in self; new_value is None for keys only in
223
            basis.
224
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
225
        # Overview:
226
        # Read both trees in lexographic, highest-first order.
227
        # Any identical nodes we skip
228
        # Any unique prefixes we output immediately.
229
        # values in a leaf node are treated as single-value nodes in the tree
230
        # which allows them to be not-special-cased. We know to output them
231
        # because their value is a string, not a key(tuple) or node.
232
        #
233
        # corner cases to beware of when considering this function:
234
        # *) common references are at different heights.
235
        #    consider two trees:
236
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
237
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
238
        #     'b': LeafNode={'b'}}
239
        #    the node with aaa/aab will only be encountered in the second tree
240
        #    after reading the 'a' subtree, but it is encountered in the first
241
        #    tree immediately. Variations on this may have read internal nodes like this.
242
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
243
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
244
        #    clean as we process each item.
245
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
246
            return
247
        self._ensure_root()
248
        basis._ensure_root()
249
        excluded_keys = set()
250
        self_node = self._root_node
251
        basis_node = basis._root_node
252
        # A heap, each element is prefix, node(tuple/NodeObject/string),
253
        # key_path (a list of tuples, tail-sharing down the tree.)
254
        self_pending = []
255
        basis_pending = []
256
        def process_node(prefix, node, path, a_map, pending):
257
            # take a node and expand it
258
            node = a_map._get_node(node)
259
            if type(node) == LeafNode:
260
                path = (node._key, path)
261
                for key, value in node._items.items():
262
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
263
            else:
264
                # type(node) == InternalNode
265
                path = (node._key, path)
266
                for prefix, child in node._items.items():
267
                    heapq.heappush(pending, (prefix, child, path))
268
        process_node(None, self_node, None, self, self_pending)
269
        process_node(None, basis_node, None, basis, basis_pending)
270
        self_seen = set()
271
        basis_seen = set()
272
        excluded_keys = set()
273
        def check_excluded(key_path):
274
            # Note that this is N^2, it depends on us trimming trees
275
            # aggressively to not become slow.
276
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
277
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
278
            # a common node is detected, the proceed to remove the already
279
            # pending children. bzrlib.graph has a searcher module with a
280
            # similar problem.
281
            while key_path is not None:
282
                key, key_path = key_path
283
                if key in excluded_keys:
284
                    return True
285
            return False
286
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
287
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
288
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
289
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
290
            if not self_pending:
291
                # self is exhausted: output remainder of basis
292
                for prefix, node, path in basis_pending:
293
                    if check_excluded(path):
294
                        continue
295
                    node = basis._get_node(node)
296
                    if type(node) == str:
297
                        # a value
298
                        yield (tuple(prefix.split('\x00')), node, None)
299
                    else:
300
                        # subtree - fastpath the entire thing.
301
                        for key, value in node.iteritems(basis._store):
302
                            yield (key, value, None)
303
                return
304
            elif not basis_pending:
305
                # basis is exhausted: output remainder of self.
306
                for prefix, node, path in self_pending:
307
                    if check_excluded(path):
308
                        continue
309
                    node = self._get_node(node)
310
                    if type(node) == str:
311
                        # a value
312
                        yield (tuple(prefix.split('\x00')), None, node)
313
                    else:
314
                        # subtree - fastpath the entire thing.
315
                        for key, value in node.iteritems(self._store):
316
                            yield (key, None, value)
317
                return
318
            else:
319
                # XXX: future optimisation - yield the smaller items
320
                # immediately rather than pushing everything on/off the
321
                # heaps. Applies to both internal nodes and leafnodes.
322
                if self_pending[0][0] < basis_pending[0][0]:
323
                    # expand self
324
                    prefix, node, path = heapq.heappop(self_pending)
325
                    if check_excluded(path):
326
                        continue
327
                    if type(node) == str:
328
                        # a value
329
                        yield (tuple(prefix.split('\x00')), None, node)
330
                    else:
331
                        process_node(prefix, node, path, self, self_pending)
332
                        continue
333
                elif self_pending[0][0] > basis_pending[0][0]:
334
                    # expand basis
335
                    prefix, node, path = heapq.heappop(basis_pending)
336
                    if check_excluded(path):
337
                        continue
338
                    if type(node) == str:
339
                        # a value
340
                        yield (tuple(prefix.split('\x00')), node, None)
341
                    else:
342
                        process_node(prefix, node, path, basis, basis_pending)
343
                        continue
344
                else:
345
                    # common prefix: possibly expand both
346
                    if type(self_pending[0][1]) != str:
347
                        # process next self
348
                        read_self = True
349
                    else:
350
                        read_self = False
351
                    if type(basis_pending[0][1]) != str:
352
                        # process next basis
353
                        read_basis = True
354
                    else:
355
                        read_basis = False
356
                    if not read_self and not read_basis:
357
                        # compare a common value
358
                        self_details = heapq.heappop(self_pending)
359
                        basis_details = heapq.heappop(basis_pending)
360
                        if self_details[1] != basis_details[1]:
361
                            yield (tuple(self_details[0].split('\x00')),
362
                                basis_details[1], self_details[1])
363
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
364
                    # At least one side wasn't a string.
365
                    if (self._node_key(self_pending[0][1]) ==
366
                        self._node_key(basis_pending[0][1])):
367
                        # Identical pointers, skip (and don't bother adding to
368
                        # excluded, it won't turn up again.
369
                        heapq.heappop(self_pending)
370
                        heapq.heappop(basis_pending)
371
                        continue
372
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
373
                    if read_self:
374
                        prefix, node, path = heapq.heappop(self_pending)
375
                        if check_excluded(path):
376
                            continue
377
                        process_node(prefix, node, path, self, self_pending)
378
                    if read_basis:
379
                        prefix, node, path = heapq.heappop(basis_pending)
380
                        if check_excluded(path):
381
                            continue
382
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
383
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
384
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
385
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
386
        """Iterate over the entire CHKMap's contents."""
387
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
388
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
389
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
390
    def key(self):
391
        """Return the key for this map."""
392
        if isinstance(self._root_node, tuple):
393
            return self._root_node
394
        else:
395
            return self._root_node._key
396
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
397
    def __len__(self):
398
        self._ensure_root()
399
        return len(self._root_node)
400
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
401
    def map(self, key, value):
402
        """Map a key tuple to value."""
403
        # Need a root object.
404
        self._ensure_root()
405
        prefix, node_details = self._root_node.map(self._store, key, value)
406
        if len(node_details) == 1:
407
            self._root_node = node_details[0][1]
408
        else:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
409
            self._root_node = InternalNode(prefix,
410
                                search_key_func=self._search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
411
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
412
            self._root_node._key_width = node_details[0][1]._key_width
413
            for split, node in node_details:
414
                self._root_node.add_node(split, node)
415
3735.2.31 by Robert Collins
CHKMap.iter_changes
416
    def _node_key(self, node):
417
        """Get the key for a node whether its a tuple o r node."""
418
        if type(node) == tuple:
419
            return node
420
        else:
421
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
422
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
423
    def unmap(self, key):
424
        """remove key from the map."""
425
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
426
        unmapped = self._root_node.unmap(self._store, key)
427
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
428
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
429
    def _save(self):
430
        """Save the map completely.
431
432
        :return: The key of the root node.
433
        """
434
        if type(self._root_node) == tuple:
435
            # Already saved.
436
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
437
        keys = list(self._root_node.serialise(self._store))
438
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
439
440
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
441
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
442
    """Base class defining the protocol for CHK Map nodes.
443
444
    :ivar _raw_size: The total size of the serialized key:value data, before
445
        adding the header bytes, and without prefix compression.
446
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
447
448
    def __init__(self, key_width=1):
449
        """Create a node.
450
451
        :param key_width: The width of keys for this node.
452
        """
453
        self._key = None
454
        # Current number of elements
455
        self._len = 0
456
        self._maximum_size = 0
457
        self._key_width = 1
458
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
459
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
460
        # The pointers/values this node has - meaning defined by child classes.
461
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
462
        # The common search prefix
463
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
464
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
465
    def __repr__(self):
466
        items_str = sorted(self._items)
467
        if len(items_str) > 20:
468
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
469
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
470
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
471
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
472
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
473
    def key(self):
474
        return self._key
475
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
476
    def __len__(self):
477
        return self._len
478
479
    @property
480
    def maximum_size(self):
481
        """What is the upper limit for adding references to a node."""
482
        return self._maximum_size
483
484
    def set_maximum_size(self, new_size):
485
        """Set the size threshold for nodes.
486
487
        :param new_size: The size at which no data is added to a node. 0 for
488
            unlimited.
489
        """
490
        self._maximum_size = new_size
491
3735.15.5 by John Arbash Meinel
Change the nomenclature.
492
    @classmethod
493
    def common_prefix(cls, prefix, key):
494
        """Given 2 strings, return the longest prefix common to both.
495
496
        :param prefix: This has been the common prefix for other keys, so it is
497
            more likely to be the common prefix in this case as well.
498
        :param key: Another string to compare to
499
        """
500
        if key.startswith(prefix):
501
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
502
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
503
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
504
            if left != right:
505
                pos -= 1
506
                break
3735.15.5 by John Arbash Meinel
Change the nomenclature.
507
        assert pos <= len(prefix)
508
        assert pos <= len(key)
509
        common = prefix[:pos+1]
510
        assert key.startswith(common)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
511
        return common
512
3735.15.5 by John Arbash Meinel
Change the nomenclature.
513
    @classmethod
514
    def common_prefix_for_keys(cls, keys):
515
        """Given a list of keys, find their common prefix.
516
517
        :param keys: An iterable of strings.
518
        :return: The longest common prefix of all keys.
519
        """
520
        common_prefix = None
521
        for key in keys:
522
            if common_prefix is None:
523
                common_prefix = key
524
                continue
525
            common_prefix = cls.common_prefix(common_prefix, key)
526
            if not common_prefix:
527
                # if common_prefix is the empty string, then we know it won't
528
                # change further
529
                return ''
530
        return common_prefix
531
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
532
533
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
534
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
535
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
536
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
537
    :ivar _size: The number of bytes that would be used by serializing all of
538
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
539
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
540
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
541
    def __init__(self, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
542
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
543
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
544
        self._common_serialised_prefix = None
545
        self._serialise_key = '\x00'.join
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
546
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
547
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
548
        else:
549
            self._search_key_func = search_key_func
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
550
551
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
552
        """Answer the current serialised size of this node.
553
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
554
        This differs from self._raw_size in that it includes the bytes used for
555
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
556
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
557
        if self._common_serialised_prefix is None:
558
            bytes_for_items = 0
559
        else:
560
            # We will store a single string with the common prefix
561
            # And then that common prefix will not be stored in any of the
562
            # entry lines
563
            prefix_len = len(self._common_serialised_prefix)
564
            bytes_for_items = (self._raw_size - (prefix_len * (self._len - 1)))
565
        return (13 # bytes overhead for the header and separators
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
566
            + len(str(self._maximum_size))
567
            + len(str(self._key_width))
3735.15.4 by John Arbash Meinel
Clean up some little bits.
568
            + len(str(self._len))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
569
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
570
571
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
572
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
573
        """Deserialise bytes, with key key, into a LeafNode.
574
575
        :param bytes: The bytes of the node.
576
        :param key: The key that the serialised node has.
577
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
578
        result = LeafNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
579
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
580
        # extra '' if the bytes ends in a final newline.
581
        lines = bytes.split('\n')
582
        assert lines[-1] == ''
583
        lines.pop(-1)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
584
        items = {}
585
        if lines[0] != 'chkleaf:':
586
            raise ValueError("not a serialised leaf node: %r" % bytes)
587
        maximum_size = int(lines[1])
588
        width = int(lines[2])
589
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
590
        prefix = lines[4]
591
        for line in lines[5:]:
592
            line = prefix + line
3735.2.25 by Robert Collins
CHKInventory core tests passing.
593
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
594
            items[tuple(elements[:-1])] = elements[-1]
595
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
596
            raise AssertionError("item count (%d) mismatch for key %s,"
597
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
598
        result._items = items
599
        result._len = length
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
600
        assert length == len(lines) - 5
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
601
        result._maximum_size = maximum_size
602
        result._key = key
603
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
604
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
605
            + (length)*(len(prefix)+1)) # prefix + '\n'
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
606
        result._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
607
        result._compute_serialised_prefix()
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
608
        if len(bytes) != result._current_size():
609
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
610
        assert len(bytes) == result._current_size()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
611
        return result
612
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
613
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
614
        """Iterate over items in the node.
615
616
        :param key_filter: A filter to apply to the node. It should be a
617
            list/set/dict or similar repeatedly iterable container.
618
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
619
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
620
            # Adjust the filter - short elements go to a prefix filter. Would this
621
            # be cleaner explicitly? That would be no harder for InternalNode..
622
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
623
            filters = {}
624
            for key in key_filter:
625
                length_filter = filters.setdefault(len(key), set())
626
                length_filter.add(key)
627
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
628
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
629
                for length, length_filter in filters:
630
                    if item[0][:length] in length_filter:
631
                        yield item
632
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
633
        else:
634
            for item in self._items.iteritems():
635
                yield item
636
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
637
    def _key_value_len(self, key, value):
638
        # TODO: Should probably be done without actually joining the key, but
639
        #       then that can be done via the C extension
3735.15.5 by John Arbash Meinel
Change the nomenclature.
640
        return 2 + len(self._serialise_key(key)) + len(value)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
641
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
642
    def _search_key(self, key):
643
        return self._search_key_func(key)
644
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
645
    def _map_no_split(self, key, value):
646
        """Map a key to a value.
647
648
        This assumes either the key does not already exist, or you have already
649
        removed its size and length from self.
650
651
        :return: True if adding this node should cause us to split.
652
        """
653
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
654
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
655
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
656
        serialised_key = self._serialise_key(key)
657
        if self._common_serialised_prefix is None:
658
            self._common_serialised_prefix = serialised_key
659
        else:
660
            self._common_serialised_prefix = self.common_prefix(
661
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
662
        search_key = self._search_key(key)
663
        if self._search_prefix is None:
664
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
665
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
666
            self._search_prefix = self.common_prefix(
667
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
668
        if (self._len > 1
669
            and self._maximum_size
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
670
            and self._current_size() > self._maximum_size):
671
            # Check to see if all of the search_keys for this node are
672
            # identical. We allow the node to grow under that circumstance
673
            # (we could track this as common state, but it is infrequent)
674
            if (search_key != self._search_prefix
675
                or not self._are_search_keys_identical()):
676
                return True
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
677
        return False
678
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
679
    def _split(self, store):
680
        """We have overflowed.
681
682
        Split this node into multiple LeafNodes, return it up the stack so that
683
        the next layer creates a new InternalNode and references the new nodes.
684
3735.15.5 by John Arbash Meinel
Change the nomenclature.
685
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
686
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
687
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
688
        split_at = len(common_prefix) + 1
689
        result = {}
690
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
691
            search_key = self._search_key(key)
692
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
693
            # TODO: Generally only 1 key can be exactly the right length,
694
            #       which means we can only have 1 key in the node pointed
695
            #       at by the 'prefix\0' key. We might want to consider
696
            #       folding it into the containing InternalNode rather than
697
            #       having a fixed length-1 node.
698
            #       Note this is probably not true for hash keys, as they
699
            #       may get a '\00' node anywhere, but won't have keys of
700
            #       different lengths.
701
            if len(prefix) < split_at:
702
                prefix += '\x00'*(split_at - len(prefix))
703
            if prefix not in result:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
704
                node = LeafNode(search_key_func=self._search_key_func)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
705
                node.set_maximum_size(self._maximum_size)
706
                node._key_width = self._key_width
707
                result[prefix] = node
708
            else:
709
                node = result[prefix]
710
            node.map(store, key, value)
711
        return common_prefix, result.items()
712
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
713
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
714
        """Map key to value."""
715
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
716
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
717
            self._len -= 1
718
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
719
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
720
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
721
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
722
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
723
724
    def serialise(self, store):
725
        """Serialise the tree to store.
726
727
        :param store: A VersionedFiles honouring the CHK extensions.
728
        :return: An iterable of the keys inserted by this operation.
729
        """
730
        lines = ["chkleaf:\n"]
731
        lines.append("%d\n" % self._maximum_size)
732
        lines.append("%d\n" % self._key_width)
733
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
734
        if self._common_serialised_prefix is None:
735
            lines.append('\n')
736
        else:
737
            lines.append('%s\n' % (self._common_serialised_prefix,))
738
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
739
        for key, value in sorted(self._items.items()):
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
740
            serialized = "%s\x00%s\n" % (self._serialise_key(key), value)
741
            assert serialized.startswith(self._common_serialised_prefix)
742
            lines.append(serialized[prefix_len:])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
743
        sha1, _, _ = store.add_lines((None,), (), lines)
744
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
745
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
746
        if len(bytes) != self._current_size():
747
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
748
        assert len(bytes) == self._current_size()
749
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
750
        return [self._key]
751
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
752
    def refs(self):
753
        """Return the references to other CHK's held by this node."""
754
        return []
755
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
756
    def _compute_search_prefix(self):
757
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
758
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
759
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
760
            unique within this node.
761
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
762
        search_keys = [self._search_key(key) for key in self._items]
763
        self._search_prefix = self.common_prefix_for_keys(search_keys)
764
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
765
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
766
    def _are_search_keys_identical(self):
767
        """Check to see if the search keys for all entries are the same.
768
769
        When using a hash as the search_key it is possible for non-identical
770
        keys to collide. If that happens enough, we may try overflow a
771
        LeafNode, but as all are collisions, we must not split.
772
        """
773
        common_search_key = None
774
        for key in self._items:
775
            search_key = self._search_key(key)
776
            if common_search_key is None:
777
                common_search_key = search_key
778
            elif search_key != common_search_key:
779
                return False
780
        return True
781
3735.15.5 by John Arbash Meinel
Change the nomenclature.
782
    def _compute_serialised_prefix(self):
783
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
784
785
        :return: A bytestring of the longest serialised key prefix that is
786
            unique within this node.
787
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
788
        serialised_keys = [self._serialise_key(key) for key in self._items]
789
        self._common_serialised_prefix = self.common_prefix_for_keys(
790
            serialised_keys)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
791
792
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
793
        """Unmap key from the node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
794
        self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
795
        self._len -= 1
796
        del self._items[key]
797
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
798
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
799
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
800
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
801
        return self
802
803
804
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
805
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
806
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
807
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
808
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
809
3735.15.5 by John Arbash Meinel
Change the nomenclature.
810
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
811
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
812
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
813
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
814
    def __init__(self, prefix='', search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
815
        Node.__init__(self)
816
        # The size of an internalnode with default values and no children.
817
        # How many octets key prefixes within this node are.
818
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
819
        self._search_prefix = prefix
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
820
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
821
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
822
        else:
823
            self._search_key_func = search_key_func
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
824
825
    def __repr__(self):
826
        items_str = sorted(self._items)
827
        if len(items_str) > 20:
828
            items_str = items_str[16] + '...]'
829
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
830
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
831
            self._maximum_size, self._search_prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
832
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
833
    def add_node(self, prefix, node):
834
        """Add a child node with prefix prefix, and node node.
835
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
836
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
837
        :param node: The node being added.
838
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
839
        assert self._search_prefix is not None
840
        assert prefix.startswith(self._search_prefix)
841
        assert len(prefix) == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
842
        self._len += len(node)
843
        if not len(self._items):
844
            self._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
845
        assert self._node_width == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
846
        self._items[prefix] = node
847
        self._key = None
848
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
849
    def _current_size(self):
850
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
851
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
852
            len(str(self._maximum_size)))
853
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
854
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
855
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
856
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
857
858
        :param bytes: The bytes of the node.
859
        :param key: The key that the serialised node has.
860
        :return: An InternalNode instance.
861
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
862
        result = InternalNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
863
        # Splitlines can split on '\r' so don't use it, remove the extra ''
864
        # from the result of split('\n') because we should have a trailing
865
        # newline
866
        lines = bytes.split('\n')
867
        assert lines[-1] == ''
868
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
869
        items = {}
870
        if lines[0] != 'chknode:':
871
            raise ValueError("not a serialised internal node: %r" % bytes)
872
        maximum_size = int(lines[1])
873
        width = int(lines[2])
874
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
875
        common_prefix = lines[4]
876
        for line in lines[5:]:
877
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
878
            prefix, flat_key = line.rsplit('\x00', 1)
879
            items[prefix] = (flat_key,)
880
        result._items = items
881
        result._len = length
882
        result._maximum_size = maximum_size
883
        result._key = key
884
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
885
        # XXX: InternalNodes don't really care about their size, and this will
886
        #      change if we add prefix compression
887
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
888
        result._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
889
        result._compute_search_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
890
        return result
891
892
    def iteritems(self, store, key_filter=None):
893
        for node in self._iter_nodes(store, key_filter=key_filter):
894
            for item in node.iteritems(store, key_filter=key_filter):
895
                yield item
896
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
897
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
898
        """Iterate over node objects which match key_filter.
899
900
        :param store: A store to use for accessing content.
901
        :param key_filter: A key filter to filter nodes. Only nodes that might
902
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
903
        :param batch_size: If not None, then we will return the nodes that had
904
            to be read using get_record_stream in batches, rather than reading
905
            them all at once.
906
        :return: An iterable of nodes. This function does not have to be fully
907
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
908
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
909
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
910
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
911
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
912
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
913
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
914
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
915
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
916
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
917
            # XXX defaultdict ?
918
            length_filters = {}
919
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
920
                search_key = self._search_prefix_filter(key)
921
                length_filter = length_filters.setdefault(
922
                                    len(search_key), set())
923
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
924
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
925
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
926
                for length, length_filter in length_filters:
927
                    if prefix[:length] in length_filter:
928
                        if type(node) == tuple:
929
                            keys[node] = prefix
930
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
931
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
932
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
933
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
934
            # Look in the page cache for some more bytes
935
            found_keys = set()
936
            for key in keys:
937
                try:
938
                    bytes = _page_cache[key]
939
                except KeyError:
940
                    continue
941
                else:
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
942
                    node = _deserialise(bytes, key,
943
                        search_key_func=self._search_key_func)
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
944
                    self._items[keys[key]] = node
945
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
946
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
947
            for key in found_keys:
948
                del keys[key]
949
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
950
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
951
            if batch_size is None:
952
                # Read all the keys in
953
                batch_size = len(keys)
954
            key_order = list(keys)
955
            for batch_start in range(0, len(key_order), batch_size):
956
                batch = key_order[batch_start:batch_start + batch_size]
957
                # We have to fully consume the stream so there is no pending
958
                # I/O, so we buffer the nodes for now.
959
                stream = store.get_record_stream(batch, 'unordered', True)
960
                nodes = []
961
                for record in stream:
962
                    bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
963
                    node = _deserialise(bytes, record.key,
964
                        search_key_func=self._search_key_func)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
965
                    nodes.append(node)
966
                    self._items[keys[record.key]] = node
967
                    _page_cache.add(record.key, bytes)
968
                for node in nodes:
969
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
970
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
971
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
972
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
973
        if not len(self._items):
974
            raise AssertionError("cant map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
975
        search_key = self._search_key(key)
976
        assert self._node_width == len(self._search_prefix) + 1
977
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
978
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
979
            # point where it would fit, insert self into that internal node,
980
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
981
            new_prefix = self.common_prefix(self._search_prefix,
982
                                            search_key)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
983
            new_parent = InternalNode(new_prefix,
984
                search_key_func=self._search_key_func)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
985
            new_parent.set_maximum_size(self._maximum_size)
986
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
987
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
988
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
989
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
990
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
991
        if children:
992
            child = children[0]
993
        else:
994
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
995
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
996
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
997
        if isinstance(child, LeafNode):
998
            old_size = child._current_size()
999
        else:
1000
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1001
        prefix, node_details = child.map(store, key, value)
1002
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
1003
            # child may have shrunk, or might be a new node
1004
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1005
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1006
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
1007
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1008
            new_node = self
1009
            if (isinstance(child, LeafNode)
1010
                and (old_size is None or child._current_size() < old_size)):
1011
                # The old node was an InternalNode which means it has now
1012
                # collapsed, so we need to check if it will chain to a collapse
1013
                # at this level. Or the LeafNode has shrunk in size, so we need
1014
                # to check that as well.
1015
                new_node = self._check_remap(store)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1016
            assert new_node._search_prefix is not None
1017
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1018
        # child has overflown - create a new intermediate node.
1019
        # XXX: This is where we might want to try and expand our depth
1020
        # to refer to more bytes of every child (which would give us
1021
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1022
        child = self._new_child(search_key, InternalNode)
1023
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1024
        for split, node in node_details:
1025
            child.add_node(split, node)
1026
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
1027
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1028
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1029
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1030
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1031
        """Create a new child node of type klass."""
1032
        child = klass()
1033
        child.set_maximum_size(self._maximum_size)
1034
        child._key_width = self._key_width
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1035
        child._search_key_func = self._search_key_func
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1036
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1037
        return child
1038
1039
    def serialise(self, store):
1040
        """Serialise the node to store.
1041
1042
        :param store: A VersionedFiles honouring the CHK extensions.
1043
        :return: An iterable of the keys inserted by this operation.
1044
        """
1045
        for node in self._items.itervalues():
1046
            if type(node) == tuple:
1047
                # Never deserialised.
1048
                continue
1049
            if node._key is not None:
1050
                # Never altered
1051
                continue
1052
            for key in node.serialise(store):
1053
                yield key
1054
        lines = ["chknode:\n"]
1055
        lines.append("%d\n" % self._maximum_size)
1056
        lines.append("%d\n" % self._key_width)
1057
        lines.append("%d\n" % self._len)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1058
        assert self._search_prefix is not None
1059
        lines.append('%s\n' % (self._search_prefix,))
1060
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1061
        for prefix, node in sorted(self._items.items()):
1062
            if type(node) == tuple:
1063
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1064
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1065
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1066
            serialised = "%s\x00%s\n" % (prefix, key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1067
            assert serialised.startswith(self._search_prefix)
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1068
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1069
        sha1, _, _ = store.add_lines((None,), (), lines)
1070
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
1071
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1072
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1073
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1074
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1075
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1076
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
1077
        # pad as necessary.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
1078
        return (self._search_key_func(key) + '\x00'*self._node_width)[:self._node_width]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1079
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1080
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1081
        """Serialise key for use as a prefix filter in iteritems."""
1082
        if len(key) == self._key_width:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1083
            return self._search_key(key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1084
        return '\x00'.join(key)[:self._node_width]
1085
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1086
    def _split(self, offset):
1087
        """Split this node into smaller nodes starting at offset.
1088
1089
        :param offset: The offset to start the new child nodes at.
1090
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1091
            prefix for reaching node.
1092
        """
1093
        if offset >= self._node_width:
1094
            for node in self._items.values():
1095
                for result in node._split(offset):
1096
                    yield result
1097
            return
1098
        for key, node in self._items.items():
1099
            pass
1100
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1101
    def refs(self):
1102
        """Return the references to other CHK's held by this node."""
1103
        if self._key is None:
1104
            raise AssertionError("unserialised nodes have no refs.")
1105
        refs = []
1106
        for value in self._items.itervalues():
1107
            if type(value) == tuple:
1108
                refs.append(value)
1109
            else:
1110
                refs.append(value.key())
1111
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1112
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1113
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1114
        """Return the unique key prefix for this node.
1115
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1116
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1117
            unique within this node.
1118
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1119
        self._search_prefix = self.common_prefix_for_keys(self._items)
1120
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1121
1122
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1123
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1124
        if not len(self._items):
1125
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1126
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1127
        if children:
1128
            child = children[0]
1129
        else:
1130
            raise KeyError(key)
1131
        self._len -= 1
1132
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1133
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1134
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1135
        if len(unmapped) == 0:
1136
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1137
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1138
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1139
        else:
1140
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1141
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1142
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1143
            # this node is no longer needed:
1144
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1145
        if isinstance(unmapped, InternalNode):
1146
            return self
1147
        return self._check_remap(store)
1148
1149
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1150
        """Check if all keys contained by children fit in a single LeafNode.
1151
1152
        :param store: A store to use for reading more nodes
1153
        :return: Either self, or a new LeafNode which should replace self.
1154
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1155
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1156
        # 1) Implicitly unmap() is removing a key which means that the child
1157
        #    nodes are going to be shrinking by some extent.
1158
        # 2) If all children are LeafNodes, it is possible that they could be
1159
        #    combined into a single LeafNode, which can then completely replace
1160
        #    this internal node with a single LeafNode
1161
        # 3) If *one* child is an InternalNode, we assume it has already done
1162
        #    all the work to determine that its children cannot collapse, and
1163
        #    we can then assume that those nodes *plus* the current nodes don't
1164
        #    have a chance of collapsing either.
1165
        #    So a very cheap check is to just say if 'unmapped' is an
1166
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1167
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1168
        # TODO: Another alternative is to check the total size of all known
1169
        #       LeafNodes. If there is some formula we can use to determine the
1170
        #       final size without actually having to read in any more
1171
        #       children, it would be nice to have. However, we have to be
1172
        #       careful with stuff like nodes that pull out the common prefix
1173
        #       of each key, as adding a new key can change the common prefix
1174
        #       and cause size changes greater than the length of one key.
1175
        #       So for now, we just add everything to a new Leaf until it
1176
        #       splits, as we know that will give the right answer
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1177
        new_leaf = LeafNode(search_key_func=self._search_key_func)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1178
        new_leaf.set_maximum_size(self._maximum_size)
1179
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1180
        # A batch_size of 16 was chosen because:
1181
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1182
        #      leaf nodes we are unlikely to hold them in a single new leaf
1183
        #      node. This still allows for 1 round trip
1184
        #   b) With 16-way fan out, we can still do a single round trip
1185
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1186
        #      the page cache, just to determine that we really don't need it.
1187
        for node in self._iter_nodes(store, batch_size=16):
1188
            if isinstance(node, InternalNode):
1189
                # Without looking at any leaf nodes, we are sure
1190
                return self
1191
            for key, value in node._items.iteritems():
1192
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1193
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1194
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1195
1196
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1197
def _deserialise(bytes, key, search_key_func):
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1198
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1199
    if bytes.startswith("chkleaf:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1200
        return LeafNode.deserialise(bytes, key, search_key_func=search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1201
    elif bytes.startswith("chknode:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1202
        return InternalNode.deserialise(bytes, key,
1203
            search_key_func=search_key_func)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1204
    else:
1205
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1206
1207
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1208
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1209
    """Read the associated records, and determine what is interesting."""
1210
    uninteresting_keys = set(uninteresting_keys)
1211
    chks_to_read = uninteresting_keys.union(interesting_keys)
1212
    next_uninteresting = set()
1213
    next_interesting = set()
1214
    uninteresting_items = set()
1215
    interesting_items = set()
1216
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1217
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1218
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1219
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1220
        if pb is not None:
1221
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1222
        bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1223
        # We don't care about search_key_func for this code, because we only
1224
        # care about external references.
1225
        node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1226
        if record.key in uninteresting_keys:
1227
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1228
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1229
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1230
                # We know we are at a LeafNode, so we can pass None for the
1231
                # store
1232
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1233
        else:
1234
            interesting_records.append(record)
1235
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1236
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1237
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1238
                interesting_items.update(node.iteritems(None))
1239
    # TODO: Filter out records that have already been read, as node splitting
1240
    #       can cause us to reference the same nodes via shorter and longer
1241
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1242
    return (next_uninteresting, uninteresting_items,
1243
            next_interesting, interesting_records, interesting_items)
1244
1245
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1246
def _find_all_uninteresting(store, interesting_root_keys,
1247
                            uninteresting_root_keys, adapter, pb):
1248
    """Determine the full set of uninteresting keys."""
1249
    # What about duplicates between interesting_root_keys and
1250
    # uninteresting_root_keys?
1251
    if not uninteresting_root_keys:
1252
        # Shortcut case. We know there is nothing uninteresting to filter out
1253
        # So we just let the rest of the algorithm do the work
1254
        # We know there is nothing uninteresting, and we didn't have to read
1255
        # any interesting records yet.
1256
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1257
    all_uninteresting_chks = set(uninteresting_root_keys)
1258
    all_uninteresting_items = set()
1259
1260
    # First step, find the direct children of both the interesting and
1261
    # uninteresting set
1262
    (uninteresting_keys, uninteresting_items,
1263
     interesting_keys, interesting_records,
1264
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1265
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1266
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1267
    all_uninteresting_chks.update(uninteresting_keys)
1268
    all_uninteresting_items.update(uninteresting_items)
1269
    del uninteresting_items
1270
    # Note: Exact matches between interesting and uninteresting do not need
1271
    #       to be search further. Non-exact matches need to be searched in case
1272
    #       there is a future exact-match
1273
    uninteresting_keys.difference_update(interesting_keys)
1274
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1275
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1276
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1277
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1278
    while chks_to_read:
1279
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1280
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1281
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1282
            if pb is not None:
1283
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1284
            if record.storage_kind in ('fulltext', 'chunked'):
1285
                bytes = record.get_bytes_as('fulltext')
1286
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1287
                bytes = adapter.get_bytes(record,
1288
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1289
            # We don't care about search_key_func for this code, because we
1290
            # only care about external references.
1291
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1292
            if isinstance(node, InternalNode):
1293
                # uninteresting_prefix_chks.update(node._items.iteritems())
1294
                chks = node._items.values()
1295
                # TODO: We remove the entries that are already in
1296
                #       uninteresting_chks ?
1297
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1298
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1299
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1300
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1301
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1302
    return (all_uninteresting_chks, all_uninteresting_items,
1303
            interesting_keys, interesting_records, interesting_items)
1304
1305
1306
def iter_interesting_nodes(store, interesting_root_keys,
1307
                           uninteresting_root_keys, pb=None):
1308
    """Given root keys, find interesting nodes.
1309
1310
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1311
    referenced from uninteresting_root_keys are not considered interesting.
1312
1313
    :param interesting_root_keys: keys which should be part of the
1314
        "interesting" nodes (which will be yielded)
1315
    :param uninteresting_root_keys: keys which should be filtered out of the
1316
        result set.
1317
    :return: Yield
1318
        (interesting records, interesting chk's, interesting key:values)
1319
    """
1320
    # TODO: consider that it may be more memory efficient to use the 20-byte
1321
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1322
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1323
    #       helper function similar to _read_bytes. This function should be
1324
    #       able to use nodes from the _page_cache as well as actually
1325
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1326
1327
    # A way to adapt from the compressed texts back into fulltexts
1328
    # In a way, this seems like a layering inversion to have CHKMap know the
1329
    # details of versionedfile
1330
    adapter_class = versionedfile.adapter_registry.get(
1331
        ('knit-ft-gz', 'fulltext'))
1332
    adapter = adapter_class(store)
1333
1334
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1335
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1336
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1337
1338
    # Now that we know everything uninteresting, we can yield information from
1339
    # our first request
1340
    interesting_items.difference_update(all_uninteresting_items)
1341
    records = dict((record.key, record) for record in interesting_records
1342
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1343
    if records or interesting_items:
1344
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1345
    interesting_keys.difference_update(all_uninteresting_chks)
1346
1347
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1348
    while chks_to_read:
1349
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1350
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1351
            if pb is not None:
1352
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1353
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1354
            if record.storage_kind in ('fulltext', 'chunked'):
1355
                bytes = record.get_bytes_as('fulltext')
1356
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1357
                bytes = adapter.get_bytes(record,
1358
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1359
            # We don't care about search_key_func for this code, because we
1360
            # only care about external references.
1361
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1362
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1363
                chks = set(node.refs())
1364
                chks.difference_update(all_uninteresting_chks)
1365
                # Is set() and .difference_update better than:
1366
                # chks = [chk for chk in node.refs()
1367
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1368
                next_chks.update(chks)
1369
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1370
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1371
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1372
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1373
                interesting_items = [item for item in node._items.iteritems()
1374
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1375
                # TODO: Do we need to filter out items that we have already
1376
                #       seen on other pages? We don't really want to buffer the
1377
                #       whole thing, but it does mean that callers need to
1378
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1379
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1380
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1381
        chks_to_read = next_chks