/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
44
import zlib
45
import struct
46
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
47
from bzrlib import versionedfile
48
""")
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
49
from bzrlib import (
50
    lru_cache,
51
    registry,
52
    )
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
53
54
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
55
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
56
# out, it takes 3.1MB to cache the layer.
57
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
58
# We are caching bytes so len(value) is perfectly accurate
59
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
60
61
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
62
def _search_key_plain(key):
63
    """Map the key tuple into a search string that just uses the key bytes."""
64
    return '\x00'.join(key)
65
66
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
67
def _search_key_16(key):
68
    """Map the key tuple into a search key string which has 16-way fan out."""
69
    return '\x00'.join(['%08X' % abs(zlib.crc32(bit)) for bit in key])
70
71
72
def _search_key_255(key):
73
    """Map the key tuple into a search key string which has 255-way fan out.
74
75
    We use 255-way because '\n' is used as a delimiter, and causes problems
76
    while parsing.
77
    """
3735.2.83 by Vincent Ladeuil
Better fix with explanation for zlib.crc32.
78
    # Depending on python version and platform, zlib.crc32 will return either a
79
    # signed (<= 2.5 >= 3.0) or an unsigned (2.5, 2.6).
80
    # http://docs.python.org/library/zlib.html recommends using a mask to force
81
    # an unsigned value to ensure the same numeric value (unsigned) is obtained
82
    # across all python versions and platforms.
3735.2.84 by John Arbash Meinel
Comment about using using 0xFFFFFFFF as part of _search_key_255
83
    # Note: However, on 32-bit platforms this causes an upcast to PyLong, which
84
    #       are generally slower than PyInts. However, if performance becomes
85
    #       critical, we should probably write the whole thing as an extension
86
    #       anyway.
87
    #       Though we really don't need that 32nd bit of accuracy. (even 2**24
88
    #       is probably enough node fan out for realistic trees.)
3735.2.83 by Vincent Ladeuil
Better fix with explanation for zlib.crc32.
89
    bytes = '\x00'.join([struct.pack('>L', zlib.crc32(bit)&0xFFFFFFFF)
90
                         for bit in key])
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
91
    return bytes.replace('\n', '_')
92
93
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
94
search_key_registry = registry.Registry()
95
search_key_registry.register('plain', _search_key_plain)
96
search_key_registry.register('hash-16-way', _search_key_16)
97
search_key_registry.register('hash-255-way', _search_key_255)
98
99
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
100
class CHKMap(object):
101
    """A persistent map from string to string backed by a CHK store."""
102
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
103
    def __init__(self, store, root_key, search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
104
        """Create a CHKMap object.
105
106
        :param store: The store the CHKMap is stored in.
107
        :param root_key: The root key of the map. None to create an empty
108
            CHKMap.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
109
        :param search_key_func: A function mapping a key => bytes. These bytes
110
            are then used by the internal nodes to split up leaf nodes into
111
            multiple pages.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
112
        """
113
        self._store = store
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
114
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
115
            search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
116
        self._search_key_func = search_key_func
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
117
        if root_key is None:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
118
            self._root_node = LeafNode(search_key_func=search_key_func)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
119
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
120
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
121
122
    def apply_delta(self, delta):
123
        """Apply a delta to the map.
124
125
        :param delta: An iterable of old_key, new_key, new_value tuples.
126
            If new_key is not None, then new_key->new_value is inserted
127
            into the map; if old_key is not None, then the old mapping
128
            of old_key is removed.
129
        """
130
        for old, new, value in delta:
131
            if old is not None and old != new:
132
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
133
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
134
        for old, new, value in delta:
135
            if new is not None:
136
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
137
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
138
        return self._save()
139
140
    def _ensure_root(self):
141
        """Ensure that the root node is an object not a key."""
142
        if type(self._root_node) == tuple:
143
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
144
            self._root_node = self._get_node(self._root_node)
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
145
            # XXX: Shouldn't this be put into _deserialize?
146
            self._root_node._search_key_func = self._search_key_func
3735.2.31 by Robert Collins
CHKMap.iter_changes
147
148
    def _get_node(self, node):
149
        """Get a node.
150
151
        Node that this does not update the _items dict in objects containing a
152
        reference to this node. As such it does not prevent subsequent IO being
153
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
154
3735.2.31 by Robert Collins
CHKMap.iter_changes
155
        :param node: A tuple key or node object.
156
        :return: A node object.
157
        """
158
        if type(node) == tuple:
159
            bytes = self._read_bytes(node)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
160
            return _deserialise(bytes, node,
161
                search_key_func=self._search_key_func)
3735.2.31 by Robert Collins
CHKMap.iter_changes
162
        else:
163
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
164
165
    def _read_bytes(self, key):
166
        stream = self._store.get_record_stream([key], 'unordered', True)
167
        return stream.next().get_bytes_as('fulltext')
168
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
169
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
170
        """Return the tree in a string representation."""
171
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
172
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
173
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
174
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
175
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
176
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
177
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
178
        """For this node and all children, generate a string representation."""
179
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
180
        if not include_keys:
181
            key_str = ''
182
        else:
183
            node_key = node.key()
184
            if node_key is not None:
185
                key_str = ' %s' % (node_key[0],)
186
            else:
187
                key_str = ' None'
188
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
189
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
190
        if isinstance(node, InternalNode):
191
            # Trigger all child nodes to get loaded
192
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
193
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
194
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
195
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
196
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
197
            for key, value in sorted(node._items.iteritems()):
198
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
199
        return result
200
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
201
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
202
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
203
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
204
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
205
        :param store: The store to record initial_value in, a VersionedFiles
206
            object with 1-tuple keys supporting CHK key generation.
207
        :param initial_value: A dict to store in store. Its keys and values
208
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
209
        :param maximum_size: The maximum_size rule to apply to nodes. This
210
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
211
        :param key_width: The number of elements in each key_tuple being stored
212
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
213
        :return: The root chk of te resulting CHKMap.
214
        """
215
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
216
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
217
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
218
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
219
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
220
            delta.append((None, key, value))
221
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
222
        return result._save()
223
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
224
    def iter_changes(self, basis):
225
        """Iterate over the changes between basis and self.
226
227
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
228
            is None for keys only in self; new_value is None for keys only in
229
            basis.
230
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
231
        # Overview:
232
        # Read both trees in lexographic, highest-first order.
233
        # Any identical nodes we skip
234
        # Any unique prefixes we output immediately.
235
        # values in a leaf node are treated as single-value nodes in the tree
236
        # which allows them to be not-special-cased. We know to output them
237
        # because their value is a string, not a key(tuple) or node.
238
        #
239
        # corner cases to beware of when considering this function:
240
        # *) common references are at different heights.
241
        #    consider two trees:
242
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
243
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
244
        #     'b': LeafNode={'b'}}
245
        #    the node with aaa/aab will only be encountered in the second tree
246
        #    after reading the 'a' subtree, but it is encountered in the first
247
        #    tree immediately. Variations on this may have read internal nodes like this.
248
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
249
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
250
        #    clean as we process each item.
251
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
252
            return
253
        self._ensure_root()
254
        basis._ensure_root()
255
        excluded_keys = set()
256
        self_node = self._root_node
257
        basis_node = basis._root_node
258
        # A heap, each element is prefix, node(tuple/NodeObject/string),
259
        # key_path (a list of tuples, tail-sharing down the tree.)
260
        self_pending = []
261
        basis_pending = []
262
        def process_node(prefix, node, path, a_map, pending):
263
            # take a node and expand it
264
            node = a_map._get_node(node)
265
            if type(node) == LeafNode:
266
                path = (node._key, path)
267
                for key, value in node._items.items():
268
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
269
            else:
270
                # type(node) == InternalNode
271
                path = (node._key, path)
272
                for prefix, child in node._items.items():
273
                    heapq.heappush(pending, (prefix, child, path))
274
        process_node(None, self_node, None, self, self_pending)
275
        process_node(None, basis_node, None, basis, basis_pending)
276
        self_seen = set()
277
        basis_seen = set()
278
        excluded_keys = set()
279
        def check_excluded(key_path):
280
            # Note that this is N^2, it depends on us trimming trees
281
            # aggressively to not become slow.
282
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
283
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
284
            # a common node is detected, the proceed to remove the already
285
            # pending children. bzrlib.graph has a searcher module with a
286
            # similar problem.
287
            while key_path is not None:
288
                key, key_path = key_path
289
                if key in excluded_keys:
290
                    return True
291
            return False
292
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
293
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
294
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
295
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
296
            if not self_pending:
297
                # self is exhausted: output remainder of basis
298
                for prefix, node, path in basis_pending:
299
                    if check_excluded(path):
300
                        continue
301
                    node = basis._get_node(node)
302
                    if type(node) == str:
303
                        # a value
304
                        yield (tuple(prefix.split('\x00')), node, None)
305
                    else:
306
                        # subtree - fastpath the entire thing.
307
                        for key, value in node.iteritems(basis._store):
308
                            yield (key, value, None)
309
                return
310
            elif not basis_pending:
311
                # basis is exhausted: output remainder of self.
312
                for prefix, node, path in self_pending:
313
                    if check_excluded(path):
314
                        continue
315
                    node = self._get_node(node)
316
                    if type(node) == str:
317
                        # a value
318
                        yield (tuple(prefix.split('\x00')), None, node)
319
                    else:
320
                        # subtree - fastpath the entire thing.
321
                        for key, value in node.iteritems(self._store):
322
                            yield (key, None, value)
323
                return
324
            else:
325
                # XXX: future optimisation - yield the smaller items
326
                # immediately rather than pushing everything on/off the
327
                # heaps. Applies to both internal nodes and leafnodes.
328
                if self_pending[0][0] < basis_pending[0][0]:
329
                    # expand self
330
                    prefix, node, path = heapq.heappop(self_pending)
331
                    if check_excluded(path):
332
                        continue
333
                    if type(node) == str:
334
                        # a value
335
                        yield (tuple(prefix.split('\x00')), None, node)
336
                    else:
337
                        process_node(prefix, node, path, self, self_pending)
338
                        continue
339
                elif self_pending[0][0] > basis_pending[0][0]:
340
                    # expand basis
341
                    prefix, node, path = heapq.heappop(basis_pending)
342
                    if check_excluded(path):
343
                        continue
344
                    if type(node) == str:
345
                        # a value
346
                        yield (tuple(prefix.split('\x00')), node, None)
347
                    else:
348
                        process_node(prefix, node, path, basis, basis_pending)
349
                        continue
350
                else:
351
                    # common prefix: possibly expand both
352
                    if type(self_pending[0][1]) != str:
353
                        # process next self
354
                        read_self = True
355
                    else:
356
                        read_self = False
357
                    if type(basis_pending[0][1]) != str:
358
                        # process next basis
359
                        read_basis = True
360
                    else:
361
                        read_basis = False
362
                    if not read_self and not read_basis:
363
                        # compare a common value
364
                        self_details = heapq.heappop(self_pending)
365
                        basis_details = heapq.heappop(basis_pending)
366
                        if self_details[1] != basis_details[1]:
367
                            yield (tuple(self_details[0].split('\x00')),
368
                                basis_details[1], self_details[1])
369
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
370
                    # At least one side wasn't a string.
371
                    if (self._node_key(self_pending[0][1]) ==
372
                        self._node_key(basis_pending[0][1])):
373
                        # Identical pointers, skip (and don't bother adding to
374
                        # excluded, it won't turn up again.
375
                        heapq.heappop(self_pending)
376
                        heapq.heappop(basis_pending)
377
                        continue
378
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
379
                    if read_self:
380
                        prefix, node, path = heapq.heappop(self_pending)
381
                        if check_excluded(path):
382
                            continue
383
                        process_node(prefix, node, path, self, self_pending)
384
                    if read_basis:
385
                        prefix, node, path = heapq.heappop(basis_pending)
386
                        if check_excluded(path):
387
                            continue
388
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
389
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
390
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
391
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
392
        """Iterate over the entire CHKMap's contents."""
393
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
394
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
395
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
396
    def key(self):
397
        """Return the key for this map."""
398
        if isinstance(self._root_node, tuple):
399
            return self._root_node
400
        else:
401
            return self._root_node._key
402
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
403
    def __len__(self):
404
        self._ensure_root()
405
        return len(self._root_node)
406
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
407
    def map(self, key, value):
408
        """Map a key tuple to value."""
409
        # Need a root object.
410
        self._ensure_root()
411
        prefix, node_details = self._root_node.map(self._store, key, value)
412
        if len(node_details) == 1:
413
            self._root_node = node_details[0][1]
414
        else:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
415
            self._root_node = InternalNode(prefix,
416
                                search_key_func=self._search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
417
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
418
            self._root_node._key_width = node_details[0][1]._key_width
419
            for split, node in node_details:
420
                self._root_node.add_node(split, node)
421
3735.2.31 by Robert Collins
CHKMap.iter_changes
422
    def _node_key(self, node):
423
        """Get the key for a node whether its a tuple o r node."""
424
        if type(node) == tuple:
425
            return node
426
        else:
427
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
428
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
429
    def unmap(self, key):
430
        """remove key from the map."""
431
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
432
        unmapped = self._root_node.unmap(self._store, key)
433
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
434
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
435
    def _save(self):
436
        """Save the map completely.
437
438
        :return: The key of the root node.
439
        """
440
        if type(self._root_node) == tuple:
441
            # Already saved.
442
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
443
        keys = list(self._root_node.serialise(self._store))
444
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
445
446
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
447
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
448
    """Base class defining the protocol for CHK Map nodes.
449
450
    :ivar _raw_size: The total size of the serialized key:value data, before
451
        adding the header bytes, and without prefix compression.
452
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
453
454
    def __init__(self, key_width=1):
455
        """Create a node.
456
457
        :param key_width: The width of keys for this node.
458
        """
459
        self._key = None
460
        # Current number of elements
461
        self._len = 0
462
        self._maximum_size = 0
463
        self._key_width = 1
464
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
465
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
466
        # The pointers/values this node has - meaning defined by child classes.
467
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
468
        # The common search prefix
469
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
470
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
471
    def __repr__(self):
472
        items_str = sorted(self._items)
473
        if len(items_str) > 20:
474
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
475
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
476
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
477
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
478
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
479
    def key(self):
480
        return self._key
481
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
482
    def __len__(self):
483
        return self._len
484
485
    @property
486
    def maximum_size(self):
487
        """What is the upper limit for adding references to a node."""
488
        return self._maximum_size
489
490
    def set_maximum_size(self, new_size):
491
        """Set the size threshold for nodes.
492
493
        :param new_size: The size at which no data is added to a node. 0 for
494
            unlimited.
495
        """
496
        self._maximum_size = new_size
497
3735.15.5 by John Arbash Meinel
Change the nomenclature.
498
    @classmethod
499
    def common_prefix(cls, prefix, key):
500
        """Given 2 strings, return the longest prefix common to both.
501
502
        :param prefix: This has been the common prefix for other keys, so it is
503
            more likely to be the common prefix in this case as well.
504
        :param key: Another string to compare to
505
        """
506
        if key.startswith(prefix):
507
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
508
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
509
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
510
            if left != right:
511
                pos -= 1
512
                break
3735.15.5 by John Arbash Meinel
Change the nomenclature.
513
        assert pos <= len(prefix)
514
        assert pos <= len(key)
515
        common = prefix[:pos+1]
516
        assert key.startswith(common)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
517
        return common
518
3735.15.5 by John Arbash Meinel
Change the nomenclature.
519
    @classmethod
520
    def common_prefix_for_keys(cls, keys):
521
        """Given a list of keys, find their common prefix.
522
523
        :param keys: An iterable of strings.
524
        :return: The longest common prefix of all keys.
525
        """
526
        common_prefix = None
527
        for key in keys:
528
            if common_prefix is None:
529
                common_prefix = key
530
                continue
531
            common_prefix = cls.common_prefix(common_prefix, key)
532
            if not common_prefix:
533
                # if common_prefix is the empty string, then we know it won't
534
                # change further
535
                return ''
536
        return common_prefix
537
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
538
539
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
540
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
541
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
542
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
543
    :ivar _size: The number of bytes that would be used by serializing all of
544
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
545
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
546
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
547
    def __init__(self, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
548
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
549
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
550
        self._common_serialised_prefix = None
551
        self._serialise_key = '\x00'.join
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
552
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
553
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
554
        else:
555
            self._search_key_func = search_key_func
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
556
557
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
558
        """Answer the current serialised size of this node.
559
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
560
        This differs from self._raw_size in that it includes the bytes used for
561
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
562
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
563
        if self._common_serialised_prefix is None:
564
            bytes_for_items = 0
565
        else:
566
            # We will store a single string with the common prefix
567
            # And then that common prefix will not be stored in any of the
568
            # entry lines
569
            prefix_len = len(self._common_serialised_prefix)
570
            bytes_for_items = (self._raw_size - (prefix_len * (self._len - 1)))
571
        return (13 # bytes overhead for the header and separators
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
572
            + len(str(self._maximum_size))
573
            + len(str(self._key_width))
3735.15.4 by John Arbash Meinel
Clean up some little bits.
574
            + len(str(self._len))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
575
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
576
577
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
578
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
579
        """Deserialise bytes, with key key, into a LeafNode.
580
581
        :param bytes: The bytes of the node.
582
        :param key: The key that the serialised node has.
583
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
584
        result = LeafNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
585
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
586
        # extra '' if the bytes ends in a final newline.
587
        lines = bytes.split('\n')
588
        assert lines[-1] == ''
589
        lines.pop(-1)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
590
        items = {}
591
        if lines[0] != 'chkleaf:':
592
            raise ValueError("not a serialised leaf node: %r" % bytes)
593
        maximum_size = int(lines[1])
594
        width = int(lines[2])
595
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
596
        prefix = lines[4]
597
        for line in lines[5:]:
598
            line = prefix + line
3735.2.25 by Robert Collins
CHKInventory core tests passing.
599
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
600
            items[tuple(elements[:-1])] = elements[-1]
601
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
602
            raise AssertionError("item count (%d) mismatch for key %s,"
603
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
604
        result._items = items
605
        result._len = length
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
606
        assert length == len(lines) - 5
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
607
        result._maximum_size = maximum_size
608
        result._key = key
609
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
610
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
611
            + (length)*(len(prefix)+1)) # prefix + '\n'
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
612
        result._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
613
        result._compute_serialised_prefix()
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
614
        if len(bytes) != result._current_size():
615
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
616
        assert len(bytes) == result._current_size()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
617
        return result
618
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
619
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
620
        """Iterate over items in the node.
621
622
        :param key_filter: A filter to apply to the node. It should be a
623
            list/set/dict or similar repeatedly iterable container.
624
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
625
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
626
            # Adjust the filter - short elements go to a prefix filter. Would this
627
            # be cleaner explicitly? That would be no harder for InternalNode..
628
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
629
            filters = {}
630
            for key in key_filter:
631
                length_filter = filters.setdefault(len(key), set())
632
                length_filter.add(key)
633
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
634
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
635
                for length, length_filter in filters:
636
                    if item[0][:length] in length_filter:
637
                        yield item
638
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
639
        else:
640
            for item in self._items.iteritems():
641
                yield item
642
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
643
    def _key_value_len(self, key, value):
644
        # TODO: Should probably be done without actually joining the key, but
645
        #       then that can be done via the C extension
3735.15.5 by John Arbash Meinel
Change the nomenclature.
646
        return 2 + len(self._serialise_key(key)) + len(value)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
647
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
648
    def _search_key(self, key):
649
        return self._search_key_func(key)
650
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
651
    def _map_no_split(self, key, value):
652
        """Map a key to a value.
653
654
        This assumes either the key does not already exist, or you have already
655
        removed its size and length from self.
656
657
        :return: True if adding this node should cause us to split.
658
        """
659
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
660
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
661
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
662
        serialised_key = self._serialise_key(key)
663
        if self._common_serialised_prefix is None:
664
            self._common_serialised_prefix = serialised_key
665
        else:
666
            self._common_serialised_prefix = self.common_prefix(
667
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
668
        search_key = self._search_key(key)
669
        if self._search_prefix is None:
670
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
671
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
672
            self._search_prefix = self.common_prefix(
673
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
674
        if (self._len > 1
675
            and self._maximum_size
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
676
            and self._current_size() > self._maximum_size):
677
            # Check to see if all of the search_keys for this node are
678
            # identical. We allow the node to grow under that circumstance
679
            # (we could track this as common state, but it is infrequent)
680
            if (search_key != self._search_prefix
681
                or not self._are_search_keys_identical()):
682
                return True
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
683
        return False
684
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
685
    def _split(self, store):
686
        """We have overflowed.
687
688
        Split this node into multiple LeafNodes, return it up the stack so that
689
        the next layer creates a new InternalNode and references the new nodes.
690
3735.15.5 by John Arbash Meinel
Change the nomenclature.
691
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
692
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
693
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
694
        split_at = len(common_prefix) + 1
695
        result = {}
696
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
697
            search_key = self._search_key(key)
698
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
699
            # TODO: Generally only 1 key can be exactly the right length,
700
            #       which means we can only have 1 key in the node pointed
701
            #       at by the 'prefix\0' key. We might want to consider
702
            #       folding it into the containing InternalNode rather than
703
            #       having a fixed length-1 node.
704
            #       Note this is probably not true for hash keys, as they
705
            #       may get a '\00' node anywhere, but won't have keys of
706
            #       different lengths.
707
            if len(prefix) < split_at:
708
                prefix += '\x00'*(split_at - len(prefix))
709
            if prefix not in result:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
710
                node = LeafNode(search_key_func=self._search_key_func)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
711
                node.set_maximum_size(self._maximum_size)
712
                node._key_width = self._key_width
713
                result[prefix] = node
714
            else:
715
                node = result[prefix]
716
            node.map(store, key, value)
717
        return common_prefix, result.items()
718
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
719
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
720
        """Map key to value."""
721
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
722
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
723
            self._len -= 1
724
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
725
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
726
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
727
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
728
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
729
730
    def serialise(self, store):
731
        """Serialise the tree to store.
732
733
        :param store: A VersionedFiles honouring the CHK extensions.
734
        :return: An iterable of the keys inserted by this operation.
735
        """
736
        lines = ["chkleaf:\n"]
737
        lines.append("%d\n" % self._maximum_size)
738
        lines.append("%d\n" % self._key_width)
739
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
740
        if self._common_serialised_prefix is None:
741
            lines.append('\n')
742
        else:
743
            lines.append('%s\n' % (self._common_serialised_prefix,))
744
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
745
        for key, value in sorted(self._items.items()):
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
746
            serialized = "%s\x00%s\n" % (self._serialise_key(key), value)
747
            assert serialized.startswith(self._common_serialised_prefix)
748
            lines.append(serialized[prefix_len:])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
749
        sha1, _, _ = store.add_lines((None,), (), lines)
750
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
751
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
752
        if len(bytes) != self._current_size():
753
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
754
        assert len(bytes) == self._current_size()
755
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
756
        return [self._key]
757
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
758
    def refs(self):
759
        """Return the references to other CHK's held by this node."""
760
        return []
761
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
762
    def _compute_search_prefix(self):
763
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
764
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
765
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
766
            unique within this node.
767
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
768
        search_keys = [self._search_key(key) for key in self._items]
769
        self._search_prefix = self.common_prefix_for_keys(search_keys)
770
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
771
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
772
    def _are_search_keys_identical(self):
773
        """Check to see if the search keys for all entries are the same.
774
775
        When using a hash as the search_key it is possible for non-identical
776
        keys to collide. If that happens enough, we may try overflow a
777
        LeafNode, but as all are collisions, we must not split.
778
        """
779
        common_search_key = None
780
        for key in self._items:
781
            search_key = self._search_key(key)
782
            if common_search_key is None:
783
                common_search_key = search_key
784
            elif search_key != common_search_key:
785
                return False
786
        return True
787
3735.15.5 by John Arbash Meinel
Change the nomenclature.
788
    def _compute_serialised_prefix(self):
789
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
790
791
        :return: A bytestring of the longest serialised key prefix that is
792
            unique within this node.
793
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
794
        serialised_keys = [self._serialise_key(key) for key in self._items]
795
        self._common_serialised_prefix = self.common_prefix_for_keys(
796
            serialised_keys)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
797
798
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
799
        """Unmap key from the node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
800
        self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
801
        self._len -= 1
802
        del self._items[key]
803
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
804
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
805
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
806
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
807
        return self
808
809
810
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
811
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
812
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
813
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
814
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
815
3735.15.5 by John Arbash Meinel
Change the nomenclature.
816
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
817
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
818
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
819
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
820
    def __init__(self, prefix='', search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
821
        Node.__init__(self)
822
        # The size of an internalnode with default values and no children.
823
        # How many octets key prefixes within this node are.
824
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
825
        self._search_prefix = prefix
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
826
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
827
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
828
        else:
829
            self._search_key_func = search_key_func
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
830
831
    def __repr__(self):
832
        items_str = sorted(self._items)
833
        if len(items_str) > 20:
834
            items_str = items_str[16] + '...]'
835
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
836
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
837
            self._maximum_size, self._search_prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
838
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
839
    def add_node(self, prefix, node):
840
        """Add a child node with prefix prefix, and node node.
841
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
842
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
843
        :param node: The node being added.
844
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
845
        assert self._search_prefix is not None
846
        assert prefix.startswith(self._search_prefix)
847
        assert len(prefix) == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
848
        self._len += len(node)
849
        if not len(self._items):
850
            self._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
851
        assert self._node_width == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
852
        self._items[prefix] = node
853
        self._key = None
854
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
855
    def _current_size(self):
856
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
857
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
858
            len(str(self._maximum_size)))
859
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
860
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
861
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
862
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
863
864
        :param bytes: The bytes of the node.
865
        :param key: The key that the serialised node has.
866
        :return: An InternalNode instance.
867
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
868
        result = InternalNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
869
        # Splitlines can split on '\r' so don't use it, remove the extra ''
870
        # from the result of split('\n') because we should have a trailing
871
        # newline
872
        lines = bytes.split('\n')
873
        assert lines[-1] == ''
874
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
875
        items = {}
876
        if lines[0] != 'chknode:':
877
            raise ValueError("not a serialised internal node: %r" % bytes)
878
        maximum_size = int(lines[1])
879
        width = int(lines[2])
880
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
881
        common_prefix = lines[4]
882
        for line in lines[5:]:
883
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
884
            prefix, flat_key = line.rsplit('\x00', 1)
885
            items[prefix] = (flat_key,)
886
        result._items = items
887
        result._len = length
888
        result._maximum_size = maximum_size
889
        result._key = key
890
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
891
        # XXX: InternalNodes don't really care about their size, and this will
892
        #      change if we add prefix compression
893
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
894
        result._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
895
        result._compute_search_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
896
        return result
897
898
    def iteritems(self, store, key_filter=None):
899
        for node in self._iter_nodes(store, key_filter=key_filter):
900
            for item in node.iteritems(store, key_filter=key_filter):
901
                yield item
902
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
903
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
904
        """Iterate over node objects which match key_filter.
905
906
        :param store: A store to use for accessing content.
907
        :param key_filter: A key filter to filter nodes. Only nodes that might
908
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
909
        :param batch_size: If not None, then we will return the nodes that had
910
            to be read using get_record_stream in batches, rather than reading
911
            them all at once.
912
        :return: An iterable of nodes. This function does not have to be fully
913
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
914
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
915
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
916
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
917
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
918
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
919
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
920
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
921
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
922
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
923
            # XXX defaultdict ?
924
            length_filters = {}
925
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
926
                search_key = self._search_prefix_filter(key)
927
                length_filter = length_filters.setdefault(
928
                                    len(search_key), set())
929
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
930
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
931
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
932
                for length, length_filter in length_filters:
933
                    if prefix[:length] in length_filter:
934
                        if type(node) == tuple:
935
                            keys[node] = prefix
936
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
937
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
938
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
939
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
940
            # Look in the page cache for some more bytes
941
            found_keys = set()
942
            for key in keys:
943
                try:
944
                    bytes = _page_cache[key]
945
                except KeyError:
946
                    continue
947
                else:
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
948
                    node = _deserialise(bytes, key,
949
                        search_key_func=self._search_key_func)
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
950
                    self._items[keys[key]] = node
951
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
952
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
953
            for key in found_keys:
954
                del keys[key]
955
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
956
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
957
            if batch_size is None:
958
                # Read all the keys in
959
                batch_size = len(keys)
960
            key_order = list(keys)
961
            for batch_start in range(0, len(key_order), batch_size):
962
                batch = key_order[batch_start:batch_start + batch_size]
963
                # We have to fully consume the stream so there is no pending
964
                # I/O, so we buffer the nodes for now.
965
                stream = store.get_record_stream(batch, 'unordered', True)
966
                nodes = []
967
                for record in stream:
968
                    bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
969
                    node = _deserialise(bytes, record.key,
970
                        search_key_func=self._search_key_func)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
971
                    nodes.append(node)
972
                    self._items[keys[record.key]] = node
973
                    _page_cache.add(record.key, bytes)
974
                for node in nodes:
975
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
976
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
977
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
978
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
979
        if not len(self._items):
980
            raise AssertionError("cant map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
981
        search_key = self._search_key(key)
982
        assert self._node_width == len(self._search_prefix) + 1
983
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
984
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
985
            # point where it would fit, insert self into that internal node,
986
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
987
            new_prefix = self.common_prefix(self._search_prefix,
988
                                            search_key)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
989
            new_parent = InternalNode(new_prefix,
990
                search_key_func=self._search_key_func)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
991
            new_parent.set_maximum_size(self._maximum_size)
992
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
993
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
994
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
995
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
996
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
997
        if children:
998
            child = children[0]
999
        else:
1000
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1001
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1002
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1003
        if isinstance(child, LeafNode):
1004
            old_size = child._current_size()
1005
        else:
1006
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1007
        prefix, node_details = child.map(store, key, value)
1008
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
1009
            # child may have shrunk, or might be a new node
1010
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1011
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1012
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
1013
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1014
            new_node = self
1015
            if (isinstance(child, LeafNode)
1016
                and (old_size is None or child._current_size() < old_size)):
1017
                # The old node was an InternalNode which means it has now
1018
                # collapsed, so we need to check if it will chain to a collapse
1019
                # at this level. Or the LeafNode has shrunk in size, so we need
1020
                # to check that as well.
1021
                new_node = self._check_remap(store)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1022
            assert new_node._search_prefix is not None
1023
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1024
        # child has overflown - create a new intermediate node.
1025
        # XXX: This is where we might want to try and expand our depth
1026
        # to refer to more bytes of every child (which would give us
1027
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1028
        child = self._new_child(search_key, InternalNode)
1029
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1030
        for split, node in node_details:
1031
            child.add_node(split, node)
1032
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
1033
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1034
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1035
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1036
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1037
        """Create a new child node of type klass."""
1038
        child = klass()
1039
        child.set_maximum_size(self._maximum_size)
1040
        child._key_width = self._key_width
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1041
        child._search_key_func = self._search_key_func
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1042
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1043
        return child
1044
1045
    def serialise(self, store):
1046
        """Serialise the node to store.
1047
1048
        :param store: A VersionedFiles honouring the CHK extensions.
1049
        :return: An iterable of the keys inserted by this operation.
1050
        """
1051
        for node in self._items.itervalues():
1052
            if type(node) == tuple:
1053
                # Never deserialised.
1054
                continue
1055
            if node._key is not None:
1056
                # Never altered
1057
                continue
1058
            for key in node.serialise(store):
1059
                yield key
1060
        lines = ["chknode:\n"]
1061
        lines.append("%d\n" % self._maximum_size)
1062
        lines.append("%d\n" % self._key_width)
1063
        lines.append("%d\n" % self._len)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1064
        assert self._search_prefix is not None
1065
        lines.append('%s\n' % (self._search_prefix,))
1066
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1067
        for prefix, node in sorted(self._items.items()):
1068
            if type(node) == tuple:
1069
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1070
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1071
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1072
            serialised = "%s\x00%s\n" % (prefix, key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1073
            assert serialised.startswith(self._search_prefix)
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1074
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1075
        sha1, _, _ = store.add_lines((None,), (), lines)
1076
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
1077
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1078
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1079
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1080
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1081
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1082
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
1083
        # pad as necessary.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
1084
        return (self._search_key_func(key) + '\x00'*self._node_width)[:self._node_width]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1085
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1086
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1087
        """Serialise key for use as a prefix filter in iteritems."""
1088
        if len(key) == self._key_width:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1089
            return self._search_key(key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1090
        return '\x00'.join(key)[:self._node_width]
1091
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1092
    def _split(self, offset):
1093
        """Split this node into smaller nodes starting at offset.
1094
1095
        :param offset: The offset to start the new child nodes at.
1096
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1097
            prefix for reaching node.
1098
        """
1099
        if offset >= self._node_width:
1100
            for node in self._items.values():
1101
                for result in node._split(offset):
1102
                    yield result
1103
            return
1104
        for key, node in self._items.items():
1105
            pass
1106
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1107
    def refs(self):
1108
        """Return the references to other CHK's held by this node."""
1109
        if self._key is None:
1110
            raise AssertionError("unserialised nodes have no refs.")
1111
        refs = []
1112
        for value in self._items.itervalues():
1113
            if type(value) == tuple:
1114
                refs.append(value)
1115
            else:
1116
                refs.append(value.key())
1117
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1118
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1119
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1120
        """Return the unique key prefix for this node.
1121
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1122
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1123
            unique within this node.
1124
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1125
        self._search_prefix = self.common_prefix_for_keys(self._items)
1126
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1127
1128
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1129
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1130
        if not len(self._items):
1131
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1132
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1133
        if children:
1134
            child = children[0]
1135
        else:
1136
            raise KeyError(key)
1137
        self._len -= 1
1138
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1139
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1140
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1141
        if len(unmapped) == 0:
1142
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1143
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1144
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1145
        else:
1146
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1147
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1148
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1149
            # this node is no longer needed:
1150
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1151
        if isinstance(unmapped, InternalNode):
1152
            return self
1153
        return self._check_remap(store)
1154
1155
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1156
        """Check if all keys contained by children fit in a single LeafNode.
1157
1158
        :param store: A store to use for reading more nodes
1159
        :return: Either self, or a new LeafNode which should replace self.
1160
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1161
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1162
        # 1) Implicitly unmap() is removing a key which means that the child
1163
        #    nodes are going to be shrinking by some extent.
1164
        # 2) If all children are LeafNodes, it is possible that they could be
1165
        #    combined into a single LeafNode, which can then completely replace
1166
        #    this internal node with a single LeafNode
1167
        # 3) If *one* child is an InternalNode, we assume it has already done
1168
        #    all the work to determine that its children cannot collapse, and
1169
        #    we can then assume that those nodes *plus* the current nodes don't
1170
        #    have a chance of collapsing either.
1171
        #    So a very cheap check is to just say if 'unmapped' is an
1172
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1173
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1174
        # TODO: Another alternative is to check the total size of all known
1175
        #       LeafNodes. If there is some formula we can use to determine the
1176
        #       final size without actually having to read in any more
1177
        #       children, it would be nice to have. However, we have to be
1178
        #       careful with stuff like nodes that pull out the common prefix
1179
        #       of each key, as adding a new key can change the common prefix
1180
        #       and cause size changes greater than the length of one key.
1181
        #       So for now, we just add everything to a new Leaf until it
1182
        #       splits, as we know that will give the right answer
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1183
        new_leaf = LeafNode(search_key_func=self._search_key_func)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1184
        new_leaf.set_maximum_size(self._maximum_size)
1185
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1186
        # A batch_size of 16 was chosen because:
1187
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1188
        #      leaf nodes we are unlikely to hold them in a single new leaf
1189
        #      node. This still allows for 1 round trip
1190
        #   b) With 16-way fan out, we can still do a single round trip
1191
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1192
        #      the page cache, just to determine that we really don't need it.
1193
        for node in self._iter_nodes(store, batch_size=16):
1194
            if isinstance(node, InternalNode):
1195
                # Without looking at any leaf nodes, we are sure
1196
                return self
1197
            for key, value in node._items.iteritems():
1198
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1199
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1200
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1201
1202
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1203
def _deserialise(bytes, key, search_key_func):
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1204
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1205
    if bytes.startswith("chkleaf:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1206
        return LeafNode.deserialise(bytes, key, search_key_func=search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1207
    elif bytes.startswith("chknode:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1208
        return InternalNode.deserialise(bytes, key,
1209
            search_key_func=search_key_func)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1210
    else:
1211
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1212
1213
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1214
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1215
    """Read the associated records, and determine what is interesting."""
1216
    uninteresting_keys = set(uninteresting_keys)
1217
    chks_to_read = uninteresting_keys.union(interesting_keys)
1218
    next_uninteresting = set()
1219
    next_interesting = set()
1220
    uninteresting_items = set()
1221
    interesting_items = set()
1222
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1223
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1224
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1225
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1226
        if pb is not None:
1227
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1228
        bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1229
        # We don't care about search_key_func for this code, because we only
1230
        # care about external references.
1231
        node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1232
        if record.key in uninteresting_keys:
1233
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1234
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1235
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1236
                # We know we are at a LeafNode, so we can pass None for the
1237
                # store
1238
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1239
        else:
1240
            interesting_records.append(record)
1241
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1242
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1243
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1244
                interesting_items.update(node.iteritems(None))
1245
    # TODO: Filter out records that have already been read, as node splitting
1246
    #       can cause us to reference the same nodes via shorter and longer
1247
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1248
    return (next_uninteresting, uninteresting_items,
1249
            next_interesting, interesting_records, interesting_items)
1250
1251
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1252
def _find_all_uninteresting(store, interesting_root_keys,
1253
                            uninteresting_root_keys, adapter, pb):
1254
    """Determine the full set of uninteresting keys."""
1255
    # What about duplicates between interesting_root_keys and
1256
    # uninteresting_root_keys?
1257
    if not uninteresting_root_keys:
1258
        # Shortcut case. We know there is nothing uninteresting to filter out
1259
        # So we just let the rest of the algorithm do the work
1260
        # We know there is nothing uninteresting, and we didn't have to read
1261
        # any interesting records yet.
1262
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1263
    all_uninteresting_chks = set(uninteresting_root_keys)
1264
    all_uninteresting_items = set()
1265
1266
    # First step, find the direct children of both the interesting and
1267
    # uninteresting set
1268
    (uninteresting_keys, uninteresting_items,
1269
     interesting_keys, interesting_records,
1270
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1271
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1272
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1273
    all_uninteresting_chks.update(uninteresting_keys)
1274
    all_uninteresting_items.update(uninteresting_items)
1275
    del uninteresting_items
1276
    # Note: Exact matches between interesting and uninteresting do not need
1277
    #       to be search further. Non-exact matches need to be searched in case
1278
    #       there is a future exact-match
1279
    uninteresting_keys.difference_update(interesting_keys)
1280
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1281
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1282
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1283
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1284
    while chks_to_read:
1285
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1286
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1287
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1288
            if pb is not None:
1289
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1290
            if record.storage_kind in ('fulltext', 'chunked'):
1291
                bytes = record.get_bytes_as('fulltext')
1292
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1293
                bytes = adapter.get_bytes(record,
1294
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1295
            # We don't care about search_key_func for this code, because we
1296
            # only care about external references.
1297
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1298
            if isinstance(node, InternalNode):
1299
                # uninteresting_prefix_chks.update(node._items.iteritems())
1300
                chks = node._items.values()
1301
                # TODO: We remove the entries that are already in
1302
                #       uninteresting_chks ?
1303
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1304
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1305
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1306
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1307
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1308
    return (all_uninteresting_chks, all_uninteresting_items,
1309
            interesting_keys, interesting_records, interesting_items)
1310
1311
1312
def iter_interesting_nodes(store, interesting_root_keys,
1313
                           uninteresting_root_keys, pb=None):
1314
    """Given root keys, find interesting nodes.
1315
1316
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1317
    referenced from uninteresting_root_keys are not considered interesting.
1318
1319
    :param interesting_root_keys: keys which should be part of the
1320
        "interesting" nodes (which will be yielded)
1321
    :param uninteresting_root_keys: keys which should be filtered out of the
1322
        result set.
1323
    :return: Yield
1324
        (interesting records, interesting chk's, interesting key:values)
1325
    """
1326
    # TODO: consider that it may be more memory efficient to use the 20-byte
1327
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1328
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1329
    #       helper function similar to _read_bytes. This function should be
1330
    #       able to use nodes from the _page_cache as well as actually
1331
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1332
1333
    # A way to adapt from the compressed texts back into fulltexts
1334
    # In a way, this seems like a layering inversion to have CHKMap know the
1335
    # details of versionedfile
1336
    adapter_class = versionedfile.adapter_registry.get(
1337
        ('knit-ft-gz', 'fulltext'))
1338
    adapter = adapter_class(store)
1339
1340
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1341
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1342
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1343
1344
    # Now that we know everything uninteresting, we can yield information from
1345
    # our first request
1346
    interesting_items.difference_update(all_uninteresting_items)
1347
    records = dict((record.key, record) for record in interesting_records
1348
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1349
    if records or interesting_items:
1350
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1351
    interesting_keys.difference_update(all_uninteresting_chks)
1352
1353
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1354
    while chks_to_read:
1355
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1356
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1357
            if pb is not None:
1358
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1359
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1360
            if record.storage_kind in ('fulltext', 'chunked'):
1361
                bytes = record.get_bytes_as('fulltext')
1362
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1363
                bytes = adapter.get_bytes(record,
1364
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1365
            # We don't care about search_key_func for this code, because we
1366
            # only care about external references.
1367
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1368
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1369
                chks = set(node.refs())
1370
                chks.difference_update(all_uninteresting_chks)
1371
                # Is set() and .difference_update better than:
1372
                # chks = [chk for chk in node.refs()
1373
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1374
                next_chks.update(chks)
1375
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1376
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1377
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1378
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1379
                interesting_items = [item for item in node._items.iteritems()
1380
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1381
                # TODO: Do we need to filter out items that we have already
1382
                #       seen on other pages? We don't really want to buffer the
1383
                #       whole thing, but it does mean that callers need to
1384
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1385
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1386
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1387
        chks_to_read = next_chks