/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
25
are all and additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
44
import zlib
45
import struct
46
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
47
from bzrlib import versionedfile
48
""")
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
49
from bzrlib import lru_cache
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
50
51
# approx 2MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
52
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
53
# out, it takes 3.1MB to cache the layer.
54
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
55
# We are caching bytes so len(value) is perfectly accurate
56
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
57
58
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
59
def _search_key_16(key):
60
    """Map the key tuple into a search key string which has 16-way fan out."""
61
    return '\x00'.join(['%08X' % abs(zlib.crc32(bit)) for bit in key])
62
63
64
def _search_key_255(key):
65
    """Map the key tuple into a search key string which has 255-way fan out.
66
67
    We use 255-way because '\n' is used as a delimiter, and causes problems
68
    while parsing.
69
    """
70
    bytes = '\x00'.join([struct.pack('>i', zlib.crc32(bit)) for bit in key])
71
    return bytes.replace('\n', '_')
72
73
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
74
class CHKMap(object):
75
    """A persistent map from string to string backed by a CHK store."""
76
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
77
    def __init__(self, store, root_key, search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
78
        """Create a CHKMap object.
79
80
        :param store: The store the CHKMap is stored in.
81
        :param root_key: The root key of the map. None to create an empty
82
            CHKMap.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
83
        :param search_key_func: A function mapping a key => bytes. These bytes
84
            are then used by the internal nodes to split up leaf nodes into
85
            multiple pages.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
86
        """
87
        self._store = store
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
88
        if search_key_func is None:
89
            search_key_func = '\x00'.join
90
        self._search_key_func = search_key_func
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
91
        if root_key is None:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
92
            self._root_node = LeafNode(search_key_func=search_key_func)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
93
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
94
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
95
96
    def apply_delta(self, delta):
97
        """Apply a delta to the map.
98
99
        :param delta: An iterable of old_key, new_key, new_value tuples.
100
            If new_key is not None, then new_key->new_value is inserted
101
            into the map; if old_key is not None, then the old mapping
102
            of old_key is removed.
103
        """
104
        for old, new, value in delta:
105
            if old is not None and old != new:
106
                # unmap
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
107
                self.unmap(old)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
108
        for old, new, value in delta:
109
            if new is not None:
110
                # map
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
111
                self.map(new, value)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
112
        return self._save()
113
114
    def _ensure_root(self):
115
        """Ensure that the root node is an object not a key."""
116
        if type(self._root_node) == tuple:
117
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
118
            self._root_node = self._get_node(self._root_node)
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
119
            # XXX: Shouldn't this be put into _deserialize?
120
            self._root_node._search_key_func = self._search_key_func
3735.2.31 by Robert Collins
CHKMap.iter_changes
121
122
    def _get_node(self, node):
123
        """Get a node.
124
125
        Node that this does not update the _items dict in objects containing a
126
        reference to this node. As such it does not prevent subsequent IO being
127
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
128
3735.2.31 by Robert Collins
CHKMap.iter_changes
129
        :param node: A tuple key or node object.
130
        :return: A node object.
131
        """
132
        if type(node) == tuple:
133
            bytes = self._read_bytes(node)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
134
            return _deserialise(bytes, node,
135
                search_key_func=self._search_key_func)
3735.2.31 by Robert Collins
CHKMap.iter_changes
136
        else:
137
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
138
139
    def _read_bytes(self, key):
140
        stream = self._store.get_record_stream([key], 'unordered', True)
141
        return stream.next().get_bytes_as('fulltext')
142
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
143
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
144
        """Return the tree in a string representation."""
145
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
146
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
147
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
148
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
149
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
150
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
151
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
152
        """For this node and all children, generate a string representation."""
153
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
154
        if not include_keys:
155
            key_str = ''
156
        else:
157
            node_key = node.key()
158
            if node_key is not None:
159
                key_str = ' %s' % (node_key[0],)
160
            else:
161
                key_str = ' None'
162
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
163
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
164
        if isinstance(node, InternalNode):
165
            # Trigger all child nodes to get loaded
166
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
167
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
168
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
169
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
170
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
171
            for key, value in sorted(node._items.iteritems()):
172
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
173
        return result
174
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
175
    @classmethod
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
176
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
177
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
178
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
179
        :param store: The store to record initial_value in, a VersionedFiles
180
            object with 1-tuple keys supporting CHK key generation.
181
        :param initial_value: A dict to store in store. Its keys and values
182
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
183
        :param maximum_size: The maximum_size rule to apply to nodes. This
184
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
185
        :param key_width: The number of elements in each key_tuple being stored
186
            in this map.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
187
        :return: The root chk of te resulting CHKMap.
188
        """
189
        result = CHKMap(store, None)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
190
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
191
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
192
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
193
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
194
            delta.append((None, key, value))
195
        result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
196
        return result._save()
197
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
198
    def iter_changes(self, basis):
199
        """Iterate over the changes between basis and self.
200
201
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
202
            is None for keys only in self; new_value is None for keys only in
203
            basis.
204
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
205
        # Overview:
206
        # Read both trees in lexographic, highest-first order.
207
        # Any identical nodes we skip
208
        # Any unique prefixes we output immediately.
209
        # values in a leaf node are treated as single-value nodes in the tree
210
        # which allows them to be not-special-cased. We know to output them
211
        # because their value is a string, not a key(tuple) or node.
212
        #
213
        # corner cases to beware of when considering this function:
214
        # *) common references are at different heights.
215
        #    consider two trees:
216
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
217
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
218
        #     'b': LeafNode={'b'}}
219
        #    the node with aaa/aab will only be encountered in the second tree
220
        #    after reading the 'a' subtree, but it is encountered in the first
221
        #    tree immediately. Variations on this may have read internal nodes like this.
222
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
223
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
224
        #    clean as we process each item.
225
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
226
            return
227
        self._ensure_root()
228
        basis._ensure_root()
229
        excluded_keys = set()
230
        self_node = self._root_node
231
        basis_node = basis._root_node
232
        # A heap, each element is prefix, node(tuple/NodeObject/string),
233
        # key_path (a list of tuples, tail-sharing down the tree.)
234
        self_pending = []
235
        basis_pending = []
236
        def process_node(prefix, node, path, a_map, pending):
237
            # take a node and expand it
238
            node = a_map._get_node(node)
239
            if type(node) == LeafNode:
240
                path = (node._key, path)
241
                for key, value in node._items.items():
242
                    heapq.heappush(pending, ('\x00'.join(key), value, path))
243
            else:
244
                # type(node) == InternalNode
245
                path = (node._key, path)
246
                for prefix, child in node._items.items():
247
                    heapq.heappush(pending, (prefix, child, path))
248
        process_node(None, self_node, None, self, self_pending)
249
        process_node(None, basis_node, None, basis, basis_pending)
250
        self_seen = set()
251
        basis_seen = set()
252
        excluded_keys = set()
253
        def check_excluded(key_path):
254
            # Note that this is N^2, it depends on us trimming trees
255
            # aggressively to not become slow.
256
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
257
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
258
            # a common node is detected, the proceed to remove the already
259
            # pending children. bzrlib.graph has a searcher module with a
260
            # similar problem.
261
            while key_path is not None:
262
                key, key_path = key_path
263
                if key in excluded_keys:
264
                    return True
265
            return False
266
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
267
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
268
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
269
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
270
            if not self_pending:
271
                # self is exhausted: output remainder of basis
272
                for prefix, node, path in basis_pending:
273
                    if check_excluded(path):
274
                        continue
275
                    node = basis._get_node(node)
276
                    if type(node) == str:
277
                        # a value
278
                        yield (tuple(prefix.split('\x00')), node, None)
279
                    else:
280
                        # subtree - fastpath the entire thing.
281
                        for key, value in node.iteritems(basis._store):
282
                            yield (key, value, None)
283
                return
284
            elif not basis_pending:
285
                # basis is exhausted: output remainder of self.
286
                for prefix, node, path in self_pending:
287
                    if check_excluded(path):
288
                        continue
289
                    node = self._get_node(node)
290
                    if type(node) == str:
291
                        # a value
292
                        yield (tuple(prefix.split('\x00')), None, node)
293
                    else:
294
                        # subtree - fastpath the entire thing.
295
                        for key, value in node.iteritems(self._store):
296
                            yield (key, None, value)
297
                return
298
            else:
299
                # XXX: future optimisation - yield the smaller items
300
                # immediately rather than pushing everything on/off the
301
                # heaps. Applies to both internal nodes and leafnodes.
302
                if self_pending[0][0] < basis_pending[0][0]:
303
                    # expand self
304
                    prefix, node, path = heapq.heappop(self_pending)
305
                    if check_excluded(path):
306
                        continue
307
                    if type(node) == str:
308
                        # a value
309
                        yield (tuple(prefix.split('\x00')), None, node)
310
                    else:
311
                        process_node(prefix, node, path, self, self_pending)
312
                        continue
313
                elif self_pending[0][0] > basis_pending[0][0]:
314
                    # expand basis
315
                    prefix, node, path = heapq.heappop(basis_pending)
316
                    if check_excluded(path):
317
                        continue
318
                    if type(node) == str:
319
                        # a value
320
                        yield (tuple(prefix.split('\x00')), node, None)
321
                    else:
322
                        process_node(prefix, node, path, basis, basis_pending)
323
                        continue
324
                else:
325
                    # common prefix: possibly expand both
326
                    if type(self_pending[0][1]) != str:
327
                        # process next self
328
                        read_self = True
329
                    else:
330
                        read_self = False
331
                    if type(basis_pending[0][1]) != str:
332
                        # process next basis
333
                        read_basis = True
334
                    else:
335
                        read_basis = False
336
                    if not read_self and not read_basis:
337
                        # compare a common value
338
                        self_details = heapq.heappop(self_pending)
339
                        basis_details = heapq.heappop(basis_pending)
340
                        if self_details[1] != basis_details[1]:
341
                            yield (tuple(self_details[0].split('\x00')),
342
                                basis_details[1], self_details[1])
343
                        continue
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
344
                    # At least one side wasn't a string.
345
                    if (self._node_key(self_pending[0][1]) ==
346
                        self._node_key(basis_pending[0][1])):
347
                        # Identical pointers, skip (and don't bother adding to
348
                        # excluded, it won't turn up again.
349
                        heapq.heappop(self_pending)
350
                        heapq.heappop(basis_pending)
351
                        continue
352
                    # Now we need to expand this node before we can continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
353
                    if read_self:
354
                        prefix, node, path = heapq.heappop(self_pending)
355
                        if check_excluded(path):
356
                            continue
357
                        process_node(prefix, node, path, self, self_pending)
358
                    if read_basis:
359
                        prefix, node, path = heapq.heappop(basis_pending)
360
                        if check_excluded(path):
361
                            continue
362
                        process_node(prefix, node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
363
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
364
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
365
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
366
        """Iterate over the entire CHKMap's contents."""
367
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
368
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
369
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
370
    def key(self):
371
        """Return the key for this map."""
372
        if isinstance(self._root_node, tuple):
373
            return self._root_node
374
        else:
375
            return self._root_node._key
376
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
377
    def __len__(self):
378
        self._ensure_root()
379
        return len(self._root_node)
380
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
381
    def map(self, key, value):
382
        """Map a key tuple to value."""
383
        # Need a root object.
384
        self._ensure_root()
385
        prefix, node_details = self._root_node.map(self._store, key, value)
386
        if len(node_details) == 1:
387
            self._root_node = node_details[0][1]
388
        else:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
389
            self._root_node = InternalNode(prefix,
390
                                search_key_func=self._search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
391
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
392
            self._root_node._key_width = node_details[0][1]._key_width
393
            for split, node in node_details:
394
                self._root_node.add_node(split, node)
395
3735.2.31 by Robert Collins
CHKMap.iter_changes
396
    def _node_key(self, node):
397
        """Get the key for a node whether its a tuple o r node."""
398
        if type(node) == tuple:
399
            return node
400
        else:
401
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
402
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
403
    def unmap(self, key):
404
        """remove key from the map."""
405
        self._ensure_root()
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
406
        unmapped = self._root_node.unmap(self._store, key)
407
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
408
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
409
    def _save(self):
410
        """Save the map completely.
411
412
        :return: The key of the root node.
413
        """
414
        if type(self._root_node) == tuple:
415
            # Already saved.
416
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
417
        keys = list(self._root_node.serialise(self._store))
418
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
419
420
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
421
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
422
    """Base class defining the protocol for CHK Map nodes.
423
424
    :ivar _raw_size: The total size of the serialized key:value data, before
425
        adding the header bytes, and without prefix compression.
426
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
427
428
    def __init__(self, key_width=1):
429
        """Create a node.
430
431
        :param key_width: The width of keys for this node.
432
        """
433
        self._key = None
434
        # Current number of elements
435
        self._len = 0
436
        self._maximum_size = 0
437
        self._key_width = 1
438
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
439
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
440
        # The pointers/values this node has - meaning defined by child classes.
441
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
442
        # The common search prefix
443
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
444
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
445
    def __repr__(self):
446
        items_str = sorted(self._items)
447
        if len(items_str) > 20:
448
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
449
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
450
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
451
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
452
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
453
    def key(self):
454
        return self._key
455
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
456
    def __len__(self):
457
        return self._len
458
459
    @property
460
    def maximum_size(self):
461
        """What is the upper limit for adding references to a node."""
462
        return self._maximum_size
463
464
    def set_maximum_size(self, new_size):
465
        """Set the size threshold for nodes.
466
467
        :param new_size: The size at which no data is added to a node. 0 for
468
            unlimited.
469
        """
470
        self._maximum_size = new_size
471
3735.15.5 by John Arbash Meinel
Change the nomenclature.
472
    @classmethod
473
    def common_prefix(cls, prefix, key):
474
        """Given 2 strings, return the longest prefix common to both.
475
476
        :param prefix: This has been the common prefix for other keys, so it is
477
            more likely to be the common prefix in this case as well.
478
        :param key: Another string to compare to
479
        """
480
        if key.startswith(prefix):
481
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
482
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
483
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
484
            if left != right:
485
                pos -= 1
486
                break
3735.15.5 by John Arbash Meinel
Change the nomenclature.
487
        assert pos <= len(prefix)
488
        assert pos <= len(key)
489
        common = prefix[:pos+1]
490
        assert key.startswith(common)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
491
        return common
492
3735.15.5 by John Arbash Meinel
Change the nomenclature.
493
    @classmethod
494
    def common_prefix_for_keys(cls, keys):
495
        """Given a list of keys, find their common prefix.
496
497
        :param keys: An iterable of strings.
498
        :return: The longest common prefix of all keys.
499
        """
500
        common_prefix = None
501
        for key in keys:
502
            if common_prefix is None:
503
                common_prefix = key
504
                continue
505
            common_prefix = cls.common_prefix(common_prefix, key)
506
            if not common_prefix:
507
                # if common_prefix is the empty string, then we know it won't
508
                # change further
509
                return ''
510
        return common_prefix
511
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
512
513
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
514
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
515
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
516
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
517
    :ivar _size: The number of bytes that would be used by serializing all of
518
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
519
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
520
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
521
    def __init__(self, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
522
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
523
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
524
        self._common_serialised_prefix = None
525
        self._serialise_key = '\x00'.join
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
526
        if search_key_func is None:
527
            self._search_key_func = self._serialise_key
528
        else:
529
            self._search_key_func = search_key_func
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
530
531
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
532
        """Answer the current serialised size of this node.
533
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
534
        This differs from self._raw_size in that it includes the bytes used for
535
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
536
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
537
        if self._common_serialised_prefix is None:
538
            bytes_for_items = 0
539
        else:
540
            # We will store a single string with the common prefix
541
            # And then that common prefix will not be stored in any of the
542
            # entry lines
543
            prefix_len = len(self._common_serialised_prefix)
544
            bytes_for_items = (self._raw_size - (prefix_len * (self._len - 1)))
545
        return (13 # bytes overhead for the header and separators
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
546
            + len(str(self._maximum_size))
547
            + len(str(self._key_width))
3735.15.4 by John Arbash Meinel
Clean up some little bits.
548
            + len(str(self._len))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
549
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
550
551
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
552
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
553
        """Deserialise bytes, with key key, into a LeafNode.
554
555
        :param bytes: The bytes of the node.
556
        :param key: The key that the serialised node has.
557
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
558
        result = LeafNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
559
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
560
        # extra '' if the bytes ends in a final newline.
561
        lines = bytes.split('\n')
562
        assert lines[-1] == ''
563
        lines.pop(-1)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
564
        items = {}
565
        if lines[0] != 'chkleaf:':
566
            raise ValueError("not a serialised leaf node: %r" % bytes)
567
        maximum_size = int(lines[1])
568
        width = int(lines[2])
569
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
570
        prefix = lines[4]
571
        for line in lines[5:]:
572
            line = prefix + line
3735.2.25 by Robert Collins
CHKInventory core tests passing.
573
            elements = line.split('\x00', width)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
574
            items[tuple(elements[:-1])] = elements[-1]
575
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
576
            raise AssertionError("item count (%d) mismatch for key %s,"
577
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
578
        result._items = items
579
        result._len = length
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
580
        assert length == len(lines) - 5
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
581
        result._maximum_size = maximum_size
582
        result._key = key
583
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
584
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
585
            + (length)*(len(prefix)+1)) # prefix + '\n'
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
586
        result._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
587
        result._compute_serialised_prefix()
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
588
        if len(bytes) != result._current_size():
589
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
590
        assert len(bytes) == result._current_size()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
591
        return result
592
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
593
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
594
        """Iterate over items in the node.
595
596
        :param key_filter: A filter to apply to the node. It should be a
597
            list/set/dict or similar repeatedly iterable container.
598
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
599
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
600
            # Adjust the filter - short elements go to a prefix filter. Would this
601
            # be cleaner explicitly? That would be no harder for InternalNode..
602
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
603
            filters = {}
604
            for key in key_filter:
605
                length_filter = filters.setdefault(len(key), set())
606
                length_filter.add(key)
607
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
608
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
609
                for length, length_filter in filters:
610
                    if item[0][:length] in length_filter:
611
                        yield item
612
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
613
        else:
614
            for item in self._items.iteritems():
615
                yield item
616
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
617
    def _key_value_len(self, key, value):
618
        # TODO: Should probably be done without actually joining the key, but
619
        #       then that can be done via the C extension
3735.15.5 by John Arbash Meinel
Change the nomenclature.
620
        return 2 + len(self._serialise_key(key)) + len(value)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
621
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
622
    def _search_key(self, key):
623
        return self._search_key_func(key)
624
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
625
    def _map_no_split(self, key, value):
626
        """Map a key to a value.
627
628
        This assumes either the key does not already exist, or you have already
629
        removed its size and length from self.
630
631
        :return: True if adding this node should cause us to split.
632
        """
633
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
634
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
635
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
636
        serialised_key = self._serialise_key(key)
637
        if self._common_serialised_prefix is None:
638
            self._common_serialised_prefix = serialised_key
639
        else:
640
            self._common_serialised_prefix = self.common_prefix(
641
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
642
        search_key = self._search_key(key)
643
        if self._search_prefix is None:
644
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
645
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
646
            self._search_prefix = self.common_prefix(
647
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
648
        if (self._len > 1
649
            and self._maximum_size
650
            and self._current_size() > self._maximum_size):
651
            return True
652
        return False
653
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
654
    def _split(self, store):
655
        """We have overflowed.
656
657
        Split this node into multiple LeafNodes, return it up the stack so that
658
        the next layer creates a new InternalNode and references the new nodes.
659
3735.15.5 by John Arbash Meinel
Change the nomenclature.
660
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
661
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
662
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
663
        split_at = len(common_prefix) + 1
664
        result = {}
665
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
666
            search_key = self._search_key(key)
667
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
668
            # TODO: Generally only 1 key can be exactly the right length,
669
            #       which means we can only have 1 key in the node pointed
670
            #       at by the 'prefix\0' key. We might want to consider
671
            #       folding it into the containing InternalNode rather than
672
            #       having a fixed length-1 node.
673
            #       Note this is probably not true for hash keys, as they
674
            #       may get a '\00' node anywhere, but won't have keys of
675
            #       different lengths.
676
            if len(prefix) < split_at:
677
                prefix += '\x00'*(split_at - len(prefix))
678
            if prefix not in result:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
679
                node = LeafNode(search_key_func=self._search_key_func)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
680
                node.set_maximum_size(self._maximum_size)
681
                node._key_width = self._key_width
682
                result[prefix] = node
683
            else:
684
                node = result[prefix]
685
            node.map(store, key, value)
686
        return common_prefix, result.items()
687
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
688
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
689
        """Map key to value."""
690
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
691
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
692
            self._len -= 1
693
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
694
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
695
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
696
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
697
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
698
699
    def serialise(self, store):
700
        """Serialise the tree to store.
701
702
        :param store: A VersionedFiles honouring the CHK extensions.
703
        :return: An iterable of the keys inserted by this operation.
704
        """
705
        lines = ["chkleaf:\n"]
706
        lines.append("%d\n" % self._maximum_size)
707
        lines.append("%d\n" % self._key_width)
708
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
709
        if self._common_serialised_prefix is None:
710
            lines.append('\n')
711
        else:
712
            lines.append('%s\n' % (self._common_serialised_prefix,))
713
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
714
        for key, value in sorted(self._items.items()):
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
715
            serialized = "%s\x00%s\n" % (self._serialise_key(key), value)
716
            assert serialized.startswith(self._common_serialised_prefix)
717
            lines.append(serialized[prefix_len:])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
718
        sha1, _, _ = store.add_lines((None,), (), lines)
719
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
720
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
721
        if len(bytes) != self._current_size():
722
            import pdb; pdb.set_trace()
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
723
        assert len(bytes) == self._current_size()
724
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
725
        return [self._key]
726
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
727
    def refs(self):
728
        """Return the references to other CHK's held by this node."""
729
        return []
730
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
731
    def _compute_search_prefix(self):
732
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
733
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
734
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
735
            unique within this node.
736
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
737
        search_keys = [self._search_key(key) for key in self._items]
738
        self._search_prefix = self.common_prefix_for_keys(search_keys)
739
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
740
741
    def _compute_serialised_prefix(self):
742
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
743
744
        :return: A bytestring of the longest serialised key prefix that is
745
            unique within this node.
746
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
747
        serialised_keys = [self._serialise_key(key) for key in self._items]
748
        self._common_serialised_prefix = self.common_prefix_for_keys(
749
            serialised_keys)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
750
751
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
752
        """Unmap key from the node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
753
        self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
754
        self._len -= 1
755
        del self._items[key]
756
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
757
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
758
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
759
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
760
        return self
761
762
763
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
764
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
765
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
766
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
767
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
768
3735.15.5 by John Arbash Meinel
Change the nomenclature.
769
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
770
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
771
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
772
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
773
    def __init__(self, prefix='', search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
774
        Node.__init__(self)
775
        # The size of an internalnode with default values and no children.
776
        # How many octets key prefixes within this node are.
777
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
778
        self._search_prefix = prefix
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
779
        if search_key_func is None:
780
            self._search_key_func = '\x00'.join
781
        else:
782
            self._search_key_func = search_key_func
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
783
784
    def __repr__(self):
785
        items_str = sorted(self._items)
786
        if len(items_str) > 20:
787
            items_str = items_str[16] + '...]'
788
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
789
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
790
            self._maximum_size, self._search_prefix, items_str)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
791
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
792
    def add_node(self, prefix, node):
793
        """Add a child node with prefix prefix, and node node.
794
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
795
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
796
        :param node: The node being added.
797
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
798
        assert self._search_prefix is not None
799
        assert prefix.startswith(self._search_prefix)
800
        assert len(prefix) == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
801
        self._len += len(node)
802
        if not len(self._items):
803
            self._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
804
        assert self._node_width == len(self._search_prefix) + 1
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
805
        self._items[prefix] = node
806
        self._key = None
807
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
808
    def _current_size(self):
809
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
810
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
811
            len(str(self._maximum_size)))
812
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
813
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
814
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
815
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
816
817
        :param bytes: The bytes of the node.
818
        :param key: The key that the serialised node has.
819
        :return: An InternalNode instance.
820
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
821
        result = InternalNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
822
        # Splitlines can split on '\r' so don't use it, remove the extra ''
823
        # from the result of split('\n') because we should have a trailing
824
        # newline
825
        lines = bytes.split('\n')
826
        assert lines[-1] == ''
827
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
828
        items = {}
829
        if lines[0] != 'chknode:':
830
            raise ValueError("not a serialised internal node: %r" % bytes)
831
        maximum_size = int(lines[1])
832
        width = int(lines[2])
833
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
834
        common_prefix = lines[4]
835
        for line in lines[5:]:
836
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
837
            prefix, flat_key = line.rsplit('\x00', 1)
838
            items[prefix] = (flat_key,)
839
        result._items = items
840
        result._len = length
841
        result._maximum_size = maximum_size
842
        result._key = key
843
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
844
        # XXX: InternalNodes don't really care about their size, and this will
845
        #      change if we add prefix compression
846
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
847
        result._node_width = len(prefix)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
848
        result._compute_search_prefix()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
849
        return result
850
851
    def iteritems(self, store, key_filter=None):
852
        for node in self._iter_nodes(store, key_filter=key_filter):
853
            for item in node.iteritems(store, key_filter=key_filter):
854
                yield item
855
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
856
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
857
        """Iterate over node objects which match key_filter.
858
859
        :param store: A store to use for accessing content.
860
        :param key_filter: A key filter to filter nodes. Only nodes that might
861
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
862
        :param batch_size: If not None, then we will return the nodes that had
863
            to be read using get_record_stream in batches, rather than reading
864
            them all at once.
865
        :return: An iterable of nodes. This function does not have to be fully
866
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
867
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
868
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
869
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
870
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
871
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
872
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
873
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
874
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
875
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
876
            # XXX defaultdict ?
877
            length_filters = {}
878
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
879
                search_key = self._search_prefix_filter(key)
880
                length_filter = length_filters.setdefault(
881
                                    len(search_key), set())
882
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
883
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
884
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
885
                for length, length_filter in length_filters:
886
                    if prefix[:length] in length_filter:
887
                        if type(node) == tuple:
888
                            keys[node] = prefix
889
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
890
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
891
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
892
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
893
            # Look in the page cache for some more bytes
894
            found_keys = set()
895
            for key in keys:
896
                try:
897
                    bytes = _page_cache[key]
898
                except KeyError:
899
                    continue
900
                else:
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
901
                    node = _deserialise(bytes, key,
902
                        search_key_func=self._search_key_func)
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
903
                    self._items[keys[key]] = node
904
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
905
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
906
            for key in found_keys:
907
                del keys[key]
908
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
909
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
910
            if batch_size is None:
911
                # Read all the keys in
912
                batch_size = len(keys)
913
            key_order = list(keys)
914
            for batch_start in range(0, len(key_order), batch_size):
915
                batch = key_order[batch_start:batch_start + batch_size]
916
                # We have to fully consume the stream so there is no pending
917
                # I/O, so we buffer the nodes for now.
918
                stream = store.get_record_stream(batch, 'unordered', True)
919
                nodes = []
920
                for record in stream:
921
                    bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
922
                    node = _deserialise(bytes, record.key,
923
                        search_key_func=self._search_key_func)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
924
                    nodes.append(node)
925
                    self._items[keys[record.key]] = node
926
                    _page_cache.add(record.key, bytes)
927
                for node in nodes:
928
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
929
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
930
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
931
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
932
        if not len(self._items):
933
            raise AssertionError("cant map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
934
        search_key = self._search_key(key)
935
        assert self._node_width == len(self._search_prefix) + 1
936
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
937
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
938
            # point where it would fit, insert self into that internal node,
939
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
940
            new_prefix = self.common_prefix(self._search_prefix,
941
                                            search_key)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
942
            new_parent = InternalNode(new_prefix,
943
                search_key_func=self._search_key_func)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
944
            new_parent.set_maximum_size(self._maximum_size)
945
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
946
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
947
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
948
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
949
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
950
        if children:
951
            child = children[0]
952
        else:
953
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
954
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
955
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
956
        if isinstance(child, LeafNode):
957
            old_size = child._current_size()
958
        else:
959
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
960
        prefix, node_details = child.map(store, key, value)
961
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
962
            # child may have shrunk, or might be a new node
963
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
964
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
965
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
966
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
967
            new_node = self
968
            if (isinstance(child, LeafNode)
969
                and (old_size is None or child._current_size() < old_size)):
970
                # The old node was an InternalNode which means it has now
971
                # collapsed, so we need to check if it will chain to a collapse
972
                # at this level. Or the LeafNode has shrunk in size, so we need
973
                # to check that as well.
974
                new_node = self._check_remap(store)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
975
            assert new_node._search_prefix is not None
976
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
977
        # child has overflown - create a new intermediate node.
978
        # XXX: This is where we might want to try and expand our depth
979
        # to refer to more bytes of every child (which would give us
980
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
981
        child = self._new_child(search_key, InternalNode)
982
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
983
        for split, node in node_details:
984
            child.add_node(split, node)
985
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
986
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
987
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
988
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
989
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
990
        """Create a new child node of type klass."""
991
        child = klass()
992
        child.set_maximum_size(self._maximum_size)
993
        child._key_width = self._key_width
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
994
        child._search_key_func = self._search_key_func
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
995
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
996
        return child
997
998
    def serialise(self, store):
999
        """Serialise the node to store.
1000
1001
        :param store: A VersionedFiles honouring the CHK extensions.
1002
        :return: An iterable of the keys inserted by this operation.
1003
        """
1004
        for node in self._items.itervalues():
1005
            if type(node) == tuple:
1006
                # Never deserialised.
1007
                continue
1008
            if node._key is not None:
1009
                # Never altered
1010
                continue
1011
            for key in node.serialise(store):
1012
                yield key
1013
        lines = ["chknode:\n"]
1014
        lines.append("%d\n" % self._maximum_size)
1015
        lines.append("%d\n" % self._key_width)
1016
        lines.append("%d\n" % self._len)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1017
        assert self._search_prefix is not None
1018
        lines.append('%s\n' % (self._search_prefix,))
1019
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1020
        for prefix, node in sorted(self._items.items()):
1021
            if type(node) == tuple:
1022
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1023
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1024
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1025
            serialised = "%s\x00%s\n" % (prefix, key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1026
            assert serialised.startswith(self._search_prefix)
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1027
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1028
        sha1, _, _ = store.add_lines((None,), (), lines)
1029
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
1030
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1031
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1032
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1033
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1034
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1035
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
1036
        # pad as necessary.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
1037
        return (self._search_key_func(key) + '\x00'*self._node_width)[:self._node_width]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1038
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1039
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1040
        """Serialise key for use as a prefix filter in iteritems."""
1041
        if len(key) == self._key_width:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1042
            return self._search_key(key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1043
        return '\x00'.join(key)[:self._node_width]
1044
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1045
    def _split(self, offset):
1046
        """Split this node into smaller nodes starting at offset.
1047
1048
        :param offset: The offset to start the new child nodes at.
1049
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1050
            prefix for reaching node.
1051
        """
1052
        if offset >= self._node_width:
1053
            for node in self._items.values():
1054
                for result in node._split(offset):
1055
                    yield result
1056
            return
1057
        for key, node in self._items.items():
1058
            pass
1059
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1060
    def refs(self):
1061
        """Return the references to other CHK's held by this node."""
1062
        if self._key is None:
1063
            raise AssertionError("unserialised nodes have no refs.")
1064
        refs = []
1065
        for value in self._items.itervalues():
1066
            if type(value) == tuple:
1067
                refs.append(value)
1068
            else:
1069
                refs.append(value.key())
1070
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1071
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1072
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1073
        """Return the unique key prefix for this node.
1074
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1075
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1076
            unique within this node.
1077
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1078
        self._search_prefix = self.common_prefix_for_keys(self._items)
1079
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1080
1081
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1082
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1083
        if not len(self._items):
1084
            raise AssertionError("cant unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1085
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1086
        if children:
1087
            child = children[0]
1088
        else:
1089
            raise KeyError(key)
1090
        self._len -= 1
1091
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1092
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1093
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1094
        if len(unmapped) == 0:
1095
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1096
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1097
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1098
        else:
1099
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1100
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1101
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1102
            # this node is no longer needed:
1103
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1104
        if isinstance(unmapped, InternalNode):
1105
            return self
1106
        return self._check_remap(store)
1107
1108
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1109
        """Check if all keys contained by children fit in a single LeafNode.
1110
1111
        :param store: A store to use for reading more nodes
1112
        :return: Either self, or a new LeafNode which should replace self.
1113
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1114
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1115
        # 1) Implicitly unmap() is removing a key which means that the child
1116
        #    nodes are going to be shrinking by some extent.
1117
        # 2) If all children are LeafNodes, it is possible that they could be
1118
        #    combined into a single LeafNode, which can then completely replace
1119
        #    this internal node with a single LeafNode
1120
        # 3) If *one* child is an InternalNode, we assume it has already done
1121
        #    all the work to determine that its children cannot collapse, and
1122
        #    we can then assume that those nodes *plus* the current nodes don't
1123
        #    have a chance of collapsing either.
1124
        #    So a very cheap check is to just say if 'unmapped' is an
1125
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1126
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1127
        # TODO: Another alternative is to check the total size of all known
1128
        #       LeafNodes. If there is some formula we can use to determine the
1129
        #       final size without actually having to read in any more
1130
        #       children, it would be nice to have. However, we have to be
1131
        #       careful with stuff like nodes that pull out the common prefix
1132
        #       of each key, as adding a new key can change the common prefix
1133
        #       and cause size changes greater than the length of one key.
1134
        #       So for now, we just add everything to a new Leaf until it
1135
        #       splits, as we know that will give the right answer
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1136
        new_leaf = LeafNode(search_key_func=self._search_key_func)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1137
        new_leaf.set_maximum_size(self._maximum_size)
1138
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1139
        # A batch_size of 16 was chosen because:
1140
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1141
        #      leaf nodes we are unlikely to hold them in a single new leaf
1142
        #      node. This still allows for 1 round trip
1143
        #   b) With 16-way fan out, we can still do a single round trip
1144
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1145
        #      the page cache, just to determine that we really don't need it.
1146
        for node in self._iter_nodes(store, batch_size=16):
1147
            if isinstance(node, InternalNode):
1148
                # Without looking at any leaf nodes, we are sure
1149
                return self
1150
            for key, value in node._items.iteritems():
1151
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1152
                    return self
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1153
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1154
1155
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1156
def _deserialise(bytes, key, search_key_func):
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1157
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1158
    if bytes.startswith("chkleaf:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1159
        return LeafNode.deserialise(bytes, key, search_key_func=search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1160
    elif bytes.startswith("chknode:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1161
        return InternalNode.deserialise(bytes, key,
1162
            search_key_func=search_key_func)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1163
    else:
1164
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1165
1166
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1167
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1168
    """Read the associated records, and determine what is interesting."""
1169
    uninteresting_keys = set(uninteresting_keys)
1170
    chks_to_read = uninteresting_keys.union(interesting_keys)
1171
    next_uninteresting = set()
1172
    next_interesting = set()
1173
    uninteresting_items = set()
1174
    interesting_items = set()
1175
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1176
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1177
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1178
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1179
        if pb is not None:
1180
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1181
        bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1182
        # We don't care about search_key_func for this code, because we only
1183
        # care about external references.
1184
        node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1185
        if record.key in uninteresting_keys:
1186
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1187
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1188
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1189
                # We know we are at a LeafNode, so we can pass None for the
1190
                # store
1191
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1192
        else:
1193
            interesting_records.append(record)
1194
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1195
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1196
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1197
                interesting_items.update(node.iteritems(None))
1198
    # TODO: Filter out records that have already been read, as node splitting
1199
    #       can cause us to reference the same nodes via shorter and longer
1200
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1201
    return (next_uninteresting, uninteresting_items,
1202
            next_interesting, interesting_records, interesting_items)
1203
1204
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1205
def _find_all_uninteresting(store, interesting_root_keys,
1206
                            uninteresting_root_keys, adapter, pb):
1207
    """Determine the full set of uninteresting keys."""
1208
    # What about duplicates between interesting_root_keys and
1209
    # uninteresting_root_keys?
1210
    if not uninteresting_root_keys:
1211
        # Shortcut case. We know there is nothing uninteresting to filter out
1212
        # So we just let the rest of the algorithm do the work
1213
        # We know there is nothing uninteresting, and we didn't have to read
1214
        # any interesting records yet.
1215
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1216
    all_uninteresting_chks = set(uninteresting_root_keys)
1217
    all_uninteresting_items = set()
1218
1219
    # First step, find the direct children of both the interesting and
1220
    # uninteresting set
1221
    (uninteresting_keys, uninteresting_items,
1222
     interesting_keys, interesting_records,
1223
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1224
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1225
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1226
    all_uninteresting_chks.update(uninteresting_keys)
1227
    all_uninteresting_items.update(uninteresting_items)
1228
    del uninteresting_items
1229
    # Note: Exact matches between interesting and uninteresting do not need
1230
    #       to be search further. Non-exact matches need to be searched in case
1231
    #       there is a future exact-match
1232
    uninteresting_keys.difference_update(interesting_keys)
1233
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1234
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1235
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1236
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1237
    while chks_to_read:
1238
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1239
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1240
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1241
            if pb is not None:
1242
                pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1243
            if record.storage_kind in ('fulltext', 'chunked'):
1244
                bytes = record.get_bytes_as('fulltext')
1245
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1246
                bytes = adapter.get_bytes(record,
1247
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1248
            # We don't care about search_key_func for this code, because we
1249
            # only care about external references.
1250
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1251
            if isinstance(node, InternalNode):
1252
                # uninteresting_prefix_chks.update(node._items.iteritems())
1253
                chks = node._items.values()
1254
                # TODO: We remove the entries that are already in
1255
                #       uninteresting_chks ?
1256
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1257
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1258
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1259
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1260
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1261
    return (all_uninteresting_chks, all_uninteresting_items,
1262
            interesting_keys, interesting_records, interesting_items)
1263
1264
1265
def iter_interesting_nodes(store, interesting_root_keys,
1266
                           uninteresting_root_keys, pb=None):
1267
    """Given root keys, find interesting nodes.
1268
1269
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1270
    referenced from uninteresting_root_keys are not considered interesting.
1271
1272
    :param interesting_root_keys: keys which should be part of the
1273
        "interesting" nodes (which will be yielded)
1274
    :param uninteresting_root_keys: keys which should be filtered out of the
1275
        result set.
1276
    :return: Yield
1277
        (interesting records, interesting chk's, interesting key:values)
1278
    """
1279
    # TODO: consider that it may be more memory efficient to use the 20-byte
1280
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1281
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1282
    #       helper function similar to _read_bytes. This function should be
1283
    #       able to use nodes from the _page_cache as well as actually
1284
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1285
1286
    # A way to adapt from the compressed texts back into fulltexts
1287
    # In a way, this seems like a layering inversion to have CHKMap know the
1288
    # details of versionedfile
1289
    adapter_class = versionedfile.adapter_registry.get(
1290
        ('knit-ft-gz', 'fulltext'))
1291
    adapter = adapter_class(store)
1292
1293
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1294
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1295
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1296
1297
    # Now that we know everything uninteresting, we can yield information from
1298
    # our first request
1299
    interesting_items.difference_update(all_uninteresting_items)
1300
    records = dict((record.key, record) for record in interesting_records
1301
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1302
    if records or interesting_items:
1303
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1304
    interesting_keys.difference_update(all_uninteresting_chks)
1305
1306
    chks_to_read = interesting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1307
    while chks_to_read:
1308
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1309
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
1310
            if pb is not None:
1311
                pb.tick()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1312
            # TODO: Handle 'absent'?
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1313
            if record.storage_kind in ('fulltext', 'chunked'):
1314
                bytes = record.get_bytes_as('fulltext')
1315
            else:
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1316
                bytes = adapter.get_bytes(record,
1317
                            record.get_bytes_as(record.storage_kind))
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1318
            # We don't care about search_key_func for this code, because we
1319
            # only care about external references.
1320
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1321
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1322
                chks = set(node.refs())
1323
                chks.difference_update(all_uninteresting_chks)
1324
                # Is set() and .difference_update better than:
1325
                # chks = [chk for chk in node.refs()
1326
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1327
                next_chks.update(chks)
1328
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1329
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1330
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1331
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1332
                interesting_items = [item for item in node._items.iteritems()
1333
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1334
                # TODO: Do we need to filter out items that we have already
1335
                #       seen on other pages? We don't really want to buffer the
1336
                #       whole thing, but it does mean that callers need to
1337
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1338
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1339
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1340
        chks_to_read = next_chks