/brz/remove-bazaar

To get this branch, use:
bzr branch http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
1
# Copyright (C) 2008 Canonical Ltd
2
#
3
# This program is free software; you can redistribute it and/or modify
4
# it under the terms of the GNU General Public License as published by
5
# the Free Software Foundation; either version 2 of the License, or
6
# (at your option) any later version.
7
#
8
# This program is distributed in the hope that it will be useful,
9
# but WITHOUT ANY WARRANTY; without even the implied warranty of
10
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
11
# GNU General Public License for more details.
12
#
13
# You should have received a copy of the GNU General Public License
14
# along with this program; if not, write to the Free Software
15
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
16
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
17
"""Persistent maps from tuple_of_strings->string using CHK stores.
18
19
Overview and current status:
20
21
The CHKMap class implements a dict from tuple_of_strings->string by using a trie
22
with internal nodes of 8-bit fan out; The key tuples are mapped to strings by
23
joining them by \x00, and \x00 padding shorter keys out to the length of the
24
longest key. Leaf nodes are packed as densely as possible, and internal nodes
3735.19.1 by Ian Clatworthy
CHKMap cleanups
25
are all an additional 8-bits wide leading to a sparse upper tree.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
26
27
Updates to a CHKMap are done preferentially via the apply_delta method, to
28
allow optimisation of the update operation; but individual map/unmap calls are
29
possible and supported. All changes via map/unmap are buffered in memory until
30
the _save method is called to force serialisation of the tree. apply_delta
31
performs a _save implicitly.
32
33
TODO:
34
-----
35
36
Densely packed upper nodes.
37
38
"""
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
39
3735.2.31 by Robert Collins
CHKMap.iter_changes
40
import heapq
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
41
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
42
from bzrlib import lazy_import
43
lazy_import.lazy_import(globals(), """
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
44
import zlib
45
import struct
46
3735.9.18 by John Arbash Meinel
Make the versionedfile import lazy.
47
from bzrlib import versionedfile
48
""")
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
49
from bzrlib import (
3735.2.98 by John Arbash Meinel
Merge bzr.dev 4032. Resolve the new streaming fetch.
50
    errors,
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
51
    lru_cache,
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
52
    osutils,
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
53
    registry,
3735.2.112 by John Arbash Meinel
Merge Ian's try/except helper for aiding in debugging strange failures.
54
    trace,
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
55
    )
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
56
3735.19.1 by Ian Clatworthy
CHKMap cleanups
57
# approx 4MB
3735.14.5 by John Arbash Meinel
Change _check_remap to only page in a batch of children at a time.
58
# If each line is 50 bytes, and you have 255 internal pages, with 255-way fan
59
# out, it takes 3.1MB to cache the layer.
60
_PAGE_CACHE_SIZE = 4*1024*1024
3735.14.2 by John Arbash Meinel
Finish using the page cache as part of _check_remap, add debugging functions
61
# We are caching bytes so len(value) is perfectly accurate
62
_page_cache = lru_cache.LRUSizeCache(_PAGE_CACHE_SIZE)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
63
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
64
# If a ChildNode falls below this many bytes, we check for a remap
65
_INTERESTING_NEW_SIZE = 50
66
# If a ChildNode shrinks by more than this amount, we check for a remap
67
_INTERESTING_SHRINKAGE_LIMIT = 20
68
# If we delete more than this many nodes applying a delta, we check for a remap
69
_INTERESTING_DELETES_LIMIT = 5
70
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
71
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
72
def _search_key_plain(key):
73
    """Map the key tuple into a search string that just uses the key bytes."""
74
    return '\x00'.join(key)
75
76
3735.2.87 by Vincent Ladeuil
Same player shoots again, zlib.crc32, we'll get you.
77
def _crc32(bit):
3735.2.83 by Vincent Ladeuil
Better fix with explanation for zlib.crc32.
78
    # Depending on python version and platform, zlib.crc32 will return either a
79
    # signed (<= 2.5 >= 3.0) or an unsigned (2.5, 2.6).
80
    # http://docs.python.org/library/zlib.html recommends using a mask to force
81
    # an unsigned value to ensure the same numeric value (unsigned) is obtained
82
    # across all python versions and platforms.
3735.2.84 by John Arbash Meinel
Comment about using using 0xFFFFFFFF as part of _search_key_255
83
    # Note: However, on 32-bit platforms this causes an upcast to PyLong, which
84
    #       are generally slower than PyInts. However, if performance becomes
85
    #       critical, we should probably write the whole thing as an extension
86
    #       anyway.
87
    #       Though we really don't need that 32nd bit of accuracy. (even 2**24
88
    #       is probably enough node fan out for realistic trees.)
3735.2.87 by Vincent Ladeuil
Same player shoots again, zlib.crc32, we'll get you.
89
    return zlib.crc32(bit)&0xFFFFFFFF
90
91
92
def _search_key_16(key):
93
    """Map the key tuple into a search key string which has 16-way fan out."""
94
    return '\x00'.join(['%08X' % _crc32(bit) for bit in key])
95
96
97
def _search_key_255(key):
98
    """Map the key tuple into a search key string which has 255-way fan out.
99
100
    We use 255-way because '\n' is used as a delimiter, and causes problems
101
    while parsing.
102
    """
103
    bytes = '\x00'.join([struct.pack('>L', _crc32(bit)) for bit in key])
3735.16.3 by John Arbash Meinel
Add functions for _search_key_16 and _search_key_255 and some basic tests for them.
104
    return bytes.replace('\n', '_')
105
106
3735.16.7 by John Arbash Meinel
Start parameterizing CHKInventory and CHKSerializer so that we can
107
search_key_registry = registry.Registry()
108
search_key_registry.register('plain', _search_key_plain)
109
search_key_registry.register('hash-16-way', _search_key_16)
110
search_key_registry.register('hash-255-way', _search_key_255)
111
112
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
113
class CHKMap(object):
114
    """A persistent map from string to string backed by a CHK store."""
115
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
116
    def __init__(self, store, root_key, search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
117
        """Create a CHKMap object.
118
119
        :param store: The store the CHKMap is stored in.
120
        :param root_key: The root key of the map. None to create an empty
121
            CHKMap.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
122
        :param search_key_func: A function mapping a key => bytes. These bytes
123
            are then used by the internal nodes to split up leaf nodes into
124
            multiple pages.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
125
        """
126
        self._store = store
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
127
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
128
            search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
129
        self._search_key_func = search_key_func
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
130
        if root_key is None:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
131
            self._root_node = LeafNode(search_key_func=search_key_func)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
132
        else:
3735.2.41 by Robert Collins
Make the parent_id_basename index be updated during CHKInventory.apply_delta.
133
            self._root_node = self._node_key(root_key)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
134
135
    def apply_delta(self, delta):
136
        """Apply a delta to the map.
137
138
        :param delta: An iterable of old_key, new_key, new_value tuples.
139
            If new_key is not None, then new_key->new_value is inserted
140
            into the map; if old_key is not None, then the old mapping
141
            of old_key is removed.
142
        """
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
143
        delete_count = 0
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
144
        for old, new, value in delta:
145
            if old is not None and old != new:
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
146
                self.unmap(old, check_remap=False)
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
147
                delete_count += 1
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
148
        for old, new, value in delta:
149
            if new is not None:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
150
                self.map(new, value)
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
151
        if delete_count > _INTERESTING_DELETES_LIMIT:
152
            trace.mutter("checking remap as %d deletions", delete_count)
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
153
            self._check_remap()
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
154
        return self._save()
155
156
    def _ensure_root(self):
157
        """Ensure that the root node is an object not a key."""
158
        if type(self._root_node) == tuple:
159
            # Demand-load the root
3735.2.31 by Robert Collins
CHKMap.iter_changes
160
            self._root_node = self._get_node(self._root_node)
161
162
    def _get_node(self, node):
163
        """Get a node.
164
3735.19.1 by Ian Clatworthy
CHKMap cleanups
165
        Note that this does not update the _items dict in objects containing a
3735.2.31 by Robert Collins
CHKMap.iter_changes
166
        reference to this node. As such it does not prevent subsequent IO being
167
        performed.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
168
3735.2.31 by Robert Collins
CHKMap.iter_changes
169
        :param node: A tuple key or node object.
170
        :return: A node object.
171
        """
172
        if type(node) == tuple:
173
            bytes = self._read_bytes(node)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
174
            return _deserialise(bytes, node,
175
                search_key_func=self._search_key_func)
3735.2.31 by Robert Collins
CHKMap.iter_changes
176
        else:
177
            return node
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
178
179
    def _read_bytes(self, key):
3735.2.124 by Ian Clatworthy
use the page cache in CHKMap._read_bytes()
180
        try:
181
            return _page_cache[key]
182
        except KeyError:
183
            stream = self._store.get_record_stream([key], 'unordered', True)
3735.23.1 by John Arbash Meinel
If you are going to read from the page cache,
184
            bytes = stream.next().get_bytes_as('fulltext')
185
            _page_cache[key] = bytes
186
            return bytes
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
187
3735.15.16 by John Arbash Meinel
Properly fix up the dump_tree tests, we now suppress the keys by default.
188
    def _dump_tree(self, include_keys=False):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
189
        """Return the tree in a string representation."""
190
        self._ensure_root()
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
191
        res = self._dump_tree_node(self._root_node, prefix='', indent='',
192
                                   include_keys=include_keys)
3735.11.9 by John Arbash Meinel
Switch _dump_tree to returning trailing '\n' for nicer results
193
        res.append('') # Give a trailing '\n'
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
194
        return '\n'.join(res)
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
195
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
196
    def _dump_tree_node(self, node, prefix, indent, include_keys=True):
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
197
        """For this node and all children, generate a string representation."""
198
        result = []
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
199
        if not include_keys:
200
            key_str = ''
201
        else:
202
            node_key = node.key()
203
            if node_key is not None:
204
                key_str = ' %s' % (node_key[0],)
205
            else:
206
                key_str = ' None'
207
        result.append('%s%r %s%s' % (indent, prefix, node.__class__.__name__,
208
                                     key_str))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
209
        if isinstance(node, InternalNode):
210
            # Trigger all child nodes to get loaded
211
            list(node._iter_nodes(self._store))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
212
            for prefix, sub in sorted(node._items.iteritems()):
3735.15.15 by John Arbash Meinel
Change child_child to use _dump_tree,
213
                result.extend(self._dump_tree_node(sub, prefix, indent + '  ',
214
                                                   include_keys=include_keys))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
215
        else:
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
216
            for key, value in sorted(node._items.iteritems()):
3735.2.109 by Vincent Ladeuil
Fixed as per John's comments
217
                # Don't use prefix nor indent here to line up when used in
218
                # tests in conjunction with assertEqualDiff
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
219
                result.append('      %r %r' % (key, value))
3735.9.2 by John Arbash Meinel
Add a _dump_tree helper that assists in debugging what is going on.
220
        return result
221
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
222
    @classmethod
3735.19.1 by Ian Clatworthy
CHKMap cleanups
223
    def from_dict(klass, store, initial_value, maximum_size=0, key_width=1,
224
        search_key_func=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
225
        """Create a CHKMap in store with initial_value as the content.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
226
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
227
        :param store: The store to record initial_value in, a VersionedFiles
228
            object with 1-tuple keys supporting CHK key generation.
229
        :param initial_value: A dict to store in store. Its keys and values
230
            must be bytestrings.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
231
        :param maximum_size: The maximum_size rule to apply to nodes. This
232
            determines the size at which no new data is added to a single node.
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
233
        :param key_width: The number of elements in each key_tuple being stored
234
            in this map.
3735.19.1 by Ian Clatworthy
CHKMap cleanups
235
        :param search_key_func: A function mapping a key => bytes. These bytes
236
            are then used by the internal nodes to split up leaf nodes into
237
            multiple pages.
238
        :return: The root chk of the resulting CHKMap.
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
239
        """
3735.19.1 by Ian Clatworthy
CHKMap cleanups
240
        result = CHKMap(store, None, search_key_func=search_key_func)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
241
        result._root_node.set_maximum_size(maximum_size)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
242
        result._root_node._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
243
        delta = []
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
244
        for key, value in initial_value.items():
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
245
            delta.append((None, key, value))
3735.19.1 by Ian Clatworthy
CHKMap cleanups
246
        return result.apply_delta(delta)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
247
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
248
    def iter_changes(self, basis):
249
        """Iterate over the changes between basis and self.
250
251
        :return: An iterator of tuples: (key, old_value, new_value). Old_value
252
            is None for keys only in self; new_value is None for keys only in
253
            basis.
254
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
255
        # Overview:
256
        # Read both trees in lexographic, highest-first order.
257
        # Any identical nodes we skip
258
        # Any unique prefixes we output immediately.
259
        # values in a leaf node are treated as single-value nodes in the tree
260
        # which allows them to be not-special-cased. We know to output them
261
        # because their value is a string, not a key(tuple) or node.
262
        #
263
        # corner cases to beware of when considering this function:
264
        # *) common references are at different heights.
265
        #    consider two trees:
266
        #    {'a': LeafNode={'aaa':'foo', 'aab':'bar'}, 'b': LeafNode={'b'}}
267
        #    {'a': InternalNode={'aa':LeafNode={'aaa':'foo', 'aab':'bar'}, 'ab':LeafNode={'ab':'bar'}}
268
        #     'b': LeafNode={'b'}}
269
        #    the node with aaa/aab will only be encountered in the second tree
270
        #    after reading the 'a' subtree, but it is encountered in the first
271
        #    tree immediately. Variations on this may have read internal nodes like this.
272
        #    we want to cut the entire pending subtree when we realise we have a common node.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
273
        #    For this we use a list of keys - the path to a node - and check the entire path is
3735.2.31 by Robert Collins
CHKMap.iter_changes
274
        #    clean as we process each item.
275
        if self._node_key(self._root_node) == self._node_key(basis._root_node):
276
            return
277
        self._ensure_root()
278
        basis._ensure_root()
279
        excluded_keys = set()
280
        self_node = self._root_node
281
        basis_node = basis._root_node
282
        # A heap, each element is prefix, node(tuple/NodeObject/string),
283
        # key_path (a list of tuples, tail-sharing down the tree.)
284
        self_pending = []
285
        basis_pending = []
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
286
        def process_node(node, path, a_map, pending):
3735.2.31 by Robert Collins
CHKMap.iter_changes
287
            # take a node and expand it
288
            node = a_map._get_node(node)
289
            if type(node) == LeafNode:
290
                path = (node._key, path)
291
                for key, value in node._items.items():
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
292
                    # For a LeafNode, the key is a serialized_key, rather than
293
                    # a search_key, but the heap is using search_keys
294
                    search_key = node._search_key_func(key)
295
                    heapq.heappush(pending, (search_key, key, value, path))
3735.2.31 by Robert Collins
CHKMap.iter_changes
296
            else:
297
                # type(node) == InternalNode
298
                path = (node._key, path)
299
                for prefix, child in node._items.items():
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
300
                    heapq.heappush(pending, (prefix, None, child, path))
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
301
        def process_common_internal_nodes(self_node, basis_node):
302
            self_items = set(self_node._items.items())
303
            basis_items = set(basis_node._items.items())
304
            path = (self_node._key, None)
305
            for prefix, child in self_items - basis_items:
306
                heapq.heappush(self_pending, (prefix, None, child, path))
307
            path = (basis_node._key, None)
308
            for prefix, child in basis_items - self_items:
309
                heapq.heappush(basis_pending, (prefix, None, child, path))
310
        def process_common_leaf_nodes(self_node, basis_node):
311
            self_items = set(self_node._items.items())
312
            basis_items = set(basis_node._items.items())
313
            path = (self_node._key, None)
314
            for key, value in self_items - basis_items:
315
                prefix = self._search_key_func(key)
316
                heapq.heappush(self_pending, (prefix, key, value, path))
317
            path = (basis_node._key, None)
318
            for key, value in basis_items - self_items:
319
                prefix = basis._search_key_func(key)
320
                heapq.heappush(basis_pending, (prefix, key, value, path))
321
        def process_common_prefix_nodes(self_node, self_path,
322
                                        basis_node, basis_path):
323
            # Would it be more efficient if we could request both at the same
324
            # time?
325
            self_node = self._get_node(self_node)
326
            basis_node = basis._get_node(basis_node)
327
            if (type(self_node) == InternalNode
328
                and type(basis_node) == InternalNode):
329
                # Matching internal nodes
330
                process_common_internal_nodes(self_node, basis_node)
331
            elif (type(self_node) == LeafNode
332
                  and type(basis_node) == LeafNode):
333
                process_common_leaf_nodes(self_node, basis_node)
334
            else:
335
                process_node(self_node, self_path, self, self_pending)
336
                process_node(basis_node, basis_path, basis, basis_pending)
337
        process_common_prefix_nodes(self_node, None, basis_node, None)
3735.2.31 by Robert Collins
CHKMap.iter_changes
338
        self_seen = set()
339
        basis_seen = set()
340
        excluded_keys = set()
341
        def check_excluded(key_path):
342
            # Note that this is N^2, it depends on us trimming trees
343
            # aggressively to not become slow.
344
            # A better implementation would probably have a reverse map
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
345
            # back to the children of a node, and jump straight to it when
3735.2.31 by Robert Collins
CHKMap.iter_changes
346
            # a common node is detected, the proceed to remove the already
347
            # pending children. bzrlib.graph has a searcher module with a
348
            # similar problem.
349
            while key_path is not None:
350
                key, key_path = key_path
351
                if key in excluded_keys:
352
                    return True
353
            return False
354
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
355
        loop_counter = 0
3735.2.31 by Robert Collins
CHKMap.iter_changes
356
        while self_pending or basis_pending:
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
357
            loop_counter += 1
3735.2.31 by Robert Collins
CHKMap.iter_changes
358
            if not self_pending:
359
                # self is exhausted: output remainder of basis
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
360
                for prefix, key, node, path in basis_pending:
3735.2.31 by Robert Collins
CHKMap.iter_changes
361
                    if check_excluded(path):
362
                        continue
363
                    node = basis._get_node(node)
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
364
                    if key is not None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
365
                        # a value
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
366
                        yield (key, node, None)
3735.2.31 by Robert Collins
CHKMap.iter_changes
367
                    else:
368
                        # subtree - fastpath the entire thing.
369
                        for key, value in node.iteritems(basis._store):
370
                            yield (key, value, None)
371
                return
372
            elif not basis_pending:
373
                # basis is exhausted: output remainder of self.
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
374
                for prefix, key, node, path in self_pending:
3735.2.31 by Robert Collins
CHKMap.iter_changes
375
                    if check_excluded(path):
376
                        continue
377
                    node = self._get_node(node)
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
378
                    if key is not None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
379
                        # a value
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
380
                        yield (key, None, node)
3735.2.31 by Robert Collins
CHKMap.iter_changes
381
                    else:
382
                        # subtree - fastpath the entire thing.
383
                        for key, value in node.iteritems(self._store):
384
                            yield (key, None, value)
385
                return
386
            else:
387
                # XXX: future optimisation - yield the smaller items
388
                # immediately rather than pushing everything on/off the
389
                # heaps. Applies to both internal nodes and leafnodes.
390
                if self_pending[0][0] < basis_pending[0][0]:
391
                    # expand self
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
392
                    prefix, key, node, path = heapq.heappop(self_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
393
                    if check_excluded(path):
394
                        continue
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
395
                    if key is not None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
396
                        # a value
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
397
                        yield (key, None, node)
3735.2.31 by Robert Collins
CHKMap.iter_changes
398
                    else:
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
399
                        process_node(node, path, self, self_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
400
                        continue
401
                elif self_pending[0][0] > basis_pending[0][0]:
402
                    # expand basis
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
403
                    prefix, key, node, path = heapq.heappop(basis_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
404
                    if check_excluded(path):
405
                        continue
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
406
                    if key is not None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
407
                        # a value
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
408
                        yield (key, node, None)
3735.2.31 by Robert Collins
CHKMap.iter_changes
409
                    else:
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
410
                        process_node(node, path, basis, basis_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
411
                        continue
412
                else:
413
                    # common prefix: possibly expand both
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
414
                    if self_pending[0][1] is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
415
                        # process next self
416
                        read_self = True
417
                    else:
418
                        read_self = False
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
419
                    if basis_pending[0][1] is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
420
                        # process next basis
421
                        read_basis = True
422
                    else:
423
                        read_basis = False
424
                    if not read_self and not read_basis:
425
                        # compare a common value
426
                        self_details = heapq.heappop(self_pending)
427
                        basis_details = heapq.heappop(basis_pending)
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
428
                        if self_details[2] != basis_details[2]:
429
                            yield (self_details[1],
430
                                basis_details[2], self_details[2])
3735.2.31 by Robert Collins
CHKMap.iter_changes
431
                        continue
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
432
                    # At least one side wasn't a simple value
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
433
                    if (self._node_key(self_pending[0][2]) ==
434
                        self._node_key(basis_pending[0][2])):
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
435
                        # Identical pointers, skip (and don't bother adding to
436
                        # excluded, it won't turn up again.
437
                        heapq.heappop(self_pending)
438
                        heapq.heappop(basis_pending)
439
                        continue
440
                    # Now we need to expand this node before we can continue
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
441
                    if read_self and read_basis:
442
                        # Both sides start with the same prefix, so process
443
                        # them in parallel
444
                        self_prefix, _, self_node, self_path = heapq.heappop(
445
                            self_pending)
446
                        basis_prefix, _, basis_node, basis_path = heapq.heappop(
447
                            basis_pending)
448
                        assert self_prefix == basis_prefix
449
                        process_common_prefix_nodes(
450
                            self_node, self_path,
451
                            basis_node, basis_path)
452
                        continue
3735.2.31 by Robert Collins
CHKMap.iter_changes
453
                    if read_self:
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
454
                        prefix, key, node, path = heapq.heappop(self_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
455
                        if check_excluded(path):
456
                            continue
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
457
                        process_node(node, path, self, self_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
458
                    if read_basis:
3735.25.2 by John Arbash Meinel
Change the data that is put on the queue.
459
                        prefix, key, node, path = heapq.heappop(basis_pending)
3735.2.31 by Robert Collins
CHKMap.iter_changes
460
                        if check_excluded(path):
461
                            continue
3735.25.3 by John Arbash Meinel
Pre-filter when the nodes are identical.
462
                        process_node(node, path, basis, basis_pending)
3735.2.32 by Robert Collins
Activate test for common node skipping. - 50 times performance improvement.
463
        # print loop_counter
3735.2.30 by Robert Collins
Start iter_changes between CHKMap instances.
464
3735.2.9 by Robert Collins
Get a working chk_map using inventory implementation bootstrapped.
465
    def iteritems(self, key_filter=None):
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
466
        """Iterate over the entire CHKMap's contents."""
467
        self._ensure_root()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
468
        return self._root_node.iteritems(self._store, key_filter=key_filter)
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
469
3735.2.12 by Robert Collins
Implement commit-via-deltas for split inventory repositories.
470
    def key(self):
471
        """Return the key for this map."""
472
        if isinstance(self._root_node, tuple):
473
            return self._root_node
474
        else:
475
            return self._root_node._key
476
3735.2.17 by Robert Collins
Cache node length to avoid full iteration on __len__ calls.
477
    def __len__(self):
478
        self._ensure_root()
479
        return len(self._root_node)
480
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
481
    def map(self, key, value):
482
        """Map a key tuple to value."""
483
        # Need a root object.
484
        self._ensure_root()
485
        prefix, node_details = self._root_node.map(self._store, key, value)
486
        if len(node_details) == 1:
487
            self._root_node = node_details[0][1]
488
        else:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
489
            self._root_node = InternalNode(prefix,
490
                                search_key_func=self._search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
491
            self._root_node.set_maximum_size(node_details[0][1].maximum_size)
492
            self._root_node._key_width = node_details[0][1]._key_width
493
            for split, node in node_details:
494
                self._root_node.add_node(split, node)
495
3735.2.31 by Robert Collins
CHKMap.iter_changes
496
    def _node_key(self, node):
3735.19.1 by Ian Clatworthy
CHKMap cleanups
497
        """Get the key for a node whether it's a tuple or node."""
3735.2.31 by Robert Collins
CHKMap.iter_changes
498
        if type(node) == tuple:
499
            return node
500
        else:
501
            return node._key
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
502
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
503
    def unmap(self, key, check_remap=True):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
504
        """remove key from the map."""
505
        self._ensure_root()
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
506
        if isinstance(self._root_node, InternalNode):
507
            unmapped = self._root_node.unmap(self._store, key,
508
                check_remap=check_remap)
509
        else:
510
            unmapped = self._root_node.unmap(self._store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
511
        self._root_node = unmapped
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
512
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
513
    def _check_remap(self):
514
        """Check if nodes can be collapsed."""
515
        self._ensure_root()
516
        if isinstance(self._root_node, InternalNode):
517
            self._root_node._check_remap(self._store)
518
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
519
    def _save(self):
520
        """Save the map completely.
521
522
        :return: The key of the root node.
523
        """
524
        if type(self._root_node) == tuple:
525
            # Already saved.
526
            return self._root_node
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
527
        keys = list(self._root_node.serialise(self._store))
528
        return keys[-1]
3735.2.8 by Robert Collins
New chk_map module for use in tree based inventory storage.
529
530
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
531
class Node(object):
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
532
    """Base class defining the protocol for CHK Map nodes.
533
534
    :ivar _raw_size: The total size of the serialized key:value data, before
535
        adding the header bytes, and without prefix compression.
536
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
537
538
    def __init__(self, key_width=1):
539
        """Create a node.
540
541
        :param key_width: The width of keys for this node.
542
        """
543
        self._key = None
544
        # Current number of elements
545
        self._len = 0
546
        self._maximum_size = 0
3735.19.1 by Ian Clatworthy
CHKMap cleanups
547
        self._key_width = key_width
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
548
        # current size in bytes
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
549
        self._raw_size = 0
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
550
        # The pointers/values this node has - meaning defined by child classes.
551
        self._items = {}
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
552
        # The common search prefix
553
        self._search_prefix = None
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
554
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
555
    def __repr__(self):
3735.2.111 by John Arbash Meinel
Merge Ian's updates to chk_map and chk_inventory.create_by_apply_delta.
556
        items_str = str(sorted(self._items))
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
557
        if len(items_str) > 20:
558
            items_str = items_str[16] + '...]'
3735.15.3 by John Arbash Meinel
repr update
559
        return '%s(key:%s len:%s size:%s max:%s prefix:%s items:%s)' % (
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
560
            self.__class__.__name__, self._key, self._len, self._raw_size,
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
561
            self._maximum_size, self._search_prefix, items_str)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
562
3735.2.38 by Robert Collins
Sufficient fixes to allow bzr-search to index a dev3 format repository.
563
    def key(self):
564
        return self._key
565
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
566
    def __len__(self):
567
        return self._len
568
569
    @property
570
    def maximum_size(self):
571
        """What is the upper limit for adding references to a node."""
572
        return self._maximum_size
573
574
    def set_maximum_size(self, new_size):
575
        """Set the size threshold for nodes.
576
577
        :param new_size: The size at which no data is added to a node. 0 for
578
            unlimited.
579
        """
580
        self._maximum_size = new_size
581
3735.15.5 by John Arbash Meinel
Change the nomenclature.
582
    @classmethod
583
    def common_prefix(cls, prefix, key):
584
        """Given 2 strings, return the longest prefix common to both.
585
586
        :param prefix: This has been the common prefix for other keys, so it is
587
            more likely to be the common prefix in this case as well.
588
        :param key: Another string to compare to
589
        """
590
        if key.startswith(prefix):
591
            return prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
592
        # Is there a better way to do this?
3735.15.5 by John Arbash Meinel
Change the nomenclature.
593
        for pos, (left, right) in enumerate(zip(prefix, key)):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
594
            if left != right:
3735.2.89 by Vincent Ladeuil
Fix the bogus previous fix.
595
                pos -= 1
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
596
                break
3735.2.89 by Vincent Ladeuil
Fix the bogus previous fix.
597
        common = prefix[:pos+1]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
598
        return common
599
3735.15.5 by John Arbash Meinel
Change the nomenclature.
600
    @classmethod
601
    def common_prefix_for_keys(cls, keys):
602
        """Given a list of keys, find their common prefix.
603
604
        :param keys: An iterable of strings.
605
        :return: The longest common prefix of all keys.
606
        """
607
        common_prefix = None
608
        for key in keys:
609
            if common_prefix is None:
610
                common_prefix = key
611
                continue
612
            common_prefix = cls.common_prefix(common_prefix, key)
613
            if not common_prefix:
614
                # if common_prefix is the empty string, then we know it won't
615
                # change further
616
                return ''
617
        return common_prefix
618
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
619
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
620
# Singleton indicating we have not computed _search_prefix yet
621
_unknown = object()
622
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
623
class LeafNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
624
    """A node containing actual key:value pairs.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
625
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
626
    :ivar _items: A dict of key->value items. The key is in tuple form.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
627
    :ivar _size: The number of bytes that would be used by serializing all of
628
        the key/value pairs.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
629
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
630
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
631
    def __init__(self, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
632
        Node.__init__(self)
3735.15.4 by John Arbash Meinel
Clean up some little bits.
633
        # All of the keys in this leaf node share this common prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
634
        self._common_serialised_prefix = None
635
        self._serialise_key = '\x00'.join
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
636
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
637
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
638
        else:
639
            self._search_key_func = search_key_func
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
640
3735.20.7 by Ian Clatworthy
include keywidth in repr for LeafNode
641
    def __repr__(self):
642
        items_str = sorted(self._items)
643
        if len(items_str) > 20:
644
            items_str = items_str[16] + '...]'
645
        return \
646
            '%s(key:%s len:%s size:%s max:%s prefix:%s keywidth:%s items:%s)' \
647
            % (self.__class__.__name__, self._key, self._len, self._raw_size,
648
            self._maximum_size, self._search_prefix, self._key_width, items_str)
649
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
650
    def _current_size(self):
3735.15.4 by John Arbash Meinel
Clean up some little bits.
651
        """Answer the current serialised size of this node.
652
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
653
        This differs from self._raw_size in that it includes the bytes used for
654
        the header.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
655
        """
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
656
        if self._common_serialised_prefix is None:
657
            bytes_for_items = 0
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
658
            prefix_len = 0
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
659
        else:
660
            # We will store a single string with the common prefix
661
            # And then that common prefix will not be stored in any of the
662
            # entry lines
663
            prefix_len = len(self._common_serialised_prefix)
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
664
            bytes_for_items = (self._raw_size - (prefix_len * self._len))
665
        return (9 # 'chkleaf:\n'
666
            + len(str(self._maximum_size)) + 1
667
            + len(str(self._key_width)) + 1
668
            + len(str(self._len)) + 1
669
            + prefix_len + 1
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
670
            + bytes_for_items)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
671
672
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
673
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
674
        """Deserialise bytes, with key key, into a LeafNode.
675
676
        :param bytes: The bytes of the node.
677
        :param key: The key that the serialised node has.
678
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
679
        result = LeafNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
680
        # Splitlines can split on '\r' so don't use it, split('\n') adds an
681
        # extra '' if the bytes ends in a final newline.
682
        lines = bytes.split('\n')
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
683
        trailing = lines.pop()
684
        if trailing != '':
685
            raise AssertionError('We did not have a final newline for %s'
686
                                 % (key,))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
687
        items = {}
688
        if lines[0] != 'chkleaf:':
689
            raise ValueError("not a serialised leaf node: %r" % bytes)
690
        maximum_size = int(lines[1])
691
        width = int(lines[2])
692
        length = int(lines[3])
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
693
        prefix = lines[4]
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
694
        pos = 5
695
        while pos < len(lines):
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
696
            line = prefix + lines[pos]
697
            elements = line.split('\x00')
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
698
            pos += 1
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
699
            if len(elements) != width + 1:
3735.20.6 by Ian Clatworthy
more helpful deserialize assertion msg
700
                raise AssertionError(
701
                    'Incorrect number of elements (%d vs %d) for: %r'
702
                    % (len(elements), width + 1, line))
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
703
            num_value_lines = int(elements[-1])
704
            value_lines = lines[pos:pos+num_value_lines]
705
            pos += num_value_lines
706
            value = '\n'.join(value_lines)
707
            items[tuple(elements[:-1])] = value
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
708
        if len(items) != length:
3735.14.3 by John Arbash Meinel
Properly remove keys that are found in the page cache. And add some debugging.
709
            raise AssertionError("item count (%d) mismatch for key %s,"
710
                " bytes %r" % (length, key, bytes))
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
711
        result._items = items
712
        result._len = length
713
        result._maximum_size = maximum_size
714
        result._key = key
715
        result._key_width = width
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
716
        result._raw_size = (sum(map(len, lines[5:])) # the length of the suffix
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
717
            + (length)*(len(prefix))
718
            + (len(lines)-5))
3735.23.2 by John Arbash Meinel
Avoid computing a known prefix on each deserialise.
719
        if not items:
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
720
            result._search_prefix = None
3735.23.2 by John Arbash Meinel
Avoid computing a known prefix on each deserialise.
721
            result._common_serialised_prefix = None
722
        else:
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
723
            result._search_prefix = _unknown
3735.23.2 by John Arbash Meinel
Avoid computing a known prefix on each deserialise.
724
            result._common_serialised_prefix = prefix
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
725
        if len(bytes) != result._current_size():
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
726
            raise AssertionError('_current_size computed incorrectly')
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
727
        return result
728
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
729
    def iteritems(self, store, key_filter=None):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
730
        """Iterate over items in the node.
731
732
        :param key_filter: A filter to apply to the node. It should be a
733
            list/set/dict or similar repeatedly iterable container.
734
        """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
735
        if key_filter is not None:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
736
            # Adjust the filter - short elements go to a prefix filter. Would this
737
            # be cleaner explicitly? That would be no harder for InternalNode..
738
            # XXX: perhaps defaultdict? Profiling<rinse and repeat>
739
            filters = {}
740
            for key in key_filter:
741
                length_filter = filters.setdefault(len(key), set())
742
                length_filter.add(key)
743
            filters = filters.items()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
744
            for item in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
745
                for length, length_filter in filters:
746
                    if item[0][:length] in length_filter:
747
                        yield item
748
                        break
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
749
        else:
750
            for item in self._items.iteritems():
751
                yield item
752
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
753
    def _key_value_len(self, key, value):
754
        # TODO: Should probably be done without actually joining the key, but
755
        #       then that can be done via the C extension
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
756
        return (len(self._serialise_key(key)) + 1
757
                + len(str(value.count('\n'))) + 1
758
                + len(value) + 1)
3735.9.4 by John Arbash Meinel
Some small cleanups, and fix _dump_tree to handle in-progress nodes.
759
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
760
    def _search_key(self, key):
761
        return self._search_key_func(key)
762
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
763
    def _map_no_split(self, key, value):
764
        """Map a key to a value.
765
766
        This assumes either the key does not already exist, or you have already
767
        removed its size and length from self.
768
769
        :return: True if adding this node should cause us to split.
770
        """
771
        self._items[key] = value
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
772
        self._raw_size += self._key_value_len(key, value)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
773
        self._len += 1
3735.15.5 by John Arbash Meinel
Change the nomenclature.
774
        serialised_key = self._serialise_key(key)
775
        if self._common_serialised_prefix is None:
776
            self._common_serialised_prefix = serialised_key
777
        else:
778
            self._common_serialised_prefix = self.common_prefix(
779
                self._common_serialised_prefix, serialised_key)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
780
        search_key = self._search_key(key)
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
781
        if self._search_prefix is _unknown:
782
            self._compute_search_prefix()
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
783
        if self._search_prefix is None:
784
            self._search_prefix = search_key
3735.15.5 by John Arbash Meinel
Change the nomenclature.
785
        else:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
786
            self._search_prefix = self.common_prefix(
787
                self._search_prefix, search_key)
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
788
        if (self._len > 1
789
            and self._maximum_size
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
790
            and self._current_size() > self._maximum_size):
791
            # Check to see if all of the search_keys for this node are
792
            # identical. We allow the node to grow under that circumstance
793
            # (we could track this as common state, but it is infrequent)
794
            if (search_key != self._search_prefix
795
                or not self._are_search_keys_identical()):
796
                return True
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
797
        return False
798
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
799
    def _split(self, store):
800
        """We have overflowed.
801
802
        Split this node into multiple LeafNodes, return it up the stack so that
803
        the next layer creates a new InternalNode and references the new nodes.
804
3735.15.5 by John Arbash Meinel
Change the nomenclature.
805
        :return: (common_serialised_prefix, [(node_serialised_prefix, node)])
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
806
        """
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
807
        assert self._search_prefix is not _unknown
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
808
        common_prefix = self._search_prefix
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
809
        split_at = len(common_prefix) + 1
810
        result = {}
811
        for key, value in self._items.iteritems():
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
812
            search_key = self._search_key(key)
813
            prefix = search_key[:split_at]
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
814
            # TODO: Generally only 1 key can be exactly the right length,
815
            #       which means we can only have 1 key in the node pointed
816
            #       at by the 'prefix\0' key. We might want to consider
817
            #       folding it into the containing InternalNode rather than
818
            #       having a fixed length-1 node.
819
            #       Note this is probably not true for hash keys, as they
820
            #       may get a '\00' node anywhere, but won't have keys of
821
            #       different lengths.
822
            if len(prefix) < split_at:
823
                prefix += '\x00'*(split_at - len(prefix))
824
            if prefix not in result:
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
825
                node = LeafNode(search_key_func=self._search_key_func)
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
826
                node.set_maximum_size(self._maximum_size)
827
                node._key_width = self._key_width
828
                result[prefix] = node
829
            else:
830
                node = result[prefix]
831
            node.map(store, key, value)
832
        return common_prefix, result.items()
833
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
834
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
835
        """Map key to value."""
836
        if key in self._items:
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
837
            self._raw_size -= self._key_value_len(key, self._items[key])
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
838
            self._len -= 1
839
        self._key = None
3735.11.13 by John Arbash Meinel
Refactor the LeafNode.map() code so we can do _check_remap more cheaply.
840
        if self._map_no_split(key, value):
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
841
            return self._split(store)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
842
        else:
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
843
            assert self._search_prefix is not _unknown
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
844
            return self._search_prefix, [("", self)]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
845
846
    def serialise(self, store):
3735.20.7 by Ian Clatworthy
include keywidth in repr for LeafNode
847
        """Serialise the LeafNode to store.
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
848
849
        :param store: A VersionedFiles honouring the CHK extensions.
850
        :return: An iterable of the keys inserted by this operation.
851
        """
852
        lines = ["chkleaf:\n"]
853
        lines.append("%d\n" % self._maximum_size)
854
        lines.append("%d\n" % self._key_width)
855
        lines.append("%d\n" % self._len)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
856
        if self._common_serialised_prefix is None:
857
            lines.append('\n')
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
858
            if len(self._items) != 0:
859
                raise AssertionError('If _common_serialised_prefix is None'
860
                    ' we should have no items')
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
861
        else:
862
            lines.append('%s\n' % (self._common_serialised_prefix,))
863
            prefix_len = len(self._common_serialised_prefix)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
864
        for key, value in sorted(self._items.items()):
3735.20.7 by Ian Clatworthy
include keywidth in repr for LeafNode
865
            # Always add a final newline
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
866
            value_lines = osutils.chunks_to_lines([value + '\n'])
867
            serialized = "%s\x00%s\n" % (self._serialise_key(key),
868
                                         len(value_lines))
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
869
            if not serialized.startswith(self._common_serialised_prefix):
870
                raise AssertionError('We thought the common prefix was %r'
871
                    ' but entry %r does not have it in common'
872
                    % (self._common_serialised_prefix, serialized))
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
873
            lines.append(serialized[prefix_len:])
3735.17.1 by John Arbash Meinel
Change the serialized form for leaf nodes.
874
            lines.extend(value_lines)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
875
        sha1, _, _ = store.add_lines((None,), (), lines)
876
        self._key = ("sha1:" + sha1,)
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
877
        bytes = ''.join(lines)
3735.15.9 by John Arbash Meinel
(broken) Initial prototype of leaf pages which pull out their common prefix.
878
        if len(bytes) != self._current_size():
3735.2.99 by John Arbash Meinel
Merge bzr.dev 4034. Whitespace cleanup
879
            raise AssertionError('Invalid _current_size')
3735.15.8 by John Arbash Meinel
Add asserts so that when serializing and deserializing
880
        _page_cache.add(self._key, bytes)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
881
        return [self._key]
882
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
883
    def refs(self):
884
        """Return the references to other CHK's held by this node."""
885
        return []
886
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
887
    def _compute_search_prefix(self):
888
        """Determine the common search prefix for all keys in this node.
3735.15.5 by John Arbash Meinel
Change the nomenclature.
889
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
890
        :return: A bytestring of the longest search key prefix that is
3735.15.5 by John Arbash Meinel
Change the nomenclature.
891
            unique within this node.
892
        """
3735.2.131 by John Arbash Meinel
Only compute LeafNode._search_prefix when we will use it.
893
        search_keys = [self._search_key_func(key) for key in self._items]
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
894
        self._search_prefix = self.common_prefix_for_keys(search_keys)
895
        return self._search_prefix
3735.15.5 by John Arbash Meinel
Change the nomenclature.
896
3735.16.10 by John Arbash Meinel
Don't track state for an infrequent edge case.
897
    def _are_search_keys_identical(self):
898
        """Check to see if the search keys for all entries are the same.
899
900
        When using a hash as the search_key it is possible for non-identical
901
        keys to collide. If that happens enough, we may try overflow a
902
        LeafNode, but as all are collisions, we must not split.
903
        """
904
        common_search_key = None
905
        for key in self._items:
906
            search_key = self._search_key(key)
907
            if common_search_key is None:
908
                common_search_key = search_key
909
            elif search_key != common_search_key:
910
                return False
911
        return True
912
3735.15.5 by John Arbash Meinel
Change the nomenclature.
913
    def _compute_serialised_prefix(self):
914
        """Determine the common prefix for serialised keys in this node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
915
916
        :return: A bytestring of the longest serialised key prefix that is
917
            unique within this node.
918
        """
3735.15.5 by John Arbash Meinel
Change the nomenclature.
919
        serialised_keys = [self._serialise_key(key) for key in self._items]
920
        self._common_serialised_prefix = self.common_prefix_for_keys(
921
            serialised_keys)
3735.19.1 by Ian Clatworthy
CHKMap cleanups
922
        return self._common_serialised_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
923
924
    def unmap(self, store, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
925
        """Unmap key from the node."""
3735.2.112 by John Arbash Meinel
Merge Ian's try/except helper for aiding in debugging strange failures.
926
        try:
927
            self._raw_size -= self._key_value_len(key, self._items[key])
928
        except KeyError:
929
            trace.mutter("key %s not found in %r", key, self._items)
930
            raise
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
931
        self._len -= 1
932
        del self._items[key]
933
        self._key = None
3735.15.2 by John Arbash Meinel
Change LeafNode to also cache its unique serialized prefix.
934
        # Recompute from scratch
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
935
        self._compute_search_prefix()
3735.15.5 by John Arbash Meinel
Change the nomenclature.
936
        self._compute_serialised_prefix()
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
937
        return self
938
939
940
class InternalNode(Node):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
941
    """A node that contains references to other nodes.
3735.11.1 by John Arbash Meinel
Clean up some trailing whitespace.
942
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
943
    An InternalNode is responsible for mapping search key prefixes to child
3735.15.5 by John Arbash Meinel
Change the nomenclature.
944
    nodes.
3735.15.4 by John Arbash Meinel
Clean up some little bits.
945
3735.15.5 by John Arbash Meinel
Change the nomenclature.
946
    :ivar _items: serialised_key => node dictionary. node may be a tuple,
3735.15.4 by John Arbash Meinel
Clean up some little bits.
947
        LeafNode or InternalNode.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
948
    """
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
949
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
950
    def __init__(self, prefix='', search_key_func=None):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
951
        Node.__init__(self)
952
        # The size of an internalnode with default values and no children.
953
        # How many octets key prefixes within this node are.
954
        self._node_width = 0
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
955
        self._search_prefix = prefix
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
956
        if search_key_func is None:
3735.16.6 by John Arbash Meinel
Include a _search_key_plain function.
957
            self._search_key_func = _search_key_plain
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
958
        else:
959
            self._search_key_func = search_key_func
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
960
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
961
    def add_node(self, prefix, node):
962
        """Add a child node with prefix prefix, and node node.
963
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
964
        :param prefix: The search key prefix for node.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
965
        :param node: The node being added.
966
        """
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
967
        if self._search_prefix is None:
968
            raise AssertionError("_search_prefix should not be None")
969
        if not prefix.startswith(self._search_prefix):
970
            raise AssertionError("prefixes mismatch: %s must start with %s"
971
                % (prefix,self._search_prefix))
972
        if len(prefix) != len(self._search_prefix) + 1:
973
            raise AssertionError("prefix wrong length: len(%s) is not %d" %
974
                (prefix, len(self._search_prefix) + 1))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
975
        self._len += len(node)
976
        if not len(self._items):
977
            self._node_width = len(prefix)
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
978
        if self._node_width != len(self._search_prefix) + 1:
979
            raise AssertionError("node width mismatch: %d is not %d" %
980
                (self._node_width, len(self._search_prefix) + 1))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
981
        self._items[prefix] = node
982
        self._key = None
983
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
984
    def _current_size(self):
985
        """Answer the current serialised size of this node."""
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
986
        return (self._raw_size + len(str(self._len)) + len(str(self._key_width)) +
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
987
            len(str(self._maximum_size)))
988
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
989
    @classmethod
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
990
    def deserialise(klass, bytes, key, search_key_func=None):
3735.2.25 by Robert Collins
CHKInventory core tests passing.
991
        """Deserialise bytes to an InternalNode, with key key.
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
992
993
        :param bytes: The bytes of the node.
994
        :param key: The key that the serialised node has.
995
        :return: An InternalNode instance.
996
        """
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
997
        result = InternalNode(search_key_func=search_key_func)
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
998
        # Splitlines can split on '\r' so don't use it, remove the extra ''
999
        # from the result of split('\n') because we should have a trailing
1000
        # newline
1001
        lines = bytes.split('\n')
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1002
        if lines[-1] != '':
1003
            raise AssertionError("last line must be ''")
3735.2.72 by John Arbash Meinel
Change deserialise to properly handle when there is a '\r' in the key.
1004
        lines.pop(-1)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1005
        items = {}
1006
        if lines[0] != 'chknode:':
1007
            raise ValueError("not a serialised internal node: %r" % bytes)
1008
        maximum_size = int(lines[1])
1009
        width = int(lines[2])
1010
        length = int(lines[3])
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1011
        common_prefix = lines[4]
1012
        for line in lines[5:]:
1013
            line = common_prefix + line
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1014
            prefix, flat_key = line.rsplit('\x00', 1)
1015
            items[prefix] = (flat_key,)
1016
        result._items = items
1017
        result._len = length
1018
        result._maximum_size = maximum_size
1019
        result._key = key
1020
        result._key_width = width
3735.15.6 by John Arbash Meinel
Add tests that LeafNodes track the common prefix for both their lookup keys
1021
        # XXX: InternalNodes don't really care about their size, and this will
1022
        #      change if we add prefix compression
1023
        result._raw_size = None # len(bytes)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1024
        result._node_width = len(prefix)
3735.23.2 by John Arbash Meinel
Avoid computing a known prefix on each deserialise.
1025
        assert len(items) > 0
1026
        result._search_prefix = common_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1027
        return result
1028
1029
    def iteritems(self, store, key_filter=None):
3735.2.115 by John Arbash Meinel
Fix the root cause of why _iter_nodes was not handling key_filter.
1030
        for node in self._iter_nodes(store, key_filter=key_filter):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1031
            for item in node.iteritems(store, key_filter=key_filter):
1032
                yield item
1033
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1034
    def _iter_nodes(self, store, key_filter=None, batch_size=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1035
        """Iterate over node objects which match key_filter.
1036
1037
        :param store: A store to use for accessing content.
1038
        :param key_filter: A key filter to filter nodes. Only nodes that might
1039
            contain a key in key_filter will be returned.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1040
        :param batch_size: If not None, then we will return the nodes that had
1041
            to be read using get_record_stream in batches, rather than reading
1042
            them all at once.
1043
        :return: An iterable of nodes. This function does not have to be fully
1044
            consumed.  (There will be no pending I/O when items are being returned.)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1045
        """
3735.2.31 by Robert Collins
CHKMap.iter_changes
1046
        keys = {}
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1047
        if key_filter is None:
3735.2.31 by Robert Collins
CHKMap.iter_changes
1048
            for prefix, node in self._items.iteritems():
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1049
                if type(node) == tuple:
3735.2.31 by Robert Collins
CHKMap.iter_changes
1050
                    keys[node] = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1051
                else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1052
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1053
        else:
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1054
            # XXX defaultdict ?
1055
            length_filters = {}
1056
            for key in key_filter:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1057
                search_key = self._search_prefix_filter(key)
1058
                length_filter = length_filters.setdefault(
1059
                                    len(search_key), set())
1060
                length_filter.add(search_key)
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1061
            length_filters = length_filters.items()
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1062
            for prefix, node in self._items.iteritems():
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1063
                for length, length_filter in length_filters:
1064
                    if prefix[:length] in length_filter:
1065
                        if type(node) == tuple:
1066
                            keys[node] = prefix
1067
                        else:
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1068
                            yield node
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1069
                        break
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1070
        if keys:
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
1071
            # Look in the page cache for some more bytes
1072
            found_keys = set()
1073
            for key in keys:
1074
                try:
1075
                    bytes = _page_cache[key]
1076
                except KeyError:
1077
                    continue
1078
                else:
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1079
                    node = _deserialise(bytes, key,
1080
                        search_key_func=self._search_key_func)
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
1081
                    self._items[keys[key]] = node
1082
                    found_keys.add(key)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1083
                    yield node
3735.2.62 by Robert Collins
Create a rudimentary CHK page cache.
1084
            for key in found_keys:
1085
                del keys[key]
1086
        if keys:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1087
            # demand load some pages.
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1088
            if batch_size is None:
1089
                # Read all the keys in
1090
                batch_size = len(keys)
1091
            key_order = list(keys)
1092
            for batch_start in range(0, len(key_order), batch_size):
1093
                batch = key_order[batch_start:batch_start + batch_size]
1094
                # We have to fully consume the stream so there is no pending
1095
                # I/O, so we buffer the nodes for now.
1096
                stream = store.get_record_stream(batch, 'unordered', True)
1097
                nodes = []
1098
                for record in stream:
1099
                    bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1100
                    node = _deserialise(bytes, record.key,
1101
                        search_key_func=self._search_key_func)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1102
                    nodes.append(node)
1103
                    self._items[keys[record.key]] = node
1104
                    _page_cache.add(record.key, bytes)
1105
                for node in nodes:
1106
                    yield node
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1107
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1108
    def map(self, store, key, value):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1109
        """Map key to value."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1110
        if not len(self._items):
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
1111
            raise AssertionError("can't map in an empty InternalNode.")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1112
        search_key = self._search_key(key)
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1113
        if self._node_width != len(self._search_prefix) + 1:
1114
            raise AssertionError("node width mismatch: %d is not %d" %
1115
                (self._node_width, len(self._search_prefix) + 1))
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1116
        if not search_key.startswith(self._search_prefix):
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
1117
            # This key doesn't fit in this index, so we need to split at the
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
1118
            # point where it would fit, insert self into that internal node,
1119
            # and then map this key into that node.
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1120
            new_prefix = self.common_prefix(self._search_prefix,
1121
                                            search_key)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1122
            new_parent = InternalNode(new_prefix,
1123
                search_key_func=self._search_key_func)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
1124
            new_parent.set_maximum_size(self._maximum_size)
1125
            new_parent._key_width = self._key_width
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1126
            new_parent.add_node(self._search_prefix[:len(new_prefix)+1],
3735.15.1 by John Arbash Meinel
Change InternalNode to always cache its serialized_prefix.
1127
                                self)
3735.9.5 by John Arbash Meinel
Don't allow an InternalNode to add a key that doesn't fit.
1128
            return new_parent.map(store, key, value)
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1129
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1130
        if children:
1131
            child = children[0]
1132
        else:
1133
            # new child needed:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1134
            child = self._new_child(search_key, LeafNode)
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1135
        old_len = len(child)
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1136
        if isinstance(child, LeafNode):
1137
            old_size = child._current_size()
1138
        else:
1139
            old_size = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1140
        prefix, node_details = child.map(store, key, value)
1141
        if len(node_details) == 1:
3735.9.11 by John Arbash Meinel
Handle when an InternalNode decides it needs to split.
1142
            # child may have shrunk, or might be a new node
1143
            child = node_details[0][1]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1144
            self._len = self._len - old_len + len(child)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1145
            self._items[search_key] = child
3735.2.29 by Robert Collins
Untested code is broken code.
1146
            self._key = None
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1147
            new_node = self
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
1148
            if isinstance(child, LeafNode):
1149
                if old_size is None:
1150
                    # The old node was an InternalNode which means it has now
1151
                    # collapsed, so we need to check if it will chain to a
1152
                    # collapse at this level.
1153
                    trace.mutter("checking remap as InternalNode -> LeafNode")
1154
                    new_node = self._check_remap(store)
1155
                else:
1156
                    # If the LeafNode has shrunk in size, we may want to run
1157
                    # a remap check. Checking for a remap is expensive though
1158
                    # and the frequency of a successful remap is very low.
1159
                    # Shrinkage by small amounts is common, so we only do the
1160
                    # remap check if the new_size is low or the shrinkage
1161
                    # amount is over a configurable limit.
1162
                    new_size = child._current_size()
1163
                    shrinkage = old_size - new_size
1164
                    if (shrinkage > 0 and new_size < _INTERESTING_NEW_SIZE
1165
                        or shrinkage > _INTERESTING_SHRINKAGE_LIMIT):
1166
                        trace.mutter(
1167
                            "checking remap as size shrunk by %d to be %d",
1168
                            shrinkage, new_size)
1169
                        new_node = self._check_remap(store)
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1170
            if new_node._search_prefix is None:
1171
                raise AssertionError("_search_prefix should not be None")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1172
            return new_node._search_prefix, [('', new_node)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1173
        # child has overflown - create a new intermediate node.
1174
        # XXX: This is where we might want to try and expand our depth
1175
        # to refer to more bytes of every child (which would give us
1176
        # multiple pointers to child nodes, but less intermediate nodes)
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1177
        child = self._new_child(search_key, InternalNode)
1178
        child._search_prefix = prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1179
        for split, node in node_details:
1180
            child.add_node(split, node)
1181
        self._len = self._len - old_len + len(child)
3735.2.29 by Robert Collins
Untested code is broken code.
1182
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1183
        return self._search_prefix, [("", self)]
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1184
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1185
    def _new_child(self, search_key, klass):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1186
        """Create a new child node of type klass."""
1187
        child = klass()
1188
        child.set_maximum_size(self._maximum_size)
1189
        child._key_width = self._key_width
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1190
        child._search_key_func = self._search_key_func
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1191
        self._items[search_key] = child
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1192
        return child
1193
1194
    def serialise(self, store):
1195
        """Serialise the node to store.
1196
1197
        :param store: A VersionedFiles honouring the CHK extensions.
1198
        :return: An iterable of the keys inserted by this operation.
1199
        """
1200
        for node in self._items.itervalues():
1201
            if type(node) == tuple:
1202
                # Never deserialised.
1203
                continue
1204
            if node._key is not None:
1205
                # Never altered
1206
                continue
1207
            for key in node.serialise(store):
1208
                yield key
1209
        lines = ["chknode:\n"]
1210
        lines.append("%d\n" % self._maximum_size)
1211
        lines.append("%d\n" % self._key_width)
1212
        lines.append("%d\n" % self._len)
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1213
        if self._search_prefix is None:
1214
            raise AssertionError("_search_prefix should not be None")
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1215
        lines.append('%s\n' % (self._search_prefix,))
1216
        prefix_len = len(self._search_prefix)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1217
        for prefix, node in sorted(self._items.items()):
1218
            if type(node) == tuple:
1219
                key = node[0]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1220
            else:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1221
                key = node._key[0]
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1222
            serialised = "%s\x00%s\n" % (prefix, key)
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1223
            if not serialised.startswith(self._search_prefix):
1224
                raise AssertionError("prefixes mismatch: %s must start with %s"
1225
                    % (serialised, self._search_prefix))
3735.15.11 by John Arbash Meinel
Change the InternalNodes to also pull out the common prefix.
1226
            lines.append(serialised[prefix_len:])
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1227
        sha1, _, _ = store.add_lines((None,), (), lines)
1228
        self._key = ("sha1:" + sha1,)
3735.2.63 by Robert Collins
Divert writes into the CHK page cache as well.
1229
        _page_cache.add(self._key, ''.join(lines))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1230
        yield self._key
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1231
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1232
    def _search_key(self, key):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1233
        """Return the serialised key for key in this node."""
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1234
        # search keys are fixed width. All will be self._node_width wide, so we
3735.15.5 by John Arbash Meinel
Change the nomenclature.
1235
        # pad as necessary.
3735.16.1 by John Arbash Meinel
(broken) Start tracking down more code that needs to pass around the 'search_key_func'
1236
        return (self._search_key_func(key) + '\x00'*self._node_width)[:self._node_width]
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1237
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1238
    def _search_prefix_filter(self, key):
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1239
        """Serialise key for use as a prefix filter in iteritems."""
3735.2.115 by John Arbash Meinel
Fix the root cause of why _iter_nodes was not handling key_filter.
1240
        return self._search_key_func(key)[:self._node_width]
3735.2.43 by Robert Collins
Teach CHKMap how to iter items in 2-tuple keyspaces.
1241
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1242
    def _split(self, offset):
1243
        """Split this node into smaller nodes starting at offset.
1244
1245
        :param offset: The offset to start the new child nodes at.
1246
        :return: An iterable of (prefix, node) tuples. prefix is a byte
1247
            prefix for reaching node.
1248
        """
1249
        if offset >= self._node_width:
1250
            for node in self._items.values():
1251
                for result in node._split(offset):
1252
                    yield result
1253
            return
1254
        for key, node in self._items.items():
1255
            pass
1256
3735.2.26 by Robert Collins
CHKInventory migrated to new CHKMap code.
1257
    def refs(self):
1258
        """Return the references to other CHK's held by this node."""
1259
        if self._key is None:
1260
            raise AssertionError("unserialised nodes have no refs.")
1261
        refs = []
1262
        for value in self._items.itervalues():
1263
            if type(value) == tuple:
1264
                refs.append(value)
1265
            else:
1266
                refs.append(value.key())
1267
        return refs
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1268
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1269
    def _compute_search_prefix(self, extra_key=None):
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1270
        """Return the unique key prefix for this node.
1271
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1272
        :return: A bytestring of the longest search key prefix that is
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1273
            unique within this node.
1274
        """
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1275
        self._search_prefix = self.common_prefix_for_keys(self._items)
1276
        return self._search_prefix
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1277
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
1278
    def unmap(self, store, key, check_remap=True):
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1279
        """Remove key from this node and it's children."""
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1280
        if not len(self._items):
3735.2.126 by Ian Clatworthy
replace asserts in chk_map.py with AssertionErrors
1281
            raise AssertionError("can't unmap in an empty InternalNode.")
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1282
        children = list(self._iter_nodes(store, key_filter=[key]))
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1283
        if children:
1284
            child = children[0]
1285
        else:
1286
            raise KeyError(key)
1287
        self._len -= 1
1288
        unmapped = child.unmap(store, key)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1289
        self._key = None
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1290
        search_key = self._search_key(key)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1291
        if len(unmapped) == 0:
1292
            # All child nodes are gone, remove the child:
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1293
            del self._items[search_key]
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1294
            unmapped = None
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1295
        else:
1296
            # Stash the returned node
3735.15.13 by John Arbash Meinel
Change the term 'lookup' to the term 'search', as it is closer to what Robert envisioned.
1297
            self._items[search_key] = unmapped
3735.2.23 by Robert Collins
Test unmapping with one child left but multiple keys.
1298
        if len(self._items) == 1:
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1299
            # this node is no longer needed:
1300
            return self._items.values()[0]
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1301
        if isinstance(unmapped, InternalNode):
1302
            return self
3735.2.122 by Ian Clatworthy
don't check_remap on every unmap call in CHKMap.apply_delta()
1303
        if check_remap:
1304
            return self._check_remap(store)
1305
        else:
1306
            return self
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1307
1308
    def _check_remap(self, store):
3735.11.11 by John Arbash Meinel
Add logic to map() so that it can also collapse when necessary.
1309
        """Check if all keys contained by children fit in a single LeafNode.
1310
1311
        :param store: A store to use for reading more nodes
1312
        :return: Either self, or a new LeafNode which should replace self.
1313
        """
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1314
        # Logic for how we determine when we need to rebuild
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1315
        # 1) Implicitly unmap() is removing a key which means that the child
1316
        #    nodes are going to be shrinking by some extent.
1317
        # 2) If all children are LeafNodes, it is possible that they could be
1318
        #    combined into a single LeafNode, which can then completely replace
1319
        #    this internal node with a single LeafNode
1320
        # 3) If *one* child is an InternalNode, we assume it has already done
1321
        #    all the work to determine that its children cannot collapse, and
1322
        #    we can then assume that those nodes *plus* the current nodes don't
1323
        #    have a chance of collapsing either.
1324
        #    So a very cheap check is to just say if 'unmapped' is an
1325
        #    InternalNode, we don't have to check further.
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1326
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1327
        # TODO: Another alternative is to check the total size of all known
1328
        #       LeafNodes. If there is some formula we can use to determine the
1329
        #       final size without actually having to read in any more
1330
        #       children, it would be nice to have. However, we have to be
1331
        #       careful with stuff like nodes that pull out the common prefix
1332
        #       of each key, as adding a new key can change the common prefix
1333
        #       and cause size changes greater than the length of one key.
1334
        #       So for now, we just add everything to a new Leaf until it
1335
        #       splits, as we know that will give the right answer
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1336
        new_leaf = LeafNode(search_key_func=self._search_key_func)
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1337
        new_leaf.set_maximum_size(self._maximum_size)
1338
        new_leaf._key_width = self._key_width
3735.14.7 by John Arbash Meinel
Change _iter_nodes into a generator.
1339
        # A batch_size of 16 was chosen because:
1340
        #   a) In testing, a 4k page held 14 times. So if we have more than 16
1341
        #      leaf nodes we are unlikely to hold them in a single new leaf
1342
        #      node. This still allows for 1 round trip
1343
        #   b) With 16-way fan out, we can still do a single round trip
1344
        #   c) With 255-way fan out, we don't want to read all 255 and destroy
1345
        #      the page cache, just to determine that we really don't need it.
1346
        for node in self._iter_nodes(store, batch_size=16):
1347
            if isinstance(node, InternalNode):
1348
                # Without looking at any leaf nodes, we are sure
1349
                return self
1350
            for key, value in node._items.iteritems():
1351
                if new_leaf._map_no_split(key, value):
3735.11.10 by John Arbash Meinel
Change how _check_remap works so it doesn't have to load all keys.
1352
                    return self
3735.2.123 by Ian Clatworthy
only check for remap if changes are interesting in size
1353
        trace.mutter("remap generated a new LeafNode")
3735.11.3 by John Arbash Meinel
At the end of unmap() see if children can be packed into a single Leaf.
1354
        return new_leaf
3735.2.18 by Robert Collins
Partial multi-layer chk dictionary trees.
1355
1356
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1357
def _deserialise(bytes, key, search_key_func):
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1358
    """Helper for repositorydetails - convert bytes to a node."""
3735.2.24 by Robert Collins
test_chk_map tests all passing.
1359
    if bytes.startswith("chkleaf:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1360
        return LeafNode.deserialise(bytes, key, search_key_func=search_key_func)
3735.2.21 by Robert Collins
BROKEN: multi level CHKMap tries, unfinished, subsystem in flux.
1361
    elif bytes.startswith("chknode:\n"):
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1362
        return InternalNode.deserialise(bytes, key,
1363
            search_key_func=search_key_func)
3735.2.16 by Robert Collins
Untested extensions to support repodetails
1364
    else:
1365
        raise AssertionError("Unknown node type.")
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1366
1367
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1368
def _find_children_info(store, interesting_keys, uninteresting_keys, pb):
3735.9.7 by John Arbash Meinel
Cleanup pass.
1369
    """Read the associated records, and determine what is interesting."""
1370
    uninteresting_keys = set(uninteresting_keys)
1371
    chks_to_read = uninteresting_keys.union(interesting_keys)
1372
    next_uninteresting = set()
1373
    next_interesting = set()
1374
    uninteresting_items = set()
1375
    interesting_items = set()
1376
    interesting_records = []
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1377
    # records_read = set()
3735.9.7 by John Arbash Meinel
Cleanup pass.
1378
    for record in store.get_record_stream(chks_to_read, 'unordered', True):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1379
        # records_read.add(record.key())
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1380
        if pb is not None:
1381
            pb.tick()
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1382
        bytes = record.get_bytes_as('fulltext')
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1383
        # We don't care about search_key_func for this code, because we only
1384
        # care about external references.
1385
        node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1386
        if record.key in uninteresting_keys:
1387
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1388
                next_uninteresting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1389
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1390
                # We know we are at a LeafNode, so we can pass None for the
1391
                # store
1392
                uninteresting_items.update(node.iteritems(None))
3735.9.7 by John Arbash Meinel
Cleanup pass.
1393
        else:
1394
            interesting_records.append(record)
1395
            if isinstance(node, InternalNode):
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1396
                next_interesting.update(node.refs())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1397
            else:
3735.9.14 by John Arbash Meinel
Start using the iter_interesting_nodes.
1398
                interesting_items.update(node.iteritems(None))
1399
    # TODO: Filter out records that have already been read, as node splitting
1400
    #       can cause us to reference the same nodes via shorter and longer
1401
    #       paths
3735.9.7 by John Arbash Meinel
Cleanup pass.
1402
    return (next_uninteresting, uninteresting_items,
1403
            next_interesting, interesting_records, interesting_items)
1404
1405
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1406
def _find_all_uninteresting(store, interesting_root_keys,
1407
                            uninteresting_root_keys, adapter, pb):
1408
    """Determine the full set of uninteresting keys."""
1409
    # What about duplicates between interesting_root_keys and
1410
    # uninteresting_root_keys?
1411
    if not uninteresting_root_keys:
1412
        # Shortcut case. We know there is nothing uninteresting to filter out
1413
        # So we just let the rest of the algorithm do the work
1414
        # We know there is nothing uninteresting, and we didn't have to read
1415
        # any interesting records yet.
1416
        return (set(), set(), set(interesting_root_keys), [], set())
3735.9.7 by John Arbash Meinel
Cleanup pass.
1417
    all_uninteresting_chks = set(uninteresting_root_keys)
1418
    all_uninteresting_items = set()
1419
1420
    # First step, find the direct children of both the interesting and
1421
    # uninteresting set
1422
    (uninteresting_keys, uninteresting_items,
1423
     interesting_keys, interesting_records,
1424
     interesting_items) = _find_children_info(store, interesting_root_keys,
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1425
                                              uninteresting_root_keys,
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1426
                                              pb=pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1427
    all_uninteresting_chks.update(uninteresting_keys)
1428
    all_uninteresting_items.update(uninteresting_items)
1429
    del uninteresting_items
1430
    # Note: Exact matches between interesting and uninteresting do not need
1431
    #       to be search further. Non-exact matches need to be searched in case
1432
    #       there is a future exact-match
1433
    uninteresting_keys.difference_update(interesting_keys)
1434
3735.9.6 by John Arbash Meinel
Add a first pass to the interesting search.
1435
    # Second, find the full set of uninteresting bits reachable by the
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1436
    # uninteresting roots
3735.9.7 by John Arbash Meinel
Cleanup pass.
1437
    chks_to_read = uninteresting_keys
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1438
    while chks_to_read:
1439
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1440
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1441
            # TODO: Handle 'absent'
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1442
            if pb is not None:
1443
                pb.tick()
3735.2.98 by John Arbash Meinel
Merge bzr.dev 4032. Resolve the new streaming fetch.
1444
            try:
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1445
                bytes = record.get_bytes_as('fulltext')
3735.2.98 by John Arbash Meinel
Merge bzr.dev 4032. Resolve the new streaming fetch.
1446
            except errors.UnavailableRepresentation:
1447
                bytes = adapter.get_bytes(record)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1448
            # We don't care about search_key_func for this code, because we
1449
            # only care about external references.
1450
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1451
            if isinstance(node, InternalNode):
1452
                # uninteresting_prefix_chks.update(node._items.iteritems())
1453
                chks = node._items.values()
1454
                # TODO: We remove the entries that are already in
1455
                #       uninteresting_chks ?
1456
                next_chks.update(chks)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1457
                all_uninteresting_chks.update(chks)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1458
            else:
3735.9.7 by John Arbash Meinel
Cleanup pass.
1459
                all_uninteresting_items.update(node._items.iteritems())
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1460
        chks_to_read = next_chks
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1461
    return (all_uninteresting_chks, all_uninteresting_items,
1462
            interesting_keys, interesting_records, interesting_items)
1463
1464
1465
def iter_interesting_nodes(store, interesting_root_keys,
1466
                           uninteresting_root_keys, pb=None):
1467
    """Given root keys, find interesting nodes.
1468
1469
    Evaluate nodes referenced by interesting_root_keys. Ones that are also
1470
    referenced from uninteresting_root_keys are not considered interesting.
1471
1472
    :param interesting_root_keys: keys which should be part of the
1473
        "interesting" nodes (which will be yielded)
1474
    :param uninteresting_root_keys: keys which should be filtered out of the
1475
        result set.
1476
    :return: Yield
1477
        (interesting records, interesting chk's, interesting key:values)
1478
    """
1479
    # TODO: consider that it may be more memory efficient to use the 20-byte
1480
    #       sha1 string, rather than tuples of hexidecimal sha1 strings.
3735.2.68 by John Arbash Meinel
Add a TODO about avoiding all of the get_record_stream calls.
1481
    # TODO: Try to factor out a lot of the get_record_stream() calls into a
1482
    #       helper function similar to _read_bytes. This function should be
1483
    #       able to use nodes from the _page_cache as well as actually
1484
    #       requesting bytes from the store.
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1485
1486
    # A way to adapt from the compressed texts back into fulltexts
1487
    # In a way, this seems like a layering inversion to have CHKMap know the
1488
    # details of versionedfile
1489
    adapter_class = versionedfile.adapter_registry.get(
1490
        ('knit-ft-gz', 'fulltext'))
1491
    adapter = adapter_class(store)
1492
1493
    (all_uninteresting_chks, all_uninteresting_items, interesting_keys,
1494
     interesting_records, interesting_items) = _find_all_uninteresting(store,
1495
        interesting_root_keys, uninteresting_root_keys, adapter, pb)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1496
1497
    # Now that we know everything uninteresting, we can yield information from
1498
    # our first request
1499
    interesting_items.difference_update(all_uninteresting_items)
1500
    records = dict((record.key, record) for record in interesting_records
1501
                    if record.key not in all_uninteresting_chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1502
    if records or interesting_items:
1503
        yield records, interesting_items
3735.9.7 by John Arbash Meinel
Cleanup pass.
1504
    interesting_keys.difference_update(all_uninteresting_chks)
1505
1506
    chks_to_read = interesting_keys
3735.18.1 by John Arbash Meinel
Change the fetch logic to properly use the child_pb for child ops.
1507
    counter = 0
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1508
    while chks_to_read:
1509
        next_chks = set()
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1510
        for record in store.get_record_stream(chks_to_read, 'unordered', False):
3735.18.1 by John Arbash Meinel
Change the fetch logic to properly use the child_pb for child ops.
1511
            counter += 1
3735.9.17 by John Arbash Meinel
Pass around a progress bar and switch to using an adapter.
1512
            if pb is not None:
3735.18.1 by John Arbash Meinel
Change the fetch logic to properly use the child_pb for child ops.
1513
                pb.update('find chk pages', counter)
3735.9.7 by John Arbash Meinel
Cleanup pass.
1514
            # TODO: Handle 'absent'?
3735.2.98 by John Arbash Meinel
Merge bzr.dev 4032. Resolve the new streaming fetch.
1515
            try:
3735.2.67 by John Arbash Meinel
Merge bzr.dev 3903 which brings in 'chunked' encoding.
1516
                bytes = record.get_bytes_as('fulltext')
3735.2.98 by John Arbash Meinel
Merge bzr.dev 4032. Resolve the new streaming fetch.
1517
            except errors.UnavailableRepresentation:
1518
                bytes = adapter.get_bytes(record)
3735.16.2 by John Arbash Meinel
Start passing around the search_key_func in more places.
1519
            # We don't care about search_key_func for this code, because we
1520
            # only care about external references.
1521
            node = _deserialise(bytes, record.key, search_key_func=None)
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1522
            if isinstance(node, InternalNode):
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1523
                chks = set(node.refs())
1524
                chks.difference_update(all_uninteresting_chks)
1525
                # Is set() and .difference_update better than:
1526
                # chks = [chk for chk in node.refs()
1527
                #              if chk not in all_uninteresting_chks]
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1528
                next_chks.update(chks)
1529
                # These are now uninteresting everywhere else
3735.9.7 by John Arbash Meinel
Cleanup pass.
1530
                all_uninteresting_chks.update(chks)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1531
                interesting_items = []
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1532
            else:
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1533
                interesting_items = [item for item in node._items.iteritems()
1534
                                     if item not in all_uninteresting_items]
3735.9.15 by John Arbash Meinel
Found a bug in iter_interesting_nodes and its test suite.
1535
                # TODO: Do we need to filter out items that we have already
1536
                #       seen on other pages? We don't really want to buffer the
1537
                #       whole thing, but it does mean that callers need to
1538
                #       understand they may get duplicate values.
3735.9.7 by John Arbash Meinel
Cleanup pass.
1539
                # all_uninteresting_items.update(interesting_items)
3735.9.19 by John Arbash Meinel
Refactor iter_interesting a little bit.
1540
            yield {record.key: record}, interesting_items
3735.9.1 by John Arbash Meinel
Start working on an iter_interesting_nodes, which can find nodes to transmit
1541
        chks_to_read = next_chks