bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
| 
2621.1.1
by Aaron Bentley
 Add performance analysis of diff  | 
1  | 
diff Performance Analysis  | 
2  | 
=========================  | 
|
3  | 
||
4  | 
.. contents:: :local:  | 
|
5  | 
||
6  | 
Minimal Work  | 
|
7  | 
------------  | 
|
8  | 
||
9  | 
Reuse of historical comparisons  | 
|
10  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
11  | 
A significant part of the work done by diff is sequence matching. This  | 
|
12  | 
scales O(n^2) with the number of lines in the file. Therefore, it  | 
|
13  | 
is worthwile to avoid content comparisons as much as possible.  | 
|
14  | 
||
15  | 
Our current knit format contains content comparisons, and this data can  | 
|
16  | 
be converted into lists of matching blocks. Other future formats such as  | 
|
17  | 
mpdiff may also support such conversion. So it is possible to reuse past  | 
|
18  | 
comparisons.  | 
|
19  | 
||
20  | 
It is also possible to combine sequential comparisons. So given a comparison  | 
|
21  | 
of "foo" to "bar", and "bar" to "baz", it is possible to derive a comparison of  | 
|
22  | 
"foo" to "baz".  | 
|
23  | 
||
24  | 
Reuse of historical comparisons will scale with the number of uncommon  | 
|
25  | 
build-parents between the two historical revisions. This will typically be  | 
|
26  | 
proportional to the amount of change that the file has undergone. Therefore,  | 
|
27  | 
in the common case, reuse of historical comparisons will scale with the  | 
|
28  | 
amount of change.  | 
|
29  | 
||
30  | 
The downside of such reuse is that it ties the comparison to the historical  | 
|
31  | 
data. But given the performance improvement, it seems to be worth  | 
|
32  | 
consideration. Fresh comparisons can be performed if the user requests them.  | 
|
33  | 
||
34  | 
It may also be possible to accelerate comparisons by including annotation data,  | 
|
35  | 
thus increasing the number of unique lines.  | 
|
36  | 
||
37  | 
Historical Tree Against Historical Tree  | 
|
38  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
39  | 
This operation should be strictly proportional to the amount of change, because  | 
|
40  | 
a comparison has already been done at commit time. Achieving that performance  | 
|
41  | 
requires the committed data to be properly structured, so that the comparison  | 
|
42  | 
can be extracted and combined with other comparisons. This comparision  | 
|
43  | 
extraction should be possible at the inventory and file-content levels.  | 
|
44  | 
||
45  | 
Minimum work:  | 
|
46  | 
||
47  | 
1. Extract and combine inventory comparisons  | 
|
48  | 
2. Extract and combine text comparisions for modified texts  | 
|
49  | 
||
50  | 
Basis Against Historical Tree  | 
|
51  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
52  | 
This is another case of Historical Tree Against Historical Tree.  | 
|
53  | 
||
54  | 
Basis Against Basis  | 
|
55  | 
~~~~~~~~~~~~~~~~~~~  | 
|
56  | 
This is another case of Historical Tree Against Historical Tree.  | 
|
57  | 
||
58  | 
Working Tree Against Basis  | 
|
59  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
60  | 
This must scale with the number of versioned files, unless the user indicates  | 
|
61  | 
that only certain files should be compared.  | 
|
62  | 
||
63  | 
Performance can be further improved by caching comparisons to avoid repeating  | 
|
64  | 
them. Caching could potentially be performed by ``diff`` and perhaps by  | 
|
65  | 
``merge``. Merge is aware of the relationship of a text merge's result to  | 
|
66  | 
the THIS value, and the THIS value is generally the basis value. So the  | 
|
67  | 
comparison is latent, but present. The only issue is extracting it.  | 
|
68  | 
||
69  | 
The cache could be indexed by sha1sum pairs. It could also be indexed by  | 
|
70  | 
file-id, to facilitate removal of stale data.  | 
|
71  | 
||
72  | 
Minimum work:  | 
|
73  | 
||
74  | 
1. Scan working tree for modified files  | 
|
75  | 
2. Retrieve cached comparisons  | 
|
76  | 
3. Perform comparisons on files with no cached comparisons  | 
|
77  | 
4. Cache comparisons for files with no cached comparisons  | 
|
78  | 
||
79  | 
Working Tree Against Historical Tree  | 
|
80  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
81  | 
This can be structured as a comparison of working tree against basis tree,  | 
|
82  | 
followed by basis tree against historical tree. Therefore, it combines the  | 
|
83  | 
performance characteristics of "Working Tree Against Basis" with "Basis Against  | 
|
84  | 
Historical Tree".  | 
|
85  | 
||
86  | 
Working Tree Against Working Tree  | 
|
87  | 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  | 
|
88  | 
This can be structured as two comparisons against basis, and one comparison  | 
|
89  | 
of basis against basis. Its performance is therefore similar to Working Tree  | 
|
90  | 
Against Historical Tree.  | 
|
91  | 
||
92  | 
API Changes  | 
|
93  | 
-----------  | 
|
94  | 
||
95  | 
Desired API:  | 
|
96  | 
||
97  | 
- Tree.get_comparision(file_id, tree)  | 
|
98  | 
||
99  | 
This probably entails:  | 
|
100  | 
||
101  | 
- WorkingTree.store_comparison(file_id, revision_id, sha1, comparison)  | 
|
102  | 
- WorkingTree.get_comparison(file_id, revision_id, sha1)  | 
|
103  | 
- Repository.get_comparision(file_id, revision_id, revision_id)  | 
|
104  | 
- merge_comparisions(comparison, comparision)  | 
|
105  | 
||
106  | 
Storage considerations  | 
|
107  | 
----------------------  | 
|
108  | 
It must be cheap (e.g. scale with number of intermediate revisions) to perform  | 
|
109  | 
comparison of two historical texts. It must be cheap to perform comparison of  | 
|
110  | 
the inventories of two historical trees.  |