bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
|
2511.1.4
by Ian Clatworthy
updated NEWS and added commit performance notes to doc/developers |
1 |
Commit Performance Notes |
2 |
======================== |
|
3 |
||
4 |
.. contents:: |
|
5 |
||
6 |
Commit: The Minimum Work Required |
|
7 |
--------------------------------- |
|
8 |
||
9 |
This is Martin Pool's email to the mailing list on the minimum work that |
|
10 |
commit needs to do. Be sure to check the mailing list archives |
|
11 |
(https://lists.ubuntu.com/archives/bazaar/2007q2/025791.html) |
|
12 |
for follow-up comments from Robert and others ... |
|
13 |
||
14 |
Here is a description of the minimum work that commit must do. We |
|
15 |
want to make sure that our design doesn't cost too much more than this |
|
16 |
minimum. I am trying to do this without making too many assumptions |
|
17 |
about the underlying storage, but am assuming that the ui and basic |
|
18 |
architecture (wt, branch, repo) stays about the same. |
|
19 |
||
20 |
The basic purpose of commit is to: |
|
21 |
||
22 |
1. create and store a new revision based on the contents of the working tree |
|
23 |
2. make this the new basis revision for the working tree |
|
24 |
||
25 |
We can do a selected commit of only some files or subtrees. |
|
26 |
||
27 |
The best performance we could hope for is: |
|
28 |
- stat each versioned selected working file once |
|
29 |
- read from the workingtree and write into the repository any new file texts |
|
30 |
- in general, do work proportional to the size of the shape (eg |
|
31 |
inventory) of the old and new selected trees, and to the total size of |
|
32 |
the modified files |
|
33 |
||
34 |
In more detail: |
|
35 |
||
36 |
1.0 - Store new file texts: if a versioned file contains a new text |
|
37 |
there is no avoiding storing it. To determine which ones have changed |
|
38 |
we must go over the workingtree and at least stat each file. If the |
|
39 |
file is modified since it was last hashed, it must be read in. |
|
40 |
Ideally we would read it only once, and either notice that it has not |
|
41 |
changed, or store it at that point. |
|
42 |
||
43 |
On the other hand we want new code to be able to handle files that are |
|
44 |
larger than will fit in memory. We may then need to read each file up |
|
45 |
to two times: once to determine if there is a new text and calculate |
|
46 |
its hash, and again to store it. |
|
47 |
||
48 |
1.1 - Store a tree-shape description (ie inventory or similar.) This |
|
49 |
describes the non-file objects, and provides a reference from the |
|
50 |
Revision to the texts within it. |
|
51 |
||
52 |
1.2 - Generate and store a new revision object. |
|
53 |
||
54 |
1.3 - Do delta-compression on the stored objects. (git notably does |
|
55 |
not do this at commit time, deferring this entirely until later.) |
|
56 |
This requires finding the appropriate basis for each modified file: in |
|
57 |
the current scheme we get the file id, last-revision from the |
|
58 |
dirstate, look into the knit for that text, extract that text in |
|
59 |
total, generate a delta, then store that into the knit. Most delta |
|
60 |
operations are O(n2) to O(n3) in the size of the modified files. |
|
61 |
||
62 |
1.4 - Cache annotation information for the changes: at the moment this |
|
63 |
is done as part of the delta storage. There are some flaws in that |
|
64 |
approach, such as that it is not updated when ghosts are filled, and |
|
65 |
the annotation can't be re-run with new diff parameters. |
|
66 |
||
67 |
2.1 - Make the new revision the basis for the tree, and clear the list |
|
68 |
of parents. Strictly this is all that's logically necessary, unless |
|
69 |
the working tree format requires more work. |
|
70 |
||
71 |
The dirstate format does require more work, because it caches the |
|
72 |
parent tree data for each file within the working tree data. In |
|
73 |
practice this means that every commit rewrites the entire dirstate |
|
74 |
file - we could try to avoid rewriting the whole file but this may be |
|
75 |
difficult because variable-length data (the last-changed revision id) |
|
76 |
is inserted into many rows. |
|
77 |
||
78 |
The current dirstate design then seems to mean that any commit of a |
|
79 |
single file imposes a cost proportional to the size of the current |
|
80 |
workingtree. Maybe there are other benefits that outweigh this. |
|
81 |
Alternatively if it was fast enough for operations to always look at |
|
82 |
the original storage of the parent trees we could do without the |
|
83 |
cache. |
|
84 |
||
85 |
2.2 - Record the observed file hashes into the workingtree control |
|
86 |
files. For the files that we just committed, we have the information |
|
87 |
to store a valid hash cache entry: we know their stat information and |
|
88 |
the sha1 of the file contents. This is not strictly necessary to the |
|
89 |
speed of commit, but it will be useful later in avoiding reading those |
|
90 |
files, and the only cost of doing it now is writing it out. |
|
91 |
||
92 |
In fact there are some user interface niceties that complicate this: |
|
93 |
||
94 |
3 - Before starting the commit proper, we prompt for a commit message |
|
95 |
and in that commit message editor we show a list of the files that |
|
96 |
will be committed: basically the output of bzr status. This is |
|
97 |
basically the same as the list of changes we detect while storing the |
|
98 |
commit, but because the user will sometimes change the tree after |
|
99 |
opening the commit editor and expect the final state to be committed I |
|
100 |
think we do have to look for changes twice. Since it takes the user a |
|
101 |
while to enter a message this is not a big problem as long as both the |
|
102 |
status summary and the commit are individually fast. |
|
103 |
||
104 |
4 - As the commit proceeds (or after?) we show another status-like |
|
105 |
summary. Just printing the names of modified files as they're stored |
|
106 |
would be easy. Recording deleted and renamed files or directories is |
|
107 |
more work: this can only be done by reference to the primary parent |
|
108 |
tree and requires it be read in. Worse, reporting renames requires |
|
109 |
searching by id across the entire parent tree. Possibly full |
|
110 |
reporting should be a default-off verbose option because it does |
|
111 |
require more work beyond the commit itself. |
|
112 |
||
113 |
5 - Bazaar currently allows for missing files to be automatically |
|
114 |
marked as removed at the time of commit. Leaving aside the ui |
|
115 |
consequences, this means that we have to update the working inventory |
|
116 |
to mark these files as removed. Since as discussed above we always |
|
117 |
have to rewrite the dirstate on commit this is not substantial, though |
|
118 |
we should make sure we do this in one pass, not two. I have |
|
119 |
previously proposed to make this behaviour a non-default option. |
|
120 |
||
121 |
We may need to run hooks or generate signatures during commit, but |
|
122 |
they don't seem to have substantial performance consequences. |
|
123 |
||
124 |
If one wanted to optimize solely for the speed of commit I think |
|
125 |
hash-addressed file-per-text storage like in git (or bzr 0.1) is very |
|
126 |
good. Remarkably, it does not need to read the inventory for the |
|
127 |
previous revision. For each versioned file, we just need to get its |
|
128 |
hash, either by reading the file or validating its stat data. If that |
|
129 |
hash is not already in the repository, the file is just copied in and |
|
130 |
compressed. As directories are traversed, they're turned into texts |
|
131 |
and stored as well, and then finally the revision is too. This does |
|
132 |
depend on later doing some delta compression of these texts. |
|
133 |
||
134 |
Variations on this are possible. Rather than writing a single file |
|
135 |
into the repository for each text, we could fold them into a single |
|
136 |
collation or pack file. That would create a smaller number of files |
|
137 |
in the repository, but looking up a single text would require looking |
|
138 |
into their indexes rather than just asking the filesystem. |
|
139 |
||
140 |
Rather than using hashes we can use file-id/rev-id pairs as at |
|
141 |
present, which has several consequences pro and con. |
|
142 |
||
143 |
||
144 |
Commit vs Status |
|
145 |
---------------- |
|
146 |
||
147 |
At first glance, commit simply stores the changes status reports. In fact, |
|
148 |
this isn't technically correct: commit considers some files modified that |
|
149 |
status does not. The notes below were put together by John Arbash Meinel |
|
150 |
and Aaron Bentley in May 2007 to explain the finer details of commit to |
|
151 |
Ian Clatworthy. They are recorded here as they are likely to be useful to |
|
152 |
others new to Bazaar ... |
|
153 |
||
154 |
1) **Unknown files have a different effect.** With --no-strict (the default) |
|
155 |
they have no effect and can be completely ignored. With --strict they |
|
156 |
should cause the commit to abort (so you don't forget to add the two new |
|
157 |
test files that you just created). |
|
158 |
||
159 |
2) **Multiple parents.** 'status' always compares 2 trees, typically the |
|
160 |
last-committed tree and the current working tree. 'commit' will compare |
|
161 |
more trees if there has been a merge. |
|
162 |
||
163 |
a) The "last modified" property for files. |
|
164 |
A file may be marked as changed since the last commit, but that |
|
165 |
change may have come in from the merge, and the change could have |
|
166 |
happened several commits back. There are several edge cases to be |
|
167 |
handled here, like if both branches modified the same file, or if |
|
168 |
just one branch modified it. |
|
169 |
||
170 |
b) The trickier case is when a file appears unmodified since last |
|
171 |
commit, but it was modified versus one of the merged branches. I |
|
172 |
believe there are a few ways this can happen, like if a merged |
|
173 |
branch changes a file and then reverts it back (you still update |
|
174 |
the 'last modified' field). |
|
175 |
In general, if both sides disagree on the 'last-modified' flag, |
|
176 |
then you need to generate a new entry pointing 'last-modified' at |
|
177 |
this revision (because you are resolving the differences between |
|
178 |
the 2 parents). |
|
179 |
||
180 |
3) **Automatic deletion of 'missing' files.** This is a point that we go |
|
181 |
back and forth on. I think the basic idea is that 'bzr commit' by |
|
182 |
default should abort if it finds a 'missing' file (in case that file was |
|
183 |
renamed rather than deleted), but 'bzr commit --auto' can add unknown |
|
184 |
files and remove missing files automatically. |
|
185 |
||
186 |
4) **sha1 for newly added files.** status doesn't really need this: it should |
|
187 |
only care that the file is not present in base, but is present now. In |
|
188 |
some ways commit doesn't care either, since it needs to read and sha the |
|
189 |
file itself anyway. |
|
190 |
||
191 |
5) **Nested trees.** status doesn't recurse into nested trees, but commit does. |
|
192 |
This is just because not all of the nested-trees work has been merged yet. |
|
193 |
||
194 |
A tree-reference is considered modified if the subtree has been |
|
195 |
committed since the last containing-tree commit. But commit needs to |
|
196 |
recurse into every subtree, to ensure that a commit is done if the |
|
197 |
subtree has changed since its last commit. _iter_changes only reports |
|
198 |
on tree-references that are modified, so it can't be used for doing |
|
199 |
subtree commits. |
|
200 |
||
201 |
||
202 |
Avoiding Work: Smarter Change Detection |
|
203 |
--------------------------------------- |
|
204 |
||
205 |
Commit currently walks through every file building an inventory. Here is |
|
206 |
Aaron's brain dump on a better way ... |
|
207 |
||
208 |
_iter_changes won't tell us about tree references that haven't changed, |
|
209 |
even if those subtrees have changed. (Unless we ask for unchanged |
|
210 |
files, which we don't want to do, of course.) |
|
211 |
||
212 |
There is an iter_references method, but using it looks just as expensive |
|
213 |
as calling kind(). |
|
214 |
||
215 |
I did some work on updating commit to use iter_changes, but found for |
|
216 |
multi-parent trees, I had to fall back to the slow inventory comparison |
|
217 |
approach. |
|
218 |
||
219 |
Really, I think we need a call akin to iter_changes that handles |
|
220 |
multiple parents, and knows to emit entries when InventoryEntry.revision |
|
221 |
is all that's changed. |
|
222 |
||
223 |
||
224 |
Avoiding Work: Better Layering |
|
225 |
------------------------------ |
|
226 |
||
227 |
For each file, commit is currently doing more work than it should. Here is |
|
228 |
John's take on a better way ... |
|
229 |
||
230 |
Note that "_iter_changes" *does* have to touch every path on disk, but |
|
231 |
it just can do it in a more efficient manner. (It doesn't have to create |
|
232 |
an InventoryEntry for all the ones that haven't changed). |
|
233 |
||
234 |
I agree with Aaron that we need something a little different than |
|
235 |
_iter_changes. Both because of handling multiple parents, as well as we |
|
236 |
don't want it to actually read the files if we have a stat-cache miss. |
|
237 |
||
238 |
Specifically, the commit code *has* to read the files because it is |
|
239 |
going to add the text to the repository, and we want it to compute the |
|
240 |
sha1 at *that* time, so we are guaranteed to have the valid sha (rather |
|
241 |
than just whatever the last cached one was). So we want the code to |
|
242 |
return 'None' if it doesn't have an up-to-date sha1, rather than reading |
|
243 |
the file and computing it, just before it returns it to the parent. |
|
244 |
||
245 |
The commit code (0.16) should really be restructured. It's layering is |
|
246 |
pretty wrong. |
|
247 |
||
248 |
Specifically, calling "kind()" requires a stat of the file. But we have |
|
249 |
to do a stat to get the size/whether the record is up-to-date, etc. So |
|
250 |
we really need to have a "create_an_up_to_date_inventory()" function. |
|
251 |
But because we are accessing every object on disk, we want to be working |
|
252 |
in tuples rather than Inventory objects. And because DirState already |
|
253 |
has the parent records next to the current working inventory, it can do |
|
254 |
all the work to do really fast comparison and throw-away of unimportant |
|
255 |
records. |
|
256 |
||
257 |
The way I made "bzr status" fast is by moving the 'ignore this record' |
|
258 |
ability as deep into the stack as I could get. Status has the property |
|
259 |
that you don't care about most of the records, just like commit. So the |
|
260 |
sooner you can stop evaluating the 99% that you don't care about, the |
|
261 |
less work you do. |
|
262 |