bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar
|
2513.1.1
by Martin Pool
(in progress) analysis of commit |
1 |
Commit |
2 |
====== |
|
3 |
||
4 |
The basic purpose of commit is to |
|
5 |
1 - create and store a new revision based on the contents of the working tree |
|
6 |
2 - make this the new basis revision for the working tree |
|
7 |
||
8 |
We can do a selected commit of only some files or subtrees. |
|
9 |
||
10 |
Minimum work |
|
11 |
------------ |
|
12 |
||
13 |
The best performance we could hope for is: |
|
14 |
- stat each versioned selected working file once |
|
15 |
- read from the workingtree and write into the repository any new file texts |
|
16 |
- in general, do work proportional to the size of the shape (eg |
|
17 |
inventory) of the old and new selected trees, and to the total size of |
|
18 |
the modified files |
|
19 |
||
20 |
In more detail: |
|
21 |
||
22 |
1.0 - Store new file texts: if a versioned file contains a new text |
|
23 |
there is no avoiding storing it. To determine which ones have changed |
|
24 |
we must go over the workingtree and at least stat each file. If the |
|
25 |
file is modified since it was last hashed, it must be read in. |
|
26 |
Ideally we would read it only once, and either notice that it has not |
|
27 |
changed, or store it at that point. |
|
28 |
||
29 |
On the other hand we want new code to be able to handle files that are |
|
30 |
larger than will fit in memory. We may then need to read each file up |
|
31 |
to two times: once to determine if there is a new text and calculate |
|
32 |
its hash, and again to store it. |
|
33 |
||
34 |
1.1 - Store a tree-shape description (ie inventory or similar.) This |
|
35 |
describes the non-file objects, and provides a reference from the |
|
36 |
Revision to the texts within it. |
|
37 |
||
38 |
1.2 - Generate and store a new revision object. |
|
39 |
||
40 |
1.3 - Do delta-compression on the stored objects. (git notably does |
|
41 |
not do this at commit time, deferring this entirely until later.) |
|
42 |
This requires finding the appropriate basis for each modified file: in |
|
43 |
the current scheme we get the file id, last-revision from the |
|
44 |
dirstate, look into the knit for that text, extract that text in |
|
45 |
total, generate a delta, then store that into the knit. Most delta |
|
46 |
operations are O(n^2) to O(n^3) in the size of the modified files. |
|
47 |
||
48 |
1.4 - Cache annotation information for the changes: at the moment this |
|
49 |
is done as part of the delta storage. There are some flaws in that |
|
50 |
approach, such as that it is not updated when ghosts are filled, and |
|
51 |
the annotation can't be re-run with new diff parameters. |
|
52 |
||
53 |
2.1 - Make the new revision the basis for the tree, and clear the list |
|
54 |
of parents. Strictly this is all that's logically necessary, unless |
|
55 |
the working tree format requires more work. |
|
56 |
||
57 |
The dirstate format does require more work, because it caches the |
|
58 |
parent tree data for each file within the working tree data. In |
|
59 |
practice this means that every commit rewrites the entire dirstate |
|
60 |
file - we could try to avoid rewriting the whole file but this may be |
|
61 |
difficult because variable-length data (the last-changed revision id) |
|
62 |
is inserted into many rows. |
|
63 |
||
64 |
The current dirstate design then seems to mean that any commit of a |
|
65 |
single file imposes a cost proportional to the size of the current |
|
66 |
workingtree. Maybe there are other benefits that outweigh this. |
|
67 |
Alternatively if it was fast enough for operations to always look at |
|
68 |
the original storage of the parent trees we could do without the |
|
69 |
cache. |
|
70 |
||
71 |
2.2 - Record the observed file hashes into the workingtree control |
|
72 |
files. For the files that we just committed, we have the information |
|
73 |
to store a valid hash cache entry: we know their stat information and |
|
74 |
the sha1 of the file contents. This is not strictly necessary to the |
|
75 |
speed of commit, but it will be useful later in avoiding reading those |
|
76 |
files, and the only cost of doing it now is writing it out. |
|
77 |
||
78 |
In fact there are some user interface niceties that complicate this: |
|
79 |
||
80 |
3 - Before starting the commit proper, we prompt for a commit message |
|
81 |
and in that commit message editor we show a list of the files that |
|
82 |
will be committed: basically the output of bzr status. This is |
|
83 |
basically the same as the list of changes we detect while storing the |
|
84 |
commit, but because the user will sometimes change the tree after |
|
85 |
opening the commit editor and expect the final state to be committed I |
|
86 |
think we do have to look for changes twice. Since it takes the user a |
|
87 |
while to enter a message this is not a big problem as long as both the |
|
88 |
status summary and the commit are individually fast. |
|
89 |
||
90 |
4 - As the commit proceeds (or after?) we show another status-like |
|
91 |
summary. Just printing the names of modified files as they're stored |
|
92 |
would be easy. Recording deleted and renamed files or directories is |
|
93 |
more work: this can only be done by reference to the primary parent |
|
94 |
tree and requires it be read in. Worse, reporting renames requires |
|
95 |
searching by id across the entire parent tree. Possibly full |
|
96 |
reporting should be a default-off verbose option because it does |
|
97 |
require more work beyond the commit itself. |
|
98 |
||
99 |
5 - Bazaar currently allows for missing files to be automatically |
|
100 |
marked as removed at the time of commit. Leaving aside the ui |
|
101 |
consequences, this means that we have to update the working inventory |
|
102 |
to mark these files as removed. Since as discussed above we always |
|
103 |
have to rewrite the dirstate on commit this is not substantial, though |
|
104 |
we should make sure we do this in one pass, not two. I have |
|
105 |
previously proposed to make this behaviour a non-default option. |
|
106 |
||
107 |
We may need to run hooks or generate signatures during commit, but |
|
108 |
they don't seem to have substantial performance consequences. |
|
109 |
||
110 |
If one wanted to optimize solely for the speed of commit I think |
|
111 |
hash-addressed file-per-text storage like in git (or bzr 0.1) is very |
|
112 |
good. Remarkably, it does not need to read the inventory for the |
|
113 |
previous revision. For each versioned file, we just need to get its |
|
114 |
hash, either by reading the file or validating its stat data. If that |
|
115 |
hash is not already in the repository, the file is just copied in and |
|
116 |
compressed. As directories are traversed, they're turned into texts |
|
117 |
and stored as well, and then finally the revision is too. This does |
|
118 |
depend on later doing some delta compression of these texts. |
|
119 |
||
120 |
Variations on this are possible. Rather than writing a single file |
|
121 |
into the repository for each text, we could fold them into a single |
|
122 |
collation or pack file. That would create a smaller number of files |
|
123 |
in the repository, but looking up a single text would require looking |
|
124 |
into their indexes rather than just asking the filesystem. |
|
125 |
||
126 |
Rather than using hashes we can use file-id/rev-id pairs as at |
|
127 |
present, which has several consequences pro and con. |
|
128 |
||
129 |
Interface stack |
|
130 |
--------------- |
|
131 |
||
132 |
The commit api is invoked by the command interface, and copies information |
|
133 |
from the tree into the branch and its repository, possibly updating the |
|
134 |
WorkingTree afterwards. |
|
135 |
||
136 |
The command interface passes: |
|
137 |
||
138 |
* a commit message (from an option, if any), |
|
139 |
* or an indication that it should be read interactively from the ui object; |
|
140 |
* a list of files to commit |
|
141 |
* an option for a dry-run commit |
|
142 |
* verbose option, or callback to indicate |
|
143 |
* timestamp, timezone, committer, chosen revision id |
|
144 |
* config (for what?) |
|
145 |
* option for local-only commit on a bound branch |
|
146 |
* option for strict commits (fail if there are unknown or missing files) |
|
147 |
* option to allow "pointless" commits (with no tree changes) |
|
148 |
||
149 |
>>> Branch.commit(from_tree, message, files_to_commit) |
|
150 |
||
151 |
There will be different implementations of this for different Branch |
|
152 |
classes, whether for foreign branches or Bazaar repositories using |
|
153 |
different storage methods. |
|
154 |
||
155 |
Most of the commit should occur during a single lockstep iteration across |
|
156 |
the workingtree and parent trees. The WorkingTree interface needs to |
|
157 |
provide methods that give commit all it needs. Some of these methods |
|
158 |
(such as answering the file's last change revision) may be deprecated in |
|
159 |
newer working trees and there we have a choice of either calculating the |
|
160 |
value from the data that is present, or refusing to support commit to |
|
161 |
newer repositories. |
|
162 |
||
163 |
For a dirstate tree the iteration of changes from the parent can easily be |
|
164 |
done within its own iter_changes. |
|
165 |
||
166 |
XXX: We currently don't support selective-file commit of a merge; this |
|
167 |
could be done if we decide how it should be recorded - is this to be |
|
168 |
stored as an overall merge revision; as a preliminary non-merge revisions; |
|
169 |
or will the per-file graph diverge from the revision graph. |
|
170 |
||
171 |
Other things commit needs to do: |
|
172 |
||
173 |
* check if there are any conflicts in the tree - if so, commit cannot |
|
174 |
continue |
|
175 |
||
176 |
* check if there are any unknown files, if --strict or automatic add is |
|
177 |
turned on |
|
178 |
||
179 |
* check the working tree basis version is up to date with the branch tip |
|
180 |
||
181 |
* when automatically adding new files or deleting missing files during |
|
182 |
commit, they must be noted during commit and written into the working |
|
183 |
tree at some point |
|
184 |
||
185 |
* refuse "pointless" commits with no file changes - should be easy by |
|
186 |
just refusing to do the final step of storing a new overall inventory |
|
187 |
and revision object |
|
188 |
||
189 |
* heuristic detection of renames between add and delete (out of scope for |
|
190 |
this change) |
|
191 |
||
192 |
* pushing changes to a master branch if any |
|
193 |
||
194 |
* running hooks, pre and post commit |
|
195 |
||
196 |
* prompting for a commit message if necessary, including a list of the |
|
197 |
changes that have already been observed |
|
198 |
||
199 |
* if there are tree references and recursing into them is enabled, then |
|
200 |
do so |
|
201 |
||
202 |
Updates that need to be made in the working tree, either on conclusion |
|
203 |
of commit or during the scan, include |
|
204 |
||
205 |
* Changes made to the tree shape, including automatic adds, renames or |
|
206 |
deletes |
|
207 |
||
208 |
* For trees (eg dirstate) that cache parent inventories, the old parent |
|
209 |
information must be removed and the new one inserted |
|
210 |
||
211 |
* The tree hashcache information should be updated to reflect the stat |
|
212 |
value at which the file was the same as the committed version. This |
|
213 |
needs to be done carefully to prevent inconsistencies if the file is |
|
214 |
modified during or shortly after the commit. Perhaps it would work to |
|
215 |
read the mtime of the file before we read its text to commit. |
|
216 |
||
217 |
Dirstate inventories may be most easily updated in a single operation at |
|
218 |
the end; however it may be best to accumulate data as we proceed through |
|
219 |
the tree rather than revisiting it at the end. |
|
220 |
||
221 |
Showing a progress bar for commit may not be necessary if we report files |
|
222 |
as they are committed. Alternatively we could transiently show a progress |
|
223 |
bar for each directory that's scanned, even if no changes are observed. |
|
224 |
||
225 |
This needs to collect a list of added/changed/removed files, each of which |
|
226 |
must have its text stored (if any) and containing directory updated. This |
|
227 |
can be done by calling Tree._iter_changes on the source tree, asking for |
|
228 |
changes |
|
229 |
||
230 |
In the 0.17 model the commit operation needs to know the per-file parents |
|
231 |
and per-file last-changed revision. |
|
232 |
||
233 |
XXX: If we want to retain explicitly stored per-file graphs, it would seem |
|
234 |
that we do need to record per-file parents. We have not yet finally |
|
235 |
settled that we do want to remove them or treat them as a cache. This api |
|
236 |
stack is still ok whether we do or not, but the internals of it may |
|
237 |
change. |
|
238 |
||
239 |
(In this and other operations we must avoid having multiple layers walk |
|
240 |
over the tree separately. For example, it is no good to have the Command |
|
241 |
layer walk the tree to generate a list of all file ids to commit, because |
|
242 |
the tree will also be walked later. The layers that do need to operate |
|
243 |
per-file should probably be bound together in a per-dirblock iterator, |
|
244 |
rather than each iterating independently.) |