/brz/remove-bazaar : contents of doc/developers/commit.txt at revision 2513.1.1

: (revision 2513.1.1)

To get this branch, use:

bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar

Commit
======

The basic purpose of commit is to
1 - create and store a new revision based on the contents of the working tree
2 - make this the new basis revision for the working tree

We can do a selected commit of only some files or subtrees.

Minimum work
------------

The best performance we could hope for is:
 - stat each versioned selected working file once
 - read from the workingtree and write into the repository any new file texts
 - in general, do work proportional to the size of the shape (eg
inventory) of the old and new selected trees, and to the total size of
the modified files

In more detail:

1.0 - Store new file texts: if a versioned file contains a new text
there is no avoiding storing it.  To determine which ones have changed
we must go over the workingtree and at least stat each file.  If the
file is modified since it was last hashed, it must be read in.
Ideally we would read it only once, and either notice that it has not
changed, or store it at that point.

On the other hand we want new code to be able to handle files that are
larger than will fit in memory.  We may then need to read each file up
to two times: once to determine if there is a new text and calculate
its hash, and again to store it.

1.1 - Store a tree-shape description (ie inventory or similar.)  This
describes the non-file objects, and provides a reference from the
Revision to the texts within it.

1.2 - Generate and store a new revision object.

1.3 - Do delta-compression on the stored objects.  (git notably does
not do this at commit time, deferring this entirely until later.)
This requires finding the appropriate basis for each modified file: in
the current scheme we get the file id, last-revision from the
dirstate, look into the knit for that text, extract that text in
total, generate a delta, then store that into the knit.  Most delta
operations are O(n^2) to O(n^3) in the size of the modified files.

1.4 - Cache annotation information for the changes: at the moment this
is done as part of the delta storage.  There are some flaws in that
approach, such as that it is not updated when ghosts are filled, and
the annotation can't be re-run with new diff parameters.

2.1 - Make the new revision the basis for the tree, and clear the list
of parents.  Strictly this is all that's logically necessary, unless
the working tree format requires more work.

The dirstate format does require more work, because it caches the
parent tree data for each file within the working tree data.  In
practice this means that every commit rewrites the entire dirstate
file - we could try to avoid rewriting the whole file but this may be
difficult because variable-length data (the last-changed revision id)
is inserted into many rows.

The current dirstate design then seems to mean that any commit of a
single file imposes a cost proportional to the size of the current
workingtree.  Maybe there are other benefits that outweigh this.
Alternatively if it was fast enough for operations to always look at
the original storage of the parent trees we could do without the
cache.

2.2 - Record the observed file hashes into the workingtree control
files.  For the files that we just committed, we have the information
to store a valid hash cache entry: we know their stat information and
the sha1 of the file contents.  This is not strictly necessary to the
speed of commit, but it will be useful later in avoiding reading those
files, and the only cost of doing it now is writing it out.

In fact there are some user interface niceties that complicate this:

3 - Before starting the commit proper, we prompt for a commit message
and in that commit message editor we show a list of the files that
will be committed: basically the output of bzr status.  This is
basically the same as the list of changes we detect while storing the
commit, but because the user will sometimes change the tree after
opening the commit editor and expect the final state to be committed I
think we do have to look for changes twice.  Since it takes the user a
while to enter a message this is not a big problem as long as both the
status summary and the commit are individually fast.

4 - As the commit proceeds (or after?) we show another status-like
summary.  Just printing the names of modified files as they're stored
would be easy.  Recording deleted and renamed files or directories is
more work: this can only be done by reference to the primary parent
tree and requires it be read in.  Worse, reporting renames requires
searching by id across the entire parent tree.   Possibly full
reporting should be a default-off verbose option because it does
require more work beyond the commit itself.

5 - Bazaar currently allows for missing files to be automatically
marked as removed at the time of commit.  Leaving aside the ui
consequences, this means that we have to update the working inventory
to mark these files as removed.  Since as discussed above we always
have to rewrite the dirstate on commit this is not substantial, though
we should make sure we do this in one pass, not two.  I have
previously proposed to make this behaviour a non-default option.

We may need to run hooks or generate signatures during commit, but
they don't seem to have substantial performance consequences.

If one wanted to optimize solely for the speed of commit I think
hash-addressed  file-per-text storage like in git (or bzr 0.1) is very
good.  Remarkably, it does not need to read the inventory for the
previous revision.  For each versioned file, we just need to get its
hash, either by reading the file or validating its stat data.  If that
hash is not already in the repository, the file is just copied in and
compressed.  As directories are traversed, they're turned into texts
and stored as well, and then finally the revision is too.  This does
depend on later doing some delta compression of these texts.

Variations on this are possible.  Rather than writing a single file
into the repository for each text, we could fold them into a single
collation or pack file.  That would create a smaller number of files
in the repository, but looking up a single text would require looking
into their indexes rather than just asking the filesystem.

Rather than using hashes we can use file-id/rev-id pairs as at
present, which has several consequences pro and con.

Interface stack
---------------

The commit api is invoked by the command interface, and copies information
from the tree into the branch and its repository, possibly updating the
WorkingTree afterwards.

The command interface passes:
 
 * a commit message (from an option, if any),
 * or an indication that it should be read interactively from the ui object;
 * a list of files to commit
 * an option for a dry-run commit
 * verbose option, or callback to indicate 
 * timestamp, timezone, committer, chosen revision id
 * config (for what?)
 * option for local-only commit on a bound branch
 * option for strict commits (fail if there are unknown or missing files)
 * option to allow "pointless" commits (with no tree changes)

>>> Branch.commit(from_tree, message, files_to_commit)

There will be different implementations of this for different Branch
classes, whether for foreign branches or Bazaar repositories using
different storage methods.

Most of the commit should occur during a single lockstep iteration across
the workingtree and parent trees.  The WorkingTree interface needs to
provide methods that give commit all it needs.  Some of these methods
(such as answering the file's last change revision) may be deprecated in
newer working trees and there we have a choice of either calculating the
value from the data that is present, or refusing to support commit to
newer repositories.

For a dirstate tree the iteration of changes from the parent can easily be
done within its own iter_changes.

XXX: We currently don't support selective-file commit of a merge; this
could be done if we decide how it should be recorded - is this to be
stored as an overall merge revision; as a preliminary non-merge revisions;
or will the per-file graph diverge from the revision graph.

Other things commit needs to do:

 * check if there are any conflicts in the tree - if so, commit cannot
   continue

 * check if there are any unknown files, if --strict or automatic add is
   turned on

 * check the working tree basis version is up to date with the branch tip

 * when automatically adding new files or deleting missing files during
   commit, they must be noted during commit and written into the working
   tree at some point

 * refuse "pointless" commits with no file changes - should be easy by
   just refusing to do the final step of storing a new overall inventory
   and revision object

 * heuristic detection of renames between add and delete (out of scope for
   this change)

 * pushing changes to a master branch if any

 * running hooks, pre and post commit

 * prompting for a commit message if necessary, including a list of the
   changes that have already been observed

 * if there are tree references and recursing into them is enabled, then
   do so

Updates that need to be made in the working tree, either on conclusion
of commit or during the scan, include

 * Changes made to the tree shape, including automatic adds, renames or
   deletes

 * For trees (eg dirstate) that cache parent inventories, the old parent
   information must be removed and the new one inserted

 * The tree hashcache information should be updated to reflect the stat
   value at which the file was the same as the committed version.  This
   needs to be done carefully to prevent inconsistencies if the file is
   modified during or shortly after the commit.  Perhaps it would work to
   read the mtime of the file before we read its text to commit.

Dirstate inventories may be most easily updated in a single operation at
the end; however it may be best to accumulate data as we proceed through
the tree rather than revisiting it at the end.

Showing a progress bar for commit may not be necessary if we report files
as they are committed.  Alternatively we could transiently show a progress
bar for each directory that's scanned, even if no changes are observed.

This needs to collect a list of added/changed/removed files, each of which
must have its text stored (if any) and containing directory updated.  This
can be done by calling Tree._iter_changes on the source tree, asking for
changes 

In the 0.17 model the commit operation needs to know the per-file parents
and per-file last-changed revision.  

XXX: If we want to retain explicitly stored per-file graphs, it would seem
that we do need to record per-file parents.  We have not yet finally
settled that we do want to remove them or treat them as a cache.  This api
stack is still ok whether we do or not, but the internals of it may
change.

(In this and other operations we must avoid having multiple layers walk
over the tree separately.  For example, it is no good to have the Command
layer walk the tree to generate a list of all file ids to commit, because
the tree will also be walked later.  The layers that do need to operate
per-file should probably be bound together in a per-dirblock iterator,
rather than each iterating independently.)

2513.1.1 by Martin Pool (in progress) analysis of commit	1	Commit
	2	======
	3
	4	The basic purpose of commit is to
	5	1 - create and store a new revision based on the contents of the working tree
	6	2 - make this the new basis revision for the working tree
	7
	8	We can do a selected commit of only some files or subtrees.
	9
	10	Minimum work
	11	------------
	12
	13	The best performance we could hope for is:
	14	- stat each versioned selected working file once
	15	- read from the workingtree and write into the repository any new file texts
	16	- in general, do work proportional to the size of the shape (eg
	17	inventory) of the old and new selected trees, and to the total size of
	18	the modified files
	19
	20	In more detail:
	21
	22	1.0 - Store new file texts: if a versioned file contains a new text
	23	there is no avoiding storing it. To determine which ones have changed
	24	we must go over the workingtree and at least stat each file. If the
	25	file is modified since it was last hashed, it must be read in.
	26	Ideally we would read it only once, and either notice that it has not
	27	changed, or store it at that point.
	28
	29	On the other hand we want new code to be able to handle files that are
	30	larger than will fit in memory. We may then need to read each file up
	31	to two times: once to determine if there is a new text and calculate
	32	its hash, and again to store it.
	33
	34	1.1 - Store a tree-shape description (ie inventory or similar.) This
	35	describes the non-file objects, and provides a reference from the
	36	Revision to the texts within it.
	37
	38	1.2 - Generate and store a new revision object.
	39
	40	1.3 - Do delta-compression on the stored objects. (git notably does
	41	not do this at commit time, deferring this entirely until later.)
	42	This requires finding the appropriate basis for each modified file: in
	43	the current scheme we get the file id, last-revision from the
	44	dirstate, look into the knit for that text, extract that text in
	45	total, generate a delta, then store that into the knit. Most delta
	46	operations are O(n^2) to O(n^3) in the size of the modified files.
	47
	48	1.4 - Cache annotation information for the changes: at the moment this
	49	is done as part of the delta storage. There are some flaws in that
	50	approach, such as that it is not updated when ghosts are filled, and
	51	the annotation can't be re-run with new diff parameters.
	52
	53	2.1 - Make the new revision the basis for the tree, and clear the list
	54	of parents. Strictly this is all that's logically necessary, unless
	55	the working tree format requires more work.
	56
	57	The dirstate format does require more work, because it caches the
	58	parent tree data for each file within the working tree data. In
	59	practice this means that every commit rewrites the entire dirstate
	60	file - we could try to avoid rewriting the whole file but this may be
	61	difficult because variable-length data (the last-changed revision id)
	62	is inserted into many rows.
	63
	64	The current dirstate design then seems to mean that any commit of a
65	single file imposes a cost proportional to the size of the current
66	workingtree. Maybe there are other benefits that outweigh this.
67	Alternatively if it was fast enough for operations to always look at
68	the original storage of the parent trees we could do without the
69	cache.
70
71	2.2 - Record the observed file hashes into the workingtree control
72	files. For the files that we just committed, we have the information
73	to store a valid hash cache entry: we know their stat information and
74	the sha1 of the file contents. This is not strictly necessary to the
75	speed of commit, but it will be useful later in avoiding reading those
76	files, and the only cost of doing it now is writing it out.
77
78	In fact there are some user interface niceties that complicate this:
79
80	3 - Before starting the commit proper, we prompt for a commit message
81	and in that commit message editor we show a list of the files that
82	will be committed: basically the output of bzr status. This is
83	basically the same as the list of changes we detect while storing the
84	commit, but because the user will sometimes change the tree after
85	opening the commit editor and expect the final state to be committed I
86	think we do have to look for changes twice. Since it takes the user a
87	while to enter a message this is not a big problem as long as both the
88	status summary and the commit are individually fast.
89
90	4 - As the commit proceeds (or after?) we show another status-like
91	summary. Just printing the names of modified files as they're stored
92	would be easy. Recording deleted and renamed files or directories is
93	more work: this can only be done by reference to the primary parent
94	tree and requires it be read in. Worse, reporting renames requires
95	searching by id across the entire parent tree. Possibly full
96	reporting should be a default-off verbose option because it does
97	require more work beyond the commit itself.
98
99	5 - Bazaar currently allows for missing files to be automatically
100	marked as removed at the time of commit. Leaving aside the ui
101	consequences, this means that we have to update the working inventory
102	to mark these files as removed. Since as discussed above we always
103	have to rewrite the dirstate on commit this is not substantial, though
104	we should make sure we do this in one pass, not two. I have
105	previously proposed to make this behaviour a non-default option.
106
107	We may need to run hooks or generate signatures during commit, but
108	they don't seem to have substantial performance consequences.
109
110	If one wanted to optimize solely for the speed of commit I think
111	hash-addressed file-per-text storage like in git (or bzr 0.1) is very
112	good. Remarkably, it does not need to read the inventory for the
113	previous revision. For each versioned file, we just need to get its
114	hash, either by reading the file or validating its stat data. If that
115	hash is not already in the repository, the file is just copied in and
116	compressed. As directories are traversed, they're turned into texts
117	and stored as well, and then finally the revision is too. This does
118	depend on later doing some delta compression of these texts.
119
120	Variations on this are possible. Rather than writing a single file
121	into the repository for each text, we could fold them into a single
122	collation or pack file. That would create a smaller number of files
123	in the repository, but looking up a single text would require looking
124	into their indexes rather than just asking the filesystem.
125
126	Rather than using hashes we can use file-id/rev-id pairs as at
127	present, which has several consequences pro and con.
128
129	Interface stack
130	---------------
131
132	The commit api is invoked by the command interface, and copies information
133	from the tree into the branch and its repository, possibly updating the
134	WorkingTree afterwards.
135
136	The command interface passes:
137
138	* a commit message (from an option, if any),
139	* or an indication that it should be read interactively from the ui object;
140	* a list of files to commit
141	* an option for a dry-run commit
142	* verbose option, or callback to indicate
143	* timestamp, timezone, committer, chosen revision id
144	* config (for what?)
145	* option for local-only commit on a bound branch
146	* option for strict commits (fail if there are unknown or missing files)
147	* option to allow "pointless" commits (with no tree changes)
148
149	>>> Branch.commit(from_tree, message, files_to_commit)
150
151	There will be different implementations of this for different Branch
152	classes, whether for foreign branches or Bazaar repositories using
153	different storage methods.
154
155	Most of the commit should occur during a single lockstep iteration across
156	the workingtree and parent trees. The WorkingTree interface needs to
157	provide methods that give commit all it needs. Some of these methods
158	(such as answering the file's last change revision) may be deprecated in
159	newer working trees and there we have a choice of either calculating the
160	value from the data that is present, or refusing to support commit to
161	newer repositories.
162
163	For a dirstate tree the iteration of changes from the parent can easily be
164	done within its own iter_changes.
165
166	XXX: We currently don't support selective-file commit of a merge; this
167	could be done if we decide how it should be recorded - is this to be
168	stored as an overall merge revision; as a preliminary non-merge revisions;
169	or will the per-file graph diverge from the revision graph.
170
171	Other things commit needs to do:
172
173	* check if there are any conflicts in the tree - if so, commit cannot
174	continue
175
176	* check if there are any unknown files, if --strict or automatic add is
177	turned on
178
179	* check the working tree basis version is up to date with the branch tip
180
181	* when automatically adding new files or deleting missing files during
182	commit, they must be noted during commit and written into the working
183	tree at some point
184
185	* refuse "pointless" commits with no file changes - should be easy by
186	just refusing to do the final step of storing a new overall inventory
187	and revision object
188
189	* heuristic detection of renames between add and delete (out of scope for
190	this change)
191
192	* pushing changes to a master branch if any
193
194	* running hooks, pre and post commit
195
196	* prompting for a commit message if necessary, including a list of the
197	changes that have already been observed
198
199	* if there are tree references and recursing into them is enabled, then
200	do so
201
202	Updates that need to be made in the working tree, either on conclusion
203	of commit or during the scan, include
204
205	* Changes made to the tree shape, including automatic adds, renames or
206	deletes
207
208	* For trees (eg dirstate) that cache parent inventories, the old parent
209	information must be removed and the new one inserted
210
211	* The tree hashcache information should be updated to reflect the stat
212	value at which the file was the same as the committed version. This
213	needs to be done carefully to prevent inconsistencies if the file is
214	modified during or shortly after the commit. Perhaps it would work to
215	read the mtime of the file before we read its text to commit.
216
217	Dirstate inventories may be most easily updated in a single operation at
218	the end; however it may be best to accumulate data as we proceed through
219	the tree rather than revisiting it at the end.
220
221	Showing a progress bar for commit may not be necessary if we report files
222	as they are committed. Alternatively we could transiently show a progress
223	bar for each directory that's scanned, even if no changes are observed.
224
225	This needs to collect a list of added/changed/removed files, each of which
226	must have its text stored (if any) and containing directory updated. This
227	can be done by calling Tree._iter_changes on the source tree, asking for
228	changes
229
230	In the 0.17 model the commit operation needs to know the per-file parents
231	and per-file last-changed revision.
232
233	XXX: If we want to retain explicitly stored per-file graphs, it would seem
234	that we do need to record per-file parents. We have not yet finally
235	settled that we do want to remove them or treat them as a cache. This api
236	stack is still ok whether we do or not, but the internals of it may
237	change.
238
239	(In this and other operations we must avoid having multiple layers walk
240	over the tree separately. For example, it is no good to have the Command
241	layer walk the tree to generate a list of all file ids to commit, because
242	the tree will also be walked later. The layers that do need to operate
243	per-file should probably be bound together in a per-dirblock iterator,
244	rather than each iterating independently.)