/brz/remove-bazaar : contents of doc/developers/performance-commit.txt at revision 2535.2.1

: (revision 2535.2.1)

To get this branch, use:

bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar

Commit Performance Notes
========================

.. contents::

Commit: The Minimum Work Required
---------------------------------

This is Martin Pool's email to the mailing list on the minimum work that
commit needs to do. Be sure to check the mailing list archives 
(https://lists.ubuntu.com/archives/bazaar/2007q2/025791.html)
for follow-up comments from Robert and others ...

Here is a description of the minimum work that commit must do.  We
want to make sure that our design doesn't cost too much more than this
minimum.  I am trying to do this without making too many assumptions
about the underlying storage, but am assuming that the ui and basic
architecture (wt, branch, repo) stays about the same.

The basic purpose of commit is to:

1. create and store a new revision based on the contents of the working tree
2. make this the new basis revision for the working tree

We can do a selected commit of only some files or subtrees.

The best performance we could hope for is:
- stat each versioned selected working file once
- read from the workingtree and write into the repository any new file texts
- in general, do work proportional to the size of the shape (eg
inventory) of the old and new selected trees, and to the total size of
the modified files

In more detail:

1.0 - Store new file texts: if a versioned file contains a new text
there is no avoiding storing it.  To determine which ones have changed
we must go over the workingtree and at least stat each file.  If the
file is modified since it was last hashed, it must be read in.
Ideally we would read it only once, and either notice that it has not
changed, or store it at that point.

On the other hand we want new code to be able to handle files that are
larger than will fit in memory.  We may then need to read each file up
to two times: once to determine if there is a new text and calculate
its hash, and again to store it.

1.1 - Store a tree-shape description (ie inventory or similar.)  This
describes the non-file objects, and provides a reference from the
Revision to the texts within it.

1.2 - Generate and store a new revision object.

1.3 - Do delta-compression on the stored objects.  (git notably does
not do this at commit time, deferring this entirely until later.)
This requires finding the appropriate basis for each modified file: in
the current scheme we get the file id, last-revision from the
dirstate, look into the knit for that text, extract that text in
total, generate a delta, then store that into the knit.  Most delta
operations are O(n2) to O(n3) in the size of the modified files.

1.4 - Cache annotation information for the changes: at the moment this
is done as part of the delta storage.  There are some flaws in that
approach, such as that it is not updated when ghosts are filled, and
the annotation can't be re-run with new diff parameters.

2.1 - Make the new revision the basis for the tree, and clear the list
of parents.  Strictly this is all that's logically necessary, unless
the working tree format requires more work.

The dirstate format does require more work, because it caches the
parent tree data for each file within the working tree data.  In
practice this means that every commit rewrites the entire dirstate
file - we could try to avoid rewriting the whole file but this may be
difficult because variable-length data (the last-changed revision id)
is inserted into many rows.

The current dirstate design then seems to mean that any commit of a
single file imposes a cost proportional to the size of the current
workingtree.  Maybe there are other benefits that outweigh this.
Alternatively if it was fast enough for operations to always look at
the original storage of the parent trees we could do without the
cache.

2.2 - Record the observed file hashes into the workingtree control
files.  For the files that we just committed, we have the information
to store a valid hash cache entry: we know their stat information and
the sha1 of the file contents.  This is not strictly necessary to the
speed of commit, but it will be useful later in avoiding reading those
files, and the only cost of doing it now is writing it out.

In fact there are some user interface niceties that complicate this:

3 - Before starting the commit proper, we prompt for a commit message
and in that commit message editor we show a list of the files that
will be committed: basically the output of bzr status.  This is
basically the same as the list of changes we detect while storing the
commit, but because the user will sometimes change the tree after
opening the commit editor and expect the final state to be committed I
think we do have to look for changes twice.  Since it takes the user a
while to enter a message this is not a big problem as long as both the
status summary and the commit are individually fast.

4 - As the commit proceeds (or after?) we show another status-like
summary.  Just printing the names of modified files as they're stored
would be easy.  Recording deleted and renamed files or directories is
more work: this can only be done by reference to the primary parent
tree and requires it be read in.  Worse, reporting renames requires
searching by id across the entire parent tree.   Possibly full
reporting should be a default-off verbose option because it does
require more work beyond the commit itself.

5 - Bazaar currently allows for missing files to be automatically
marked as removed at the time of commit.  Leaving aside the ui
consequences, this means that we have to update the working inventory
to mark these files as removed.  Since as discussed above we always
have to rewrite the dirstate on commit this is not substantial, though
we should make sure we do this in one pass, not two.  I have
previously proposed to make this behaviour a non-default option.

We may need to run hooks or generate signatures during commit, but
they don't seem to have substantial performance consequences.

If one wanted to optimize solely for the speed of commit I think
hash-addressed  file-per-text storage like in git (or bzr 0.1) is very
good.  Remarkably, it does not need to read the inventory for the
previous revision.  For each versioned file, we just need to get its
hash, either by reading the file or validating its stat data.  If that
hash is not already in the repository, the file is just copied in and
compressed.  As directories are traversed, they're turned into texts
and stored as well, and then finally the revision is too.  This does
depend on later doing some delta compression of these texts.

Variations on this are possible.  Rather than writing a single file
into the repository for each text, we could fold them into a single
collation or pack file.  That would create a smaller number of files
in the repository, but looking up a single text would require looking
into their indexes rather than just asking the filesystem.

Rather than using hashes we can use file-id/rev-id pairs as at
present, which has several consequences pro and con.


Commit vs Status
----------------

At first glance, commit simply stores the changes status reports. In fact,
this isn't technically correct: commit considers some files modified that
status does not. The notes below were put together by John Arbash Meinel
and Aaron Bentley in May 2007 to explain the finer details of commit to
Ian Clatworthy. They are recorded here as they are likely to be useful to
others new to Bazaar ...

1) **Unknown files have a different effect.** With --no-strict (the default)
   they have no effect and can be completely ignored. With --strict they
   should cause the commit to abort (so you don't forget to add the two new
   test files that you just created).

2) **Multiple parents.** 'status' always compares 2 trees, typically the
   last-committed tree and the current working tree. 'commit' will compare
   more trees if there has been a merge.

  a) The "last modified" property for files.
     A file may be marked as changed since the last commit, but that
     change may have come in from the merge, and the change could have
     happened several commits back. There are several edge cases to be
     handled here, like if both branches modified the same file, or if
     just one branch modified it.
     
  b) The trickier case is when a file appears unmodified since last
     commit, but it was modified versus one of the merged branches. I
     believe there are a few ways this can happen, like if a merged
     branch changes a file and then reverts it back (you still update
     the 'last modified' field).
     In general, if both sides disagree on the 'last-modified' flag,
     then you need to generate a new entry pointing 'last-modified' at
     this revision (because you are resolving the differences between
     the 2 parents).

3) **Automatic deletion of 'missing' files.** This is a point that we go
   back and forth on. I think the basic idea is that 'bzr commit' by
   default should abort if it finds a 'missing' file (in case that file was
   renamed rather than deleted), but 'bzr commit --auto' can add unknown
   files and remove missing files automatically.

4) **sha1 for newly added files.** status doesn't really need this: it should
   only care that the file is not present in base, but is present now. In
   some ways commit doesn't care either, since it needs to read and sha the
   file itself anyway.

5) **Nested trees.** status doesn't recurse into nested trees, but commit does.
   This is just because not all of the nested-trees work has been merged yet.

   A tree-reference is considered modified if the subtree has been
   committed since the last containing-tree commit.  But commit needs to
   recurse into every subtree, to ensure that a commit is done if the
   subtree has changed since its last commit.  _iter_changes only reports
   on tree-references that are modified, so it can't be used for doing
   subtree commits.


Avoiding Work: Smarter Change Detection
---------------------------------------

Commit currently walks through every file building an inventory. Here is
Aaron's brain dump on a better way ...

_iter_changes won't tell us about tree references that haven't changed,
even if those subtrees have changed.  (Unless we ask for unchanged
files, which we don't want to do, of course.)

There is an iter_references method, but using it looks just as expensive
as calling kind().

I did some work on updating commit to use iter_changes, but found for
multi-parent trees, I had to fall back to the slow inventory comparison
approach.

Really, I think we need a call akin to iter_changes that handles
multiple parents, and knows to emit entries when InventoryEntry.revision
is all that's changed.


Avoiding Work: Better Layering
------------------------------

For each file, commit is currently doing more work than it should. Here is
John's take on a better way ...

Note that "_iter_changes" *does* have to touch every path on disk, but
it just can do it in a more efficient manner. (It doesn't have to create
an InventoryEntry for all the ones that haven't changed).

I agree with Aaron that we need something a little different than
_iter_changes. Both because of handling multiple parents, as well as we
don't want it to actually read the files if we have a stat-cache miss.

Specifically, the commit code *has* to read the files because it is
going to add the text to the repository, and we want it to compute the
sha1 at *that* time, so we are guaranteed to have the valid sha (rather
than just whatever the last cached one was). So we want the code to
return 'None' if it doesn't have an up-to-date sha1, rather than reading
the file and computing it, just before it returns it to the parent.

The commit code (0.16) should really be restructured. It's layering is
pretty wrong.

Specifically, calling "kind()" requires a stat of the file. But we have
to do a stat to get the size/whether the record is up-to-date, etc. So
we really need to have a "create_an_up_to_date_inventory()" function.
But because we are accessing every object on disk, we want to be working
in tuples rather than Inventory objects. And because DirState already
has the parent records next to the current working inventory, it can do
all the work to do really fast comparison and throw-away of unimportant
records.

The way I made "bzr status" fast is by moving the 'ignore this record'
ability as deep into the stack as I could get. Status has the property
that you don't care about most of the records, just like commit. So the
sooner you can stop evaluating the 99% that you don't care about, the
less work you do.


2511.1.4 by Ian Clatworthy updated NEWS and added commit performance notes to doc/developers	1	Commit Performance Notes
	2	========================
	3
	4	.. contents::
	5
	6	Commit: The Minimum Work Required
	7	---------------------------------
	8
	9	This is Martin Pool's email to the mailing list on the minimum work that
	10	commit needs to do. Be sure to check the mailing list archives
	11	(https://lists.ubuntu.com/archives/bazaar/2007q2/025791.html)
	12	for follow-up comments from Robert and others ...
	13
	14	Here is a description of the minimum work that commit must do. We
	15	want to make sure that our design doesn't cost too much more than this
	16	minimum. I am trying to do this without making too many assumptions
	17	about the underlying storage, but am assuming that the ui and basic
	18	architecture (wt, branch, repo) stays about the same.
	19
	20	The basic purpose of commit is to:
	21
	22	1. create and store a new revision based on the contents of the working tree
	23	2. make this the new basis revision for the working tree
	24
	25	We can do a selected commit of only some files or subtrees.
	26
	27	The best performance we could hope for is:
	28	- stat each versioned selected working file once
	29	- read from the workingtree and write into the repository any new file texts
	30	- in general, do work proportional to the size of the shape (eg
	31	inventory) of the old and new selected trees, and to the total size of
	32	the modified files
	33
	34	In more detail:
	35
	36	1.0 - Store new file texts: if a versioned file contains a new text
	37	there is no avoiding storing it. To determine which ones have changed
	38	we must go over the workingtree and at least stat each file. If the
	39	file is modified since it was last hashed, it must be read in.
	40	Ideally we would read it only once, and either notice that it has not
	41	changed, or store it at that point.
	42
	43	On the other hand we want new code to be able to handle files that are
	44	larger than will fit in memory. We may then need to read each file up
	45	to two times: once to determine if there is a new text and calculate
	46	its hash, and again to store it.
	47
	48	1.1 - Store a tree-shape description (ie inventory or similar.) This
	49	describes the non-file objects, and provides a reference from the
	50	Revision to the texts within it.
	51
	52	1.2 - Generate and store a new revision object.
	53
	54	1.3 - Do delta-compression on the stored objects. (git notably does
	55	not do this at commit time, deferring this entirely until later.)
	56	This requires finding the appropriate basis for each modified file: in
	57	the current scheme we get the file id, last-revision from the
	58	dirstate, look into the knit for that text, extract that text in
	59	total, generate a delta, then store that into the knit. Most delta
	60	operations are O(n2) to O(n3) in the size of the modified files.
	61
	62	1.4 - Cache annotation information for the changes: at the moment this
	63	is done as part of the delta storage. There are some flaws in that
	64	approach, such as that it is not updated when ghosts are filled, and
65	the annotation can't be re-run with new diff parameters.
66
67	2.1 - Make the new revision the basis for the tree, and clear the list
68	of parents. Strictly this is all that's logically necessary, unless
69	the working tree format requires more work.
70
71	The dirstate format does require more work, because it caches the
72	parent tree data for each file within the working tree data. In
73	practice this means that every commit rewrites the entire dirstate
74	file - we could try to avoid rewriting the whole file but this may be
75	difficult because variable-length data (the last-changed revision id)
76	is inserted into many rows.
77
78	The current dirstate design then seems to mean that any commit of a
79	single file imposes a cost proportional to the size of the current
80	workingtree. Maybe there are other benefits that outweigh this.
81	Alternatively if it was fast enough for operations to always look at
82	the original storage of the parent trees we could do without the
83	cache.
84
85	2.2 - Record the observed file hashes into the workingtree control
86	files. For the files that we just committed, we have the information
87	to store a valid hash cache entry: we know their stat information and
88	the sha1 of the file contents. This is not strictly necessary to the
89	speed of commit, but it will be useful later in avoiding reading those
90	files, and the only cost of doing it now is writing it out.
91
92	In fact there are some user interface niceties that complicate this:
93
94	3 - Before starting the commit proper, we prompt for a commit message
95	and in that commit message editor we show a list of the files that
96	will be committed: basically the output of bzr status. This is
97	basically the same as the list of changes we detect while storing the
98	commit, but because the user will sometimes change the tree after
99	opening the commit editor and expect the final state to be committed I
100	think we do have to look for changes twice. Since it takes the user a
101	while to enter a message this is not a big problem as long as both the
102	status summary and the commit are individually fast.
103
104	4 - As the commit proceeds (or after?) we show another status-like
105	summary. Just printing the names of modified files as they're stored
106	would be easy. Recording deleted and renamed files or directories is
107	more work: this can only be done by reference to the primary parent
108	tree and requires it be read in. Worse, reporting renames requires
109	searching by id across the entire parent tree. Possibly full
110	reporting should be a default-off verbose option because it does
111	require more work beyond the commit itself.
112
113	5 - Bazaar currently allows for missing files to be automatically
114	marked as removed at the time of commit. Leaving aside the ui
115	consequences, this means that we have to update the working inventory
116	to mark these files as removed. Since as discussed above we always
117	have to rewrite the dirstate on commit this is not substantial, though
118	we should make sure we do this in one pass, not two. I have
119	previously proposed to make this behaviour a non-default option.
120
121	We may need to run hooks or generate signatures during commit, but
122	they don't seem to have substantial performance consequences.
123
124	If one wanted to optimize solely for the speed of commit I think
125	hash-addressed file-per-text storage like in git (or bzr 0.1) is very
126	good. Remarkably, it does not need to read the inventory for the
127	previous revision. For each versioned file, we just need to get its
128	hash, either by reading the file or validating its stat data. If that
129	hash is not already in the repository, the file is just copied in and
130	compressed. As directories are traversed, they're turned into texts
131	and stored as well, and then finally the revision is too. This does
132	depend on later doing some delta compression of these texts.
133
134	Variations on this are possible. Rather than writing a single file
135	into the repository for each text, we could fold them into a single
136	collation or pack file. That would create a smaller number of files
137	in the repository, but looking up a single text would require looking
138	into their indexes rather than just asking the filesystem.
139
140	Rather than using hashes we can use file-id/rev-id pairs as at
141	present, which has several consequences pro and con.
142
143
144	Commit vs Status
145	----------------
146
147	At first glance, commit simply stores the changes status reports. In fact,
148	this isn't technically correct: commit considers some files modified that
149	status does not. The notes below were put together by John Arbash Meinel
150	and Aaron Bentley in May 2007 to explain the finer details of commit to
151	Ian Clatworthy. They are recorded here as they are likely to be useful to
152	others new to Bazaar ...
153
154	1) Unknown files have a different effect. With --no-strict (the default)
155	they have no effect and can be completely ignored. With --strict they
156	should cause the commit to abort (so you don't forget to add the two new
157	test files that you just created).
158
159	2) Multiple parents. 'status' always compares 2 trees, typically the
160	last-committed tree and the current working tree. 'commit' will compare
161	more trees if there has been a merge.
162
163	a) The "last modified" property for files.
164	A file may be marked as changed since the last commit, but that
165	change may have come in from the merge, and the change could have
166	happened several commits back. There are several edge cases to be
167	handled here, like if both branches modified the same file, or if
168	just one branch modified it.
169
170	b) The trickier case is when a file appears unmodified since last
171	commit, but it was modified versus one of the merged branches. I
172	believe there are a few ways this can happen, like if a merged
173	branch changes a file and then reverts it back (you still update
174	the 'last modified' field).
175	In general, if both sides disagree on the 'last-modified' flag,
176	then you need to generate a new entry pointing 'last-modified' at
177	this revision (because you are resolving the differences between
178	the 2 parents).
179
180	3) Automatic deletion of 'missing' files. This is a point that we go
181	back and forth on. I think the basic idea is that 'bzr commit' by
182	default should abort if it finds a 'missing' file (in case that file was
183	renamed rather than deleted), but 'bzr commit --auto' can add unknown
184	files and remove missing files automatically.
185
186	4) sha1 for newly added files. status doesn't really need this: it should
187	only care that the file is not present in base, but is present now. In
188	some ways commit doesn't care either, since it needs to read and sha the
189	file itself anyway.
190
191	5) Nested trees. status doesn't recurse into nested trees, but commit does.
192	This is just because not all of the nested-trees work has been merged yet.
193
194	A tree-reference is considered modified if the subtree has been
195	committed since the last containing-tree commit. But commit needs to
196	recurse into every subtree, to ensure that a commit is done if the
197	subtree has changed since its last commit. _iter_changes only reports
198	on tree-references that are modified, so it can't be used for doing
199	subtree commits.
200
201
202	Avoiding Work: Smarter Change Detection
203	---------------------------------------
204
205	Commit currently walks through every file building an inventory. Here is
206	Aaron's brain dump on a better way ...
207
208	_iter_changes won't tell us about tree references that haven't changed,
209	even if those subtrees have changed. (Unless we ask for unchanged
210	files, which we don't want to do, of course.)
211
212	There is an iter_references method, but using it looks just as expensive
213	as calling kind().
214
215	I did some work on updating commit to use iter_changes, but found for
216	multi-parent trees, I had to fall back to the slow inventory comparison
217	approach.
218
219	Really, I think we need a call akin to iter_changes that handles
220	multiple parents, and knows to emit entries when InventoryEntry.revision
221	is all that's changed.
222
223
224	Avoiding Work: Better Layering
225	------------------------------
226
227	For each file, commit is currently doing more work than it should. Here is
228	John's take on a better way ...
229
230	Note that "_iter_changes" does have to touch every path on disk, but
231	it just can do it in a more efficient manner. (It doesn't have to create
232	an InventoryEntry for all the ones that haven't changed).
233
234	I agree with Aaron that we need something a little different than
235	_iter_changes. Both because of handling multiple parents, as well as we
236	don't want it to actually read the files if we have a stat-cache miss.
237
238	Specifically, the commit code has to read the files because it is
239	going to add the text to the repository, and we want it to compute the
240	sha1 at that time, so we are guaranteed to have the valid sha (rather
241	than just whatever the last cached one was). So we want the code to
242	return 'None' if it doesn't have an up-to-date sha1, rather than reading
243	the file and computing it, just before it returns it to the parent.
244
245	The commit code (0.16) should really be restructured. It's layering is
246	pretty wrong.
247
248	Specifically, calling "kind()" requires a stat of the file. But we have
249	to do a stat to get the size/whether the record is up-to-date, etc. So
250	we really need to have a "create_an_up_to_date_inventory()" function.
251	But because we are accessing every object on disk, we want to be working
252	in tuples rather than Inventory objects. And because DirState already
253	has the parent records next to the current working inventory, it can do
254	all the work to do really fast comparison and throw-away of unimportant
255	records.
256
257	The way I made "bzr status" fast is by moving the 'ignore this record'
258	ability as deep into the stack as I could get. Status has the property
259	that you don't care about most of the records, just like commit. So the
260	sooner you can stop evaluating the 99% that you don't care about, the
261	less work you do.
262