5
The unit for compressed storage in bzr is a *revfile*, whose design
6
was suggested by Matt Mackall.
12
Compressed storage is a tradeoff between several goals:
14
* Reasonably compact storage of long histories.
16
* Robustness and simplicity.
18
* Fast extraction of versions and addition of new versions (preferably
19
without rewriting the whole file, or reading the whole history.)
21
* Fast and precise annotations.
23
* Storage of files of at least a few hundred MB.
29
revfiles store the history of a single logical file, which is
30
identified in bzr by its file-id. In this sense they are similar to
31
an RCS or CVS ``,v`` file or an SCCS sfile.
33
Each state of the file is called a *text*.
35
Renaming, adding and deleting this file is handled at a higher level
36
by the inventory system, and is outside the scope of the revfile. The
37
revfile name is typically based on the file id which is itself
38
typically based on the name the file had when it was first added. But
39
this is purely cosmetic.
41
For example a file now called ``frob.c`` may have the id
42
``frobber.c-12873`` because it was originally called
43
``frobber.c``. Its texts are kept in the revfile
44
``.bzr/revfiles/frobber.c-12873.revs``.
46
When the file is deleted from the inventory the revfile does not
47
change. It's just not used in reproducing trees from that point
50
The revfile does not record the date when the text was added, a commit
51
message, properties, or any other metadata. That is handled in the
52
higher-level revision history.
54
Inventories and other metadata files that vary from one version to the
55
next can themselves be stored in revfiles.
57
revfiles store files as simple byte streams, with no consideration of
58
translating character sets, line endings, or keywords. Those are also
59
handled at a higher level. However, the revfile may make use of
60
knowledge that a file is line-based in generating a diff.
62
(The Python builtin difflib is too slow when generating a purely
63
byte-by-byte delta so we always make a line-by-line diff; when this
64
is fixed it may be feasible to use line-by-line diffs for all
67
Files whose text does not change from one revision to the next are
68
stored as just a single text in the revfile. This can happen even if
69
the file was renamed or other properties were changed in the
72
The revfile is held on disk as two files: an *index* and a *data*
73
file. The index file is short and always read completely into memory;
74
the data file is much longer and only the relevant bits of it,
75
identified by the index file, need to be read.
77
In previous versions, the index file identified texts by their
78
SHA-1 digest. This was unsatisfying for two reasons. Firstly it
79
assumes that SHA-1 will not collide, which is not an assumption we
80
wish to make in long-lived files. Secondly for annotations we need
81
to be able to map from file versions back to a revision.
83
Texts are identified by the name of the revfile and a UUID
84
corresponding to the first revision in which they were first
85
introduced. This means that given a text we can identify which
86
revision it belongs to, and annotations can use the index within the
87
revfile to identify where a region was first introduced.
89
We cannot identify texts by the integer revision number, because
90
that would limit us to only referring to a file in a particular
93
I'd like to just use the revision-id, but those are variable-length
94
strings, and I'd like the revfile index to be fixed-length and
95
relatively short. UUIDs can be encoded in binary as only 16 bytes.
96
Perhaps we should just use UUIDs for revisions and be done?
98
This is meant to scale to hold 100,000 revisions of a single file, by
99
which time the index file will be ~4.8MB and a bit big to read
102
Some of the reserved fields could be used to implement a (semi?)
103
balanced tree indexed by SHA1 so we can much more efficiently find the
104
index associated with a particular hash. For 100,000 revs we would be
105
able to find it in about 17 random reads, which is not too bad.
107
This performs pretty well except when trying to calculate deltas of
108
really large files. For that the main thing would be to plug in
109
something faster than difflib, which is after all pure Python.
110
Another approach is to just store the gzipped full text of big files,
111
though perhaps that's too perverse?
119
Because the basis of a delta does not need to be the text's logical
120
predecessor, we can adjust the deltas to avoid ever needing to apply
121
too many deltas to reproduce a particular file.
127
Annotations indicate which revision of a file first inserted a line
128
(or region of bytes).
130
Given a string, we can write annotations on it like so: a sequence of
131
*(index, length)* pairs, giving the *index* of the revision which
132
introduced the next run of *length* bytes. The sum of the lengths
133
must equal the length of the string. For text files the regions will
134
typically fall on line breaks. This can be transformed in memory to
135
other structures, such as a list of *(index, content)* pairs.
137
When a line was inserted from a merge revision then the annotation for
138
that line should still be the source in the merged branch, rather than
139
just being the revision in which the merge took place.
141
They can cheaply be calculated when inserting a new text, but are
142
expensive to calculate after the fact because that requires searching
143
back through all previous text and all texts which were merged in. It
144
therefore seems sensible to calculate them once and store them.
146
To do this we need two operators which update an existing annotated
149
A. Given an annotated file and a working text, update the annotation to
150
mark regions inserted in the working file as new in this revision.
152
B. Given two annotated files, merge them to produce an annotated
153
result. When there are conflicts, both texts should be included
156
These may be repeated: after a merge there may be another merge, or
157
there may be manual fixups or conflict resolutions.
159
So what we require is given a diff or a diff3 between two files, map
160
the regions of bytes changed into corresponding updates to the origin
167
* revfiles use unsigned 32-bit integers both in diffs and the index.
168
This should be more than enough for any reasonable source file but
169
perhaps not enough for large binaries that are frequently committed.
171
Perhaps for those files there should be an option to continue to use
172
the text-store. There is unlikely to be any benefit in holding
173
deltas between them, and deltas will anyhow be hard to calculate.
175
* The append-only design does not allow for destroying committed data,
176
as when confidential information is accidentally added. That could
177
be fixed by creating the fixed repository as a separate branch, into
178
which only the preserved revisions are exported.
180
* Should annotations also indicate where text was deleted?
182
* This design calls for only one annotation per line, which seems
183
standard. However, this is lacking in at least two cases:
185
- Lines which originate in the same way in more than one revision,
186
through being independently introduced. In this case we would
187
apparently have to make an arbitrary choice; I suppose branches
188
could prefer to assume lines originated in their own history.
190
- It might be useful to directly indicate which mergers included
191
which lines. We do have that information in the revision history
192
though, so there seems no need to store it for every line.