/brz/remove-bazaar : revision 327

To get this branch, use:

bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar

« back to all changes in this revision

Viewing changes to doc/revfile.txt

Committer: Martin Pool
Date: 2005-05-03 01:43:08 UTC
Revision ID: mbp@sourcefrog.net-20050503014308-965efd065589fcb8

todo

files added:
.bzrignore

.rsyncexclude

NEWS

README

TODO

build-api

bzrlib

bzrlib/__init__.py

bzrlib/add.py

bzrlib/branch.py

bzrlib/check.py

bzrlib/commands.py

bzrlib/diff.py

bzrlib/errors.py

bzrlib/info.py

bzrlib/inventory.py

bzrlib/mdiff.py

bzrlib/newinventory.py

bzrlib/osutils.py

bzrlib/remotebranch.py

bzrlib/revfile.py

bzrlib/revision.py

bzrlib/store.py

bzrlib/tests.py

bzrlib/textinv.py

bzrlib/textui.py

bzrlib/trace.py

bzrlib/tree.py

bzrlib/xml.py

doc/Makefile

doc/adoption.txt

doc/bitkeeper.txt

doc/changelogs.txt

doc/cherry-picking.txt

doc/cmdref.txt

doc/common-format.txt

doc/compared-aegis.txt

doc/compared-codeville.txt

doc/compared-cvsnt.txt

doc/compared-opencm.txt

doc/compared-prcs.txt

doc/compared-teamware.txt

doc/compression.txt

doc/config-specs.txt

doc/conflicts.txt

doc/costs.txt

doc/darcs.txt

doc/deadly-sins.txt

doc/default.css

doc/design.txt

doc/extra-commands.txt

doc/faq.txt

doc/formats.txt

doc/hashes.txt

doc/ignore.txt

doc/index.txt

doc/interrupted.txt

doc/intro.txt

doc/inventory.txt

doc/join-branches.txt

doc/kill-version.txt

doc/layers.txt

doc/library-interface.txt

doc/merge.txt

doc/mirroring.txt

doc/monotone.txt

doc/news.txt

doc/optional-edit.txt

doc/partial-commit.txt

doc/pool.txt

doc/purpose.txt

doc/python.txt

doc/quickref.txt

doc/quilt.txt

doc/quotes.txt

doc/random.txt

doc/requirements.txt

doc/revfile.txt

doc/revision-syntax.txt

doc/rollup.txt

doc/scalability.txt

doc/security.txt

doc/shared-branches.txt

doc/short-demo.txt

doc/supportability.txt

doc/svk.txt

doc/tagging.txt

doc/taxonomy.txt

doc/thanks.txt

doc/todo-from-arch.txt

doc/unchanged.txt

doc/unrelated-merge.txt

doc/usability.txt

doc/use-cases.txt

doc/web-interface.txt

doc/workflow.txt

doc/yaml.txt

elementtree

elementtree/ElementTree.py

elementtree/__init__.py

notes

notes/new-inventory-sample.xml

notes/performance.txt

setup.py

test.sh

testbzr

urlgrabber

urlgrabber/__init__.py

urlgrabber/byterange.py

urlgrabber/grabber.py

urlgrabber/keepalive.py

urlgrabber/mirror.py

urlgrabber/progress.py

files removed:
.bzrignore

COPYING

INSTALL

Makefile

README

TODO

__init__.py

branch.py

bzr-receive-pack

bzr-upload-pack

commands.py

converter.py

dir.py

errors.py

fetch.py

foreign

foreign/.bzrignore

foreign/TODO

foreign/__init__.py

foreign/test_versionedfiles.py

foreign/upgrade.py

foreign/versionedfiles.py

mapping.py

notes

notes/roundtripping.txt

remote.py

repository.py

revspec.py

server.py

setup.py

shamap.py

tests

tests/__init__.py

tests/test_blackbox.py

tests/test_branch.py

tests/test_builder.py

tests/test_dir.py

tests/test_fetch.py

tests/test_ids.py

tests/test_repository.py

versionedfiles.py

workingtree.py

Show diffs side-by-side

added added

removed removed

doc/revfile.txt

********

Revfiles

********

The unit for compressed storage in bzr is a *revfile*, whose design

was suggested by Matt Mackall.

Requirements

============

Compressed storage is a tradeoff between several goals:

* Reasonably compact storage of long histories.

* Robustness and simplicity.

* Fast extraction of versions and addition of new versions (preferably

without rewriting the whole file, or reading the whole history.)

* Fast and precise annotations.

* Storage of files of at least a few hundred MB.

Design

======

revfiles store the history of a single logical file, which is

identified in bzr by its file-id. In this sense they are similar to

an RCS or CVS ``,v`` file or an SCCS sfile.

Each state of the file is called a *text*.

Renaming, adding and deleting this file is handled at a higher level

by the inventory system, and is outside the scope of the revfile. The

revfile name is typically based on the file id which is itself

typically based on the name the file had when it was first added. But

this is purely cosmetic.

For example a file now called ``frob.c`` may have the id

``frobber.c-12873`` because it was originally called

``frobber.c``. Its texts are kept in the revfile

``.bzr/revfiles/frobber.c-12873.revs``.

When the file is deleted from the inventory the revfile does not

change. It's just not used in reproducing trees from that point

onwards.

The revfile does not record the date when the text was added, a commit

message, properties, or any other metadata. That is handled in the

higher-level revision history.

Inventories and other metadata files that vary from one version to the

next can themselves be stored in revfiles.

revfiles store files as simple byte streams, with no consideration of

translating character sets, line endings, or keywords. Those are also

handled at a higher level. However, the revfile may make use of

knowledge that a file is line-based in generating a diff.

(The Python builtin difflib is too slow when generating a purely

byte-by-byte delta so we always make a line-by-line diff; when this

is fixed it may be feasible to use line-by-line diffs for all

files.)

Files whose text does not change from one revision to the next are

stored as just a single text in the revfile. This can happen even if

the file was renamed or other properties were changed in the

inventory.

The revfile is held on disk as two files: an *index* and a *data*

file. The index file is short and always read completely into memory;

the data file is much longer and only the relevant bits of it,

identified by the index file, need to be read.

In previous versions, the index file identified texts by their

SHA-1 digest. This was unsatisfying for two reasons. Firstly it

assumes that SHA-1 will not collide, which is not an assumption we

wish to make in long-lived files. Secondly for annotations we need

to be able to map from file versions back to a revision.

Texts are identified by the name of the revfile and a UUID

corresponding to the first revision in which they were first

introduced. This means that given a text we can identify which

revision it belongs to, and annotations can use the index within the

revfile to identify where a region was first introduced.

We cannot identify texts by the integer revision number, because

that would limit us to only referring to a file in a particular

branch.

I'd like to just use the revision-id, but those are variable-length

strings, and I'd like the revfile index to be fixed-length and

relatively short. UUIDs can be encoded in binary as only 16 bytes.

Perhaps we should just use UUIDs for revisions and be done?

This is meant to scale to hold 100,000 revisions of a single file, by

which time the index file will be ~4.8MB and a bit big to read

100

sequentially.

101

102

Some of the reserved fields could be used to implement a (semi?)

103

balanced tree indexed by SHA1 so we can much more efficiently find the

104

index associated with a particular hash. For 100,000 revs we would be

105

able to find it in about 17 random reads, which is not too bad.

106

107

This performs pretty well except when trying to calculate deltas of

108

really large files. For that the main thing would be to plug in

109

something faster than difflib, which is after all pure Python.

110

Another approach is to just store the gzipped full text of big files,

111

though perhaps that's too perverse?

112

113

114

115

116

Skip-deltas

117

-----------

118

119

Because the basis of a delta does not need to be the text's logical

120

predecessor, we can adjust the deltas to avoid ever needing to apply

121

too many deltas to reproduce a particular file.

122

123

124

Annotations

125

-----------

126

127

Annotations indicate which revision of a file first inserted a line

128

(or region of bytes).

129

130

Given a string, we can write annotations on it like so: a sequence of

131

*(index, length)* pairs, giving the *index* of the revision which

132

introduced the next run of *length* bytes. The sum of the lengths

133

must equal the length of the string. For text files the regions will

134

typically fall on line breaks. This can be transformed in memory to

135

other structures, such as a list of *(index, content)* pairs.

136

137

When a line was inserted from a merge revision then the annotation for

138

that line should still be the source in the merged branch, rather than

139

just being the revision in which the merge took place.

140

141

They can cheaply be calculated when inserting a new text, but are

142

expensive to calculate after the fact because that requires searching

143

back through all previous text and all texts which were merged in. It

144

therefore seems sensible to calculate them once and store them.

145

146

To do this we need two operators which update an existing annotated

147

file:

148

149

A. Given an annotated file and a working text, update the annotation to

150

mark regions inserted in the working file as new in this revision.

151

152

B. Given two annotated files, merge them to produce an annotated

153

result. When there are conflicts, both texts should be included

154

and annotated.

155

156

These may be repeated: after a merge there may be another merge, or

157

there may be manual fixups or conflict resolutions.

158

159

So what we require is given a diff or a diff3 between two files, map

160

the regions of bytes changed into corresponding updates to the origin

161

annotations.

162

163

164

Open issues

165

===========

166

167

* revfiles use unsigned 32-bit integers both in diffs and the index.

168

This should be more than enough for any reasonable source file but

169

perhaps not enough for large binaries that are frequently committed.

170

171

Perhaps for those files there should be an option to continue to use

172

the text-store. There is unlikely to be any benefit in holding

173

deltas between them, and deltas will anyhow be hard to calculate.

174

175

* The append-only design does not allow for destroying committed data,

176

as when confidential information is accidentally added. That could

177

be fixed by creating the fixed repository as a separate branch, into

178

which only the preserved revisions are exported.

179

180

* Should annotations also indicate where text was deleted?

181

182

* This design calls for only one annotation per line, which seems

183

standard. However, this is lacking in at least two cases:

184

185

- Lines which originate in the same way in more than one revision,

186

through being independently introduced. In this case we would

187

apparently have to make an arbitrary choice; I suppose branches

188

could prefer to assume lines originated in their own history.

189

190

- It might be useful to directly indicate which mergers included

191

which lines. We do have that information in the revision history

192

though, so there seems no need to store it for every line.

Older »