/brz/remove-bazaar : revision 3350.3.1

To get this branch, use:

bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar

« back to all changes in this revision

Viewing changes to doc/developers/repository-stream.txt

Committer: Robert Collins
Date: 2008-04-11 01:06:29 UTC
mto: (3350.7.1 KeyMapper) (3221.12.10 Development1) (3517.4.1 annotate)
mto: This revision was merged to the branch mainline in revision 3424.
Revision ID: robertc@robertcollins.net-20080411010629-j07mncp10h10obg8

Draft up an interface for repository streams that is more capable than the
current one.

files added:
doc/developers/repository-stream.txt

files modified:
doc/developers/index.txt

Show diffs side-by-side

added added

removed removed

doc/developers/repository-stream.txt

==================

Repository Streams

==================

Status

======

:Date: 2008-04-11

This document describes the proposed programming interface for streaming

data from and into repositories. This programming interface should allow

a single interface for pulling data from and inserting data into a Bazaar

repository.

.. contents::

Motivation

==========

To eliminate the current requirement that extracting data from a

repository requires either using a slow format, or knowing the format of

both the source repository and the target repository.

Use Cases

=========

Here's a brief description of use cases this interface is intended to

support.

Fetch operations

----------------

We fetch data between repositories as part of push/pull/branch operations.

Fetching data is currently an very interactive process with lots of

requests. For performance having the data be supplied in a stream will

improve push and pull to remote servers. For purely local operations the

streaming logic should help reduce memory pressure. In fetch operations

we always know the formats of both the source and target.

Smart server operations

~~~~~~~~~~~~~~~~~~~~~~~

With the smart server we support one streaming format, but this is only

usable when both the client and server have the same model of data, and

requires non-optimal IO ordering for pack to pack operations. Ideally we

can

Bundles

-------

Bundles also create a stream of data for revisions from a repository.

Unlike fetch operations we do not know the format of the target at the

time the stream is created. It would be good to be able to treat bundles

as frozen branches and repositories, so a serialised stream should be

suitable for this.

Data conversion

---------------

At this point we are not trying to integrate data conversion into this

interface, though it is likely possible.

Characteristics

===============

Some key aspects of the described interface are discussed in this section.

Single round trip

-----------------

All users of this should be able to create an appropriate stream from a

single round trip.

Forward-only reads

------------------

There should be no need to seek in a stream when inserting data from it

into a repository. This places an ordering constraint on streams which

some repositories do not need.

Serialisation

=============

At this point serialisation of a repository stream has not been specified.

Some considerations to bear in mind about serialisation are worth noting

however.

Weaves

------

While there shouldn't be too many users of weave repositories anymore,

avoiding pathological behaviour when a weave is being read is a good idea.

Having the weave itself embedded in the stream is very straight forward

and does not need expensive on the fly extraction and re-diffing to take

place.

100

101

Bundles

102

-------

103

104

Being able to perform random reads from a repository stream which is a

105

bundle would allow stacking a bundle and a real repository together. This

106

will need the pack container format to be used in such a way that we can

107

avoid reading more data than needed within the pack container's readv

108

interface.

109

110

111

Specification

112

=============

113

114

This describes the interface for requesting a stream, and the programming

115

interface a stream must provide. Streams that have been serialised should

116

expose the same interface.

117

118

Requesting a stream

119

-------------------

120

121

To request a stream, three parameters are needed:

122

123

* A revision search to select the revisions to include.

124

* A data ordering flag. There are two values for this - 'unordered' and

125

'topological'. 'unordered' streams are useful when inserting into

126

repositories that have the ability to perform atomic insertions.

127

'topological' streams are useful when converting data, or when

128

inserting into repositories that cannot perform atomic insertions (such

129

as knit or weave based repositories).

130

* A complete_inventory flag. When provided this flag signals the stream

131

generator to include all the data needed to construct the inventory of

132

each revision included in the stream, rather than just deltas. This is

133

useful when converting data from a repository with a different

134

inventory serialisation, as pure deltas would not be able to be

135

reconstructed.

136

137

138

Structure of a stream

139

---------------------

140

141

A stream is an object. It can be consistency checked via the ``check``

142

method (which consumes the stream). The ``iter_contents`` method can be

143

used to iterate the contents of the stream. The contents of the stream are

144

a series of top level records, each of which contains one or more

145

bytestrings (potentially as a delta against another item in the

146

repository) and some optional metadata.

147

148

149

Consuming a stream

150

------------------

151

152

To consume a stream, obtain an iterator from the streams

153

``iter_contents`` method. This iterator will yield the top level records.

154

Each record has two attributes. One is ``key_prefix`` which is a tuple key

155

prefix for the names of each of the bytestrings in the record. The other

156

attribute is ``entries``, an iterator of the individual items in the

157

record. Each item that the iterator yields is a two-tuple with a meta-data

158

dict and the compressed bytestring data.

159

160

In pseudocode::

161

162

stream = repository.get_repository_stream(search, UNORDERED, False)

163

for record in stream.iter_contents():

164

for metadata, bytes in record.entries:

165

print "Object %s, compression type %s, %d bytes long." % (

166

record.key_prefix + metadata['key'],

167

metadata['storage_kind'], len(bytes))

168

169

This structure should allow stream adapters to be written which can coerce

170

all records to the type of compression that a particular client needs. For

171

instance, inserting into weaves requires fulltexts, so an adapter that

172

applies knit records and extracts them to fulltexts will avoid weaves

173

needing to know about all potential storage kinds. Likewise, inserting

174

into knits would use an adapter that gives everything as either matching

175

knit records or full texts.

176

177

bytestring metadata

178

~~~~~~~~~~~~~~~~~~~

179

180

Valid keys in the metadata dict are:

181

* sha1: Optional ascii representation of the sha1 of the bytestring (after

182

delta reconstruction).

183

* storage_kind: Required kind of storage compression that has been used

184

on the bytestring. One of ``mpdiff``, ``knit-annotated-ft``,

185

``knit-annotated-delta``, ``knit-ft``, ``knit-delta``, ``fulltext``.

186

* parents: Required graph parents to associate with this bytestring.

187

* compressor_data: Required opaque data relevant to the storage_kind.

188

(This is set to None when the compressor has no special state needed)

189

* key: The key for this bytestring. Like each parent this is a tuple that

190

should have the key_prefix prepended to it to give the unified

191

repository key name.

192

193

vim: ft=rst tw=74 ai

194

Older »