/brz/remove-bazaar : contents of doc/developers/knitpack.txt at revision 2939

: (revision 2939)

To get this branch, use:

bzr branch
http://gegoxaren.bato24.eu/bzr/brz/remove-bazaar

KnitPack repository format
==========================

Bazaar 0.92 adds a new format (experimental at first) implemented in
``bzrlib.repofmt.pack_repo.py``.  

This format provides a knit-like interface which is quite compatible
with knit format repositories: you can get a VersionedFile for a
particular file-id, or for revisions, or for the inventory, even though
these do not correspond to single files on disk.

The on-disk format is that the repository directory contains these
files and subdirectories:

==================== =============================================
packs/               completed readonly packs
indices/             indices for completed packs
upload/              temporary files for packs currently being 
                     written
obsolete_packs/      packs that have been repacked and are no 
                     longer normally needed
pack-names           index of all live packs
lock/                lockdir
==================== =============================================

Note that for consistency we always write "indices" not "indexes".

This is implemented on top of pack files, which are written once from
start to end, then left alone.  A pack consists of a body file, plus
several index files.  There are four index files for each pack, which
have the same basename and an extension indicating the purpose of the
index:

======== ========== ======================== ==========================
extn     Purpose    Key                      References
======== ========== ======================== ==========================
``.tix`` File texts ``file_id, revision_id`` per-file parents,
                                             compression basis
                                             per-file parents
``.six`` Signatures ``revision_id,``         -
``.rix`` Revisions  ``revision_id,``         revision parents
``.iix`` Inventory  ``revision_id,``         revision parents,
                                             compression base
======== ========== ======================== ==========================

Indices are accessed through the ``bzrlib.index.GraphIndex`` class.  
Indices are stored as sorted files on disk.  Each line is one record,
and contains:

 * key fields
 * a value string - for all these indices, this is an ascii decimal pair
   of "offset length" giving the position of the refenced data within 
   the pack body file
 * a list of zero or more reference lists

The reference lists let a graph be stored within the index.  Each
reference list entry points to another entry in the same index.  The
references are represented as a byte offset for the target within the
index file.

When a compression base is given, it indicates that the body of the text
or inventory is a forward delta from the referenced revision.  The
compression base list must have length 0 or 1.

Like packs, indexes are written only once and then unmodified.  A
GraphIndex builder is a mutable in-memory graph that can be sorted,
cross-referenced and written out when the write group completes.

There can also be index entries with a value of 'a' for absent.  These
records exist just to be pointed to in a graph.  This is used, for
example, to give the revision-parent pointer when the parent revision is
in a previous pack.

The data content for each record is a knit data chunk.  The knits are
always unannotated - the annotations must be generated when needed.
(We'd like to cache/memoize the annotations.)  The data hunks can be
moved between packs without needing to recompress them.

It is not possible to regenerate an index from the body file, because it
contains information stored in the knit index that's not in the body.
(In particular, the per-file graph is only stored in the index.) 
We would like to change this in a future format.

The lock is a regular LockDir lock.  The lock is only held for a much
reduced scope, while updating the pack-names file.  The bulk of the
insertion can be done without the repository locked.  This is an
implementation detail; the repository user should still call
``repository.lock_write`` at the regular time but be aware this does not
correspond to a physical mutex. 

Read locks control caching but do not affect writers.

The newly-added repository write group concept is very important to
KnitPack repositories.  When ``start_write_group`` is called, a new
temporary pack is created and all modifications to the repository will 
go into it until either ``commit_write_group`` or ``abort_write_group``
is called, at which time it is either finished and moved into place or
discarded respectively.  Write groups cannot be nested, only one can be
underway at a time on a Repository instance and they must occur within a
write lock.

Normally the data for each revision will be entirely within a single
pack but this is not required.

When a pack is finished, it gets a final name based on the md5 of all
the data written into the pack body file.

The ``pack-names`` file gives the list of all finished non-obsolete
packs.  (This should always be the same as the list of files in the
``packs/`` directory, but the file is needed for readonly http clients
that can't easily list directories, and it includes other information.)
The constraint on the ``pack-names`` list is that every file mentioned
must exist in the ``packs/`` directory.  

In rare cases, when a writer is interrupted, about-to-be-removed packs
may still be present in the directory but removed from the list.

As well as the list of names, the pack-names file also contains the
size, in bytes, of each of the four indices.  This is used to bootstrap
bisection search within the indices.

In normal use, one pack will be created for each commit to a repository.
This would build up to an inefficient number of files over time, so a
``repack`` operation is available to recombine them, by producing larger
files containing data on multiple revisions.  This can be done manually
by running ``bzr pack``, and it also may happen automatically when a
write group is committed.

The repacking strategy used at the moment tries to balance not doing too
much work during commit with not having too many small files left in the
repository.  The algorithm is roughly this: the total number of
revisions in the repository is expressed as a decimal number, e.g.
"532".  Then we'll repack until we have five packs containing a hundred
revisions each, three packs containing ten revisions each, and two packs
with single revisions.  This means that each revision will normally
initially be created in a single-revision pack, then moved to a
ten-revision pack, then to a 100-pack, and so on.

As with other repositories, in normal use data is only inserted.
However, in some circumstances we may want to garbage-collect or prune
existing data, or reconcile indexes.

  vim: tw=72 ft=rest expandtab

2592.3.229 by Martin Pool Initial pack format documentation	1	KnitPack repository format
	2	==========================
	3
	4	Bazaar 0.92 adds a new format (experimental at first) implemented in
	5	``bzrlib.repofmt.pack_repo.py``.
	6
	7	This format provides a knit-like interface which is quite compatible
	8	with knit format repositories: you can get a VersionedFile for a
	9	particular file-id, or for revisions, or for the inventory, even though
	10	these do not correspond to single files on disk.
	11
	12	The on-disk format is that the repository directory contains these
	13	files and subdirectories:
	14
	15	==================== =============================================
	16	packs/ completed readonly packs
	17	indices/ indices for completed packs
	18	upload/ temporary files for packs currently being
	19	written
	20	obsolete_packs/ packs that have been repacked and are no
	21	longer normally needed
	22	pack-names index of all live packs
	23	lock/ lockdir
	24	==================== =============================================
	25
2592.3.230 by Martin Pool Review comments on knitpack docs	26	Note that for consistency we always write "indices" not "indexes".
2592.3.230 by Martin Pool Review comments on knitpack docs	27
2592.3.229 by Martin Pool Initial pack format documentation	28	This is implemented on top of pack files, which are written once from
	29	start to end, then left alone. A pack consists of a body file, plus
	30	several index files. There are four index files for each pack, which
	31	have the same basename and an extension indicating the purpose of the
	32	index:
	33
2592.3.230 by Martin Pool Review comments on knitpack docs	34	======== ========== ======================== ==========================
	35	extn Purpose Key References
	36	======== ========== ======================== ==========================
	37	``.tix`` File texts ``file_id, revision_id`` per-file parents,
	38	compression basis
	39	per-file parents
	40	``.six`` Signatures ``revision_id,`` -
	41	``.rix`` Revisions ``revision_id,`` revision parents
	42	``.iix`` Inventory ``revision_id,`` revision parents,
	43	compression base
	44	======== ========== ======================== ==========================
2592.3.229 by Martin Pool Initial pack format documentation	45
2592.3.230 by Martin Pool Review comments on knitpack docs	46	Indices are accessed through the ``bzrlib.index.GraphIndex`` class.
2592.3.229 by Martin Pool Initial pack format documentation	47	Indices are stored as sorted files on disk. Each line is one record,
	48	and contains:
	49
	50	* key fields
	51	* a value string - for all these indices, this is an ascii decimal pair
	52	of "offset length" giving the position of the refenced data within
	53	the pack body file
	54	* a list of zero or more reference lists
	55
	56	The reference lists let a graph be stored within the index. Each
	57	reference list entry points to another entry in the same index. The
	58	references are represented as a byte offset for the target within the
	59	index file.
	60
	61	When a compression base is given, it indicates that the body of the text
	62	or inventory is a forward delta from the referenced revision. The
	63	compression base list must have length 0 or 1.
	64
2592.3.230 by Martin Pool Review comments on knitpack docs	65	Like packs, indexes are written only once and then unmodified. A
	66	GraphIndex builder is a mutable in-memory graph that can be sorted,
	67	cross-referenced and written out when the write group completes.
	68
	69	There can also be index entries with a value of 'a' for absent. These
	70	records exist just to be pointed to in a graph. This is used, for
	71	example, to give the revision-parent pointer when the parent revision is
	72	in a previous pack.
	73
2592.3.229 by Martin Pool Initial pack format documentation	74	The data content for each record is a knit data chunk. The knits are
	75	always unannotated - the annotations must be generated when needed.
	76	(We'd like to cache/memoize the annotations.) The data hunks can be
	77	moved between packs without needing to recompress them.
	78
	79	It is not possible to regenerate an index from the body file, because it
	80	contains information stored in the knit index that's not in the body.
	81	(In particular, the per-file graph is only stored in the index.)
2592.3.230 by Martin Pool Review comments on knitpack docs	82	We would like to change this in a future format.
2592.3.229 by Martin Pool Initial pack format documentation	83
	84	The lock is a regular LockDir lock. The lock is only held for a much
	85	reduced scope, while updating the pack-names file. The bulk of the
	86	insertion can be done without the repository locked. This is an
	87	implementation detail; the repository user should still call
	88	``repository.lock_write`` at the regular time but be aware this does not
	89	correspond to a physical mutex.
	90
	91	Read locks control caching but do not affect writers.
	92
	93	The newly-added repository write group concept is very important to
	94	KnitPack repositories. When ``start_write_group`` is called, a new
	95	temporary pack is created and all modifications to the repository will
	96	go into it until either ``commit_write_group`` or ``abort_write_group``
	97	is called, at which time it is either finished and moved into place or
	98	discarded respectively. Write groups cannot be nested, only one can be
	99	underway at a time on a Repository instance and they must occur within a
	100	write lock.
	101
	102	Normally the data for each revision will be entirely within a single
	103	pack but this is not required.
	104
	105	When a pack is finished, it gets a final name based on the md5 of all
	106	the data written into the pack body file.
	107
	108	The ``pack-names`` file gives the list of all finished non-obsolete
	109	packs. (This should always be the same as the list of files in the
	110	``packs/`` directory, but the file is needed for readonly http clients
	111	that can't easily list directories, and it includes other information.)
2592.3.230 by Martin Pool Review comments on knitpack docs	112	The constraint on the ``pack-names`` list is that every file mentioned
	113	must exist in the ``packs/`` directory.
	114
	115	In rare cases, when a writer is interrupted, about-to-be-removed packs
	116	may still be present in the directory but removed from the list.
	117
	118	As well as the list of names, the pack-names file also contains the
	119	size, in bytes, of each of the four indices. This is used to bootstrap
	120	bisection search within the indices.
2592.3.229 by Martin Pool Initial pack format documentation	121
	122	In normal use, one pack will be created for each commit to a repository.
	123	This would build up to an inefficient number of files over time, so a
	124	``repack`` operation is available to recombine them, by producing larger
	125	files containing data on multiple revisions. This can be done manually
	126	by running ``bzr pack``, and it also may happen automatically when a
	127	write group is committed.
	128
	129	The repacking strategy used at the moment tries to balance not doing too
	130	much work during commit with not having too many small files left in the
	131	repository. The algorithm is roughly this: the total number of
	132	revisions in the repository is expressed as a decimal number, e.g.
	133	"532". Then we'll repack until we have five packs containing a hundred
	134	revisions each, three packs containing ten revisions each, and two packs
	135	with single revisions. This means that each revision will normally
	136	initially be created in a single-revision pack, then moved to a
	137	ten-revision pack, then to a 100-pack, and so on.
	138
2592.3.230 by Martin Pool Review comments on knitpack docs	139	As with other repositories, in normal use data is only inserted.
	140	However, in some circumstances we may want to garbage-collect or prune
	141	existing data, or reconcile indexes.
2592.3.229 by Martin Pool Initial pack format documentation	142
2592.3.229 by Martin Pool Initial pack format documentation	143	vim: tw=72 ft=rest expandtab