global/git

Fork 0

mirror of https://github.com/git/git.git synced 2026-03-05 14:59:04 +01:00

Commit Graph

Author	SHA1	Message	Date
Taylor Blau	b2ec8e90c2	midx: do not require packs to be sorted in lexicographic order The MIDX file format currently requires that pack files be identified by the lexicographic ordering of their names (that is, a pack having a checksum beginning with "abc" would have a numeric pack_int_id which is smaller than the same value for a pack beginning with "bcd"). As a result, it is impossible to combine adjacent MIDX layers together without permuting bits from bitmaps that are in more recent layer(s). To see why, consider the following example: \| packs \| preferred pack --------+-------------+--------------- MIDX #0 \| { X, Y, Z } \| Y MIDX #1 \| { A, B, C } \| B MIDX #2 \| { D, E, F } \| D , where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want to combine MIDX layers #0 and #1, to create a new layer #0' containing the packs from both layers. With the original three MIDX layers, objects are laid out in the bitmap in the order they appear in their source pack, and the packs themselves are arranged according to the pseudo-pack order. In this case, that ordering is Y, X, Z, B, A, C. But recall that the pseudo-pack ordering is defined by the order that packs appear in the MIDX, with the exception of the preferred pack, which sorts ahead of all other packs regardless of its position within the MIDX. In the above example, that means that pack 'Y' could be placed anywhere (so long as it is designated as preferred), however, all other packs must be placed in the location listed above. Because that ordering isn't sorted lexicographically, it is impossible to compact MIDX layers in the above configuration without permuting the object-to-bit-position mapping. Changing this mapping would affect all bitmaps belonging to newer layers, rendering the bitmaps associated with MIDX #2 unreadable. One of the goals of MIDX compaction is that we are able to shrink the length of the MIDX chain without invalidating bitmaps that belong to newer layers, and the lexicographic ordering constraint is at odds with this goal. However, packs do not need to be lexicographically ordered within the MIDX. As far as I can gather, the only reason they are sorted lexically is to make it possible to perform a binary search over the pack names in a MIDX, necessary to make `midx_contains_pack()`'s performance logarithmic in the number of packs rather than linear. Relax this constraint by allowing MIDX writes to proceed with packs that are not arranged in lexicographic order. `midx_contains_pack()` will lazily instantiate a `pack_names_sorted` array on the MIDX, which will be used to implement the binary search over pack names. This change produces MIDXs which may not be correctly read with external tools or older versions of Git. Though older versions of Git know how to gracefully degrade and ignore any MIDX(s) they consider corrupt, external tools may not be as robust. To avoid unintentionally breaking any such tools, guard this change behind a version bump in the MIDX's on-disk format. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-02-24 11:16:33 -08:00
brian m. carlson	24d46f8633	docs: improve ambiguous areas of pack format documentation It is fair to say that our pack and indexing code is quite complex. Contributors who wish to work on this code or implementors of other implementations would benefit from clear, unambiguous documentation about how our data formats are structured and encoded and what data is used in the computation of certain values. Unfortunately, some of this data is missing, which leads to confusion and frustration. Let's document some of this data to help clarify things. Specify over what data CRC32 values are computed and also note which CRC32 algorithm is used, since Wikipedia mentions at least four 32-bit CRC algorithms and notes that it's possible to use different bit orderings. In addition, note how we encode objects in the pack. One might be led to believe that packed objects are always stored with the "<type> <size>\0" prefix of loose objects, but that is not the case, although for obvious reasons this data is included in the computation of the object ID. Explain why this is for the curious reader. Finally, indicate what the size field of the packed object represents. Otherwise, a reader might think that the size of a delta is the size of the full object or that it might contain the offset or object ID, neither of which are the case. Explain clearly, however, that the values represent uncompressed sizes to avoid confusion. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-10-09 17:46:14 -07:00
brian m. carlson	1f010d6bdf	doc: use .adoc extension for AsciiDoc files We presently use the ".txt" extension for our AsciiDoc files. While not wrong, most editors do not associate this extension with AsciiDoc, meaning that contributors don't get automatic editor functionality that could be useful, such as syntax highlighting and prose linting. It is much more common to use the ".adoc" extension for AsciiDoc files, since this helps editors automatically detect files and also allows various forges to provide rich (HTML-like) rendering. Let's do that here, renaming all of the files and updating the includes where relevant. Adjust the various build scripts and makefiles to use the new extension as well. Note that this should not result in any user-visible changes to the documentation. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-01-21 12:56:06 -08:00

Author

SHA1

Message

Date

Taylor Blau

b2ec8e90c2

midx: do not require packs to be sorted in lexicographic order

The MIDX file format currently requires that pack files be identified by
the lexicographic ordering of their names (that is, a pack having a
checksum beginning with "abc" would have a numeric pack_int_id which is
smaller than the same value for a pack beginning with "bcd").

As a result, it is impossible to combine adjacent MIDX layers together
without permuting bits from bitmaps that are in more recent layer(s).

To see why, consider the following example:

          | packs       | preferred pack
  --------+-------------+---------------
  MIDX #0 | { X, Y, Z } | Y
  MIDX #1 | { A, B, C } | B
  MIDX #2 | { D, E, F } | D

, where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want
to combine MIDX layers #0 and #1, to create a new layer #0' containing
the packs from both layers. With the original three MIDX layers, objects
are laid out in the bitmap in the order they appear in their source
pack, and the packs themselves are arranged according to the pseudo-pack
order. In this case, that ordering is Y, X, Z, B, A, C.

But recall that the pseudo-pack ordering is defined by the order that
packs appear in the MIDX, with the exception of the preferred pack,
which sorts ahead of all other packs regardless of its position within
the MIDX. In the above example, that means that pack 'Y' could be placed
anywhere (so long as it is designated as preferred), however, all other
packs must be placed in the location listed above.

Because that ordering isn't sorted lexicographically, it is impossible
to compact MIDX layers in the above configuration without permuting the
object-to-bit-position mapping. Changing this mapping would affect all
bitmaps belonging to newer layers, rendering the bitmaps associated with
MIDX #2 unreadable.

One of the goals of MIDX compaction is that we are able to shrink the
length of the MIDX chain *without* invalidating bitmaps that belong to
newer layers, and the lexicographic ordering constraint is at odds with
this goal.

However, packs do not *need* to be lexicographically ordered within the
MIDX. As far as I can gather, the only reason they are sorted lexically
is to make it possible to perform a binary search over the pack names in
a MIDX, necessary to make `midx_contains_pack()`'s performance
logarithmic in the number of packs rather than linear.

Relax this constraint by allowing MIDX writes to proceed with packs that
are not arranged in lexicographic order. `midx_contains_pack()` will
lazily instantiate a `pack_names_sorted` array on the MIDX, which will
be used to implement the binary search over pack names.

This change produces MIDXs which may not be correctly read with external
tools or older versions of Git. Though older versions of Git know how to
gracefully degrade and ignore any MIDX(s) they consider corrupt,
external tools may not be as robust. To avoid unintentionally breaking
any such tools, guard this change behind a version bump in the MIDX's
on-disk format.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

2026-02-24 11:16:33 -08:00

brian m. carlson

24d46f8633

docs: improve ambiguous areas of pack format documentation

It is fair to say that our pack and indexing code is quite complex.
Contributors who wish to work on this code or implementors of other
implementations would benefit from clear, unambiguous documentation
about how our data formats are structured and encoded and what data is
used in the computation of certain values.  Unfortunately, some of this
data is missing, which leads to confusion and frustration.

Let's document some of this data to help clarify things.  Specify over
what data CRC32 values are computed and also note which CRC32 algorithm
is used, since Wikipedia mentions at least four 32-bit CRC algorithms
and notes that it's possible to use different bit orderings.

In addition, note how we encode objects in the pack.  One might be led
to believe that packed objects are always stored with the "<type>
<size>\0" prefix of loose objects, but that is not the case, although
for obvious reasons this data is included in the computation of the
object ID.  Explain why this is for the curious reader.

Finally, indicate what the size field of the packed object represents.
Otherwise, a reader might think that the size of a delta is the size of
the full object or that it might contain the offset or object ID,
neither of which are the case.  Explain clearly, however, that the
values represent uncompressed sizes to avoid confusion.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

2025-10-09 17:46:14 -07:00

brian m. carlson

1f010d6bdf

doc: use .adoc extension for AsciiDoc files

We presently use the ".txt" extension for our AsciiDoc files.  While not
wrong, most editors do not associate this extension with AsciiDoc,
meaning that contributors don't get automatic editor functionality that
could be useful, such as syntax highlighting and prose linting.

It is much more common to use the ".adoc" extension for AsciiDoc files,
since this helps editors automatically detect files and also allows
various forges to provide rich (HTML-like) rendering.  Let's do that
here, renaming all of the files and updating the includes where
relevant.  Adjust the various build scripts and makefiles to use the new
extension as well.

Note that this should not result in any user-visible changes to the
documentation.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

2025-01-21 12:56:06 -08:00

3 Commits