When filling packs from an existing MIDX, `fill_packs_from_midx()`
handles preparing a MIDX'd pack, and reading out its pack name from the
existing MIDX.
MIDX compaction will want to perform an identical operation, though the
caller will look quite different than `fill_packs_from_midx()`. To
reduce any future code duplication, extract `fill_pack_from_midx()`
from `fill_packs_from_midx()` to prepare to call our new helper function
in a future change.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `ctx->pack_perm` array can be considered as a permutation between
the original `pack_int_id` of some given pack to its position in the
`ctx->info` array containing all packs.
Today we can always index into this array with any known `pack_int_id`,
since there is never a `pack_int_id` which is greater than or equal to
the value `ctx->nr`.
That is not necessarily the case with MIDX compaction. For example,
suppose we have a MIDX chain with three layers, each containing three
packs. The base of the MIDX chain will have packs with IDs 0, 1, and 2,
the next layer 3, 4, and 5, and so on. If we are compacting the topmost
two layers, we'll have input `pack_int_id` values between [3, 8], but
`ctx->nr` will only be 6.
In that example, if we want to know where the pack whose original
`pack_int_id` value was, say, 7, we would compute `ctx->pack_perm[7]`,
leading to an uninitialized read, since there are only 6 entries
allocated in that array.
To address this, there are a couple of options:
- We could allocate enough entries in `ctx->pack_perm` to accommodate
the largest `orig_pack_int_id` value.
- Or, we could internally shift the input values by the number of packs
in the base layer of the lower end of the MIDX compaction range.
This patch prepare us to take the latter approach, since it does not
allocate more memory than strictly necessary. (In our above example, the
base of the lower end of the compaction range is the first MIDX layer
(having three packs), so we would end up indexing `ctx->pack_perm[7-3]`,
which is a valid read.)
Note that this patch does not actually implement that approach yet, but
merely performs a behavior-preserving refactoring which will make the
change easier to carry out in the future.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The MIDX file format currently requires that pack files be identified by
the lexicographic ordering of their names (that is, a pack having a
checksum beginning with "abc" would have a numeric pack_int_id which is
smaller than the same value for a pack beginning with "bcd").
As a result, it is impossible to combine adjacent MIDX layers together
without permuting bits from bitmaps that are in more recent layer(s).
To see why, consider the following example:
| packs | preferred pack
--------+-------------+---------------
MIDX #0 | { X, Y, Z } | Y
MIDX #1 | { A, B, C } | B
MIDX #2 | { D, E, F } | D
, where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want
to combine MIDX layers #0 and #1, to create a new layer #0' containing
the packs from both layers. With the original three MIDX layers, objects
are laid out in the bitmap in the order they appear in their source
pack, and the packs themselves are arranged according to the pseudo-pack
order. In this case, that ordering is Y, X, Z, B, A, C.
But recall that the pseudo-pack ordering is defined by the order that
packs appear in the MIDX, with the exception of the preferred pack,
which sorts ahead of all other packs regardless of its position within
the MIDX. In the above example, that means that pack 'Y' could be placed
anywhere (so long as it is designated as preferred), however, all other
packs must be placed in the location listed above.
Because that ordering isn't sorted lexicographically, it is impossible
to compact MIDX layers in the above configuration without permuting the
object-to-bit-position mapping. Changing this mapping would affect all
bitmaps belonging to newer layers, rendering the bitmaps associated with
MIDX #2 unreadable.
One of the goals of MIDX compaction is that we are able to shrink the
length of the MIDX chain *without* invalidating bitmaps that belong to
newer layers, and the lexicographic ordering constraint is at odds with
this goal.
However, packs do not *need* to be lexicographically ordered within the
MIDX. As far as I can gather, the only reason they are sorted lexically
is to make it possible to perform a binary search over the pack names in
a MIDX, necessary to make `midx_contains_pack()`'s performance
logarithmic in the number of packs rather than linear.
Relax this constraint by allowing MIDX writes to proceed with packs that
are not arranged in lexicographic order. `midx_contains_pack()` will
lazily instantiate a `pack_names_sorted` array on the MIDX, which will
be used to implement the binary search over pack names.
This change produces MIDXs which may not be correctly read with external
tools or older versions of Git. Though older versions of Git know how to
gracefully degrade and ignore any MIDX(s) they consider corrupt,
external tools may not be as robust. To avoid unintentionally breaking
any such tools, guard this change behind a version bump in the MIDX's
on-disk format.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the MIDX writing code, there are four functions which perform some
sort of MIDX write operation. They are:
- write_midx_file()
- write_midx_file_only()
- expire_midx_packs()
- midx_repack()
All of these functions are thin wrappers over `write_midx_internal()`,
which implements the bulk of these routines. As a result, the
`write_midx_internal()` function takes six arguments.
Future commits in this series will want to add additional arguments, and
in general this function's signature will be the union of parameters
among *all* possible ways to write a MIDX.
Instead of adding yet more arguments to this function to support MIDX
compaction, introduce a `struct write_midx_opts`, which has the same
struct members as `write_midx_internal()`'s arguments.
Adding additional fields to the `write_midx_opts` struct is preferable
to adding additional arguments to `write_midx_internal()`. This is
because the callers below all zero-initialize the struct, so each time
we add a new piece of information, we do not have to pass the zero value
for it in all other call-sites that do not care about it.
For now, no functional changes are included in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In midx_pack_order(), we compute for each bitmapped pack the first bit
to correspond to an object in that pack, along with how many bits were
assigned to object(s) in that pack.
Initially, each bitmap_nr value is set to zero, and each bitmap_pos
value is set to the sentinel BITMAP_POS_UNKNOWN. This is done to ensure
that there are no packs who have an unknown bit position but a somehow
non-zero number of objects (cf. `write_midx_bitmapped_packs()` in
midx-write.c).
Once the pack order is fully determined, midx_pack_order() sets the
bitmap_pos field for any bitmapped packs to zero if they are still
listed as BITMAP_POS_UNKNOWN.
However, we enumerate the bitmapped packs in order of `ctx->pack_perm`.
This is fine for existing cases, since the only time the
`ctx->pack_perm` array holds a value outside of the addressable range of
`ctx->info` is when there are expired packs, which only occurs via 'git
multi-pack-index expire', which does not support writing MIDX bitmaps.
As a result, the range of ctx->pack_perm covers all values in [0,
`ctx->nr`), so enumerating in this order isn't an issue.
A future change necessary for compaction will complicate this further by
introducing a wrapper around the `ctx->pack_perm` array, which turns the
given `pack_int_id` into one that is relative to the lower end of the
compaction range. As a result, indexing into `ctx->pack_perm` through
this helper, say, with "0" will produce a crash when the lower end of
the compaction range has >0 pack(s) in its base layer, since the
subtraction will wrap around the 32-bit unsigned range, resulting in an
uninitialized read.
But the process is completely unnecessary in the first place: we are
enumerating all values of `ctx->info`, and there is no reason to process
them in a different order than they appear in memory. Index `ctx->info`
directly to reflect that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit d4bf1d88b9 (multi-pack-index: verify missing pack, 2018-09-13)
adds a new test to the MIDX test script to test how we handle missing
packs.
While the commit itself describes the test as "verify missing pack[s]",
the test itself is actually called "verify packnames out of order",
despite that not being what it tests.
Likely this was a copy-and-paste of the test immediately above it of the
same name. Correct this by renaming the test to match the commit
message.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since c39fffc1c9 (tests: start asserting that *.txt SYNOPSIS matches -h
output, 2022-10-13), the manual page for 'git multi-pack-index' has a
SYNOPSIS section which differs from 'git multi-pack-index -h'.
Correct this while also documenting additional options accepted by the
'write' sub-command.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since fcb2205b77 (midx: implement support for writing incremental MIDX
chains, 2024-08-06), the command-line options '--incremental' and
'--bitmap' were declared to be incompatible with one another when
running 'git multi-pack-index write'.
However, since 27afc272c4 (midx: implement writing incremental MIDX
bitmaps, 2025-03-20), that incompatibility no longer exists, despite the
documentation saying so. Correct this by removing the stale reference to
their incompatibility.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All multi-pack-index sub-commands (write, verify, repack, and expire)
support a '--progress' command-line option, despite not listing it as
one of the common options in `common_opts`.
As a result each sub-command declares its own `OPT_BIT()` for a
"--progress" command-line option. Centralize this within the
`common_opts` to avoid re-declaring it in each sub-command.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When trying to print out, say, the hexadecimal representation of a
MIDX's hash, our code will do something like:
hash_to_hex_algop(midx_get_checksum_hash(m),
m->source->odb->repo->hash_algo);
, which is both cumbersome and repetitive. In fact, all but a handful of
callers to `midx_get_checksum_hash()` do exactly the above. Reduce the
repetitive nature of calling `midx_get_checksum_hash()` by having it
return a pointer into a static buffer containing the above result.
For the handful of callers that do need to compare the raw bytes and
don't want to deal with an encoded copy (e.g., because they are passing
it to hasheq() or similar), they may still rely on
`midx_get_checksum_hash()` which returns the raw bytes.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since 541204aabe (Documentation: document naming schema for structs and
their functions, 2024-07-30), we have adopted a naming convention for
functions that would prefer a name like, say, `midx_get_checksum()` over
`get_midx_checksum()`.
Adopt this convention throughout the midx.h API. Since this function
returns a raw (that is, non-hex encoded) hash, let's suffix the function
with "_hash()" to make this clear. As a side effect, this prepares us
for the subsequent change which will introduce a "_hex()" variant that
encodes the checksum itself.
Suggested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To make clear that the function `get_midx_checksum()` does not do
anything to modify its argument, mark the MIDX pointer as const.
The following commit will rename this function altogether to make clear
that it returns the raw bytes of the checksum, not a hex-encoded copy of
it.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Small clean-up of xdiff library to remove unnecessary data
duplication.
* pw/xdiff-cleanups:
xdiff: remove unused data from xdlclass_t
xdiff: remove "line_hash" field from xrecord_t
"git merge-file" can be run outside a repository, but it ignored
all configuration, even the per-user ones. The command now uses
available configuration files to find its customization.
* yt/merge-file-outside-a-repository:
merge-file: honor merge.conflictStyle outside of a repository
A handful of documentation pages have been modernized to use the
"synopsis" style.
* ja/doc-synopsis-style-even-more:
doc: convert git-show to synopsis style
doc: fix some style issues in git-clone and for-each-ref-options
doc: finalize git-clone documentation conversion to synopsis style
doc: convert git-submodule to synopsis style
Allow recording process ID of the process that holds the lock next
to a lockfile for diagnosis.
* pc/lockfile-pid:
lockfile: add PID file for debugging stale locks
"git merge-ours" is taught to work better in a sparse checkout.
* sb/merge-ours-sparse:
merge-ours: integrate with sparse-index
merge-ours: drop USE_THE_REPOSITORY_VARIABLE
Test contrib/ things in CI to catch breakages before they enter the
"next" branch.
* jc/ci-test-contrib-too:
: Some of our downstream folks run more tests than we do and catch
: breakages in them, namely, where contrib/*/Makefile has "test" target.
: Let's make sure we fail upon accepting a new topic that break them in
: 'seen'.
ci: ubuntu: use GNU coreutils for dirname
test: optionally test contrib in CI
Transaction to create objects (or not) is currently tied to the
repository, but in the future a repository can have multiple object
sources, which may have different transaction mechanisms. Make the
odb transaction API per object source.
* jt/odb-transaction-per-source:
odb: transparently handle common transaction behavior
odb: prepare `struct odb_transaction` to become generic
object-file: rename transaction functions
odb: store ODB source in `struct odb_transaction`
Rename three functions around the commit_list data structure.
* ps/commit-list-functions-renamed:
commit: rename `free_commit_list()` to conform to coding guidelines
commit: rename `reverse_commit_list()` to conform to coding guidelines
commit: rename `copy_commit_list()` to conform to coding guidelines
Giving "git last-modified" a tree (not a commit-ish) died an
uncontrolled death, which has been corrected.
* tc/last-modified-not-a-tree:
last-modified: verify revision argument is a commit-ish
last-modified: remove double error message
last-modified: fix memory leak when more than one commit is given
last-modified: rewrite error message when more than one commit given
ISO C23 redefines strchr and friends that tradiotionally took
a const pointer and returned a non-const pointer derived from it to
preserve constness (i.e., if you ask for a substring in a const
string, you get a const pointer to the substring). Update code
paths that used non-const pointer to receive their results that did
not have to be non-const to adjust.
* cf/c23-const-preserving-strchr-updates-0:
gpg-interface: remove an unnecessary NULL initialization
global: constify some pointers that are not written to
Replace assertion-style 'test -f' checks with Git's
test_path_is_file() helper for clearer failures and
consistency.
Signed-off-by: Ashwani Kumar Kamal <ashwanikamal.im421@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For a line to be an anchor it has to appear in each of the files being
diffed exactly once. With that in mind lets delay checking whether
a line is an anchor until we know there is exactly one instance of
the line in each file. As each line is checked at most once, there
is no need to cache the result of is_anchor() and we can drop that
field from the hashmap entries. When diffing 5000 recent commits in
git.git this gives a modest speedup of ~2%. In the (rather extreme)
example below that consists largely of deletions the speedup is ~16%.
seq 0 10000000 >old
printf '%s\n' 300000 100000 200000 >new
git diff --no-index --anchored=300000 old new
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We do not use // comments in our C code, which is implied by the
description of multi-line comment rule and its examples, but is not
explicitly spelled out. Spell it out.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git blame --ignore-revs=... --color-lines" did not account for
ignored revisions passing blame to the same commit an adjacent line
gets blamed for.
* rs/blame-ignore-colors-fix:
blame: fix coloring for repeated suspects
GitHub repository banner update.
* am/doc-github-contributiong-link-to-submittingpatches:
.github/CONTRIBUTING.md: link to SubmittingPatches on git-scm.com