When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.
Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:
1. Consider if we can even compute the set of objects with bitmaps,
and fall back to the usual traversal if we cannot. For example,
pathspec limiting traversals can't be computed using bitmaps (since
they don't know which objects are at which paths). The same is true
of certain kinds of non-trivial object filters.
2. If we can compute the traversal with bitmaps, partition the
(dereferenced) tips into two object lists, "haves", and "wants",
based on whether or not the objects have the UNINTERESTING flag,
respectively.
3. Fall back to the ordinary object traversal if either (a) there are
more than zero haves, none of which are in the bitmapped pack or
MIDX, or (b) there are no wants.
4. Construct a reachability bitmap for the "haves" side by walking
from the revision tips down to any existing bitmaps, OR-ing in any
bitmaps as they are found.
5. Then do the same for the "wants" side, stopping at any objects that
appear in the "haves" bitmap.
6. Filter the results if any object filter (that can be easily
computed with bitmaps alone) was given, and then return back to the
caller.
When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.
But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.
One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.
But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.
Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.
The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.
The new algorithm works as follows:
1. Build a (partial) bitmap of the haves side by first OR-ing any
bitmap(s) that already exist for UNINTERESTING commits between the
haves and the boundary.
2. For each commit along the boundary, add it as a fill-in traversal
tip (where the traversal terminates once an existing bitmap is
found), and perform fill-in traversal.
3. Build up a complete bitmap of the wants side as usual, stopping any
time we intersect the (partial) haves side.
4. Return the results.
And is more-or-less equivalent to using the *old* algorithm with this
invocation:
$ git rev-list --objects --use-bitmap-index $WANTS --not \
$(git rev-list --objects --boundary $WANTS --not $HAVES |
perl -lne 'print $1 if /^-(.*)/')
The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.
Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):
# (Computing the number of tagged objects not on any branches
# without bitmaps).
$ time git rev-list --count --objects --tags --not --branches
20
real 0m1.388s
user 0m1.092s
sys 0m0.296s
# (Computing the same query using the old bitmap traversal).
$ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
19
real 0m22.709s
user 0m21.628s
sys 0m1.076s
# (this commit)
$ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
19
real 0m1.518s
user 0m1.234s
sys 0m0.284s
The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 15-fold improvement over the existing traversal.
In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:
$ argv="--count --objects --branches --not --tags"
hyperfine \
-n 'no bitmaps' "git.compile rev-list $argv" \
-n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
-n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
Benchmark 1: no bitmaps
Time (mean ± σ): 124.6 ms ± 2.1 ms [User: 103.7 ms, System: 20.8 ms]
Range (min … max): 122.6 ms … 133.1 ms 22 runs
Benchmark 2: existing traversal
Time (mean ± σ): 368.6 ms ± 3.0 ms [User: 325.3 ms, System: 43.1 ms]
Range (min … max): 365.1 ms … 374.8 ms 10 runs
Benchmark 3: boundary traversal
Time (mean ± σ): 167.6 ms ± 0.9 ms [User: 139.5 ms, System: 27.9 ms]
Range (min … max): 166.1 ms … 169.2 ms 17 runs
Summary
'no bitmaps' ran
1.34 ± 0.02 times faster than 'boundary traversal'
2.96 ± 0.05 times faster than 'existing traversal'
Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 2-fold improvement over the existing traversal in
a more modest example.
Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:
- Decrease the cost to decompress and OR together many bitmaps
together (particularly when enumerating the uninteresting side of
the walk). Here we could explore more efficient bitmap storage
techniques, like Roaring+Run and/or use SIMD instructions to speed
up ORing them together.
- Store pseudo-merge bitmaps, which could allow us to OR together
fewer "summary" bitmaps (which would also help with the above).
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When "gc" needs to retain unreachable objects, packing them into
cruft packs (instead of exploding them into loose object files) has
been offered as a more efficient option for some time. Now the use
of cruft packs has been made the default and no longer considered
an experimental feature.
* tb/enable-cruft-packs-by-default:
repository.h: drop unused `gc_cruft_packs`
builtin/gc.c: make `gc.cruftPacks` enabled by default
t/t9300-fast-import.sh: prepare for `gc --cruft` by default
t/t6500-gc.sh: add additional test cases
t/t6500-gc.sh: refactor cruft pack tests
t/t6501-freshen-objects.sh: prepare for `gc --cruft` by default
t/t5304-prune.sh: prepare for `gc --cruft` by default
builtin/gc.c: ignore cruft packs with `--keep-largest-pack`
builtin/repack.c: fix incorrect reference to '-C'
pack-write.c: plug a leak in stage_tmp_packfiles()
Instead of the time the formatter was run, show the timestamp
recorded in the commit in the documentation.
* fc/doc-use-datestamp-in-commit:
doc: set actual revdate for manpages
The on-disk reverse index that allows mapping from the pack offset
to the object name for the object stored at the offset has been
enabled by default.
* tb/pack-revindex-on-disk:
t: invert `GIT_TEST_WRITE_REV_INDEX`
config: enable `pack.writeReverseIndex` by default
pack-revindex: introduce `pack.readReverseIndex`
pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
pack-revindex: make `load_pack_revindex` take a repository
t5325: mark as leak-free
pack-write.c: plug a leak in stage_tmp_packfiles()
The sendemail-validate validate hook learned to pass the total
number of input files and where in the sequence each invocation is
via environment variables.
* rj/send-email-validate-hook-count-messages:
send-email: export patch counters in validate environment
"git branch --format=..." and "git format-patch --format=..."
learns "--omit-empty" to hide refs that whose formatting result
becomes an empty string from the output.
* ow/ref-filter-omit-empty:
branch, for-each-ref, tag: add option to omit empty lines
"git clone --local" stops copying from an original repository that
has symbolic links inside its $GIT_DIR; an error message when that
happens has been updated.
* gc/better-error-when-local-clone-fails-with-symlink:
clone: error specifically with --local and symlinked objects
The examples are an ordered list, however, they are complex enough that
a callout is inside example 1, and that confuses the parsers as the list
continuation (`+`) is unclear (are we continuing the previous list item,
or the previous callout?).
We could use an open block as the asciidoctor documentation suggests,
but that has a tiny formatting issue (a newline is missing).
To simplify things for everyone (the reader, the writer, and the parser)
let's use subsections.
After this change, the HTML documentation generated with asciidoc has
the right indentation.
Cc: Jeff King <peff@peff.net>
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The callouts are directly tied to the listing above, remove spaces to
make it clear they are one and the same.
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Back in 5b92477f89 (builtin/gc.c: conditionally avoid pruning objects
via loose, 2022-05-20), `git gc` learned the `--cruft` option and
`gc.cruftPacks` configuration to opt-in to writing cruft packs when
collecting or pruning unreachable objects.
Cruft packs were introduced with the merge in a50036da1a (Merge branch
'tb/cruft-packs', 2022-06-03). They address the problem of "loose object
explosions", where Git will write out many individual loose objects when
there is a large number of unreachable objects that have not yet aged
past `--prune=<date>`.
Instead of keeping track of those unreachable yet recent objects via
their loose object file's mtime, cruft packs collect all unreachable
objects into a single pack with a corresponding `*.mtimes` file that
acts as a table to store the mtimes of all unreachable objects. This
prevents the need to store unreachable objects as loose as they age out
of the repository, and avoids the problem of loose object explosions.
Beyond avoiding loose object explosions, cruft packs also act as a more
efficient mechanism to store unreachable objects as they age out of a
repository. This is because pairs of similar unreachable objects serve
as delta bases for one another.
In 5b92477f89, the feature was introduced as experimental. Since then,
GitHub has been running these patches in every repository generating
hundreds of millions of cruft packs along the way. The feature is
battle-tested, and avoids many pathological cases such as above. Users
who either run `git gc` manually, or via `git maintenance` can benefit
from having cruft packs.
As such, enable cruft pack generation to take place by default (by
making `gc.cruftPacks` have the default of "true" rather than "false).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When cruft packs were implemented, we never adjusted the code for `git
gc`'s `--keep-largest-pack` and `gc.bigPackThreshold` to ignore cruft
packs. This option and configuration option share a common
implementation, but including cruft packs is wrong in both cases:
- Running `git gc --keep-largest-pack` in a repository where the
largest pack is the cruft pack itself will make it impossible for
`git gc` to prune objects, since the cruft pack itself is kept.
- The same is true for `gc.bigPackThreshold`, if the size of the cruft
pack exceeds the limit set by the caller.
In the future, it is possible that `gc.bigPackThreshold` could be used
to write a separate cruft pack containing any new unreachable objects
that entered the repository since the last time a cruft pack was
written.
There are some complexities to doing so, mainly around handling
pruning objects that are in an existing cruft pack that is above the
threshold (which would either need to be rewritten, or else delay
pruning). Rewriting a substantially similar cruft pack isn't ideal, but
it is significantly better than the status-quo.
If users have large cruft packs that they don't want to rewrite, they
can mark them as `*.keep` packs. But in general, if a repository has a
cruft pack that is so large it is slowing down GC's, it should probably
be pruned anyway.
In the meantime, ignore cruft packs in the common implementation for
both of these options, and add a pair of tests to prevent any future
regressions here.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git mergetool" and "git difftool" learns a new configuration
guiDefault to optionally favor configured guitool over non-gui-tool
automatically when $DISPLAY is set.
* tk/mergetool-gui-default-config:
mergetool: new config guiDefault supports auto-toggling gui by DISPLAY
* maint-2.39: (34 commits)
Git 2.39.3
Git 2.38.5
Git 2.37.7
Git 2.36.6
Git 2.35.8
Makefile: force -O0 when compiling with SANITIZE=leak
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
range-diff: use ssize_t for parsed "len" in read_patches()
...
* maint-2.38: (32 commits)
Git 2.38.5
Git 2.37.7
Git 2.36.6
Git 2.35.8
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
...
* maint-2.37: (31 commits)
Git 2.37.7
Git 2.36.6
Git 2.35.8
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
...
* maint-2.36: (30 commits)
Git 2.36.6
Git 2.35.8
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
...
* maint-2.35: (29 commits)
Git 2.35.8
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
...
* maint-2.34: (28 commits)
Git 2.34.8
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
http: prefer CURLOPT_SEEKFUNCTION to CURLOPT_IOCTLFUNCTION
...
* maint-2.33: (27 commits)
Git 2.33.8
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
http: prefer CURLOPT_SEEKFUNCTION to CURLOPT_IOCTLFUNCTION
http-push: prefer CURLOPT_UPLOAD to CURLOPT_PUT
...
* maint-2.32: (26 commits)
Git 2.32.7
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
http: prefer CURLOPT_SEEKFUNCTION to CURLOPT_IOCTLFUNCTION
http-push: prefer CURLOPT_UPLOAD to CURLOPT_PUT
ci: install python on ubuntu
...
* maint-2.31: (25 commits)
Git 2.31.8
tests: avoid using `test_i18ncmp`
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
http: prefer CURLOPT_SEEKFUNCTION to CURLOPT_IOCTLFUNCTION
http-push: prefer CURLOPT_UPLOAD to CURLOPT_PUT
ci: install python on ubuntu
ci: use the same version of p4 on both Linux and macOS
...
* maint-2.30: (23 commits)
Git 2.30.9
gettext: avoid using gettext if the locale dir is not present
apply --reject: overwrite existing `.rej` symlink if it exists
http.c: clear the 'finished' member once we are done with it
clone.c: avoid "exceeds maximum object size" error with GCC v12.x
range-diff: use ssize_t for parsed "len" in read_patches()
range-diff: handle unterminated lines in read_patches()
range-diff: drop useless "offset" variable from read_patches()
t5604: GETTEXT_POISON fix, conclusion
t5604: GETTEXT_POISON fix, part 1
t5619: GETTEXT_POISON fix
t0003: GETTEXT_POISON fix, conclusion
t0003: GETTEXT_POISON fix, part 1
t0033: GETTEXT_POISON fix
http: support CURLOPT_PROTOCOLS_STR
http: prefer CURLOPT_SEEKFUNCTION to CURLOPT_IOCTLFUNCTION
http-push: prefer CURLOPT_UPLOAD to CURLOPT_PUT
ci: install python on ubuntu
ci: use the same version of p4 on both Linux and macOS
ci: remove the pipe after "p4 -V" to catch errors
github-actions: run gcc-8 on ubuntu-20.04 image
...
When sending patch series (with a cover-letter or not)
sendemail-validate is called with every email/patch file independently
from the others. When one of the patches depends on a previous one, it
may not be possible to use this hook in a meaningful way. A hook that
wants to check some property of the whole series needs to know which
patch is the final one.
Expose the current and total number of patches to the hook via the
GIT_SENDEMAIL_PATCH_COUNTER and GIT_SENDEMAIL_PATCH_TOTAL environment
variables so that both incremental and global validation is possible.
Sharing any other state between successive invocations of the validate
hook must be done via external means. For example, by storing it in
a git config sendemail.validateWorktree entry.
Add a sample script with placeholder validations and update tests to
check that the counters are properly exported.
Suggested-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: Robin Jarry <robin@jarry.cc>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
manpages expect the date of the last revision, if that is not found
DocBook Stylesheets go through a series of hacks to generate one with
the format `%d/%d/%Y` which is not ideal.
In addition to this format not being standard, different tools generate
dates with different formats.
There's no need for any confusion if we specify the revision date, so
let's do so.
Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If the given format string expands to the empty string, a newline is
still printed. This makes using the output linewise more tedious. For
example, git update-ref --stdin does not accept empty lines.
Add options to "git branch", "git for-each-ref", and "git tag" to
not print these empty lines. The default behavior remains the same.
Signed-off-by: Øystein Walle <oystwa@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Back in e37d0b8730 (builtin/index-pack.c: write reverse indexes,
2021-01-25), Git learned how to read and write a pack's reverse index
from a file instead of in-memory.
A pack's reverse index is a mapping from pack position (that is, the
order that objects appear together in a ".pack") to their position in
lexical order (that is, the order that objects are listed in an ".idx"
file).
Reverse indexes are consulted often during pack-objects, as well as
during auxiliary operations that require mapping between pack offsets,
pack order, and index index.
They are useful in GitHub's infrastructure, where we have seen a
dramatic increase in performance when writing ".rev" files[1]. In
particular:
- an ~80% reduction in the time it takes to serve fetches on a popular
repository, Homebrew/homebrew-core.
- a ~60% reduction in the peak memory usage to serve fetches on that
same repository.
- a collective savings of ~35% in CPU time across all pack-objects
invocations serving fetches across all repositories in a single
datacenter.
Reverse indexes are also beneficial to end-users as well as forges. For
example, the time it takes to generate a pack containing the objects for
the 10 most recent commits in linux.git (representing a typical push) is
significantly faster when on-disk reverse indexes are available:
$ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~10 } >in
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
Time (mean ± σ): 543.0 ms ± 20.3 ms [User: 616.2 ms, System: 58.8 ms]
Range (min … max): 521.0 ms … 577.9 ms 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
Time (mean ± σ): 245.0 ms ± 11.4 ms [User: 335.6 ms, System: 31.3 ms]
Range (min … max): 226.0 ms … 259.6 ms 13 runs
Summary
'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
2.22 ± 0.13 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
The same is true of writing a pack containing the objects for the 30
most-recent commits:
$ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~30 } >in
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
Time (mean ± σ): 866.5 ms ± 16.2 ms [User: 1414.5 ms, System: 97.0 ms]
Range (min … max): 839.3 ms … 886.9 ms 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
Time (mean ± σ): 581.6 ms ± 10.2 ms [User: 1181.7 ms, System: 62.6 ms]
Range (min … max): 567.5 ms … 599.3 ms 10 runs
Summary
'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
1.49 ± 0.04 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
...and savings on trivial operations like computing the on-disk size of
a single (packed) object are even more dramatic:
$ git rev-parse HEAD >in
$ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
Time (mean ± σ): 305.8 ms ± 11.4 ms [User: 264.2 ms, System: 41.4 ms]
Range (min … max): 290.3 ms … 331.1 ms 10 runs
Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
Time (mean ± σ): 4.0 ms ± 0.3 ms [User: 1.7 ms, System: 2.3 ms]
Range (min … max): 1.6 ms … 4.6 ms 1155 runs
Summary
'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
76.96 ± 6.25 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'
In the more than two years since e37d0b8730 was merged, Git's
implementation of on-disk reverse indexes has been thoroughly tested,
both from users enabling `pack.writeReverseIndexes`, and from GitHub's
deployment of the feature. The latter has been running without incident
for more than two years.
This patch changes Git's behavior to write on-disk reverse indexes by
default when indexing a pack, which should make the above operations
faster for everybody's Git installation after a repack.
(The previous commit explains some potential drawbacks of using on-disk
reverse indexes in certain limited circumstances, that essentially boil
down to a trade-off between time to generate, and time to access. For
those limited cases, the `pack.readReverseIndex` escape hatch can be
used).
[1]: https://github.blog/2021-04-29-scaling-monorepo-maintenance/#reverse-indexes
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>