global/git

mirror of https://github.com/git/git.git synced 2026-02-28 10:47:33 +00:00

Author	SHA1	Message	Date
Junio C Hamano	e8b2fc9974	Merge branch 'ps/receive-pack-shallow-optim' into jch The code to accept shallow "git push" has been optimized. * ps/receive-pack-shallow-optim: commit: use commit graph in `lookup_commit_reference_gently()` commit: make `repo_parse_commit_no_graph()` more robust commit: avoid parsing non-commits in `lookup_commit_reference_gently()`	2026-02-23 14:25:51 -08:00
Patrick Steinhardt	f23ac77a43	commit: avoid parsing non-commits in `lookup_commit_reference_gently()` The function `lookup_commit_reference_gently()` can be used to look up a committish by object ID. As such, the function knows to peel for example tag objects so that we eventually end up with the commit. The function is used quite a lot throughout our tree. One such user is "shallow.c" via `assign_shallow_commits_to_refs()`. The intent of this function is to figure out whether a shallow push is missing any objects that are required to satisfy the ref updates, and if so, which of the ref updates is missing objects. This is done by painting the tree with `UNINTERESTING`. We start painting by calling `refs_for_each_ref()` so that we can mark all existing referenced objects as the boundary of objects that we already have, and which are supposed to be fully connected. The reference tips are then parsed via `lookup_commit_reference_gently()`, and the commit is then marked as uninteresting. But references may not necessarily point to a committish, and if a lot of them aren't then this step takes a lot of time. This is mostly due to the way that `lookup_commit_reference_gently()` is implemented: before we learn about the type of the object we already call `parse_object()` on the object ID. This has two consequences: - We parse all objects, including trees and blobs, even though we don't even need the contents of them. - More importantly though, `parse_object()` will cause us to check whether the object ID matches its contents. Combined this means that we deflate and hash every non-committish object, and that of course ends up being both CPU- and memory-intensive. Improve the logic so that we first use `peel_object()`. This function won't parse the object for us, and thus it allows us to learn about the object's type before we parse and return it. The following benchmark pushes a single object from a shallow clone into a repository that has 100,000 refs. These refs were created by listing all objects via `git rev-list(1) --objects --all` and creating refs for a subset of them, so lots of those refs will cover non-commit objects. Benchmark 1: git-receive-pack (rev = HEAD~) Time (mean ± σ): 62.571 s ± 0.413 s [User: 58.331 s, System: 4.053 s] Range (min … max): 62.191 s … 63.010 s 3 runs Benchmark 2: git-receive-pack (rev = HEAD) Time (mean ± σ): 38.339 s ± 0.192 s [User: 36.220 s, System: 1.992 s] Range (min … max): 38.176 s … 38.551 s 3 runs Summary git-receive-pack . </tmp/input (rev = HEAD) ran 1.63 ± 0.01 times faster than git-receive-pack . </tmp/input (rev = HEAD~) This leads to a sizeable speedup as we now skip reading and parsing non-commit objects. Before this change we spent around 40% of the time in `assign_shallow_commits_to_refs()`, after the change we only spend around 1.2% of the time in there. Almost the entire remainder of the time is spent in git-rev-list(1) to perform the connectivity checks. Despite the speedup though, this also leads to a massive reduction in allocations. Before: HEAP SUMMARY: in use at exit: 352,480,441 bytes in 97,185 blocks total heap usage: 2,793,820 allocs, 2,696,635 frees, 67,271,456,983 bytes allocated And after: HEAP SUMMARY: in use at exit: 17,524,978 bytes in 22,393 blocks total heap usage: 33,313 allocs, 10,920 frees, 407,774,251 bytes allocated Note that when all references refer to commits performance stays roughly the same, as expected. The following benchmark was executed with 600k commits: Benchmark 1: git-receive-pack (rev = HEAD~) Time (mean ± σ): 9.101 s ± 0.006 s [User: 8.800 s, System: 0.520 s] Range (min … max): 9.095 s … 9.106 s 3 runs Benchmark 2: git-receive-pack (rev = HEAD) Time (mean ± σ): 9.128 s ± 0.094 s [User: 8.820 s, System: 0.522 s] Range (min … max): 9.019 s … 9.188 s 3 runs Summary git-receive-pack (rev = HEAD~) ran 1.00 ± 0.01 times faster than git-receive-pack (rev = HEAD) This will be improved in the next commit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-02-19 09:34:16 -08:00
Derrick Stolee	b4e8f60a3c	revision: add --maximal-only option When inspecting a range of commits from some set of starting references, it is sometimes useful to learn which commits are not reachable from any other commits in the selected range. One such application is in the creation of a sequence of bundles for the bundle URI feature. Creating a stack of bundles representing different slices of time includes defining which references to include. If all references are used, then this may be overwhelming or redundant. Instead, selecting commits that are maximal to the range could help defining a smaller reference set to use in the bundle header. Add a new '--maximal-only' option to restrict the output of a revision range to be only the commits that are not reachable from any other commit in the range, based on the reachability definition of the walk. This is accomplished by adding a new 28th bit flag, CHILD_VISITED, that is set as we walk. This does extend the bit range in object.h, but using an earlier bit may collide with another feature. The tests demonstrate the behavior of the feature with a positive-only range, ranges with negative references, and walk-modifying flags like --first-parent and --exclude-first-parent-only. Since the --boundary option would not increase any results when used with the --maximal-only option, mark them as incompatible. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2026-01-22 10:58:14 -08:00
Junio C Hamano	c62d2d3810	Merge branch 'kn/maintenance-is-needed' "git maintenance" command learned "is-needed" subcommand to tell if it is necessary to perform various maintenance tasks. * kn/maintenance-is-needed: maintenance: add 'is-needed' subcommand maintenance: add checking logic in `pack_refs_condition()` refs: add a `optimize_required` field to `struct ref_storage_be` reftable/stack: add function to check if optimization is required reftable/stack: return stack segments directly	2025-11-21 09:14:17 -08:00
Junio C Hamano	ee27005905	Merge branch 'ps/ref-peeled-tags-fixes' Another fix-up to "peeled-tags" topic. * ps/ref-peeled-tags-fixes: object: fix performance regression when peeling tags	2025-11-19 10:55:42 -08:00
Junio C Hamano	13134cecb0	Merge branch 'ps/ref-peeled-tags' Some ref backend storage can hold not just the object name of an annotated tag, but the object name of the object the tag points at. The code to handle this information has been streamlined. * ps/ref-peeled-tags: t7004: do not chdir around in the main process ref-filter: fix stale parsed objects ref-filter: parse objects on demand ref-filter: detect broken tags when dereferencing them refs: don't store peeled object IDs for invalid tags object: add flag to `peel_object()` to verify object type refs: drop infrastructure to peel via iterators refs: drop `current_ref_iter` hack builtin/show-ref: convert to use `reference_get_peeled_oid()` ref-filter: propagate peeled object ID upload-pack: convert to use `reference_get_peeled_oid()` refs: expose peeled object ID via the iterator refs: refactor reference status flags refs: fully reset `struct ref_iterator::ref` on iteration refs: introduce `.ref` field for the base iterator refs: introduce wrapper struct for `each_ref_fn`	2025-11-19 10:55:39 -08:00
Karthik Nayak	8c1ce2204c	maintenance: add checking logic in `pack_refs_condition()` The 'git-maintenance(1)' command supports an '--auto' flag. Usage of the flag ensures to run maintenance tasks only if certain thresholds are met. The heuristic is defined on a task level, wherein each task defines an 'auto_condition', which states if the task should be run. The 'pack-refs' task is hard-coded to return 1 as: 1. There was never a way to check if the reference backend needs to be optimized without actually performing the optimization. 2. We can pass in the '--auto' flag to 'git-pack-refs(1)' which would optimize based on heuristics. The previous commit added a `refs_optimize_required()` function, which can be used to check if a reference backend required optimization. Use this within `pack_refs_condition()`. This allows us to add a 'git maintenance is-needed' subcommand which can notify the user if maintenance is needed without actually performing the optimization. Without this change, the reference backend would always state that optimization is needed. Since we import 'revision.h', we need to remove the definition for 'SEEN' which is duplicated in the included header. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Acked-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-11-10 09:28:48 -08:00
Patrick Steinhardt	7048e74609	object: fix performance regression when peeling tags Our Bencher dashboards [1] have recently alerted us about a bunch of performance regressions when writing references, specifically with the reftable backend. There is a 3x regression when writing many refs with preexisting refs in the reftable format, and a 10x regression when migrating refs between backends in either of the formats. Bisecting the issue lands us at `6ec4c0b45b` (refs: don't store peeled object IDs for invalid tags, 2025-10-23). The gist of the commit is that we may end up storing peeled objects in both reftables and packed-refs for corrupted tags, where the claimed tagged object type is different than the actual tagged object type. This will then cause us to create the `struct object *` with a wrong type, as well, and obviously nothing good comes out of that. The fix for this issue was to introduce a new flag to `peel_object()` that causes us to verify the tagged object's type before writing it into the refdb -- if the tag is corrupt, we skip writing the peeled value. To verify whether the peeled value is correct we have to look up the object type via the ODB and compare the actual type with the claimed type, and that additional object lookup is costly. This also explains why we see the regression only when writing refs with the reftable backend, but we see the regression with both backends when migrating refs: - The reftable backend knows to store peeled values in the new table immediately, so it has to try and peel each ref it's about to write to the transaction. So the performance regression is visible for all writes. - The files backend only stores peeled values when writing the packed-refs file, so it wouldn't hit the performance regression for normal writes. But on ref migrations we know to write all new values into the packed-refs file immediately, and that's why we see the regression for both backends there. Taking a step back though reveals an oddity in the new verification logic: we not only verify the _tagged_ object's type, but we also verify the type of the tag itself. But this isn't really needed, as we wouldn't hit the bug in such a case anyway, as we only hit the issue with corrupt tags claiming an invalid type for the tagged object. The consequence of this is that we now started to look up the target object of every single reference we're about to write, regardless of whether it even is a tag or not. And that is of course quite costly. Fix the issue by only verifying the type of the tagged objects. This means that we of course still have a performance hit for actual tags. But this only happens for writes anyway, and I'd claim it's preferable to not store corrupted data in the refdb than to be fast here. Rename the flag accordingly to clarify that we only verify the tagged object's type. This fix brings performance back to previous levels: Benchmark 1: baseline Time (mean ± σ): 46.0 ms ± 0.4 ms [User: 40.0 ms, System: 5.7 ms] Range (min … max): 45.0 ms … 47.1 ms 54 runs Benchmark 2: regression Time (mean ± σ): 140.2 ms ± 1.3 ms [User: 77.5 ms, System: 60.5 ms] Range (min … max): 138.0 ms … 142.7 ms 20 runs Benchmark 3: fix Time (mean ± σ): 46.2 ms ± 0.4 ms [User: 40.2 ms, System: 5.7 ms] Range (min … max): 45.0 ms … 47.3 ms 55 runs Summary update-ref: baseline 1.00 ± 0.01 times faster than fix 3.05 ± 0.04 times faster than regression [1]: https://bencher.dev/perf/git/plots Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-11-06 10:54:34 -08:00
Patrick Steinhardt	7ec85185b1	object: add flag to `peel_object()` to verify object type When peeling a tag to a non-tag object we repeatedly call `parse_object()` on the tagged object until we find the first object that isn't a tag. While this feels sensible at first, there is a big catch here: `parse_object()` doesn't actually verify the type of the tagged object. The relevant code path here eventually ends up in `parse_tag_buffer()`. Here, we parse the various fields of the tag, including the "type". Once we've figured out the type and the tagged object ID, we call one of the `lookup_${type}()` functions for whatever type we have found. There is two possible outcomes in the successful case: 1. The object is already part of our cached objects. In that case we double-check whether the type we're trying to look up matches the type that was cached. 2. The object is _not_ part of our cached objects. In that case, we simply create a new object with the expected type, but we don't parse that object. In the first case we might notice type mismatches, but only in the case where our cache has the object with the correct type. In the second case, we'll blindly assume that the type is correct and then go with it. We'll only notice that the type might be wrong when we try to parse the object at a later point. Now arguably, we could change `parse_tag_buffer()` to verify the tagged object's type for us. But that would have the effect that such a tag cannot be parsed at all anymore, and we have a small bunch of tests for exactly this case that assert we still can open such tags. So this change does not feel like something we can retroactively tighten, even though one shouldn't ever hit such corrupted tags. Instead, add a new `flags` field to `peel_object()` that allows the caller to opt in to strict object verification. This will be wired up at a subset of callsites over the next few commits. Note that this change also inlines `deref_tag_noverify()`. There's only been two callsites of that function, the one we're changing and one in our test helpers. The latter callsite can trivially use `deref_tag()` instead, so by inlining the function we avoid having to pass down the flag. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-11-04 07:32:25 -08:00
Toon Claes	2a04e8c293	last-modified: implement faster algorithm The current implementation of git-last-modified(1) works by doing a revision walk, and inspecting the diff at each level of that walk to annotate entries remaining in the hashmap of paths. In other words, if the diff at some level touches a path which has not yet been associated with a commit, then that commit becomes associated with the path. While a perfectly reasonable implementation, it can perform poorly in either one of two scenarios: 1. There are many entries of interest, in which case there is simply a lot of work to do. 2. Or, there are (even a few) entries which have not been updated in a long time, and so we must walk through a lot of history in order to find a commit that touches that path. This patch rewrites the last-modified implementation that addresses the second point. The idea behind the algorithm is to propagate a set of 'active' paths (a path is 'active' if it does not yet belong to a commit) up to parents and do a truncated revision walk. The walk is truncated because it does not produce a revision for every change in the original pathspec, but rather only for active paths. More specifically, consider a priority queue of commits sorted by generation number. First, enqueue the set of boundary commits with all paths in the original spec marked as interesting. Then, while the queue is not empty, do the following: 1. Pop an element, say, 'c', off of the queue, making sure that 'c' isn't reachable by anything in the '--not' set. 2. For each parent 'p' (with index 'parent_i') of 'c', do the following: a. Compute the diff between 'c' and 'p'. b. Pass any active paths that are TREESAME from 'c' to 'p'. c. If 'p' has any active paths, push it onto the queue. 3. Any path that remains active on 'c' is associated to that commit. This ends up being equivalent to doing something like 'git log -1 -- $path' for each path simultaneously. But, it allows us to go much faster than the original implementation by limiting the number of diffs we compute, since we can avoid parts of history that would have been considered by the revision walk in the original implementation, but are known to be uninteresting to us because we have already marked all paths in that area to be inactive. To avoid computing many first-parent diffs, add another trick on top of this and check if all paths active in 'c' are DEFINITELY NOT in c's Bloom filter. Since the commit-graph only stores first-parent diffs in the Bloom filters, we can only apply this trick to first-parent diffs. Comparing the performance of this new algorithm shows about a 2.5x improvement on git.git: Benchmark 1: master no bloom Time (mean ± σ): 2.868 s ± 0.023 s [User: 2.811 s, System: 0.051 s] Range (min … max): 2.847 s … 2.926 s 10 runs Benchmark 2: master with bloom Time (mean ± σ): 949.9 ms ± 15.2 ms [User: 907.6 ms, System: 39.5 ms] Range (min … max): 933.3 ms … 971.2 ms 10 runs Benchmark 3: HEAD no bloom Time (mean ± σ): 782.0 ms ± 6.3 ms [User: 740.7 ms, System: 39.2 ms] Range (min … max): 776.4 ms … 798.2 ms 10 runs Benchmark 4: HEAD with bloom Time (mean ± σ): 307.1 ms ± 1.7 ms [User: 276.4 ms, System: 29.9 ms] Range (min … max): 303.7 ms … 309.5 ms 10 runs Summary HEAD with bloom ran 2.55 ± 0.02 times faster than HEAD no bloom 3.09 ± 0.05 times faster than master with bloom 9.34 ± 0.09 times faster than master no bloom In short, the existing implementation is comparably fast with Bloom filters as the new implementation is without Bloom filters. So, most repositories should get a dramatic speed-up by just deploying this (even without computing Bloom filters), and all repositories should get faster still when computing Bloom filters. When comparing a more extreme example of `git last-modified -- COPYING t`, the difference is even 5 times better: Benchmark 1: master Time (mean ± σ): 4.372 s ± 0.057 s [User: 4.286 s, System: 0.062 s] Range (min … max): 4.308 s … 4.509 s 10 runs Benchmark 2: HEAD Time (mean ± σ): 826.3 ms ± 22.3 ms [User: 784.1 ms, System: 39.2 ms] Range (min … max): 810.6 ms … 881.2 ms 10 runs Summary HEAD ran 5.29 ± 0.16 times faster than master As an added benefit, results are more consistent now. For example implementation in 'master' gives: $ git log --max-count=1 --format=%H -- pkt-line.h `15df15fe07` $ git last-modified -- pkt-line.h `15df15fe07` pkt-line.h $ git last-modified \| grep pkt-line.h `5b49c1af03` pkt-line.h With the changes in this patch the results of git-last-modified(1) always match those of `git log --max-count=1`. One thing to note though, the results might be outputted in a different order than before. This is not considerd to be an issue because nowhere is documented the order is guaranteed. Based-on-patches-by: Derrick Stolee <stolee@gmail.com> Based-on-patches-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Toon Claes <toon@iotcl.com> Acked-by: Taylor Blau <me@ttaylorr.com> [jc: tweaked use of xcalloc() to unbreak coccicheck] Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-11-03 07:25:41 -08:00
Junio C Hamano	bb74c0abbc	Merge branch 'kn/bundle-dedup-optim' Optimize the code to dedup references recorded in a bundle file. * kn/bundle-dedup-optim: bundle: fix non-linear performance scaling with refs t6020: test for duplicate refnames in bundle creation	2025-04-23 13:58:50 -07:00
Karthik Nayak	a52d459e72	bundle: fix non-linear performance scaling with refs The 'git bundle create' command has non-linear performance with the number of refs in the repository. Benchmarking the command shows that a large portion of the time (~75%) is spent in the `object_array_remove_duplicates()` function. The `object_array_remove_duplicates()` function was added in `b2a6d1c686` (bundle: allow the same ref to be given more than once, 2009-01-17) to skip duplicate refs provided by the user from being written to the bundle. Since this is an O(N^2) algorithm, in repos with large number of references, this can take up a large amount of time. Let's instead use a 'strset' to skip duplicates inside `write_bundle_refs()`. This improves the performance by around 6 times when tested against in repository with 100000 refs: Benchmark 1: bundle (refcount = 100000, revision = master) Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s] Range (min … max): 14.237 s … 14.920 s 10 runs Benchmark 2: bundle (refcount = 100000, revision = HEAD) Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s] Range (min … max): 2.364 s … 2.425 s 10 runs Summary bundle (refcount = 100000, revision = HEAD) ran 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master) Previously, `object_array_remove_duplicates()` ensured that both the refname and the object it pointed to were checked for duplicates. The new approach, implemented within `write_bundle_refs()`, eliminates duplicate refnames without comparing the objects they reference. This works because, for bundle creation, we only need to prevent duplicate refs from being written to the bundle header. The `revs->pending` array can contain duplicates of multiple types. First, references which resolve to the same refname. For e.g. "git bundle create out.bdl master master" or "git bundle create out.bdl refs/heads/master refs/heads/master" or "git bundle create out.bdl master refs/heads/master". In these scenarios we want to prevent writing "refs/heads/master" twice to the bundle header. Since both the refnames here would point to the same object (unless there is a race), we do not need to check equality of the object. Second, refnames which are duplicates but do not point to the same object. This can happen when we use an exclusion criteria. For e.g. "git bundle create out.bdl master master^!", Here `revs->pending` would contain two elements, both with refname set to "master". However, each of them would be pointing to an INTERESTING and UNINTERESTING object respectively. Since we only write refnames with INTERESTING objects to the bundle header, we perform our duplicate checks only on such objects. Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-04-08 14:21:49 -07:00
Patrick Steinhardt	74d414c9f1	object: stop depending on `the_repository` There are a couple of functions exposed by "object.c" that implicitly depend on `the_repository`. Remove this dependency by injecting the repository via a parameter. Adapt callers accordingly by simply using `the_repository`, except in cases where the subsystem is already free of the repository. In that case, we instead pass the repository provided by the caller's context. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2025-03-10 13:16:18 -07:00
Patrick Steinhardt	0d1d22f5a3	object: clear grafts when clearing parsed object pool We do not clear grafts part of the parsed object pool when clearing the pool itself, which can lead to memory leaks when a repository is being cleared. Fix this by moving `reset_commit_grafts()` into "object.c" and making it part of the `struct parsed_object_pool` interface such that we can call it from `parsed_object_pool_clear()`. Adapt `parsed_object_pool_new()` to take and store a reference to its owning repository, which is needed by `unparse_commit()`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-09-05 08:49:11 -07:00
Junio C Hamano	ecf7fc600a	Merge branch 'tb/path-filter-fix' The Bloom filter used for path limited history traversal was broken on systems whose "char" is unsigned; update the implementation and bump the format version to 2. * tb/path-filter-fix: bloom: introduce `deinit_bloom_filters()` commit-graph: reuse existing Bloom filters where possible object.h: fix mis-aligned flag bits table commit-graph: new Bloom filter version that fixes murmur3 commit-graph: unconditionally load Bloom filters bloom: prepare to discard incompatible Bloom filters bloom: annotate filters with hash version repo-settings: introduce commitgraph.changedPathsVersion t4216: test changed path filters with high bit paths t/helper/test-read-graph: implement `bloom-filters` mode bloom.h: make `load_bloom_filter_from_graph()` public t/helper/test-read-graph.c: extract `dump_graph_info()` gitformat-commit-graph: describe version 2 of BDAT commit-graph: ensure Bloom filters are read with consistent settings revision.c: consult Bloom filters for root commits t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`	2024-07-08 14:53:10 -07:00
Junio C Hamano	7b472da915	Merge branch 'ps/use-the-repository' A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help transition the codebase to rely less on the availability of the singleton the_repository instance. * ps/use-the-repository: hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE` t/helper: remove dependency on `the_repository` in "proc-receive" t/helper: fix segfault in "oid-array" command without repository t/helper: use correct object hash in partial-clone helper compat/fsmonitor: fix socket path in networked SHA256 repos replace-object: use hash algorithm from passed-in repository protocol-caps: use hash algorithm from passed-in repository oidset: pass hash algorithm when parsing file http-fetch: don't crash when parsing packfile without a repo hash-ll: merge with "hash.h" refs: avoid include cycle with "repository.h" global: introduce `USE_THE_REPOSITORY_VARIABLE` macro hash: require hash algorithm in `empty_tree_oid_hex()` hash: require hash algorithm in `is_empty_{blob,tree}_oid()` hash: make `is_null_oid()` independent of `the_repository` hash: convert `oidcmp()` and `oideq()` to compare whole hash global: ensure that object IDs are always padded hash: require hash algorithm in `oidread()` and `oidclr()` hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()` hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions	2024-07-02 09:59:00 -07:00
Taylor Blau	5421e7c3a1	commit-graph: reuse existing Bloom filters where possible In an earlier commit, a bug was described where it's possible for Git to produce non-murmur3 hashes when the platform's "char" type is signed, and there are paths with characters whose highest bit is set (i.e. all characters >= 0x80). That patch allows the caller to control which version of Bloom filters are read and written. However, even on platforms with a signed "char" type, it is possible to reuse existing Bloom filters if and only if there are no changed paths in any commit's first parent tree-diff whose characters have their highest bit set. When this is the case, we can reuse the existing filter without having to compute a new one. This is done by marking trees which are known to have (or not have) any such paths. When a commit's root tree is verified to not have any such paths, we mark it as such and declare that the commit's Bloom filter is reusable. Note that this heuristic only goes in one direction. If neither a commit nor its first parent have any paths in their trees with non-ASCII characters, then we know for certain that a path with non-ASCII characters will not appear in a tree-diff against that commit's first parent. The reverse isn't necessarily true: just because the tree-diff doesn't contain any such paths does not imply that no such paths exist in either tree. So we end up recomputing some Bloom filters that we don't strictly have to (i.e. their bits are the same no matter which version of murmur3 we use). But culling these out is impossible, since we'd have to perform the full tree-diff, which is the same effort as computing the Bloom filter from scratch. But because we can cache our results in each tree's flag bits, we can often avoid recomputing many filters, thereby reducing the time it takes to run $ git commit-graph write --changed-paths --reachable when upgrading from v1 to v2 Bloom filters. To benchmark this, let's generate a commit-graph in linux.git with v1 changed-paths in generation order[^1]: $ git clone git@github.com:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} Then let's time how long it takes to go from v1 to v2 filters (with and without the upgrade path enabled), resetting the state of the commit-graph each time: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' On linux.git (where there aren't any non-ASCII paths), the timings indicate that this patch represents a speed-up over recomputing all Bloom filters from scratch: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, we do have some non-ASCII paths, giving us a more modest improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up. On my machine, the stats for git.git are: - 8,285 Bloom filters computed from scratch - 10 Bloom filters generated as empty - 4 Bloom filters generated as truncated due to too many changed paths - 65,114 Bloom filters were reused when transitioning from v1 to v2. [^1]: Note that this is is important, since `--stdin-packs` or `--stdin-commits` orders commits in the commit-graph by their pack position (with `--stdin-packs`) or in the raw input (with `--stdin-commits`). Since we compute Bloom filters in the same order that commits appear in the graph, we must see a commit's (first) parent before we process the commit itself. This is only guaranteed to happen when sorting commits by their generation number. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-06-25 13:52:06 -07:00
Taylor Blau	df3df2dcf4	object.h: fix mis-aligned flag bits table Bit position 23 is one column too far to the left. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-06-25 13:52:06 -07:00
Junio C Hamano	ffa47b75cf	Merge branch 'tb/pseudo-merge-reachability-bitmap' The pseudo-merge reachability bitmap to help more efficient storage of the reachability bitmap in a repository with too many refs has been added. * tb/pseudo-merge-reachability-bitmap: (26 commits) pack-bitmap.c: ensure pseudo-merge offset reads are bounded Documentation/technical/bitmap-format.txt: add missing position table t/perf: implement performance tests for pseudo-merge bitmaps pseudo-merge: implement support for finding existing merges ewah: `bitmap_equals_ewah()` pack-bitmap: extra trace2 information pack-bitmap.c: use pseudo-merges during traversal t/test-lib-functions.sh: support `--notick` in `test_commit_bulk()` pack-bitmap: implement test helpers for pseudo-merge ewah: implement `ewah_bitmap_popcount()` pseudo-merge: implement support for reading pseudo-merge commits pack-bitmap.c: read pseudo-merge extension pseudo-merge: scaffolding for reads pack-bitmap: extract `read_bitmap()` function pack-bitmap-write.c: write pseudo-merge table pseudo-merge: implement support for selecting pseudo-merge commits config: introduce `git_config_double()` pack-bitmap: make `bitmap_writer_push_bitmapped_commit()` public pack-bitmap: implement `bitmap_writer_has_bitmapped_object_id()` pack-bitmap-write: support storing pseudo-merge commits ...	2024-06-24 16:39:13 -07:00
Patrick Steinhardt	8a676bdc5c	hash-ll: merge with "hash.h" The "hash-ll.h" header was introduced via `d1cbe1e6d8` (hash-ll.h: split out of hash.h to remove dependency on repository.h, 2023-04-22) to make explicit the split between hash-related functions that rely on the global `the_repository`, and those that don't. This split is no longer necessary now that we we have removed the reliance on `the_repository`. Merge "hash-ll.h" back into "hash.h". This causes some code units to not include "repository.h" anymore, which requires us to add some forward declarations. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-06-14 10:26:33 -07:00
Junio C Hamano	988499e295	Merge branch 'ps/refs-without-the-repository-updates' Further clean-up the refs subsystem to stop relying on the_repository, and instead use the repository associated to the ref_store object. * ps/refs-without-the-repository-updates: refs/packed: remove references to `the_hash_algo` refs/files: remove references to `the_hash_algo` refs/files: use correct repository refs: remove `dwim_log()` refs: drop `git_default_branch_name()` refs: pass repo when peeling objects refs: move object peeling into "object.c" refs: pass ref store when detecting dangling symrefs refs: convert iteration over replace refs to accept ref store refs: retrieve worktree ref stores via associated repository refs: refactor `resolve_gitlink_ref()` to accept a repository refs: pass repo when retrieving submodule ref store refs: track ref stores via strmap refs: implement releasing ref storages refs: rename `init_db` callback to avoid confusion refs: adjust names for `init` and `init_db` callbacks	2024-05-30 14:15:13 -07:00
Taylor Blau	0d41b18317	pack-bitmap-write: support storing pseudo-merge commits Prepare to write pseudo-merge bitmaps by annotating individual bitmapped commits (which are represented by the `bitmapped_commit` structure) with an extra bit indicating whether or not they are a pseudo-merge. In subsequent commits, pseudo-merge bitmaps will be generated by allocating a fake commit node with parents covering the full set of commits represented by the pseudo-merge bitmap. These commits will be added to the set of "selected" commits as usual, but will be written specially instead of being included with the rest of the selected commits. Mechanically speaking, there are two parts of this change: - The bitmapped_commit struct gets a new bit indicating whether it is a pseudo-merge, or an ordinary commit selected for bitmaps. - A handful of changes to only write out the non-pseudo-merge commits when enumerating through the selected array (see the new `bitmap_writer_selected_nr()` function). Pseudo-merge commits appear after all non-pseudo-merge commits, so it is safe to enumerate through the selected array like so: for (i = 0; i < bitmap_writer_selected_nr(); i++) if (writer.selected[i].pseudo_merge) BUG("unexpected pseudo-merge"); without encountering the BUG(). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-05-24 11:40:42 -07:00
Patrick Steinhardt	30aaff437f	refs: pass repo when peeling objects Both `peel_object()` and `peel_iterated_oid()` implicitly rely on `the_repository` to look up objects. Despite the fact that we want to get rid of `the_repository`, it also leads to some restrictions in our ref iterators when trying to retrieve the peeled value for a repository other than `the_repository`. Refactor these functions such that both take a repository as argument and remove the now-unnecessary restrictions. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-05-17 10:33:39 -07:00
Patrick Steinhardt	19c76e8235	refs: move object peeling into "object.c" Peeling an object has nothing to do with refs, but we still have the code in "refs.c". Move it over into "object.c", which is a more natural place to put it. Ideally, we'd also move `peel_iterated_oid()` over into "object.c". But this function is tied to the refs interfaces because it uses a global ref iterator variable to optimize peeling when the iterator already has the peeled object ID readily available. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-05-17 10:33:39 -07:00
Taylor Blau	b174a97a54	object.h: add flags allocated by pack-bitmap.h In commit `7cc8f97108` (pack-objects: implement bitmap writing, 2013-12-21) the NEEDS_BITMAP flag was introduced into pack-bitmap.h, but no object flags allocation table existed at the time. In `208acbfb82` (object.h: centralize object flag allocation, 2014-03-25) when that table was first introduced, we never added the flags from `7cc8f97108`, which has remained the case since. Rectify this by including the flag bit used by pack-bitmap.h into the centralized table in object.h. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-05-15 06:52:31 -07:00
Junio C Hamano	1002f28a52	Merge branch 'eb/hash-transition' Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...	2024-03-28 14:13:50 -07:00
Jeff King	6cd05e768b	upload-pack: free tree buffers after parsing When a client sends us a "want" or "have" line, we call parse_object() to get an object struct. If the object is a tree, then the parsed state means that tree->buffer points to the uncompressed contents of the tree. But we don't really care about it. We only really need to parse commits and tags; for trees and blobs, the important output is just a "struct object" with the correct type. But much worse, we do not ever free that tree buffer. It's not leaked in the traditional sense, in that we still have a pointer to it from the global object hash. But if the client requests many trees, we'll hold all of their contents in memory at the same time. Nobody really noticed because it's rare for clients to directly request a tree. It might happen for a lightweight tag pointing straight at a tree, or it might happen for a "tree:depth" partial clone filling in missing trees. But it's also possible for a malicious client to request a lot of trees, causing upload-pack's memory to balloon. For example, without this patch, requesting every tree in git.git like: pktline() { local msg="$*" printf "%04x%s\n" $((1+4+${#msg})) "$msg" } want_trees() { pktline command=fetch printf 0001 git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' \| while read oid type; do test "$type" = "tree" \|\| continue pktline want $oid done pktline done printf 0000 } want_trees \| GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . >/dev/null shows a peak heap usage of ~3.7GB. Which is just about the sum of the sizes of all of the uncompressed trees. For linux.git, it's closer to 17GB. So the obvious thing to do is to call free_tree_buffer() after we realize that we've parsed a tree. We know that upload-pack won't need it later. But let's push the logic into parse_object_with_flags(), telling it to discard the tree buffer immediately. There are two reasons for this. One, all of the relevant call-sites already call the with_options variant to pass the SKIP_HASH flag. So it actually ends up as less code than manually free-ing in each spot. And two, it enables an extra optimization that I'll discuss below. I've touched all of the sites that currently use SKIP_HASH in upload-pack. That drops the peak heap of the upload-pack invocation above from 3.7GB to ~24MB. I've also modified the caller in get_reference(); a partial clone benefits from its use in pack-objects for the reasons given in `0bc2557951` (upload-pack: skip parse-object re-hashing of "want" objects, 2022-09-06), where we were measuring blob requests. But note that the results of get_reference() are used for traversing, as well; so we really would _eventually_ use the tree contents. That makes this at first glance a space/time tradeoff: we won't hold all of the trees in memory at once, but we'll have to reload them each when it comes time to traverse. And here's where our extra optimization comes in. If the caller is not going to immediately look at the tree contents, and it doesn't care about checking the hash, then parse_object() can simply skip loading the tree entirely, just like we do for blobs! And now it's not a space/time tradeoff in get_reference() anymore. It's just a lazy-load: we're delaying reading the tree contents until it's time to actually traverse them one by one. And of course for upload-pack, this optimization means we never load the trees at all, saving lots of CPU time. Timing the "every tree from git.git" request above shows upload-pack dropping from 32 seconds of CPU to 19 (the remainder is mostly due to pack-objects actually sending the pack; timing just the upload-pack portion shows we go from 13s to ~0.28s). These are all highly gamed numbers, of course. For real-world partial-clone requests we're saving only a small bit of time in practice. But it does help harden upload-pack against malicious denial-of-service attacks. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-02-28 14:42:01 -08:00
Eric W. Biederman	45b3b12141	object: factor out parse_mode out of fast-import and tree-walk into in object.h builtin/fast-import.c and tree-walk.c have almost identical version of get_mode. The two functions started out the same but have diverged slightly. The version in fast-import changed mode to a uint16_t to save memory. The version in tree-walk started erroring if no mode was present. As far as I can tell both of these changes are valid for both of the callers, so add the both changes and place the common parsing helper in object.h Rename the helper from get_mode to parse_mode so it does not conflict with another helper named get_mode in diff-no-index.c This will be used shortly in a new helper decode_tree_entry_raw which is used to compute cmpatibility objects as part of the sha256 transition. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-10-02 14:57:39 -07:00
Junio C Hamano	f2ffc74186	Merge branch 'tb/pack-bitmap-traversal-with-boundary' The object traversal using reachability bitmap done by "pack-object" has been tweaked to take advantage of the fact that using "boundary" commits as representative of all the uninteresting ones can save quite a lot of object enumeration. * tb/pack-bitmap-traversal-with-boundary: pack-bitmap.c: use commit boundary during bitmap traversal pack-bitmap.c: extract `fill_in_bitmap()` object: add object_array initializer helper function	2023-06-22 16:29:05 -07:00
Taylor Blau	fe90355361	object: add object_array initializer helper function The object_array API has an OBJECT_ARRAY_INIT macro, but lacks a function to initialize an object_array at a given location in memory. Introduce `object_array_init()` to implement such a function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-05-08 12:05:55 -07:00
Elijah Newren	d1cbe1e6d8	hash-ll.h: split out of hash.h to remove dependency on repository.h hash.h depends upon and includes repository.h, due to the definition and use of the_hash_algo (defined as the_repository->hash_algo). However, most headers trying to include hash.h are only interested in the layout of the structs like object_id. Move the parts of hash.h that do not depend upon repository.h into a new file hash-ll.h (the "low level" parts of hash.h), and adjust other files to use this new header where the convenience inline functions aren't needed. This allows hash.h and object.h to be fairly small, minimal headers. It also exposes a lot of hidden dependencies on both path.h (which was brought in by repository.h) and repository.h (which was previously implicitly brought in by object.h), so also adjust other files to be more explicit about what they depend upon. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-24 12:47:32 -07:00
Elijah Newren	8876ea83a7	object.h: move some inline functions and defines from cache.h The object_type() inline function is very tied to the enum object_type declaration within object.h, and just seemed to make more sense to live there. That makes S_ISGITLINK and some other defines make sense to go with it, as well as the create_ce_mode() and canon_mode() inline functions. Move all these inline functions and defines from cache.h to object.h. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-11 08:52:10 -07:00
Elijah Newren	a64215b6cd	object.h: stop depending on cache.h; make cache.h depend on object.h Things should be able to depend on object.h without pulling in all of cache.h. Move an enum to allow this. Note that a couple files previously depended on things brought in through cache.h indirectly (revision.h -> commit.h -> object.h -> cache.h). As such, this change requires making existing dependencies more explicit in half a dozen files. The inclusion of strbuf.h in some headers if of particular note: these headers directly embedded a strbuf in some new structs, meaning they should have been including strbuf.h all along but were indirectly getting the necessary definitions. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-23 17:25:29 -08:00
Jeff King	c868d8e91f	parse_object(): allow skipping hash check The parse_object() function checks the object hash of any object it parses. This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations. But it also can be slow. It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all. Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct. For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway. But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them. And blobs, of course, tend to be larger. We don't want to just throw out the hash-checks for blobs, though. We do depend on them in some circumstances (e.g., rev-list --verify-objects uses parse_object() to check them). It's only the callers that know how they're going to use the result. And so we can help them by providing a special flag to skip the hash check. We could just apply this to blobs, as they're going to be the main source of performance improvement. But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too. Even though we can't avoid reading the object contents, we can still skip the actual hash computation. If this seems like it is making Git a little bit less safe against corruption, it may be. But it's part of a series of tradeoffs we're already making. For instance, "rev-list --objects" does not open the contents of blobs it prints. And when a commit graph is present, we skip opening most commits entirely. The important thing will be to use this flag in cases where it's safe to skip the check. For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself. There's little to be gained from the server side re-hashing a blob itself. And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line. So applied properly, this new feature shouldn't make anything less safe in practice. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-07 12:18:57 -07:00
Elijah Newren	257418c590	revision: allow --ancestry-path to take an argument We have long allowed users to run e.g. git log --ancestry-path master..seen which shows all commits which satisfy all three of these criteria: * are an ancestor of seen * are not an ancestor of master * have master as an ancestor This commit allows another variant: git log --ancestry-path=$TOPIC master..seen which shows all commits which satisfy all of these criteria: * are an ancestor of seen * are not an ancestor of master * have $TOPIC in their ancestry-path that last bullet can be defined as commits meeting any of these criteria: * are an ancestor of $TOPIC * have $TOPIC as an ancestor * are $TOPIC This also allows multiple --ancestry-path arguments, which can be used to find commits with any of the given topics in their ancestry path. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-08-19 10:45:08 -07:00
John Cai	7d3d226e70	reflog: libify delete reflog function and helpers Currently stash shells out to reflog in order to delete refs. In an effort to reduce how much we shell out to a subprocess, libify the functionality that stash needs into reflog.c. Add a reflog_delete function that is pretty much the logic in the while loop in builtin/reflog.c cmd_reflog_delete(). This is a function that builtin/reflog.c and builtin/stash.c can both call. Also move functions needed by reflog_delete and export them. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: John Cai <johncai86@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-02 15:24:47 -08:00
Ævar Arnfjörð Bjarmason	9865b6e6a4	.[ch] _INIT macros: use { 0 } for a "zero out" idiom In C it isn't required to specify that all members of a struct are zero'd out to 0, NULL or '\0', just providing a "{ 0 }" will accomplish that. Let's also change code that provided N zero'd fields to just provide one, and change e.g. "{ NULL }" to "{ 0 }" for consistency. I.e. even if the first member is a pointer let's use "0" instead of "NULL". The point of using "0" consistently is to pick one, and to not have the reader wonder why we're not using the same pattern everywhere. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-09-27 14:47:59 -07:00
Taylor Blau	b0173340c6	builtin/pack-objects.c: remove duplicate hash lookup In the original code from `08cdfb1337` (pack-objects --keep-unreachable, 2007-09-16), we add each object to the packing list with type `obj->type`, where `obj` comes from `lookup_unknown_object()`. Unless we had already looked up and parsed the object, this will be `OBJ_NONE`. That's fine, since oe_set_type() sets the type_valid bit to '0', and we determine the real type later on. So the only thing we need from the object lookup is access to the `flags` field so that we can mark that we've added the object with `OBJECT_ADDED` to avoid adding it again (we can just pass `OBJ_NONE` directly instead of grabbing it from the object). But add_object_entry() already rejects duplicates! This has been the behavior since `7a979d99ba` (Thin pack - create packfile with missing delta base., 2006-02-19), but `08cdfb1337` didn't take advantage of it. Moreover, to do the OBJECT_ADDED check, we have to do a hash lookup in `obj_hash`. So we can drop the lookup_unknown_object() call completely, and the OBJECT_ADDED flag, too, since the spot we're touching here is the only location that checks it. In the end, we perform the same number of hash lookups, but with the added bonus that we don't waste memory allocating an OBJ_NONE object (if we were traversing, we'd need it eventually, but the whole point of this code path is not to traverse). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-08-29 23:25:43 -07:00
Jeff King	7463064b28	object.h: add lookup_object_by_type() function In some cases it's useful for efficiency reasons to get the type of an object before deciding whether to parse it, but we still want an object struct. E.g., in reachable.c, bitmaps give us the type, but we just want to mark flags on each object. Likewise, we may loop over every object and only parse tags in order to peel them; checking the type first lets us avoid parsing the non-tags. But our lookup_blob(), etc, functions make getting an object struct annoying: we have to call the right function for every type. And we cannot just use the generic lookup_object(), because it only returns an already-seen object; it won't allocate a new object struct. Let's provide a function that dispatches to the correct lookup_* function based on a run-time type. In fact, reachable.c already has such a helper, so we'll just make that public. I did change the return type from "void " to "struct object ". While the former is a clever way to avoid casting inside the function, it's less safe and less informative to people reading the function declaration. The next commit will add a new caller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-06-28 20:30:18 -07:00
Jeff King	542d6abbb4	object.h: expand docstring for lookup_unknown_object() The lookup_unknown_object() system is not often used and is somewhat confusing. Let's try to explain it a bit more (which is especially important as I'm adding a related but slightly different function in the next commit). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-06-28 20:30:17 -07:00
Junio C Hamano	644f4a2046	Merge branch 'jt/push-negotiation' "git push" learns to discover common ancestor with the receiving end over protocol v2. * jt/push-negotiation: send-pack: support push negotiation fetch: teach independent negotiation (no packfile) fetch-pack: refactor command and capability write fetch-pack: refactor add_haves() fetch-pack: refactor process_acks()	2021-05-16 21:05:22 +09:00
Jonathan Tan	9c1e657a8f	fetch: teach independent negotiation (no packfile) Currently, the packfile negotiation step within a Git fetch cannot be done independent of sending the packfile, even though there is at least one application wherein this is useful. Therefore, make it possible for this negotiation step to be done independently. A subsequent commit will use this for one such application - push negotiation. This feature is for protocol v2 only. (An implementation for protocol v0 would require a separate implementation in the fetch, transport, and transport helper code.) In the protocol, the main hindrance towards independent negotiation is that the server can unilaterally decide to send the packfile. This is solved by a "wait-for-done" argument: the server will then wait for the client to say "done". In practice, the client will never say it; instead it will cease requests once it is satisfied. In the client, the main change lies in the transport and transport helper code. fetch_refs_via_pack() performs everything needed - protocol version and capability checks, and the negotiation itself. There are 2 code paths that do not go through fetch_refs_via_pack() that needed to be individually excluded: the bundle transport (excluded through requiring smart_options, which the bundle transport doesn't support) and transport helpers that do not support takeover. If or when we support independent negotiation for protocol v0, we will need to modify these 2 code paths to support it. But for now, report failure if independent negotiation is requested in these cases. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-05-05 10:41:29 +09:00
Jeff King	45a187cc34	lookup_unknown_object(): take a repository argument All of the other lookup_foo() functions take a repository argument, but lookup_unknown_object() was never converted, and it uses the_repository internally. Let's fix that. We could leave a wrapper that uses the_repository, but there aren't that many calls, so we'll just convert them all. I looked briefly at each site to see if we had a repository struct (besides the_repository) we could pass, but none of them do (so this conversion to pass the_repository is a pure noop in each case, though it does take us one step closer to eventually getting rid of the_repository). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-13 13:18:46 -07:00
René Scharfe	cd8888452c	object: allow clear_commit_marks_all to handle any repo Allow callers to specify the repository to use. Rename the function to repo_clear_commit_marks to document its new scope. No functional change intended. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-10-31 10:46:34 -07:00
Derrick Stolee	4ddc79b2da	maintenance: add auto condition for commit-graph task Instead of writing a new commit-graph in every 'git maintenance run --auto' process (when maintenance.commit-graph.enalbed is configured to be true), only write when there are "enough" commits not in a commit-graph file. This count is controlled by the maintenance.commit-graph.auto config option. To compute the count, use a depth-first search starting at each ref, and leaving markers using the SEEN flag. If this count reaches the limit, then terminate early and start the task. Otherwise, this operation will peel every ref and parse the commit it points to. If these are all in the commit-graph, then this is typically a very fast operation. Users with many refs might feel a slow-down, and hence could consider updating their limit to be very small. A negative value will force the step to run every time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-17 11:30:05 -07:00
Junio C Hamano	24ecfdf206	Merge branch 'tb/fix-persistent-shallow' into master When "fetch.writeCommitGraph" configuration is set in a shallow repository and a fetch moves the shallow boundary, we wrote out broken commit-graph files that do not match the reality, which has been corrected. * tb/fix-persistent-shallow: commit.c: don't persist substituted parents when unshallowing	2020-07-09 14:00:44 -07:00
Taylor Blau	ce16364e89	commit.c: don't persist substituted parents when unshallowing Since `37b9dcabfc` (shallow.c: use '{commit,rollback}_shallow_file', 2020-04-22), Git knows how to reset stat-validity checks for the $GIT_DIR/shallow file, allowing it to change between a shallow and non-shallow state in the same process (e.g., in the case of 'git fetch --unshallow'). However, when $GIT_DIR/shallow changes, Git does not alter or remove any grafts (nor substituted parents) in memory. This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph set to true. Ordinarily in a shallow repository (and before `37b9dcabfc`, even in this case), commit_graph_compatible() would return false, indicating that the repository should not be used to write a commit-graphs (since commit-graph files cannot represent a shallow history). But since `37b9dcabfc`, in an --unshallow operation that check succeeds. Thus even though the repository isn't shallow any longer (that is, we have all of the objects), the in-core representation of those objects still has munged parents at the shallow boundaries. When the commit-graph write proceeds, we use the incorrect parentage, producing wrong results. There are two ways for a user to work around this: either (1) set 'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after unshallowing. One way to fix this would be to reset the parsed object pool entirely (flushing the cache and thus preventing subsequent reads from modifying their parents) after unshallowing. That would produce a problem when callers have a now-stale reference to the old pool, and so this patch implements a different approach. Instead, attach a new bit to the pool, 'substituted_parent', which indicates if the repository ever stored a commit which had its parents modified (i.e., the shallow boundary prior to unshallowing). This bit needs to be sticky because all reads subsequent to modifying a commit's parents are unreliable when unshallowing. Modify the check in 'commit_graph_compatible' to take this bit into account, and correctly avoid generating commit-graphs in this case, thus solving the bug. Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: Jonathan Nieder <jrnieder@gmail.com> Reported-by: Jay Conrod <jayconrod@google.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-07-08 16:13:46 -07:00
Junio C Hamano	480e78595e	Merge branch 'rs/pack-bits-in-object-better' By renumbering object flag bits, "struct object" managed to lose bloated inter-field padding. * rs/pack-bits-in-object-better: revision: reallocate TOPO_WALK object flags	2020-07-06 22:09:17 -07:00
Junio C Hamano	67d99b82de	Merge branch 'bc/http-push-flagsfix' The code to push changes over "dumb" HTTP had a bad interaction with the commit reachability code due to incorrect allocation of object flag bits, which has been corrected. * bc/http-push-flagsfix: http-push: ensure unforced pushes fail when data would be lost	2020-07-06 22:09:17 -07:00
René Scharfe	23c4319f0d	revision: reallocate TOPO_WALK object flags The bit fields in struct object have an unfortunate layout. Here's what pahole reports on x86_64 GNU/Linux: struct object { unsigned int parsed:1; /* 0: 0 4 / unsigned int type:3; / 0: 1 4 / / XXX 28 bits hole, try to pack / / Force alignment to the next boundary: / unsigned int :0; unsigned int flags:29; / 4: 0 4 / / XXX 3 bits hole, try to pack / struct object_id oid; / 8 32 / / size: 40, cachelines: 1, members: 4 / / sum members: 32 / / sum bitfield members: 33 bits, bit holes: 2, sum bit holes: 31 bits / / last cacheline: 40 bytes / }; Notice the 1+3+29=33 bits in bit fields and 28+3=31 bits in holes. There are holes inside the flags bit field as well -- while some object flags are used for more than one purpose, 22, 23 and 24 are still free. Use 23 and 24 instead of 27 and 28 for TOPO_WALK_EXPLORED and TOPO_WALK_INDEGREE. This allows us to reduce FLAG_BITS by one so that all bitfields combined fit into a single 32-bit slot: struct object { unsigned int parsed:1; / 0: 0 4 / unsigned int type:3; / 0: 1 4 / unsigned int flags:28; / 0: 4 4 / struct object_id oid; / 4 32 / / size: 36, cachelines: 1, members: 4 / / last cacheline: 36 bytes */ }; With this tight packing the size of struct object is reduced by 10%. Other architectures probably benefit as well. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-24 09:09:44 -07:00

1 2 3 4

163 Commits