global/git

mirror of https://github.com/git/git.git synced 2026-03-11 09:29:49 +01:00

Author	SHA1	Message	Date
Patrick Steinhardt	eb22c1b46b	reftable/stack: add mechanism to notify callers on reload Reftable stacks are reloaded in two cases: - When calling `reftable_stack_reload()`, if the stat-cache tells us that the stack has been modified. - When committing a reftable addition. While callers can figure out the second case, they do not have a mechanism to figure out whether `reftable_stack_reload()` led to an actual reload of the on-disk data. All they can do is thus to assume that data is always being reloaded in that case. Improve the situation by introducing a new `on_reload()` callback to the reftable options. If provided, the function will be invoked every time the stack has indeed been reloaded. This allows callers to invalidate data that depends on the current stack data. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:38 +09:00
Patrick Steinhardt	96e7cb83b6	refs/reftable: refactor reflog expiry to use reftable backend Refactor the callback function that expires reflog entries in the reftable backend to use `reftable_backend_read_ref()` instead of accessing the reftable stack directly. This ensures that the function will benefit from the new caching layer that we're about to introduce. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:37 +09:00
Patrick Steinhardt	ad6c41f4b7	refs/reftable: refactor reading symbolic refs to use reftable backend Refactor the callback function that reads symbolic references in the reftable backend to use `reftable_backend_read_ref()` instead of accessing the reftable stack directly. This ensures that the function will benefit from the new caching layer that we're about to introduce. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:37 +09:00
Patrick Steinhardt	27fdf8f4ed	refs/reftable: read references via `struct reftable_backend` Refactor `read_ref_without_reload()` to accept `struct reftable_backend` as parameter instead of `struct reftable_stack`. Rename the function to `reftable_backend_read_ref()` to clarify its scope and move it close to other functions operating on `struct reftable_backend`. This change allows us to implement an additional caching layer when reading refs where we can reuse reftable iterators. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:37 +09:00
Patrick Steinhardt	3ec8022bb0	refs/reftable: figure out hash via `reftable_stack` The function `read_ref_without_reload()` accepts a ref store as input only so that we can figure out the hash function used by it. This is duplicate information though because the reftable stack knows about its hash function, too. Drop the superfluous parameter to simplify the calling convention a bit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:37 +09:00
Patrick Steinhardt	c9f76fc7d1	reftable/stack: add accessor for the hash ID Add an accessor function that allows callers to access the hash ID of a reftable stack. This function will be used in a subsequent commit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:36 +09:00
Patrick Steinhardt	46b5f67019	refs/reftable: handle reloading stacks in the reftable backend When accessing a stack we almost always have to reload the stack before reading data from it. This is mostly because Git does not have a notification mechanism for when underlying data has been changed, and thus we are forced to opportunistically reload the stack every single time to account for any changes that may have happened concurrently. Handle the reload internally in `backend_for()`. For one this forces callsites to think about whether or not they need to reload the stack. But second this makes the logic to access stacks more self-contained by letting the `struct reftable_backend` manage themselves. Update callsites where we don't reload the stack to document why we don't. In some cases it's unclear whether it is the right thing to do in the first place, but fixing that is outside of the scope of this patch series. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:36 +09:00
Patrick Steinhardt	ad0986c676	refs/reftable: encapsulate reftable stack The reftable ref store needs to keep track of multiple stacks, one for the main worktree and an arbitrary number of stacks for worktrees. This is done by storing pointers to `struct reftable_stack`, which we then access directly. Wrap the stack in a new `struct reftable_backend`. This will allow us to attach more data to each respective stack in subsequent commits. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-26 17:18:36 +09:00
Junio C Hamano	b8558e6abd	Merge branch 'ps/reftable-detach' into ps/reftable-iterator-reuse * ps/reftable-detach: reftable/system: provide thin wrapper for lockfile subsystem reftable/stack: drop only use of `get_locked_file_path()` reftable/system: provide thin wrapper for tempfile subsystem reftable/stack: stop using `fsync_component()` directly reftable/system: stop depending on "hash.h" reftable: explicitly handle hash format IDs reftable/system: move "dir.h" to its only user	2024-11-19 12:24:33 +09:00
Patrick Steinhardt	988e7f5e95	reftable/system: provide thin wrapper for lockfile subsystem We use the lockfile subsystem to write lockfiles for "tables.list". As with the tempfile subsystem, the lockfile subsystem also hooks into our infrastructure to prune stale locks via atexit(3p) or signal handlers. Furthermore, the lockfile subsystem also handles locking timeouts, which do add quite a bit of logic. Having to reimplement that in the context of Git wouldn't make a whole lot of sense, and it is quite likely that downstream users of the reftable library may have a better idea for how exactly to implement timeouts. So again, provide a thin wrapper for the lockfile subsystem instead such that the compatibility shim is fully self-contained. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:11 +09:00
Patrick Steinhardt	6361226b79	reftable/stack: drop only use of `get_locked_file_path()` We've got a single callsite where we call `get_locked_file_path()`. As we're about to convert our usage of the lockfile subsystem to instead be used via a compatibility shim we'd have to implement more logic for this single callsite. While that would be okay if Git was the only supposed user of the reftable library, it's a bit more awkward when considering that we have to reimplement this functionality for every user of the library eventually. Refactor the code such that we don't call `get_locked_file_path()` anymore. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:10 +09:00
Patrick Steinhardt	01e49941d6	reftable/system: provide thin wrapper for tempfile subsystem We use the tempfile subsystem to write temporary tables, but given that we're in the process of converting the reftable library to become standalone we cannot use this subsystem directly anymore. While we could in theory convert the code to use mkstemp(3p) instead, we'd lose access to our infrastructure that automatically prunes tempfiles via atexit(3p) or signal handlers. Provide a thin wrapper for the tempfile subsystem instead. Like this, the compatibility shim is fully self-contained in "reftable/system.c". Downstream users of the reftable library would have to implement their own tempfile shims by replacing "system.c" with a custom version. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:10 +09:00
Patrick Steinhardt	86b770b0bb	reftable/stack: stop using `fsync_component()` directly We're executing `fsync_component()` directly in the reftable library so that we can fsync data to disk depending on "core.fsync". But as we're in the process of converting the reftable library to become standalone we cannot use that function in the library anymore. Refactor the code such that users of the library can inject a custom fsync function via the write options. This allows us to get rid of the dependency on "write-or-die.h". Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:10 +09:00
Patrick Steinhardt	c2f08236ed	reftable/system: stop depending on "hash.h" We include "hash.h" in "reftable/system.h" such that we can use hash format IDs as well as the raw size of SHA1 and SHA256. As we are in the process of converting the reftable library to become standalone we of course cannot rely on those constants anymore. Introduce a new `enum reftable_hash` to replace internal uses of the hash format IDs and new constants that replace internal uses of the hash size. Adapt the reftable backend to set up the correct hash function. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:10 +09:00
Patrick Steinhardt	88e297275b	reftable: explicitly handle hash format IDs The hash format IDs are used for two different things across the reftable codebase: - They are used as a 32 bit unsigned integer when reading and writing the header in order to identify the hash function. - They are used internally to identify which hash function is in use. When one only considers the second usecase one might think that one can easily change the representation of those hash IDs. But because those IDs end up in the reftable header and footer on disk it is important that those never change. Create separate constants `REFTABLE_FORMAT_ID_` and use them in contexts where we read or write reftable headers. This serves multiple purposes: - It allows us to more easily discern cases where we actually use those constants for the on-disk format. - It detangles us from the same constants that are defined in libgit.a, which is another required step to convert the reftable library to become standalone. - It makes the next step easier where we stop using `GIT__FORMAT_ID` constants in favor of a custom enum. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:09 +09:00
Patrick Steinhardt	17e8039878	reftable/system: move "dir.h" to its only user We still include "dir.h" in "reftable/system.h" even though it is not used by anything but by a single unit test. Move it over into that unit test so that we don't accidentally use any functionality provided by it in the reftable codebase. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2024-11-19 12:23:08 +09:00
Taylor Blau	8f8d6eee53	The seventh batch	2024-11-01 12:59:31 -04:00
Taylor Blau	1c5a712f26	Merge branch 'jk/dumb-http-finalize' The dumb-http code regressed when the result of re-indexing a pack yielded an .idx file that differs in content from the .idx file it downloaded from the remote. This has been corrected by no longer relying on the .idx file we got from the remote. jk/dumb-http-finalize: packfile: use oidread() instead of hashcpy() to fill object_id packfile: use object_id in find_pack_entry_one() packfile: convert find_sha1_pack() to use object_id http-walker: use object_id instead of bare hash packfile: warn people away from parse_packed_git() packfile: drop sha1_pack_index_name() packfile: drop sha1_pack_name() packfile: drop has_pack_index() dumb-http: store downloaded pack idx as tempfile t5550: count fetches in "previously-fetched .idx" test midx: avoid duplicate packed_git entries	2024-11-01 12:53:32 -04:00
Taylor Blau	6d81fe64dd	Merge branch 'kh/update-ref' Documentation updates to 'git-update-ref(1)'. * kh/update-ref: Documentation: mutually link update-ref and symbolic-ref Documentation/git-update-ref.txt: discuss symbolic refs Documentation/git-update-ref.txt: remove confusing paragraph Documentation/git-update-ref.txt: demote symlink to last section Documentation/git-update-ref.txt: remove safety paragraphs Documentation/git-update-ref.txt: drop “flag”	2024-11-01 12:53:30 -04:00
Taylor Blau	aebc4bd8ce	Merge branch 'ak/more-typofixes' More typofixes. * ak/more-typofixes: t: fix typos	2024-11-01 12:53:29 -04:00
Taylor Blau	43ac23945c	Merge branch 'rs/grep-lookahead' Fix 'git grep' regression on macOS by disabling lookahead when encountering invalid UTF-8 byte sequences. * rs/grep-lookahead: grep: disable lookahead on error	2024-11-01 12:53:28 -04:00
Taylor Blau	81a5461518	Merge branch 'ak/t1016-cleanup' Test cleanup. * ak/t1016-cleanup: t1016: clean up style	2024-11-01 12:53:27 -04:00
Taylor Blau	59dc0ab83c	Merge branch 'ua/atoi' Replace various calls to atoi() with strtol_i() and strtoul_ui(), and add improved error handling. * ua/atoi: imap: replace atoi() with strtol_i() for UIDVALIDITY and UIDNEXT parsing merge: replace atoi() with strtol_i() for marker size validation daemon: replace atoi() with strtoul_ui() and strtol_i()	2024-11-01 12:53:26 -04:00
Taylor Blau	20ab7fa3b6	Merge branch 'sa/notes-edit' Teach 'git notes add' and 'git notes append' a new '-e' flag, instructing them to open the note in $GIT_EDITOR before saving. * sa/notes-edit: notes: teach the -e option to edit messages in editor	2024-11-01 12:53:25 -04:00
Taylor Blau	8237b49ade	Merge branch 'sk/t9101-cleanup' Test cleanup. * sk/t9101-cleanup: t9101: ensure no whitespace after redirect	2024-11-01 12:53:24 -04:00
Taylor Blau	7b1f01f02e	Merge branch 'ss/duplicate-typos' Typofixes. * ss/duplicate-typos: global: Fix duplicate word typos	2024-11-01 12:53:23 -04:00
Taylor Blau	aabbcf2783	Merge branch 'ps/upload-pack-doc' Documentation update to clarify that 'uploadpack.allowAnySHA1InWant' implies both 'allowTipSHA1InWant' and 'allowReachableSHA1InWant'. * ps/upload-pack-doc: doc: document how uploadpack.allowAnySHA1InWant impact other allow options	2024-11-01 12:53:22 -04:00
Taylor Blau	07c6066f82	Merge branch 'kh/mv-breakage' Demonstrate an assertion failure in 'git mv'. * kh/mv-breakage: t7001: add failure test which triggers assertion	2024-11-01 12:53:21 -04:00
Taylor Blau	787297b396	Merge branch 'rj/cygwin-exit' Treat ECONNABORTED the same as ECONNRESET in 'git credential-cache' to work around a possible Cygwin regression. This resolves a race condition caused by changes in Cygwin's handling of socket closures, allowing the client to exit cleanly when encountering ECONNABORTED. * rj/cygwin-exit: credential-cache: treat ECONNABORTED like ECONNRESET	2024-11-01 12:53:19 -04:00
Taylor Blau	a524cc77ad	Merge branch 'ua/t3404-cleanup' Test update. * ua/t3404-cleanup: t3404: replace test with test_line_count() t3404: avoid losing exit status with focus on `git show` and `git cat-file`	2024-11-01 12:53:18 -04:00
Taylor Blau	268fd2fe58	Merge branch 'ps/platform-compat-fixes' Various platform compatibility fixes split out of the larger effort to use Meson as the primary build tool. * ps/platform-compat-fixes: t6006: fix prereq handling with `test_format ()` http: fix build error on FreeBSD builtin/credential-cache: fix missing parameter for stub function t7300: work around platform-specific behaviour with long paths on MinGW t5500, t5601: skip tests which exercise paths with '[::1]' on Cygwin t3404: work around platform-specific behaviour on macOS 10.15 t1401: make invocation of tar(1) work with Win32-provided one t/lib-gpg: fix setup of GNUPGHOME in MinGW t/lib-gitweb: test against the build version of gitweb t/test-lib: wire up NO_ICONV prerequisite t/test-lib: fix quoting of TEST_RESULTS_SAN_FILE	2024-11-01 12:53:17 -04:00
Taylor Blau	47c3170a3e	Merge branch 'jc/breaking-changes-early-adopter-option' Describe the policy to introduce breaking changes. * jc/breaking-changes-early-adopter-option: BreakingChanges: early adopter option	2024-11-01 12:53:14 -04:00
Taylor Blau	23d289d273	The sixth batch	2024-10-30 13:36:44 -04:00
Taylor Blau	6aa984af3b	Merge branch 'sk/t7011-cleanup' Test cleanup. * sk/t7011-cleanup: t7011: ensure no whitespace after redirect	2024-10-30 13:08:07 -04:00
Taylor Blau	8d305fdaac	Merge branch 'co/t6050-pipefix' Avoid losing exit status by having Git command being tested on the upstream side of a pipe. * co/t6050-pipefix: t6050: avoid pipes with upstream Git commands	2024-10-30 13:08:06 -04:00
Taylor Blau	cec4461e3f	Merge branch 'ks/t4205-fixup' Testfix. * ks/t4205-fixup: t4205: fix typo in 'NUL termination with --stat'	2024-10-30 13:08:05 -04:00
Taylor Blau	9947803926	Merge branch 'kh/submitting-patches' Docfix. * kh/submitting-patches: SubmittingPatches: tags -> trailers	2024-10-30 13:08:04 -04:00
Taylor Blau	6f763d798b	Merge branch 'ps/ref-filter-sort' Teaches the ref-filter machinery to recognize and avoid cases where sorting would be redundant. * ps/ref-filter-sort: ref-filter: format iteratively with lexicographic refname sorting	2024-10-30 13:08:02 -04:00
Taylor Blau	bc627658b0	Merge branch 'ps/reftable-strbuf' Implements a new reftable-specific strbuf replacement to reduce reftable's dependency on Git-specific data structures. * ps/reftable-strbuf: reftable: handle trivial `reftable_buf` errors reftable/stack: adapt `stack_filename()` to handle allocation failures reftable/record: adapt `reftable_record_key()` to handle allocation failures reftable/stack: adapt `format_name()` to handle allocation failures t/unit-tests: check for `reftable_buf` allocation errors reftable/blocksource: adapt interface name reftable: convert from `strbuf` to `reftable_buf` reftable/basics: provide new `reftable_buf` interface reftable: stop using `strbuf_addf()` reftable: stop using `strbuf_addbuf()`	2024-10-30 13:08:01 -04:00
Patrick Steinhardt	dd6003f200	t6006: fix prereq handling with `test_format ()` In `df383b5842` (t/test-lib: wire up NO_ICONV prerequisite, 2024-10-16) we have introduced a new NO_ICONV prerequisite that makes us skip tests in case Git is not compiled with support for iconv. This change subtly broke t6006: while the test suite still passes, some of its tests won't execute because they run into an error. ./t6006-rev-list-format.sh: line 92: test_expect_%e: command not found The broken tests use `test_format ()`, and the mentioned commit simply prepended the new prerequisite to its arguments. But that does not work, as the function is not aware of prereqs at all and will now treat all of its arguments incorrectly. Fix this by making the function aware of prereqs by accepting an optional fourth argument. Adapt the callsites accordingly. Reported-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-28 13:44:38 -04:00
Jeff King	863f2459a2	packfile: use oidread() instead of hashcpy() to fill object_id When chasing a REF_DELTA, we need to pull the raw hash bytes out of the mmap'd packfile into an object_id struct. We do that with a raw hashcpy() of the appropriate length (that happens directly now, though before the previous commit it happened inside find_pack_entry_one(), also using a hashcpy). But I think this creates a potentially dangerous situation due to `d4d364b2c7` (hash: convert `oidcmp()` and `oideq()` to compare whole hash, 2024-06-14). When using sha1, we'll have uninitialized bytes in the latter part of the object_id.hash buffer, which could fool oideq(), etc. We should use oidread() instead, which correctly zero-pads the extra bytes, as of `c98d762ed9` (global: ensure that object IDs are always padded, 2024-06-14). As far as I can see, this has not been a problem in practice because the object_id we feed to find_pack_entry_one() is never used with oideq(), etc. It is being compared to the bytes mmap'd from a pack idx file, which of course do not have the extra padding bytes themselves. So there's no bug here, but this just puzzled me while looking at the code. We should do the more obviously safe thing, both for future-proofing and to avoid confusing readers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	479ab76c9f	packfile: use object_id in find_pack_entry_one() The main function we use to search a pack index for an object is find_pack_entry_one(). That function still takes a bare pointer to the hash, despite the fact that its underlying bsearch_pack() function needs an object_id struct. And so we end up making an extra copy of the hash into the struct just to do a lookup. As it turns out, all callers but one already have such an object_id. So we can just take a pointer to that struct and use it directly. This avoids the extra copy and provides a more type-safe interface. The one exception is get_delta_base() in packfile.c, when we are chasing a REF_DELTA from inside the pack (and thus we have a pointer directly to the mmap'd pack memory, not a struct). We can just bump the hashcpy() from inside find_pack_entry_one() to this one caller that needs it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	4d99559147	packfile: convert find_sha1_pack() to use object_id The find_sha1_pack() function has a few problems: - it's badly named, since it works with any object hash - it takes the hash as a bare pointer rather than an object_id struct We can fix both of these easily, as all callers actually have a real object_id anyway. I also found the existence of this function somewhat confusing, as it is about looking in an arbitrary set of linked packed_git structs. It's good for things like dumb-http which are looking in downloaded remote packs, and not our local packs. But despite the name, it is not a good way to find the pack which contains a local object (it skips the use of the midx, the pack mru list, and so on). So let's also add an explanatory comment above the declaration that may point people in the right direction. I suspect the calls in fast-import.c, which use the packed_git list from the repository struct, could actually just be using find_pack_entry(). But since we'd need to keep it anyway for dumb-http, I didn't dig further there. If we eventually drop dumb-http support, then it might be worth examining them to see if we can get rid of the function entirely. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	0af861e0c8	http-walker: use object_id instead of bare hash We long ago switched most code to using object_id structs instead of bare "unsigned char *" hashes. This gives us more type safety from the compiler, and generally makes it easier to understand what we expect in each parameter. But the dumb-http code has lagged behind. And indeed, the whole "walker" subsystem interface has the same problem, though http-walker is the only user left. So let's update the walker interface to pass object_id structs (which we already have anyway at all call sites!), and likewise use those within the http-walker methods that it calls. This cleans up the dumb-http code a bit, but will also let us fix a few more commonly used helper functions. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	6b2fc22050	packfile: warn people away from parse_packed_git() With a name like parse_packed_git(), you might think it's the right way to access a local pack index and its associated objects. But not so! It's a one-off used by the dumb-http code to access pack idx files we've downloaded from the remote, but whose packs we might not have. There's only one caller left for this function, and ideally we'd drop it completely and just inline it there. But that would require exposing other internals from packfile.[ch], like alloc_packed_git() and check_packed_git_idx(). So let's leave it be for now, and just warn people that it's probably not what they're looking for. Perhaps in the long run if we eventually drop dumb-http support, we can remove the function entirely then. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	4390fea963	packfile: drop sha1_pack_index_name() Like sha1_pack_name() that we dropped in the previous commit, this function uses an error-prone static strbuf and the somewhat misleading name "sha1". The only caller left is in pack-redundant.c. While this command is marked for potential removal in our BreakingChanges document, we still have it for now. But it's simple enough to convert it to use its own strbuf with the underlying odb_pack_name() function, letting us drop the otherwise obsolete function. Note that odb_pack_name() does its own strbuf_reset(), so it's safe to use directly within a loop like this. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	c2dc4c9fbb	packfile: drop sha1_pack_name() The sha1_pack_name() function has a few ugly bits: - it writes into a static strbuf (and not even a ring buffer of them), which can lead to subtle invalidation problems - it uses the term "sha1", but it's really using the_hash_algo, which could be sha256 There's only one caller of it left. And in fact that caller is better off using the underlying odb_pack_name() function itself, since it's just copying the result into its own strbuf anyway. Converting that caller lets us get rid of this now-obselete function. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	03b8eed7f5	packfile: drop has_pack_index() The has_pack_index() function has several oddities that may make it surprising if you are trying to find out if we have a pack with some $hash: - it is not looking for a valid pack that we found while searching object directories. It just looks for any pack-$hash.idx file in the pack directory. - it only looks in the local directory, not any alternates - it takes a bare "unsigned char" hash, which we try to avoid these days The only caller it has is in the dumb http code; it wants to know if we already have the pack idx in question. This can happen if we downloaded the pack (and generated its index) during a previous fetch. Before the previous patch ("dumb-http: store downloaded pack idx as tempfile"), it could also happen if we downloaded the .idx from the remote but didn't get the matching .pack. But since that patch, we don't hold on to those .idx files. So there's no need to look for the .idx file in the filesystem; we can just scan through the packed_git list to see if we have it. That lets us simplify the dumb http code a bit, as we know that if we have the .idx we have the matching .pack already. And it lets us get rid of this odd function that is unlikely to be needed again. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	63aca3f7f1	dumb-http: store downloaded pack idx as tempfile This patch fixes a regression in `b1b8dfde69` (finalize_object_file(): implement collision check, 2024-09-26) where fetching a v1 pack idx file over the dumb-http protocol would cause the fetch to fail. The core of the issue is that dumb-http stores the idx we fetch from the remote at the same path that will eventually hold the idx we generate from "index-pack --stdin". The sequence is something like this: 0. We realize we need some object X, which we don't have locally, and nor does the other side have it as a loose object. 1. We download the list of remote packs from objects/info/packs. 2. For each entry in that file, we download each pack index and store it locally in .git/objects/pack/pack-$hash.idx (the $hash is not something we can verify yet and is given to us by the remote). 3. We check each pack index we got to see if it has object X. When we find a match, we download the matching .pack file from the remote to a tempfile. We feed that to "index-pack --stdin", which reindexes the pack, rather than trusting that it has what the other side claims it does. In most cases, this will end up generating the exact same (byte-for-byte) pack index which we'll store at the same pack-$hash.idx path, because the index generation and $hash id are computed based on what's in the packfile. But: a. The other side might have used other options to generate the index. For instance we use index v2 by default, but long ago it was v1 (and you can still ask for v1 explicitly). b. The other side might even use a different mechanism to determine $hash. E.g., long ago it was based on the sorted list of objects in the packfile, but we switched to using the pack checksum in `1190a1acf8` (pack-objects: name pack files after trailer hash, 2013-12-05). The regression we saw in the real world was (3a). A recent client fetching from a server with a v1 index downloaded that index, then complained about trying to overwrite it with its own v2 index. This collision is otherwise harmless; we know we want to replace the remote version with our local one, but the collision check doesn't realize that. There are a few options to fix it: - we could teach index-pack a command-line option to ignore only pack idx collisions, and use it when the dumb-http code invokes index-pack. This would be an awkward thing to expose users to and would involve a lot of boilerplate to get the option down to the collision code. - we could delete the remote .idx file right before running index-pack. It should be redundant at that point (since we've just downloaded the matching pack). But it feels risky to delete something from our own .git/objects based on what the other side has said. I'm not entirely positive that a malicious server couldn't lie about which pack-$hash.idx it has and get us to delete something precious. - we can stop co-mingling the downloaded idx files in our local objects directory. This is a slightly bigger change but I think fixes the root of the problem more directly. This patch implements the third option. The big design questions are: where do we store the downloaded files, and how do we manage their lifetimes? There are some additional quirks to the dumb-http system we should consider. Remember that in step 2 we downloaded every pack index, but in step 3 we may only download some of the matching packs. What happens to those other idx files now? They sit in the .git/objects/pack directory, possibly waiting to be used at a later date. That may save bandwidth for a subsequent fetch, but it also creates a lot of weird corner cases: - our local object directory now has semi-untrusted .idx files sitting around, without their matching .pack - in case 3b, we noted that we might not generate the same hash as the other side. In that case even if we download the matching pack, our index-pack invocation will store it in a different pack-$hash.idx file. And the unmatched .idx will sit there forever. - if the server repacks, it may delete the old packs. Now we have these orphaned .idx files sitting around locally that will never be used (nor deleted). - if we repack locally we may delete our local version of the server's pack index and not realize we have it. So we'll download it again, even though we have all of the objects it mentions. I think the right solution here is probably some more complex cache management system: download the remote .idx files to their own storage directory, mark them as "seen" when we get their matching pack (to avoid re-downloading even if we repack), and then delete them when the server's objects/info/refs no longer mentions them. But since the dumb http protocol is so ancient and so inferior to the smart http protocol, I don't think it's worth spending a lot of time creating such a system. For this patch I'm just downloading the idx files to .git/objects/tmp_pack_*, and marking them as tempfiles to be deleted when we exit (and due to the name, any we miss due to a crash, etc, should eventually be removed by "git gc" runs based on timestamps). That is slightly worse for one case: if we download an idx but not the matching pack, we won't retain that idx for subsequent runs. But the flip side is that we're making other cases better (we never hold on to useless idx files forever). I suspect that worse case does not even come up often, since it implies that the packs are generated to match distinct parts of history (i.e., in practice even in a repo with many packs you're going to end up grabbing all of those packs to do a clone). If somebody really cares about that, I think the right path forward is a managed cache directory as above, and this patch is providing the first step in that direction anyway (by moving things out of the objects/pack/ directory). There are two test changes. One demonstrates the broken v1 index case (it double-checks the resulting clone with fsck to be careful, but prior to this patch it actually fails at the clone step). The other tweaks the expectation for a test that covers the "slightly worse" case to accommodate the extra index download. The code changes are fairly simple. We stop using finalize_object_file() to copy the remote's index file into place, and leave it as a tempfile. We give the tempfile a real ".idx" name, since the packfile code expects that, and thus we make sure it is out of the usual packs/ directory (so we'd never mistake it for a real local .idx). We also have to change parse_pack_index(), which creates a temporary packed_git to access our index (we need this because all of the pack idx code assumes we have that struct). It reads the index data from the tempfile, but prior to this patch would speculatively write the finalized name into the packed_git struct using the pack-$hash we expect to use. I was mildly surprised that this worked at all, since we call verify_pack_index() on the packed_git which mentions the final name before moving the file into place! But it works because parse_pack_index() leaves the mmap-ed data in the struct, so the lazy-open in verify_pack_index() never triggers, and we read from the tempfile, ignoring the filename in the struct completely. Hacky, but it works. After this patch, parse_pack_index() now uses the index filename we pass in to derive a matching .pack name. This is OK to change because there are only two callers, both in the dumb http code (and the other passes in an existing pack-$hash.idx name, so the derived name is going to be pack-$hash.pack, which is what we were using anyway). I'll follow up with some more cleanups in that area, but this patch is sufficient to fix the regression. Reported-by: fox <fox.gbr@townlong-yak.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00
Jeff King	019b21d402	t5550: count fetches in "previously-fetched .idx" test We have a test in t5550 that looks at index fetching over dumb http. It creates two branches, each of which is completely stored in its own pack, then fetches the branches independently. What should (and does) happen is that the first fetch grabs both .idx files and one .pack file, and then the fetch of the second branch re-uses the previously downloaded .idx files (fetching none) and grabs the now-required .pack file. Since the next few patches will be touching this area of the code, let's beef up the test a little by checking that we're downloading the expected items at each step. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2024-10-25 17:35:46 -04:00

1 2 3 4 5 ...

75277 Commits