builtin/pack-objects: reduce lock contention when writing packfile data

When running `git pack-objects --stdout` we feed the data through `hashfd_ext()` with a progress meter and a smaller-than-usual buffer length of 8kB so that we can track throughput more granularly. But as packfiles tend to be on the larger side, this small buffer size may cause a ton of write(3p) syscalls. Originally, the buffer we used in `hashfd()` was 8kB for all use cases. This was changed though in 2ca245f8be (csum-file.h: increase hashfile buffer size, 2021-05-18) because we noticed that the number of writes can have an impact on performance. So the buffer size was increased to 128kB, which improved performance a bit for some use cases. But the commit didn't touch the buffer size for `hashd_throughput()`. The reasoning here was that callers expect the progress indicator to update frequently, and a larger buffer size would of course reduce the update frequency especially on slow networks. While that is of course true, there was (and still is, even though it's now a call to `hashfd_ext()`) only a single caller of this function in git-pack-objects(1). This command is responsible for writing packfiles, and those packfiles are often on the bigger side. So arguably: - The user won't care about increments of 8kB when packfiles tend to be megabytes or even gigabytes in size. - Reducing the number of syscalls would be even more valuable here than it would be for multi-pack indices, which was the benchmark done in the mentioned commit, as MIDXs are typically significantly smaller than packfiles. - Nowadays, many internet connections should be able to transfer data at a rate significantly higher than 8kB per second. Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`, which translates to ~64kB. This limit was chosen because `git pack-objects --stdout` is most often used when sending packfiles via git-upload-pack(1), where packfile data is chunked into pktlines when using the sideband. Furthermore, most internet connections should have a bandwidth signifcantly higher than 64kB/s, so we'd still be able to observe progress updates at a rate of at least once per second. This change significantly reduces the number of write(3p) syscalls from 355,000 to 44,000 when packing the Linux repository. While this results in a small performance improvement on an otherwise-unused system, this improvement is mostly negligible. More importantly though, it will reduce lock contention in the kernel on an extremely busy system where we have many processes writing data at once. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-04 06:27:36 +01:00 · 2026-03-03 16:00:25 +01:00
parent b7b5cde063
commit b095803c90
1 changed files with 9 additions and 5 deletions
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -41,6 +41,7 @@
 #include "promisor-remote.h"
 #include "pack-mtimes.h"
 #include "parse-options.h"
+#include "pkt-line.h"
 #include "blob.h"
 #include "tree.h"
 #include "path-walk.h"
@@ -1333,14 +1334,17 @@ static void write_pack_file(void)

 		if (pack_to_stdout) {
 			/*
-			 * Since we are expecting to report progress of the
-			 * write into this hashfile, use a smaller buffer
-			 * size so the progress indicators arrive at a more
-			 * frequent rate.
+			 * This command is most often invoked via
+			 * git-upload-pack(1), which will typically chunk data
+			 * into pktlines. As such, we use the maximum data
+			 * length of them as buffer length.
+			 *
+			 * Note that we need to subtract one though to
+			 * accomodate for the sideband byte.
 			 */
 			struct hashfd_options opts = {
 				.progress = progress_state,
-				.buffer_len = 8 * 1024,
+				.buffer_len = LARGE_PACKET_DATA_MAX - 1,
 			};
 			f = hashfd_ext(the_repository->hash_algo, 1,
 				       "<stdout>", &opts);