mirror of
https://github.com/git/git.git
synced 2026-03-04 06:27:36 +01:00
builtin/pack-objects: reduce lock contention when writing packfile data
When running `git pack-objects --stdout` we feed the data through
`hashfd_ext()` with a progress meter and a smaller-than-usual buffer
length of 8kB so that we can track throughput more granularly. But as
packfiles tend to be on the larger side, this small buffer size may
cause a ton of write(3p) syscalls.
Originally, the buffer we used in `hashfd()` was 8kB for all use cases.
This was changed though in 2ca245f8be (csum-file.h: increase hashfile
buffer size, 2021-05-18) because we noticed that the number of writes
can have an impact on performance. So the buffer size was increased to
128kB, which improved performance a bit for some use cases.
But the commit didn't touch the buffer size for `hashd_throughput()`.
The reasoning here was that callers expect the progress indicator to
update frequently, and a larger buffer size would of course reduce the
update frequency especially on slow networks.
While that is of course true, there was (and still is, even though it's
now a call to `hashfd_ext()`) only a single caller of this function in
git-pack-objects(1). This command is responsible for writing packfiles,
and those packfiles are often on the bigger side. So arguably:
- The user won't care about increments of 8kB when packfiles tend to
be megabytes or even gigabytes in size.
- Reducing the number of syscalls would be even more valuable here
than it would be for multi-pack indices, which was the benchmark
done in the mentioned commit, as MIDXs are typically significantly
smaller than packfiles.
- Nowadays, many internet connections should be able to transfer data
at a rate significantly higher than 8kB per second.
Update the buffer to instead have a size of `LARGE_PACKET_DATA_MAX - 1`,
which translates to ~64kB. This limit was chosen because `git
pack-objects --stdout` is most often used when sending packfiles via
git-upload-pack(1), where packfile data is chunked into pktlines when
using the sideband. Furthermore, most internet connections should have a
bandwidth signifcantly higher than 64kB/s, so we'd still be able to
observe progress updates at a rate of at least once per second.
This change significantly reduces the number of write(3p) syscalls from
355,000 to 44,000 when packing the Linux repository. While this results
in a small performance improvement on an otherwise-unused system, this
improvement is mostly negligible. More importantly though, it will
reduce lock contention in the kernel on an extremely busy system where
we have many processes writing data at once.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
committed by
Junio C Hamano
parent
b7b5cde063
commit
b095803c90
@@ -41,6 +41,7 @@
|
||||
#include "promisor-remote.h"
|
||||
#include "pack-mtimes.h"
|
||||
#include "parse-options.h"
|
||||
#include "pkt-line.h"
|
||||
#include "blob.h"
|
||||
#include "tree.h"
|
||||
#include "path-walk.h"
|
||||
@@ -1333,14 +1334,17 @@ static void write_pack_file(void)
|
||||
|
||||
if (pack_to_stdout) {
|
||||
/*
|
||||
* Since we are expecting to report progress of the
|
||||
* write into this hashfile, use a smaller buffer
|
||||
* size so the progress indicators arrive at a more
|
||||
* frequent rate.
|
||||
* This command is most often invoked via
|
||||
* git-upload-pack(1), which will typically chunk data
|
||||
* into pktlines. As such, we use the maximum data
|
||||
* length of them as buffer length.
|
||||
*
|
||||
* Note that we need to subtract one though to
|
||||
* accomodate for the sideband byte.
|
||||
*/
|
||||
struct hashfd_options opts = {
|
||||
.progress = progress_state,
|
||||
.buffer_len = 8 * 1024,
|
||||
.buffer_len = LARGE_PACKET_DATA_MAX - 1,
|
||||
};
|
||||
f = hashfd_ext(the_repository->hash_algo, 1,
|
||||
"<stdout>", &opts);
|
||||
|
||||
Reference in New Issue
Block a user