Derrick Stolee 30696be71f p5313: add size comparison test
As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.

The recently-added --name-hash-version option allows for testing
different name hash functions. Version 2 intends to preserve some of the
locality of version 1 while more often breaking collisions due to long
filenames.

Distinguishing objects by more of the path is critical when there are
many name hash collisions and several versions of the same path in the
full history, giving a significant boost to the full repack case. The
locality of the hash function is critical to compressing something like
a shallow clone or a thin pack representing a push of a single commit.

This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.

[1] https://github.com/microsoft/fluentui
[2] a637a06df05360ce5ff21420803f64608226a875

Checked out at the parent of [2], I see the following statistics:

Test                                         HEAD
---------------------------------------------------------------
5313.2: thin pack with version 1             0.37(0.44+0.02)
5313.3: thin pack size with version 1                   1.2M
5313.4: big pack with version 1              2.04(7.77+0.23)
5313.5: big pack size with version 1                   20.4M
5313.6: shallow fetch pack with version 1    1.41(2.94+0.11)
5313.7: shallow pack size with version 1               34.4M
5313.8: repack with version 1                95.70(676.41+2.87)
5313.9: repack size with version 1                    439.3M
5313.10: thin pack with version 2            0.12(0.12+0.06)
5313.11: thin pack size with version 2                 22.0K
5313.12: big pack with version 2             2.80(5.43+0.34)
5313.13: big pack size with version 2                  25.9M
5313.14: shallow fetch pack with version 2   1.77(2.80+0.19)
5313.15: shallow pack size with version 2              33.7M
5313.16: repack with version 2               33.68(139.52+2.58)
5313.17: repack size with version 2                   160.5M

To make comparisons easier, I will reformat this output into a different
table style:

| Test         | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack    |  0.37 s |  0.12 s |   1.2 M |  22.0 K |
| Big Pack     |  2.04 s |  2.80 s |  20.4 M |  25.9 M |
| Shallow Pack |  1.41 s |  1.77 s |  34.4 M |  33.7 M |
| Repack       | 95.70 s | 33.68 s | 439.3 M | 160.5 M |

The v2 hash function successfully differentiates the CHANGELOG.md files
from each other, which leads to significant improvements in the thin
pack (simulating a push of this commit) and the full repack. There is
some bloat in the "big pack" scenario and essentially the same results
for the shallow pack.

In the case of the Git repository, these numbers show some of the issues
with this approach:

| Test         | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack    |  0.02 s |  0.02 s |   1.1 K |   1.1 K |
| Big Pack     |  1.69 s |  1.95 s |  13.5 M |  14.5 M |
| Shallow Pack |  1.26 s |  1.29 s |  12.0 M |  12.2 M |
| Repack       | 29.51 s | 29.01 s | 237.7 M | 238.2 M |

Here, the attempts to remove conflicts in the v2 function seem to cause
slight bloat to these sizes. This shows that the Git repository benefits
a lot from cross-path delta pairs.

The results are similar with the nodejs/node repo:

| Test         | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack    |  0.02 s |  0.02 s |   1.6 K |   1.6 K |
| Big Pack     |  4.61 s |  3.26 s |  56.0 M |  52.8 M |
| Shallow Pack |  7.82 s |  7.51 s | 104.6 M | 107.0 M |
| Repack       | 88.90 s | 73.75 s | 740.1 M | 764.5 M |

Here, the v2 name-hash causes some size bloat more often than it reduces
the size, but it also universally improves performance time, which is an
interesting reversal. This must mean that it is helping to short-circuit
some delta computations even if it is not finding the most efficient
ones. The performance improvement cannot be explained only due to the
I/O cost of writing the resulting packfile.

The Linux kernel repository was the initial target of the default name
hash value, and its naming conventions are practically build to take the
most advantage of the default name hash values:

| Test         | V1 Time  | V2 Time  | V1 Size | V2 Size |
|--------------|----------|----------|---------|---------|
| Thin Pack    |   0.17 s |   0.07 s |   4.6 K |   4.6 K |
| Big Pack     |  17.88 s |  12.35 s | 201.1 M | 159.1 M |
| Shallow Pack |  11.05 s |  22.94 s | 269.2 M | 273.8 M |
| Repack       | 727.39 s | 566.95 s |   2.5 G |   2.5 G |

Here, the thin and big packs gain some performance boosts in time, with
a modest gain in the size of the big pack. The shallow pack, however, is
more expensive to compute, likely because similarly-named files across
different directories are farther apart in the name hash ordering in v2.
The repack also gains benefits in computation time but no meaningful
change to the full size.

Finally, an internal Javascript repo of moderate size shows significant
gains when repacking with --name-hash-version=2 due to it having many name
hash collisions. However, it's worth noting that only the full repack
case has significant differences from the v1 name hash:

| Test      | V1 Time   | V2 Time  | V1 Size | V2 Size |
|-----------|-----------|----------|---------|---------|
| Thin Pack |    8.28 s |   7.28 s |  16.8 K |  16.8 K |
| Big Pack  |   12.81 s |  11.66 s |  29.1 M |  29.1 M |
| Shallow   |    4.86 s |   4.06 s |  42.5 M |  44.1 M |
| Repack    | 3126.50 s | 496.33 s |   6.2 G | 855.6 M |

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-27 13:21:43 -08:00
2024-11-01 12:53:23 -04:00
2024-10-25 14:02:44 -04:00
2024-05-24 11:40:44 -07:00
2024-10-04 14:21:40 -07:00
2024-10-21 16:05:04 -04:00
2025-01-27 13:21:43 -08:00
2024-09-20 14:40:41 -07:00
2024-09-06 09:31:15 -07:00
2024-09-25 10:37:10 -07:00
2024-09-23 10:35:07 -07:00
2024-06-17 15:55:55 -07:00
2024-07-08 14:53:10 -07:00
2024-09-23 10:35:07 -07:00
2024-09-03 09:15:00 -07:00
2024-09-16 10:46:00 -07:00
2024-06-14 10:26:33 -07:00
2024-06-14 10:26:33 -07:00
2024-06-14 10:26:33 -07:00
2024-06-14 10:26:33 -07:00
2024-08-23 09:02:33 -07:00
2024-04-05 15:21:14 -07:00
2024-10-10 14:22:30 -07:00
2024-10-10 14:22:30 -07:00
2024-10-02 07:46:26 -07:00
2024-08-08 09:36:53 -07:00
2024-10-21 16:05:04 -04:00
2024-09-19 13:46:00 -07:00
2024-09-19 13:46:00 -07:00
2024-10-10 14:22:30 -07:00
2024-10-22 12:45:49 -04:00
2024-09-23 10:35:09 -07:00
2024-11-01 12:53:32 -04:00
2024-04-19 12:38:50 +02:00
2024-06-14 10:26:33 -07:00
2024-08-09 08:47:34 -07:00
2024-09-06 10:38:49 -07:00
2024-07-08 14:53:10 -07:00
2024-07-08 14:53:10 -07:00
2024-07-08 14:53:10 -07:00
2024-06-14 10:26:33 -07:00
2024-10-21 16:05:04 -04:00
2024-06-14 10:26:33 -07:00
2024-07-25 09:03:00 -07:00
2024-05-24 11:40:42 -07:00
2024-05-24 11:40:42 -07:00
2024-09-25 10:37:12 -07:00
2024-08-09 08:47:34 -07:00
2024-05-11 17:22:17 +02:00
2024-09-19 13:46:01 -07:00
2024-04-05 15:21:14 -07:00
2024-09-25 10:37:11 -07:00
2024-09-25 10:37:12 -07:00
2024-10-10 14:22:30 -07:00
2024-06-14 10:26:33 -07:00
2024-06-14 10:26:33 -07:00
2024-09-30 11:23:07 -07:00
2024-09-19 13:46:12 -07:00
2024-09-19 13:46:12 -07:00
2024-10-10 14:22:29 -07:00
2024-09-20 11:16:30 -07:00
2024-09-30 11:23:03 -07:00
2024-06-14 10:26:33 -07:00
2024-06-12 13:37:15 -07:00
2024-10-02 07:46:26 -07:00
2024-05-17 10:33:39 -07:00
2024-05-23 11:04:27 -07:00
2024-06-14 10:26:33 -07:00
2024-04-05 15:16:27 -07:00
2024-09-04 08:03:24 -07:00
2024-06-14 10:26:33 -07:00

Build status

Git - fast, scalable, distributed revision control system

Git is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.

Git is an Open Source project covered by the GNU General Public License version 2 (some parts of it are under different licenses, compatible with the GPLv2). It was originally written by Linus Torvalds with help of a group of hackers around the net.

Please read the file INSTALL for installation instructions.

Many Git online resources are accessible from https://git-scm.com/ including full documentation and Git related tools.

See Documentation/gittutorial.txt to get started, then see Documentation/giteveryday.txt for a useful minimum set of commands, and Documentation/git-<commandname>.txt for documentation of each command. If git has been correctly installed, then the tutorial can also be read with man gittutorial or git help tutorial, and the documentation of each command with man git-<commandname> or git help <commandname>.

CVS users may also want to read Documentation/gitcvs-migration.txt (man gitcvs-migration or git help cvs-migration if git is installed).

The user discussion and development of Git take place on the Git mailing list -- everyone is welcome to post bug reports, feature requests, comments and patches to git@vger.kernel.org (read Documentation/SubmittingPatches for instructions on patch submission and Documentation/CodingGuidelines).

Those wishing to help with error message, usage and informational message string translations (localization l10) should see po/README.md (a po file is a Portable Object file that holds the translations).

To subscribe to the list, send an email to git+subscribe@vger.kernel.org (see https://subspace.kernel.org/subscribing.html for details). The mailing list archives are available at https://lore.kernel.org/git/, https://marc.info/?l=git and other archival sites.

Issues which are security relevant should be disclosed privately to the Git Security mailing list git-security@googlegroups.com.

The maintainer frequently sends the "What's cooking" reports that list the current status of various development topics to the mailing list. The discussion following them give a good reference for project status, development direction and remaining tasks.

The name "git" was given by Linus Torvalds when he wrote the very first version. He described the tool as "the stupid content tracker" and the name as (depending on your mood):

  • random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of "get" may or may not be relevant.
  • stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
  • "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • "goddamn idiotic truckload of sh*t": when it breaks
Description
No description provided
Readme 725 MiB
Languages
C 50.5%
Shell 38.8%
Perl 4.4%
Tcl 3.2%
Python 0.8%
Other 2.1%