Teach "add" to use preload-index and fscache features
to improve performance on very large repositories.
During an "add", a call is made to run_diff_files()
which calls check_remove() for each index-entry. This
calls lstat(). On Windows, the fscache code intercepts
the lstat() calls and builds a private cache using the
FindFirst/FindNext routines, which are much faster.
Somewhat independent of this, is the preload-index code
which distributes some of the start-up costs across
multiple threads.
We need to keep the call to read_cache() before parsing the
pathspecs (and hence cannot use the pathspecs to limit any preload)
because parse_pathspec() is using the index to determine whether a
pathspec is, in fact, in a submodule. If we would not read the index
first, parse_pathspec() would not error out on a path that is inside
a submodule, and t7400-submodule-basic.sh would fail with
not ok 47 - do not add files from a submodule
We still want the nice preload performance boost, though, so we simply
call read_cache_preload(&pathspecs) after parsing the pathspecs.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The purpose of this function is to stat() the files listed in the index
in a multi-threaded fashion. It is called directly after reading the
index in the read_index_preloaded() function.
However, in some cases we may want to separate the index reading from
the preloading step, e.g. in builtin/add.c, where we need to load the
index before we parse the pathspecs (which needs to error out if one of
the pathspecs refers to a path within a submodule, for which the index
must have been read already), and only then will we want to preload,
possibly limited by the just-parsed pathspecs.
So let's just export that function to allow calling it separately.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
If multiple threads access a directory that is not yet in the cache, the
directory will be loaded by each thread. Only one of the results is added
to the cache, all others are leaked. This wastes performance and memory.
On cache miss, add a future object to the cache to indicate that the
directory is currently being loaded. Subsequent threads register themselves
with the future object and wait. When the first thread has loaded the
directory, it replaces the future object with the result and notifies
waiting threads.
Signed-off-by: Karsten Blees <blees@dcon.de>
Checking the work tree status is quite slow on Windows, due to slow lstat
emulation (git calls lstat once for each file in the index). Windows
operating system APIs seem to be much better at scanning the status
of entire directories than checking single files.
Add an lstat implementation that uses a cache for lstat data. Cache misses
read the entire parent directory and add it to the cache. Subsequent lstat
calls for the same directory are served directly from the cache.
Also implement opendir / readdir / closedir so that they create and use
directory listings in the cache.
The cache doesn't track file system changes and doesn't plug into any
modifying file APIs, so it has to be explicitly enabled for git functions
that don't modify the working copy.
Note: in an earlier version of this patch, the cache was always active and
tracked file system changes via ReadDirectoryChangesW. However, this was
much more complex and had negative impact on the performance of modifying
git commands such as 'git checkout'.
Signed-off-by: Karsten Blees <blees@dcon.de>
Add a macro to mark code sections that only read from the file system,
along with a config option and documentation.
This facilitates implementation of relatively simple file system level
caches without the need to synchronize with the file system.
Enable read-only sections for 'git status' and preload_index.
Signed-off-by: Karsten Blees <blees@dcon.de>
Emulating the POSIX lstat API on Windows via GetFileAttributes[Ex] is quite
slow. Windows operating system APIs seem to be much better at scanning the
status of entire directories than checking single files. A caching
implementation may improve performance by bulk-reading entire directories
or reusing data obtained via opendir / readdir.
Make the lstat implementation pluggable so that it can be switched at
runtime, e.g. based on a config option.
Signed-off-by: Karsten Blees <blees@dcon.de>
Emulating the POSIX dirent API on Windows via FindFirstFile/FindNextFile is
pretty staightforward, however, most of the information provided in the
WIN32_FIND_DATA structure is thrown away in the process. A more
sophisticated implementation may cache this data, e.g. for later reuse in
calls to lstat.
Make the dirent implementation pluggable so that it can be switched at
runtime, e.g. based on a config option.
Define a base DIR structure with pointers to readdir/closedir that match
the opendir implementation (i.e. similar to vtable pointers in OOP).
Define readdir/closedir so that they call the function pointers in the DIR
structure. This allows to choose the opendir implementation on a
call-by-call basis.
Move the fixed sized dirent.d_name buffer to the dirent-specific DIR
structure, as d_name may be implementation specific (e.g. a caching
implementation may just set d_name to point into the cache instead of
copying the entire file name string).
Signed-off-by: Karsten Blees <blees@dcon.de>
Git for Windows ships with its own Perl interpreter, and insists on
using it, so it will most likely wreak havoc if PERL5LIB is set before
launching Git.
Let's just unset that environment variables when spawning processes.
To make this feature extensible (and overrideable), there is a new
config setting `core.unsetenvvars` that allows specifying a
comma-separated list of names to unset before spawning processes.
Reported by Gabriel Fuhrmann.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In the Git for Windows project, we have ample precendent for config
settings that apply to Windows, and to Windows only.
Let's formalize this concept by introducing a platform_core_config()
function that can be #define'd in a platform-specific manner.
This will allow us to contain platform-specific code better, as the
corresponding variables no longer need to be exported so that they can
be defined in environment.c and be set in config.c
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This is the convention elsewhere (and prepares for the case where we may
need to pass callback data).
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In git reset --hard, unpack-trees() is called with oneway_merge().
oneway_merge calls lstat for each files in a repository.
It is bottleneck of git reset --hard, especially in large repository.
This patch improves time by using fscache.
In chromium repository, time of git reset --hard is changed like below.
I took 3 times stats in the repository.
master:
TotalSeconds: 21.0337971
TotalSeconds: 20.0046612
TotalSeconds: 20.6501752
Avg: 20.5628778333333
this patch:
TotalSeconds: 4.8552376
TotalSeconds: 4.8722343
TotalSeconds: 4.9268245
Avg: 4.88476546666667
Signed-off-by: Takuto Ikuta <tikuta@chromium.org>
When I do git fetch, git call file stats under .git/objects for each
refs. This takes time when there are many refs.
By enabling fscache, git takes file stats by directory traversing and that
improved the speed of fetch-pack for repository having large number of
refs.
In my windows workstation, this improves the time of `git fetch` for
chromium repository like below. I took stats 3 times.
* With this patch
TotalSeconds: 9.9825165
TotalSeconds: 9.1862075
TotalSeconds: 10.1956256
Avg: 9.78811653333333
* Without this patch
TotalSeconds: 15.8406702
TotalSeconds: 15.6248053
TotalSeconds: 15.2085938
Avg: 15.5580231
Signed-off-by: Takuto Ikuta <tikuta@chromium.org>
In 89a70b80 ("t0302 & t3900: add forgotten quotes", 2018-01-03), quotes
were added to protect against spaces in $HOME. In the test_when_finished
command, two files are deleted which must be quoted individually.
[jc: with \$HOME in the test_when_finished command quoted, as
pointed out by j6t].
Signed-off-by: Beat Bolli <dev+git@drbeat.li>
Helped-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git merge -s recursive" did not correctly abort when the index is
dirty, if the merged tree happened to be the same as the current
HEAD, which has been fixed.
* ew/empty-merge-with-dirty-index:
merge-recursive: do not look at the index during recursive merge
"git rebase -p -X<option>" did not propagate the option properly
down to underlying merge strategy backend.
* js/fix-merge-arg-quoting-in-rebase-p:
rebase -p: fix quoting when calling `git merge`
Git's assumption that all path lists are colon-separated is not only
wrong on Windows, it is not even an assumption that is compatible with
POSIX.
In the interest of time, let's not try to fix this properly but simply
work around the obvious breakage on Windows, where the MSYS2 Bash used
by Git for Windows to interpret the Git's Unix shell scripts will
automagically convert path lists in the environment to
semicolon-separated lists of Windows paths (with drive letter and the
corresponding colon and all that jazz).
In other words, we simply look whether there is a semicolon in
GITPERLLIB and split by semicolons if found instead of colons. This is
not fool-proof, of course, as the path list could consist of a single
path. But that is not the case in Git for Windows' test suite, there are
always two paths in GITPERLLIB.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Especially in huge code bases with fast-moving `master`, it can be
prohibitively expensive to calculate whether an upstream branch of
a local branch is ahead, behind or diverged.
This topic branch introduces a set of flags to avoid that computation
when we're not even interested in it to begin with.
This merge commit takes the feature early, therefore it is marked
experimental.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Teach long (normal) status format to respect the --no-ahead-behind
parameter and skip the possibly expensive ahead/behind computation
between the branch and the upstream.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Teach "git status --short --branch" to respect "--no-ahead-behind"
parameter to skip computing ahead/behind counts for the branch and
its upstream and just report '[different]'.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Teach "git status" and "git commit" to accept "--no-ahead-behind"
and "--ahead-behind" arguments to request quick or full ahead/behind
reporting.
When "--no-ahead-behind" is given, the existing porcelain V2 line
"branch.ab +x -y" is replaced with a new "branch.ab +? -?" line.
This indicates that the branch and its upstream are or are not equal
without the expense of computing the full ahead/behind values.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Extend stat_tracking_info() to return +1 when branches are not equal and to
take a new "enum ahead_behind_flags" argument to allow skipping the (possibly
expensive) ahead/behind computation.
This will be used in the next commit to allow "git status" to avoid full
ahead/behind calculations for performance reasons.
Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>