Saturday, August 4, 2012

Bringing a bit more sanity to "alternates"?


The "alternates" mechanism lets you keep a single object store (not necessarily a git repository on its own, but just the objects/ part of it) on a machine, have multiple repositories on the same machine share objects from it, to save the network transfer bandwidth when cloning from remote repositories and the disk space used by the local repositories.  A repository created by "clone --reference" or "clone -s" uses this mechanism to borrow objects from the object store of another repository.  A user also can manually add new entries to $GIT_DIR/objects/info/alternates to borrow from other object stores.

The UI for this mechanism however has some room for improvement, and we may want to start improving it for the next release after the upcoming Git 1.7.12 (or even Git 2.0 if the change is a large one that may be backward incompatible but gives us a vast improvement).

Here are some random thoughts as a discussion starter (the real discussion is on the git mailing list git@vger.kernel.org; see http://thread.gmane.org/gmane.comp.version-control.git/202902).

By design, the borrowed object store MUST not ever lose any object from it, as such an object loss can corrupt the borrowing repositories.  In theory, it is OK for the object store whose objects are borrowed by repositories to acquire new objects, but losing existing objects is an absolute no-no.

But the UI of "clone -s" encourages users to borrow from the object store of a repository that the user may actively develop in.  It is perfectly normal for users to perform operations that make objects that used to be reachable from tips of its branches unreachable (e.g. rebase, reset, "branch -d") in a repository that is used for active development, but a "gc" after such an operation will lose objects that were originally available in the repository.  If objects lost that way were still needed by the repositories that borrow from it, the borrowing repository gets corrupt immediately.

In practice, this means that users who use "clone -s" to make a new repository can *never* prune the original repository without risking to corrupt its borrowing repository [1].
Some ideas:
  • Make "clone --reference" without "-s" not to borrow from the reference repository.  E.g. if you have a clone of Linus repository at /git/linux.git/, cloning a related repository using it as --reference:

    $ git clone --reference /git/linux.git git://k.org/linux-next.git should still take advantage of /git/linux.git/{refs,objects} to reduce the transfer cost of fetching from k.org, but the resulting repository should not point /git/linux.git with its objects/info/alternates file.
  • Make the distinction between a regular repository and an object store that is meant to be used for object sharing stronger.
    Perhaps a configuration item "core.objectstore = readonly" can be introduced, and we forbid "clone -s" from pointing at a repository without such a configuration.  We also forbid object pruning operations such as "gc" and "repack" from being run in a repository marked as such.

    It may be necessary to allow some special kind of repacking of such a "readonly" object store, in order to reduce the number of packfiles (and get rid of loose object files); it needs to be implemented carefully not to lose any object, regardless of local reachability.
When you have a repository and one or more repositories that borrow from it, you may want to dissociate the borrowing repositories from the borrowed one (e.g. so that you can repack or prune the original repository safely, or you may even want to remove it).

I think "git repack -a -d [-f]" in the borrowing repository happens to be the way to do this, but it is not clear to the users why.

Some ideas:

  • It might not be a bad idea to have a dedicated new command to help users manage alternates ("git alternates"?); obviously this will be one of its subcommand "git alternates detach" if we go that route.
  • Or just an entry in the documentation is sufficient?
When you have two or more repositories that do not share objects, you may want to rearrange things so that they share their objects from a single common object store.

There is no direct UI to do this, as far as I know.  You can obviously create a new bare repository, push there from all of these repositories, and then borrow from there, e.g.

git --bare init shared.git &&
for r in a.git b.git c.git ...
        do
   (
cd "$r" &&
       git push ../shared.git "refs/*:refs/remotes/$r/*" &&
echo ../../../shared.git/objects >.git/objects/info/alternates
       )
done

And then repack shared.git once.

Some ideas:
  • (obvious: give a canned command to do the above, perhaps then set the core.objectstore=readonly in the resuting shared.git)

When you have one object store and a repository that does not yet borrow from it, you may want to make the repository borrow from the object store.  Obviously you can run "echo" like the sample script in the previous item above, but it is not obvious how to perform the logical next step of shrinking $GIT_DIR/objects of the repository that now borrows the objects.

[edit: This is supported as "git repack -a -d -l"]

I think "git repack -a -d" is the way to do this, but if you compare this command to "git repack -a -d -f" we saw previously in this message, it is not surprising that the users would be confused---it is not obvious at all.

Some ideas:
  • (obvious: give a canned subcommand to do this)

[Footnote]

1 Making the borrowed object store aware of all the repositories that borrow from it, so that operations like "gc" and "repack" in the repository with the borrowed object store can keep objects that are needed by borrowing repositories, is theoretically possible, but is not a workable approach in practice, as (1) borrowers may not have a write access to the shared object store to add such a back pointer to begin with, (2) "gc"/"repack" in the borrowed object store and normal operations in the borrowing repositories can easily race with each other, without any coordination between the users, and (3) a casual "borrowing" can simply be done with a simple "echo" as shown in the main text of this message, and there is no way to ensure a backpointer from the borrowed object store to such a borrowing repository.