Using Python Dulwich for read and write to git repositories

Python Dulwich provides direct read-write access to git repositories, in pure python. As such, it is seriously low-level, very hard to find good examples of how to work with it, and places a git repository at risk of corruption and destruction if you get it wrong.

The object store tutorial gives some hints on how to commit an object, and the repo tutorial gives some hints on how to add new files by using the "index". However, the object store tutorial basically replaces whatever file and directory structure that was in the repository with a single commit comprising a single file (the blob being committed), and the repo example adds a file to a pre-existing repository but requires that the example file be written out to disk in order to do so.

This tutorial therefore shows how to add a file to a repository without having to write it out to local storage, whilst also maintaining the existing hierarchy as represented by the current master. Here is the code:

        blob = Blob.from_string(text)
        bloblen = blob.raw_length()

        idx = repo.open_index()

        object_store = repo.object_store
        object_store.add_object(blob)

        idx[fname] = ((0, 0), (0, 0),
                      0, 0, 0100644, 0, 0, bloblen, str(blob.id), 0)

        tree_id = commit_tree(object_store, idx.iterblobs())

        commit = Commit()
        commit.tree = tree_id
        commit.author = commit.committer = author
        commit.commit_time = commit.author_time = int(time())
        tz = parse_timezone('+0000')[0]
        commit.commit_timezone = commit.author_timezone = tz
        commit.encoding = "UTF-8"
        commit.message = message
        try:
            commit.parents = [repo.refs['HEAD']]
        except KeyError:
            commit.parents = []

        object_store.add_object(commit)

        repo.refs['refs/heads/master'] = commit.id

        idx.write() # write out to local checkout

        return commit.id

This looks fairly straightforward - however there are a couple of things that need to be understood. The first is that this code critically depends on having a "non-bare" repository, such that they have what is called an "index" (see the .git/index file in any git repository). The "index" mirrors the file structure of what has been checked out into the working tree (i.e. everything that isn't in the .git directory).

Dulwich has a convenient function for reading the index file and creating a dictionary representing the file hierarchy, called open_index. The result acts like a dictionary, with the fully-qualified filename being the key and the file information, including owner, permissions, inode and crucially a repository blob id, being stored as the values.

The repo.stage function modifies the index, but it does so by trawling through a series of directories - in the working directory of the git repository. This may not be appropriate: for example, the above code is utilised in the context of a JSONRPC service for an online wiki, where the new page data is received over a network, not off of a local filestore. Thus it is inappropriate to store that data on-disk.

Instead what is done is that a blob is created directly from the data, then, once the index has been read, the new blob is directly added to the dulwich Index object, with some fake values but crucially correct values for the important parameters such as the new blob's id, the data length and the file's mode.

The next critical insight is the use of the commit_tree function, which takes the dulwich Index object and creates a hierarchy of Tree objects, referencing the Blobs as well. Each commit in a git repository must contain a full hierarchy of the objects (Trees and Blobs) associated with that commit. Whilst this may be obvious to people familiar with the internals of git, it most definitely isn't obvious to the average developer.

The rest of the code is near-identical to that which is in the dulwich tutorial, with the exception of the line that sets up the commit parent. In the case of the dulwich tutorial, if the commit parent is not set, the consequences become very clear once the repository is viewed with another tool: all commits bar the one that has just been committed using the tutorial code are no longer visible! running "git fsck" will show every single prior commit as a "dangling commit". The commit parent is therefore a linked list, and it is essential that the "new list head" refers to the previous commit. If however this is the first commit, then obviously that list is empty - hence the reason for the try / except.

The only other thing worthwhile mentioning is that the blob has to be added to the object store manually (unlike with repo.stage). The reason for this is that repo.stage actually modifies (writes out) a new index, whereas the above code specifically does not do that.

Overall, then: this is a good way to add files to an existing git repository, without having to store the data on-disk as an intermediate step.