This is a “how to” on a process for managing changes that you keep locally (and only locally) on a fork of a git repository.
I can’t believe I’m writing this as a “how to”, but I keep wanting to refer people to this kind of information, so I guess if I want it out here, I should do something about it.
To be very clear, 99% of the time, you should NOT DO THIS. If you’re intentionally going here, be very aware of the additional work you just signed up for – both now, and in the future.
In the world of open source software, you make a fork of software all the time. Github has made it super easy, and more importantly, it’s how they (and git) have arranged to collaborate on software. This “how to” is for when you decide that you want to maintain your own fork, with changes in addition, or just divergent, from the original project. For most cases, you are going to be much better off submitting back your changes. Be damned sure you need to keep your changes to yourself.
If you want to keep your changes locally and just for yourself, then immediately recognize that you have just taken on “technical debt”. The interest rate for this debt could be high or low. Following the “debt” metaphor, the cost is based on how much activity and change happens in the repository from which you forked and want to take future changes.
I’m writing this article presuming you want or need to keep a fork with “a few changes added”, and you want to keep it otherwise up to date with the changes happening by others in the open source community. So let’s get to it – I’ll start with some terms and git basics:
First and foremost, realize that git repositories are sequences of diffs that make up a history (or streams of histories). This is all source control 101 stuff, but getting that you’re dealing with sequences of diffs, applied one after another, is a key insight.
If you’re new to git, do yourself a favor and read through some docs and guides. The site at https://help.github.com/articles/good-resources-for-learning-git-and-github/ is a pretty good “getting started” reference.
When you make a “fork”- either on github or cloning locally, you’re making a copy of that history of diffs – in fact, of all the histories that are contained within the repository.
is doing this:
And it’s the same thing that’s happening if you invoke “git clone https://github.com/rackhd/rackhd” on the command line, just the fork you’re making is local to your machine and filesystem as opposed to hosted on github.
Most commonly, you won’t have access to make changes to someone else’s repository on github. You generally “make a fork” because when you make a copy, you do have permissions to make changes to that instance. It’s what the whole github flow concept is about – make a fork, push your changes up to that repository, and then make a pull request to propose them back to someone else’s repository.
Setting up your own forks and their branches
Github makes it easy to make forks on github, but in reality you can fork and push changes in and out of any git repository – so it’s entirely possible to make a clone out of github and keep it in a local git source control system – maybe Enterprise Github or Bitbucket, whatever you’re using – as long as it supports git.
What github doesn’t make easy is keeping your fork “up to date” with remote changes, wether or not it’s on the hosted github site. You do this work yourself.
So let’s talk some descriptive terms for the moving parts here:
“Upstream” is a common term in open source software representing the project that you’re taking changes from, and to which you might contribute back, but maybe don’t have full contributor access to. “Upstream” includes an intentional concept that it’s a source of truth that you want to follow.
When you make a fork on github you don’t see the “links” but they’re there. You do, however, see the links when you make a fork using “git clone”. And by links, I mean the thing that’s called a remote in git.
When you make a clone, the repository will store within it where that clone came from. You can also add your own “remotes” – so you can keep track of many different repositories. That allows you to keep an insane number of remotes if you want to – for your sanity, I recommend you only keep a few, and you keep it simple.
By default, when you do a “git clone…” command, the name of remote is set as “origin”. So in the example above, the “origin” remote for that blue git repository is https://github.com/heckj/rackhd. If you look at the remotes for any local git repository (using the command “git remote -v”), you’ll see something like this:
$ git remote -v origin https://github.com/heckj/rackhd (fetch) origin https://github.com/heckj/rackhd (push)
Git lets you separate up where you’re getting updates from, and where you’re pushing updates to. This gives you all the rope you need to really hang yourself with complexity, so do yourself a favor and keep your process simple! I never divide fetch and push on a remote, so I’m not going into that level of detail here.
What I recommend doing is adding a remote to your local clone to the upstream repository. And because I’m nice and predictable that way, I like to name them consistently – so I call it “upstream”. You can do this like so:
$ git remote add upstream https://github.com/rackhd/rackhd
and then “git remote -v” shows:
origin https://github.com/heckj/rackhd (fetch) origin https://github.com/heckj/rackhd (push) upstream https://github.com/rackhd/rackhd (fetch) upstream https://github.com/rackhd/rackhd (push)
What you end up with is a sort of triangle of repositories all linked to each other.
Handling the Changes
Now it’s down to handling the changes. When you make changes, you add them as “commits” to a branch. If you’re going to be managing a long-running set of additions over something upstream and taking the upstream changes, I recommend doing this in a branch. It’ll just make some of the commands you need to run easier.
The three commands you need to know how to use are “fetch”, “merge” and “rebase”. Fetch is about getting data from remote repositories, merge is making changes to your local clone, and rebase is about replaying sequences of commits in new orders, or moving where they’re attached. The very common “git pull” command is really a combination of “git fetch” and “git merge” just combined into a single command for convenience.
For the purposes of this example, I’m going to assume you made a branch (which I’m labelling as “mystuff”) and you’re leaving the branch “master” to track changes from upstream. In this example, the green boxes represent a couple of commits upstream that you want to take into your local project, where you’re managing your additional change (the dark red commits).
The first thing you want to do is get the changes from upstream. I’m going to do this in sort of pedantic commands here to make it clear whats happening:
git fetch --all --prune
Is my “goto” command – it gets all the updates from all the remotes, removes any “remote” branches that are now gone on the remotes, but doesn’t “change” your working copy. It’s getting all the info, but not making any explicit changes to your local clone.
To take the changes in, switch to the master branch and “merge” them in. If you’re being “safe”, you’ll use the command extension “–ff-only”, which means only add the changes if the are in the same sequence upstream. “ff-only” meaning “fast forward only”, and implies no attempts at merging.
git checkout master git merge upstream/master --ff-only
When this is done, you’ll have the changes from “upstream” in your master branch, but your “mystuff” branch won’t reflect those:
If you plan on making some contributions back to the upstream project, then you’ll probably want to update your fork on github with those changes, although that’s not critical for the main challenge we have here. If you wanted to do that, the command to push those changes:
git push origin master
The last thing you want to do is move your changes to be applied after the changes from the upstream repository. I’m specifically suggesting “after” because you if you’re maintaining a branch for a long time, you’ll need to repeat this process every time you want to take changes from the “upstream” project. Remember these are applying “diffs” in sequence – so every time you do this, if the diffs you’re managing conflict with diffs that got added, you’ll need to resolve those conflicts and update those commits. That’s the core of the technical debt – managing those conflicts with every update from the upstream project.
The way I like to do this is using the “rebase” command with the “-i” flag:
# switch to the 'mystuff' branch git checkout mystuff # start the rebase git rebase -i master mystuff
The order the rebase command options is important. The “-i” flag will list out your changes that you’re about to apply in a text list, and generally I just take it as is. Save and quit that list, and the rebase will start applying. If it goes smoothly, you’re all done. If there’s a problem with the changes, then you need to make the relevant updates and invoke “git rebase –continue” until the rebase is complete. Atlassian has a nice little tutorial on the rebase process.
The quick version
Now that I’ve walked you through the long way, there’s a compressed version of this process you can use if you want to:
git checkout mystuff git fetch upstream --prune git rebase -i upstream/master mystuff
And you’ll achieve the same result for your mystuff branch (minus updating your local master, and pushing that master branch up to your fork)