Octokit and Noise Reduction with Pull Requests

Last time in this series on Octokit we looked at how to get the commits that have been made between one release and another. Usually, these commits will contain noise such as lazy commit messages and merge flog ("Fixed it", "Corrected spelling", etc.), merge commits, or commits that formed part of a larger feature change submitted via pull request. Rather than include all this noise in our release note generation, I want to filter those commits and either remove them entirely, or replace them with their associated pull request (which hopefully will be a little less noisy).

Before we filter out the noise, it seems prudent to reduce the commits to be filtered by matching them to pull requests. As with commits, we can query pull requests using a specific set of criteria; however, though we can request the results be sorted a certain way, we cannot specify a date range. To get all the pull requests that were merged before our release, we need to query for all the pull requests and then filter by date locally.

This query can be slow, since we are getting all closed pull requests in the repository. We could speed it up by providing a base branch name in the query criteria. However, to remove as much commit noise as possible, I would like to include pull requests that were merged to a different branch besides just the release branch¹. We could make things more performant by managing a list of active release branches and then querying pull requests for each of those branches only rather than the entire repository, but for now, we will stick with the less optimal approach as it keeps the code examples a little cleaner.

var prRequest = new PullRequestRequest
{
    State = ItemState.Closed,
    SortDirection = SortDirection.Descending,
    SortProperty = PullRequestSort.Updated
};

var pullRequests = await gitHubClient.PullRequest.GetAllForRepository("RepositoryOwner", "RepositoryName", prRequest);
var pullRequestsPriorToRelease = pullRequests
    .Where(pr => pr.MergedAt < mostRecentTwoReleases[0].CreatedAt);

Before we can start filtering our commits against the pull requests, we need to get the commits that comprise each pull request. When requesting a collection of items (like we did for pull requests), the GitHub API returns just enough information about each item so that we can filter and identify the ones we really care about. Before we can do things with other properties on the items, we have to request additional information. More information on each pull request can be obtained about a specific pull request by using the `Get`, `Commits`, `Files`, and `Merged` calls. The `Get` call returns the same type of objects as the `GetAllForRepository` method, except that all the data is now populated instead of just a few select properties; the `Merged` call returns a Boolean value indicating if the PR has been merged (equivalent to the `Merged` property populated by `Get`); the `Files` method returns the files changed by that pull request; and the `Commits` method returns the commits.

var commitsForPullRequest = await gitHubClient.PullRequest.Commits("RepositoryOwner", "RepositoryName", pullRequest.Number);

At this point, things are looking pretty good: we can get a list of commits in the release and a list of pull requests that might be in the release. Now, we want to filter that list of commits to remove items that are covered by a pull request. This is easy; we just compare the hashes and remove the matches.

var commitsNotInPullRequest = from commit in commitsInRelease
                              join prCommit in prCommits on commit.Sha equals prCommit.Sha into matchedCommits
                              from match in matchedCommits.DefaultIfEmpty()
                              where match == null
                              select commit;

Using the collection of commits for the latest release, we join the commits from the pull requests using the SHA hash and then select all release commits that have no matching commit in the pull requests². However, we don't want to lose information just because we're losing noise, so we have to maintain a list of the pull requests that were matched so that we can build our release note history. To keep track, we will hold off on discarding any information by pairing up commits in the release with their prospective pull requests instead of just dropping them.

Going back to where we had a list of pull requests merged prior to our release, let us revisit getting the commits for those pull requests and this time, pairing them with the commits in the release to retain information.

var commitsFromPullRequests = from pr in pullRequestsPriorToRelease
                              from commit in github.PullRequest.Commits("RepositoryOwner", "RepositoryName", pr.Number).Result
                              select new {commit,pr};

var commitsWithPRs = from commit in commitsInRelease
                     join prCommit in commitsFromPullRequests on commit.Sha equals prCommit.commit.Sha into matchedPrCommits
                     from matchedPrCommit in  matchedPrCommits.DefaultIfEmpty()
                     select new
                     {
                         PullRequest = match?.pr,
                         Commit = commit
                     };

Now we have a list of commits paired with their parent pull request, if there is one. Using this we can build a more meaningful set of changes for a release. If I run this on the latest release of the Octokit.NET repository and then group the commits by their paired pull request, I can see that the original list of 135 commits would be reduced to just 58 if each commit that belonged to a pull request were bundled into just one entry.

Next, we need to process the commits to remove those representing merges and other noise. These are things to discuss in the next post of this series where perhaps we will take stock and see whether this effort has been valuable in producing more meaningful release note generation. Until then, thanks for reading and don't forget to leave a comment.

often changes are merged forward from one branch to another, especially if there are multiple release branches to support patch development and such [↩]
The `join` in this example is an outer join; we are taking the join results and using `DefaultIfEmpty()` to supply an empty collection when there was nothing to join [↩]

Octokit and the Content of Releases

I started out my series on Octokit by defining a goal; to use GitHub repository history to build a basic summary of changes contained in a release. In order to do this, we need to define what a release is and then determine how we get the pertinent information to say what changes that release contains.

At a basic level, a release is a tagged point in the git repository. GitHub takes this one step further by making a release a first class concept as a lightweight git tag with additional attributes like a title and release notes. Octokit even allows first class access to GitHub releases in a repository, like so:

var releases = await gitHubClient.Release.GetAll("RepositoryOwner", "RepositoryName");

Great! With a little extra code, we can determine which release was the latest and then get all the commits in that release.

var latestRelease = releases.MaxBy(r => r.CreatedAt);

var commitRequest = new CommitRequest
{
    Until = latestRelease.CreatedAt,
    Sha = latest.TagName
};
var commits = await github.Repository.Commits.GetAll("RepositoryOwner", "RepositoryName", commitRequest);

In the above code, we use MoreLinq to get the most recent release and then request all the commits in the repository on the same branch as that release up until the date the release was created. We request these commits using a `CommitRequest` object that specifies the query parameters. In this case, we want all the commits until the date of the release for the tag on which the release was made¹. Of course, this will include everything ever done in that branch since the beginning of time, which is a bit of information overload. What we really want are the commits since the previous release.

var mostRecentTwoReleases= releases
    .OrderByDescending(r => r.CreatedAt)
    .Take(2)
    .ToArray();

var commitRequest = new CommitRequest
{
    Until = mostRecentTwoReleases[0].CreatedAt,
    Sha = mostRecentTwoReleases[0].TagName,
    Since = mostRecentTwoReleases[1].CreatedAt
};
var commits = await github.Repository.Commits.GetAll("RepositoryOwner", "RepositoryName", commitRequest);

Now we have taken the releases and used their `CreatedAt` dates to determine the most recent two and used the previous release date to set the `Since` date in our request. However, this code still has a flaw; we never said what branch the releases should be from. For all we know, the most recent two releases are on entirely different branches. To fix that, we need to filter the releases to just the branch we want.

var mostRecentTwoReleases= releases
    .Where(r => r.TargetCommitish = "myBranch")
    .OrderByDescending(r => r.CreatedAt)
    .Take(2)
    .ToArray();

var commitRequest = new CommitRequest
{
    Until = mostRecentTwoReleases[0].CreatedAt,
    Sha = mostRecentTwoReleases[0].TagName,
    Since = mostRecentTwoReleases[1].CreatedAt
};
var commits = await github.Repository.Commits.GetAll("RepositoryOwner", "RepositoryName", commitRequest);

The highlighted line is where we filter on the appropriate branch (it took some investigation to discover that the `TargetCommitish` property of a release is its branch name). We now have just the commits for the release branch we care about between the most recent release and the one before it.

In the next post, we will look at reducing the noise in the commit history using pull requests. Until then, thank you for stopping by and don't forget to leave a comment.

The `Sha` property of the `CommitRequest` can be either a commit hash or branch/tag name [↩]

Change Requests

Much like my previous post on Meeting Etiquette, this is a topic I feel strongly about. I am sure there are good reasons that people will hate some of my suggestions and I'd love to hear them, but here are my views on change requests¹ based on my personal experiences.

All work is a change

I loathe projects that differentiate new work from changes to existing work. It creates two different process flows for little gain, creating points for confusion and mistakes. If all work items, whether new features, bug fixes or enhancements to existing features are raised as change requests, the work flow is the same. Everything should be tied back to requirements, regardless of the type of work, so arguments that claim there is a difference just don't wash with me. Consider a new feature as a change from not having it to having it, after all, that's exactly what it is.

Whatever the work is that is being conducted must still be implemented, reviewed and tested against requirements. Why make it harder than it needs to be?

Specify requirements, not solutions

There are many times I've been assigned a requirement that tells me how to fix something, not what needs fixing. Let's face it, everyone has an opinion but change requests are not the place to express them (except perhaps as a suggestion in the comments somewhere). A change request should clearly state the requirements that drive the change (i.e. the things that can be used to identify when the change request has been resolved) and any other information that may help (for example, steps to reproduce a bug or some rationale behind the change required).

Be descriptive

If I see one more change request with a summary or title like "Change to menu dropdown" or "Display control update", I will be rather miffed and may hurt someone (I'm British, "miffed" is just above "peeved" on the British Standards Anger Scale²). The title of a change request is very important and should give a clear indication of what the change request actually requires. Think of it a bit like twitter; it's much nicer reading some useful information in a tweet than it is to learn that someone just had a coffee. If the title is not clear, time is wasted in going to look at the description every time someone sees that change request. Every status meeting, every discussion, click click click. Save everyone the effort and get it right first time, and if you spot a title that isn't clear enough, fix it right away.

Add value

Finally, when adding comments, additional description, attachments or anything else to a change request, make sure it adds value. Leave an trail for those who follow in your footsteps so that they can discover what changed and why. Document important discussions and decisions. If you don't, you are destined to go around in circles.

Manage releases by managing change

Target changes at releases and review new changes regularly. This way, new requests raised during that release cycle can be considered for inclusion and deferred changes can be ignored until after the release. Each time a new release is started, review all the open requests and determine if they should be rejected, deferred or included in that release. Justify and document rejections in case a duplicate is raised and make sure to link duplicate issues as they can add value to one another.

Have meaningful states

I feel that there are the following possible states for a change request in any sane process to manage them.

Raised
Assigned
In progress
Ready for review
Passed review
Merged to trunk
Rejected
Closed

These are clear, unambiguous states.

If something is marked as "Raised", it hasn't been assigned to any release and no work should be happening on it.
If it's "Assigned", it should be targetted at a release (even if it's only intended for investigation at first – it can always be removed from the release back to "Raised" or rejected and closed).
If someone is working on something, that something should be marked as "In progress" as this helps to track progress at a glance and can also be useful if resources become available and things need reassigning.
If something passes testing, close it, unless you really don't trust the test team, in which case have a "Passed test" state and then review the results before closing³.

Considering these basic states, the workflow looks something like this:

You could add additional states if you so desired but I feel that these cover the bases well enough and provide an easy to follow work flow.

Closed means closed

Change request management gets messy when the process allows for closed requests to be re-opened. If a closed request seems like it really is now needed, raise a new request. Don't close change requests just because they aren't being done right away; if it's a real issue, then it should remain open until it is resolved.

Not always, but mostly

The guiding principle for me when it comes to change requests is simplicity. Don't make your process more complicated than it needs to be. Make it easy to follow and hard to get wrong. While I'm yet to encounter a project that required anything different to what I've suggested, I am certain there are exceptions, so if you have any, let me know. This is often a contentious, polarising topic, so I expect someone, somewhere to emphatically insist I am wrong. I can accept that, so I look forward to finding out even better ways to manage change.

I made that up [↩]
You may also know them as issues, bugs, defects or some other moniker that ultimately means "a repository of things to do". [↩]
Flippant remarks aside, this may be valuable if you need to perform a round of customer acceptance testing after internal testing before closing out change requests. [↩]