Package Management – somewhat abstract

🙇🏻‍♂️ Introducing checksync

Have you ever written code in more than one place that needs to stay in sync? Perhaps there is a tool in your framework of choice that can generate multiple files from a single source of truth, like T4 templates in the .NET world; perhaps not. Even if there is such a tool, it adds a layer of complexity that is not necessarily easy to grok. If you look at the output files or the template itself, it may not be clear what files are affected or related.

At Khan Academy, we have a linter, written in Python, that is executed whenever we create a new diff for review. It runs across a subset of our files and looks for blocks of text that are marked up with a custom comment format that identifies those blocks as being synchronized with other target blocks. Included in that markup is a checksum of the target block content such that if the target changes, we will get an error from the linter. This is our signal to check if further changes are need and then update the checksums that are invalidated. The only bugbear folks seem to have is that instead of offering an option to auto-fix checksums in need of update, it outputs a perl script that has to be copied and run for that purpose.

Small bugbear aside, this tool is fantastic. It enables us to link code blocks that need to be synchronized and catches when we change them with reasonably low overhead. Though I believe it is hugely useful, it is sadly custom to our codebase. I have long wanted to address that and create an open source version for everyone to use. checksync is that open source version.

🤔 The Requirements

Before writing checksync, I started out with the following requirements:

It should work with existing marked up code in the Khan Academy codebase; specifically,
1. File paths are relative to the project root directory
2. Checksums are calculated using Adler-32
3. Both // and # style comments are used to comment the markup tags
4. Start tag format is:
  sync-start:<ID> <CHECKSUM> <TARGET_FILE_PATH>
5. End tag format is:
  sync-end:<ID>
6. Multiple start tags can exist for the same tag ID but with different target files
7. Sync tags are not included in the checksum'd content
8. An extra line of blank content is included in the checksum'd content (due to a holdover from an earlier implementation)
9. .gitignore files should be ignored
10. Additional files can be ignored
It should be comparably performant to the existing linter
- The linter ran over the entire Khan Academy website codebase in less than 15 seconds
It should auto-update invalid checksums if asked to do so
It should output file paths such that editors like Visual Studio Code can open them on the correct line
It should support more comment styles
It should generally support any text file
It should run on Node 8 and above
- Some of our projects are still using Node 8 and I wanted to support those uses

With these requirements in mind, I implemented checksync (and ancesdir, which I ended up needing to ensure project root-relative file paths). By making it compatible with the existing Khan Academy linter, I could leverage the existing Khan Academy codebase to help measure performance and verify that things worked correctly. After a few changes to address various bugs and performance issues, it is still mildly slower than the Python equivalent, but the added features it provides more than make up for that (especially the fact that it is available to folks outside of our organization).

🎉 Check It Out

checksync includes a --help option to get information on usage. I have included the output below to give an overview of usage and the options available to customize how checksync runs.

checksync --help

checksync ✅ 🔗

Checksync uses tags in your files to identify blocks that need to remain
synchronised. It works on any text file as long as it can find the tags.

Tag Format

Each tagged block is identified by one or more sync-start tags and a single
sync-end tag.

The sync-start tags take the form:

    <comment> sync-start:<marker_id> <?checksum> <target_file>

The sync-end tags take the form:

    <comment> sync-end:<marker_id>

Each marker_idcan have multiple sync-start tags, each with a different
target file, but there must be only one corresponding sync-endtag.

Where:

    <comment>       is one of the comment tokens provided by the --comment
                    argument

    <marker_id>     is the unique identifier for this marker

    <checksum>      is the expected checksum of the corresponding block in
                    the target file

    <target_file>   is the path from your package root to the target file
                    with a corresponding sync block with the same marker_id

Usage

checksync <arguments> <include_globs>

Where:

    <arguments>       are the arguments you provide (see below)

    <include_globs>   are glob patterns for identifying files to check

Arguments

    --comments,-c      A string containing comma-separated tokens that
                       indicate the start of lines where tags appear.
                       Defaults to "//,#".

    --dry-run,-n       Ignored unless supplied with --update-tags.

    --help,-h          Outputs this help text.

    --ignore,-i        A string containing comma-separated globs that identify
                       files that should not be checked.

    --ignore-files     A comma-separated list of .gitignore-like files that
                       provide path patterns to be ignored. These will be
                       combined with the --ignore globs.
                       Ignored if --no-ignore-file is present.
                       Defaults to .gitignore.

    --no-ignore-file   When true, does not use any ignore file. This is
                       useful when the default value for --ignore-file is not
                       wanted.

    --root-marker,-m   By default, the root directory (used to generate
                       interpret and generate target paths for sync-start
                       tags) for your project is determined by the nearest
                       ancestor directory to the processed files that
                       contains a package.json file. If you want to
                       use a different file or directory to identify your
                       root directory, specify that using this argument.
                       For example, --root-marker .gitignore would mean
                       the first ancestor directory containing a
                       .gitignore file.

    --update-tags,-u   Updates tags with incorrect target checksums. This
                       modifies files in place; run with --dry-run to see what
                       files will change without modifying them.

    --verbose          More details will be added to the output when this
                       option is provided. This is useful when determining if
                       provided glob patterns are applying as expected, for
                       example.

And here is a simple example (taken from the checksync code repository) of running checksync against a directory with two files, using the defaults. The two files are given below to show how they are marked up for use with checksync. In this example, the checksums do not match the tagged content (though you are not expected to know that just by looking at the files – that's what checksync is for).

// This is a a javascript (or similar language) file

// sync-start:update_me 45678 __examples__/checksums_need_updating/b.py
const someCode = "does a thing";
console.log(someCode);
// sync-end:update_me

# Test file in Python style

# sync-start:update_me 4567 __examples__/checksums_need_updating/a.js
code = 1
# sync-end:update_me

Example output showing mismatched checksums

Additional examples that demonstrate various synchronization conditions and error cases can be found in the checksync code repository. To give checksync a try for yourself:

Install it from the npmjs.com repository:
yarn install checksync
Get the source from github.com/somewhatabstract/checksync and follow the usage instructions.

I hope you find this tool useful, and if you do or you have any questions, please do comment on this blog.

🙇🏻‍♂️ Introducing ancesdir

Photo by Maksym Kaharlytskyi on Unsplash

After many years of software development, I finally published my own NPM package. In fact, I published two. I was working on my checksync tool when I realised that I needed the package that this blog introduces. More on checksync in the next entry.

https://www.npmjs.com/package/ancesdir

🤔 What is root? Where is root?

Quite often, when working on some projects at Khan Academy, we need to know the root directory of the project. This enables us to write tools, linters, and tests that use root-relative paths, which in turn can make it much easier to refactor code. However, determining the root path of a project is not necessarily simple.

First, there is working out what identifies the root of a project. Is it the node_modules directory? The package.json file? The existence of .git folder? It may seem obvious to use one of these, but all these things have something in common; they don't necessarily exist. We can configure our package manager to have package.json and node_modules in non-standard places and we might change our source control, or not even run our code from within a clone of our repository. Determining the root folder by relying on any of these things as a marker is potentially not going to work.

Second, the code to walk the directory structure to find the given "marker" file or directory is not trivial. Sharing a common implementation within your project means everything that needs it, needs to locate it; in JavaScript, that means a relative path, at which point, you may as well just use a relative path to the known root directory and skip the shared approach all together. Yet, if you don't share a common implementation from a single location, then the code has to be duplicated everywhere you need it. I don't know about you, but that feels wrong.

💁🏻‍♂️ Solution: ancesdir

The issue of sharing a common implementation is easiest to solve. If that common implementation is installed as an NPM package, we don't need to include it via a relative path; we can just import it by its package name. There are packages out there that do this, but the ones I found all assumed some level of default setup, failing to acknowledge that this may change. In turn, they did not support a monorepo setup where there could be multiple sub-projects. How could one find the root folder of the monorepo from within a sub-project if all we used to identify the root folder were package.json? What if we wanted to sometimes get the root of the sub-project and sometimes the root of the monorepo?

I needed a way to identify a specific ancestor directory based on a known marker file or directory that would work even with non-standard setups. At Khan Academy, we have a marker file at the root of the project that is there solely to identify its parent directory as the project root. This file is agnostic of tech stack; it's just an empty file. It is solely there to say "this directory is the root directory". No tooling changes are going to render this mechanism broken unexpectedly unless they happen to use the same filename, which is unlikely. This way, we can find the repository root easily by locating that file. I wanted a package that could work just as easily with this custom marker file as it could with package.json.

I created ancesdir to fulfill these requirements¹.

yarn add ancesdir

The API is simple. In the default case, all you need to do is:

import ancesdir from "ancesdir";

console.log(`ancesdir's root directory is ${ancesdir()}`);

If you have a standard setup, with a package.json file, you will get the ancestor directory of the ancesdir package that contains that package.json file.

However, if you want the ancestor directory of the current file or a different path, you might use ancesdir like this:

import ancesdir from "ancesdir";

console.log(`This file's root directory is ${ancesdir(__dirname)}`);

In this example, we have given ancesdir a path from which to being its search. Of course, that still only works if there is an ancestor directory that contains a package.json file. What if that's not what you want?

For the more complex scenarios, like monorepos, for example, you can use ancesdir with a marker name, like this:

import ancesdir from "ancesdir";

console.log(`The monorepo root directory is ${ancesdir(__dirname, ".my_unique_root_marker_file")}`);

ancesdir will then give you the directory you seek (or null if it cannot be found). Not only that, but repeated requests will work faster as the results are cached as the directory tree is traversed.

Conclusion

If you find yourself needing a utility like this, checkout ancesdir. I hope y'all find it useful and I would love to hear if you do. You can checkout the source on GitHub.

The name is a play on the word "ancestor", while also attempting indicate that it has something to do with directories. I know, clever, right? [↩]

Getting Information About Your Git Repository With C#

During a hackathon not so long ago, I wanted to incorporate some source control data into my .NET assembly version information for the purposes of troubleshooting installations, making it easier for people to report the code in which they found a bug, and making it easier for people to find the code in which a bug was found¹. The plan was to automatically encode the branch, the commit hash, and whether there were local commits or local changes into the `AssemblyConfiguration` attribute of my assemblies during the build.

At the time, I hacked together the `RepositoryInformation` class below that wraps the command line tool to extract the required information. This class supported detecting if the directory is a repository, checking for local commits and changes, getting the branch name and the name of the upstream branch, and enumerating the log. Though it felt a little wrong just wrapping the command line (and seemed pretty fragile too), it worked. Unfortunately, it was dependent on git being installed on the build system; I would prefer the build to get everything it needs using package management like NuGet and npm².

class RepositoryInformation : IDisposable
{
    public static RepositoryInformation GetRepositoryInformationForPath(string path, string gitPath = null)
    {
        var repositoryInformation = new RepositoryInformation(path, gitPath);
        if (repositoryInformation.IsGitRepository)
        {
            return repositoryInformation;
        }
        return null;
    }
    
    public string CommitHash
    {
        get
        {
            return RunCommand("rev-parse HEAD");
        }
    }
    
    public string BranchName
    {
        get
        {
            return RunCommand("rev-parse --abbrev-ref HEAD");
        }
    }
    
    public string TrackedBranchName
    {
        get
        {
            return RunCommand("rev-parse --abbrev-ref --symbolic-full-name @{u}");
        }
    }
    
    public bool HasUnpushedCommits
    {
        get
        {
            return !String.IsNullOrWhiteSpace(RunCommand("log @{u}..HEAD"));
        }
    }
    
    public bool HasUncommittedChanges
    {
        get
        {
            return !String.IsNullOrWhiteSpace(RunCommand("status --porcelain"));
        }
    }
    
    public IEnumerable<string> Log
    {
        get
        {
            int skip = 0;
            while (true)
            {
                string entry = RunCommand(String.Format("log --skip={0} -n1", skip++));
                if (String.IsNullOrWhiteSpace(entry))
                {
                    yield break;
                }
                
                yield return entry;
            }
        }
    }
    
    public void Dispose()
    {
        if (!_disposed)
        {
            _disposed = true;
            _gitProcess.Dispose();
        }
    }
    
    private RepositoryInformation(string path, string gitPath)
    {
        var processInfo = new ProcessStartInfo
        {
            UseShellExecute = false,
            RedirectStandardOutput = true,
            FileName = Directory.Exists(gitPath) ? gitPath : "git.exe",
            CreateNoWindow = true,
            WorkingDirectory = (path != null && Directory.Exists(path)) ? path : Environment.CurrentDirectory
        };
        
        _gitProcess = new Process();
        _gitProcess.StartInfo = processInfo;
    }
    
    private bool IsGitRepository
    {
        get
        {
            return !String.IsNullOrWhiteSpace(RunCommand("log -1"));
        }
    }
    
    private string RunCommand(string args)
    {
        _gitProcess.StartInfo.Arguments = args;
        _gitProcess.Start();
        string output = _gitProcess.StandardOutput.ReadToEnd().Trim();
        _gitProcess.WaitForExit();
        return output;
    }
    
    private bool _disposed;
    private readonly Process _gitProcess;
}

If I were to approach this again today, I would use the LibGit2Sharp NuGet package or something similar³. Below is an updated version of `RepositoryInformation` that uses LibGit2Sharp instead of git command line. Clearly, you could forego any type of wrapper for LibGit2Sharp and I probably would if I were incorporating this into a bigger task like the one I originally had planned.

class RepositoryInformation : IDisposable
{
    public static RepositoryInformation GetRepositoryInformationForPath(string path)
    {
        if (LibGit2Sharp.Repository.IsValid(path))
        {
            return new RepositoryInformation(path);
        }
        return null;
    }
    
    public string CommitHash
    {
        get
        {
            return _repo.Head.Tip.Sha;
        }
    }
    
    public string BranchName
    {
        get
        {
            return _repo.Head.Name;
        }
    }
    
    public string TrackedBranchName
    {
        get
        {
            return _repo.Head.IsTracking ? _repo.Head.TrackedBranch.Name : String.Empty;
        }
    }
    
    public bool HasUnpushedCommits
    {
        get
        {
            return _repo.Head.TrackingDetails.AheadBy > 0;
        }
    }
    
    public bool HasUncommittedChanges
    {
        get
        {
            return _repo.RetrieveStatus().Any(s => s.State != FileStatus.Ignored);
        }
    }
    
    public IEnumerable<Commit> Log
    {
        get
        {
            return _repo.Head.Commits;
        }
    }
    
    public void Dispose()
    {
        if (!_disposed)
        {
            _disposed = true;
            _repo.Dispose();
        }
    }
    
    private RepositoryInformation(string path)
    {
        _repo = new Repository(path);
    }

    private bool _disposed;
    private readonly Repository _repo;
}

I have yet to use any of this outside of my hackathon work or this blog entry, but now that I have resurrected it from my library of coding exploits past to write about, I might just resurrect the original plans I had too. Whether that happens or not, I hope you found this useful or at least a little interesting; if so, or if you have some suggestions related to this post, please let me know in the comments.

Sometimes, like a squirrel, you want to know which branch you were on [↩]
I had looked at NuGet packages when I was working on the original hackathon project, but had decided not to use one for some reason or another (perhaps the available packages did not do everything I wanted at that time) [↩]
PowerShell could be a viable replacement for my initial approach, but it would suffer from the same issue of needing git on the build system; by using a NuGet package, the build includes everything it needs [↩]