LINQ: Understanding Your Query Chain

This is part of a short series on the basics of LINQ:

Part One – Clarity, complexity, and understanding
Part Two – Deferred Execution
Part Three – Understanding Your Query Chain
Part Four – Notation, Syntax, and Snags

This is the third part in my small series on LINQ and it covers what I feel is the most important thing to understand when using LINQ, query chains. We are going to build on the deferred execution concepts discussed in the last entry and look at why it is important to know your query operations.

Each method in a LINQ query is either immediately executed or deferred. When deferred, a method is either lazily evaluated one element at a time or eagerly evaluated as the entire collection. Usually, you can determine which is which from the documentation or, if that fails, a little experimentation. Why does it matter? This question from StackOverflow provides us with an example:

For those that did not read it or do not understand the problem, let me summarize. The original poster had a problem where values they had obtained from a LINQ query result, when passed into the `Except()` method on that same query, did not actually exclude anything. It was as if they had taken the sequence `1,2,3,4`, called `Exclude(2)`, and that had returned `1,2,3,4` instead of the expected `1,3,4`. On the surface, the code looked like it should work, so what was going on? To explain, we need a little more detail.

The example code has a class that described a user. An XML file contained user details and this is loaded into a sequence of `User` instances using LINQ-to-XML.

IEnumerable<IUser> users = doc.Element("Users").Elements("User").Select
        (u => new User
            { ID = (int)u.Attribute("id")
              Name = (string)u.Attribute("name")
            }
        ).OfType<IUser>();       //still a query, objects have not been materialized

As noted in the commentary, the poster understood that at this point, the query is not yet evaluated. With their query ready to be iterated, they use it to determine which users should be excluded using a different query.

public static IEnumerable<IUser> GetMatchingUsers(IEnumerable<IUser> users)
{
     IEnumerable<IUser> localList = new List<User>
     {
        new User{ ID=4, Name="James"},
        new User{ ID=5, Name="Tom"}
     }.OfType<IUser>();
     var matches = from u in users
                   join lu in localList
                       on u.ID equals lu.ID
                   select u;
     return matches;
}

And then using those results, the originally loaded list of users is filtered.

var excludes = users.Except(matches);    // excludes should contain 6 users but here it contains 8 users

Now, it is clear this code is not perfect and we could rewrite it to function without so many LINQ queries (I will give an example of that later in this post), but we do not care about the elegance of the solution; we are using this code as an example of why it is important to understand what a LINQ query is doing.

As noted in the commentary, when the line declaring the initial `users` query is executed, the query it defines has not. The query does not actually become a real list of users until it gets consumed¹. Where does that happen? Go on, guess.

If you guessed `GetMatchingUsers`, you are wrong. All that method does is build an additional level of querying off the initial query and return that new query. If you guessed the `Except()` method, that's wrong too, because `Except()` is also deferred. In fact, the example only implies that something eventually looks at the results of `Except()` and as such, the query is never evaluated. So, for us to continue, let's assume that after the `excludes` variable (containing yet another unexecuted query), we have some code like this to consume the results of the query:

foreach (var user in excludes)
{
   Console.WriteLine(user.ID);
}

By iterating over `excludes`, the query is executed and gives us some results. Now that we are looking at the query results, what happens?

First, the `Except()` method takes the very first element from the `users` query, which in turn, takes the very first `User` element from the XML document and turns it into a `User` instance. This instance is then cast to `IUser` using `OfType`².

Next, the `Except()` method takes each of the elements in the `matches` query result and compares it to the item just retrieved from the `users` collection. This means the entire `matches` query is turned into a concrete list. This causes the `users` query to be reevaluated to extract the matched users. The instances of `User` created from the `matches` query are compared with each instance from the `users` query and the ones that do not match are returned for us to output to the console.

It seems like it should work, but it does not, and the key to why is in how queries and, more importantly, deferred execution work.

Each evaluation of a deferred query is unique. It is very important to remember this when using LINQ. If there is one thing to take away from reading my blog today, it is this. In fact, it's so important, I'll repeat it:

Each evaluation of a deferred query is unique.

It is important because it means that each evaluation of a deferred query (in most cases) results in an entirely new sequence with entirely new items. Consider the following iterator method:

IEnumerable<object> GetObject()
{
    yield return new Object();
}

It returns an enumerable that upon being consumed will produce a single `Object` instance. If we had a variable of the result of `GetObject()`, such as `var obj = GetObject()` and then iterated `obj` several times, each time would give us a different `Object` instance. They would not match because on each iteration, the deferred execution is reevaluated.

If we go back to the question from StackOverflow armed with this knowledge, we can identify that `users` is evaluated twice by the `Except()` call. One time to get the list of exceptions out of the `matches` query and another to process the list that is being filtered. It is the equivalent of this:

IEnumerable<object> GetObjects()
{
    return new[]
    {
        new Object(),
        new Object(),
        new Object(),
        new Object(),
        new Object()
    };
}

void main()
{
   var objects = GetObjects().Except(GetObjects());
   foreach (var o in objects)
   {
        Console.WriteLine(o);
   }
}

From this code, we would never expect `objects` to contain nothing since the two calls to the immediately executed `GetObjects` would return collections of completely different instances. When execution is deferred, we get the same effect; each evaluation of a query is as if it were a separate method call.

To fix this problem, we need to make sure our query is executed once to make the results "concrete", then use those concrete results to do the rest of the work. This is not only important to ensure that the objects being manipulated are the same in all uses of the queried data, but also to ensure that we don't do work more than once³. To make the query concrete, we call an immediately executed method such as `ToList()`, evaluating the query and capturing its results in a collection.

This is our solution, as the original poster of our StackOverflow question indicated. If we change the original `users` query to be evaluated and stored, everything works as it should. With the help of some investigation and knowledge of how LINQ works, we now also know why.

Now that we understand a little more about LINQ we can consider how we might rewrite the original poster's example code. For example, we really should not to iterate the `users` list twice at all; we should see the failure of `Except()` as a code smell that we are iterating the collection too often. Though making it concrete with `ToList()` fixes the bug, it does not fix this inefficiency.

To do that, we can rewrite it to something like this:

public static bool IsMatchingUser(User user)
{
     var localList = new List<User> {
        new User{ ID=4, Name="James"},
        new User{ ID=5, Name="Tom"} }
      .Cast<IUser>()
      .AsReadOnly();

    return localList.Any(u=>u.ID == user.ID && u.Name == user.Name);
}

var users = doc
    .Element("Users")
    .Elements("User")
    .Select(u => new User {
        ID = (int)u.Attribute("id")
        Name = (string)u.Attribute("name") })
    .Where(u => IsMatchingUser(u));

This update only iterates over each user once, resulting in a collection that excludes the users we don't want⁴.

In conclusion…

My intention here was to show why it is fundamental to know which methods are immediately executed, which ones are deferred, and whether those deferred methods are lazily or eagerly evaluated. At the end of this post are some examples of each kind of LINQ method, but a good rule of thumb is that if the method returns a type other than `IEnumerable` or `IQueryable` (e.g. `int` or `List`), it is immediately executed; all other cases are probably using deferred execution. If a method does use deferred execution, it is also helpful to know which ones iterate the entire collection every time and which ones stop iterating when they have their answer, but for this you will need to consult documentation and possibly experiment with some code.

Just knowing these different types of methods can be a part of your query will often be enough to help you write better LINQ and debug queries faster. By knowing your LINQ methods, you can improve performance and reduce memory overhead, especially when working with large data sets and slow network resources. Without this knowledge, you are likely to evaluate queries and iterate sequences too often, and instantiate objects too many times.

Hopefully you were able to follow this post and it has helped you get a better grasp on LINQ. In the final post of this series, I will ease up on the deep code analysis, and look at query syntax versus dot notation (aka fluent notation). In the meantime, if you have any comments, I'd love to read them.

Examples of LINQ Method Varieties

Immediate Execution

`Count()`, `Last()`, `ToList()`, `ToDictionary()`, `Max()`, `Aggregate()`
Immediate iterate the entire collection every time

`Any()`, `All()`, `First()`, `Single()`
Iterate the collection until a condition is or is not met

Deferred Execution, Eager Evaluation

`Distinct()`, `OrderBy()`, `GroupBy()`
Iterate the entire collection but only when the query is evaluated

Deferred Execution, Lazy Evaluation

`Select()`,`Where()`
Iterate the entire collection

`Take()`,`Skip()`
Iterate until the specified count is reached

"consumed" is often used as an alternative to "iterated" [↩]
`Cast()` should have been used here since all the objects loaded are of the same type [↩]
This is something that becomes very important when working with large queries that can be time or resource consuming to run [↩]
with more effort, I am certain there are more things that could be done to improve this code, but we're not here for that, so I'll leave is as an exercise for you; I'm generous like that [↩]

LINQ: Deferred Execution

This is part of a short series on the basics of LINQ:

Part One – Clarity, complexity, and understanding
Part Two – Deferred Execution
Part Three – Understanding Your Query Chain
Part Four – Notation, Syntax, and Snags

In the first ~~rant~~ post of this short series on LINQ, I explained the motivation behind writing this series in the first place, which can be summarised as:

People don't know LINQ and that impacts my ability to make use of it; I should try to fix that.

To start, I'm going to explain what I believe is the most important concept in LINQ; deferred execution.

So, what is deferred execution?

Deferred execution code is not executed until the result is needed; the execution is put off (deferred) until later. By doing this we can combine a series of actions without actually executing any of them, then execute them at the time we need a result. This allows us to limit the execution of computationally expensive operations until we absolutely need them.

That's my description, here is one from an MSDN tutorial on LINQ-to-XML that perhaps puts it more clearly:

Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required. Deferred execution can greatly improve performance when you have to manipulate large data collections, especially in programs that contain a series of chained queries or manipulations. In the best case, deferred execution enables only a single iteration through the source collection.

Some may be surprised to know that deferred execution was not new when LINQ arrived, it had already been around for quite some time in the form of iterator methods. In fact, it is iterator methods that give LINQ its deferred execution. Before we look at an iterator method, let's look at an example of immediate execution. For this example, we will give ourselves the task of taking a collection of people and outputting a collection of unique last names for all those born before 1980.

public struct Person
{
    public Person(string first, string last, DateTimeOffset dateOfBirth) : this()
    {
        FirstName = first;
        LastName = last;
        DateOfBirth = dateOfBirth;
    }
	
    public string FirstName { get; private set; }
    public string LastName { get; private set; }
    public DateTimeOffset DateOfBirth { get; private set; }
}

public static class Data
{
    public static IEnumerable<Person> People = new[] {
        new Person("John", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 34 )),
        new Person("Bill", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 20 )),
        new Person("Sarah", "Allans", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 19 )),
        new Person("John", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 44 )),
        new Person("James", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 78 )),
        new Person("Alex", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 30 )),
        new Person("Mabel", "Thomas", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 29 )),
        new Person("Sarah", "Brown", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 23 )),
        new Person("Gareth", "Smythe", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 100 )),
        new Person("Gregory", "Drake", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 90 )),
        new Person("Michael", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 56 )),
        new Person("Alex", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 22 )),
        new Person("William", "Pickwick", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 17 )),
        new Person("Marcy", "Dickens", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 18 )),
        new Person("Erica", "Waters", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 26 ))
    };
}

IEnumerable<string> GetLastNamesImmediate()
{
    var distinctNamesToReturn = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !distinctNamesToReturn.Contains(person.LastName))
        {
            distinctNamesToReturn.Add(person.LastName);
        }
    }
    return distinctNamesToReturn;
}

When this method is called, it iterates over the entire collection and then returns its result, which also can then be iterated. If the collection of people were huge and we only cared about the first five names, this would be incredibly slow. To turn this into deferred execution, we can write it like this:

IEnumerable<string> GetLastNamesLazyDeferredWithoutLINQ()
{
    var lastNamesWeHaveSeen = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !lastNamesWeHaveSeen.Contains(person.LastName))
        {
            lastNamesWeHaveSeen.Add(person.LastName);
            yield return person.LastName;
        }
    }
}

Now, none of the code in this method is executed until the first time something calls `MoveNext()` on the returned enumerable. This means we could take the first five names without processing the entire collection of people, giving us a potentially enormous performance gain. If each item in the generated collection were computationally expensive to produce, without lazy evaluation, that expense would be multiplied by the total number of items in the collection on every call to the generator method; however, with lazy evaluation, the consumer of the collection gets to decide how many items are computed and therefore, how much work gets done. This ability to defer and control computationally expensive operations is the power of deferred execution.

However, not every deferred action necessarily has low overhead. Deferred execution actually comes in two flavours; eager evaluation and lazy evaluation (the example above is an example of lazy evaluation)¹. Every action in a deferred execution chain uses either lazy or eager evaluation. Though lazy evaluation is preferred, sometimes it is not possible to evaluate one item at a time, such as when sorting. Eagerly evaluated deferred execution allows us to at least defer the effort until we want it done.

An eagerly evaluated version of the iterator method we have been looking at might look like this:

IEnumerable<string> GetLastNamesEagerDeferredWithoutLINQ()
{
    var distinctNamesToReturn = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !distinctNamesToReturn.Contains(person.LastName))
        {
            distinctNamesToReturn.Add(person.LastName);
        }
    }
	
    foreach (var name in distinctNamesToReturn)
    {
        yield return name;
    }
}

In this example, because it is still an iterator method (it returns its results using `yield return`), none of the code is executed until the very first time the `MoveNext()` method is called on the returned enumerable, and therefore, the execution is deferred. When `MoveNext()` gets called for the first time, the entire collection of data is processed at once and then the results are output one by one as needed. The difference between this and the immediate execution equivalent we first looked at is that in this version, no work is done until a result is demanded.

Allowing the consumer of a collection to control how much work is done rather than work being dictated by the collection generator allows us to manage data more efficiently by building chains of operations and then processing the result in one go when needed. Lazy evaluation gives us the additional ability to spread the effort across each call to `MoveNext()`. The key to writing good LINQ is understanding which actions are immediate, which are deferred and lazily evaluated, which are deferred and eagerly evaluated, and why it matters. We will take a look at that next time.

Quite often, people use 'deferred execution' and 'lazy evaluation' interchangeably, but they are not actually synonymous, nor is 'immediate execution' synonymous with 'eager evaluation'. [↩]

LINQ: Clarity, complexity, and understanding

This is part of a short series on the basics of LINQ:

Part One – Clarity, complexity, and understanding
Part Two – Deferred Execution
Part Three – Understanding Your Query Chain
Part Four – Notation, Syntax, and Snags

At CareEvolution, we tend to develop using JavaScript on the front end, and C# on the back end (with some Python, PowerShell, CoffeeScript, R, SQL, and other languages thrown in when appropriate or technical debt dictates). We have hackathons every eight weeks where we get to be creative without the constraints of day-to-day work. We have a brown bag lunch talk every Wednesday. We work hard at embracing change, exploring new ways of doing things, and sharing what we have learned with each other. Quite often, leading figures in a particular technology emerge within our organisation: Brian knows JavaScript, Chris knows CSS, Brad knows SQL. While I doubt I know even half of the things about LINQ and its various implementations for database, Web API, or file interaction, I know enough to make it useful in my day to day work and I seem to be the one that employs it most in their code. I know LINQ.

I am certain this is going to sound familiar to many, but while my colleagues and I embrace all things as a collective, quite often a specific technology or its use will be avoided, derided, and hated by some. Whether driven by ignorance, a particular terrible experience, or prejudice¹, these deep-seated feelings can create conflict and occasionally hinder progress. For me, my use of LINQ has been a cause for contention during code reviews. I have faced comments like "LINQ is too hard to understand", "loops are clearer", "it's too easy to get burned using LINQ", and "I don't know how to use it so I'd prefer not to see it". And that's all true; LINQ can be confusing, it can be complicated, it can be a debugging nightmare. LINQ can suck. Whether you use the C# language keywords or the dot notation (a debate almost as passionate as tabs versus spaces), LINQ can tie you up in knots and leave you wondering what you did to deserve this fresh hell. Yet any technology could be described the same way when one doesn't know anything about it or when early mistakes have left a bitter aftertaste.

In response to these dissenting voices, I usually indicate the years of academic learning and professional experience it takes us to learn how to code at all. None of it is particularly easy and straightforward without some education. Don't believe me? Go stick your mum or dad in front of Visual Studio and, assuming they have never learned anything about C# or programming, see how far they get on writing Hello World without your help. Without educational instruction, we would not know any of it and LINQ is no different. When review comments inevitably request that I change my code to use less LINQ, none at all, or more understood language features like `foreach` and `while` loops, it frustrates me. It frustrates me because I usually feel that LINQ was the right choice for the job. I feel like I am being told, "use something I already know so I don't have to learn."

Of course, this interpretation is hyperbole. In actuality, when presented with opposing views to our own, it is easy to commit the black or white fallacy and assume one must be right and the other wrong, when really we should accept that we both may have a point (or neither) and learn more about the opposing view. Since I find, when used appropriately, LINQ can provide the best, most sublime, most elegant solution to problems that require the manipulation of collections in C#, I desperately want others to see that. It is as much on me as anyone else to try and correct for the disparity between what I see and what others see when I write LINQ. So, with my next post we will begin a journey into the basics of LINQ, when to use it², when to use dot notation over language keywords (or vice versa), and how to avoid some of the more common traps. We will begin with the cause of many confusing experiences; deferred execution.

we all know someone in the "That's new, I hate it" crowd [↩]
even I recognize LINQ is not a golden hammer; it's more of a chainsaw that kicks a little [↩]

Testing AngularJS: Asynchrony

So far, we have looked at some techniques for testing simple AngularJS factories and directives. However, things are rarely simple when it comes to web development and one area that complicates things is that of asynchronous operations such as web requests, timeouts and promises.

Eventually, when writing AngularJS, you will rely on the $timeout, $interval, or $q services to defer an action by some interval or indefinitely using promises. I will not go very deep into their use here, you can read much of that on the AngularJS documentation, but since it is likely that you will use them, how do you test them? How do you test asynchronous code without horribly complex and unreliable tests?

$timeout

Consider this simple example where we have a controller that defers some action using $timeout.

somewhatAbstract.controller('DeferredController', ['$scope', '$timeout', function ($scope, $timeout) {
    $scope.started = false;

    $timeout(function() {
        $scope.started = true;
    } );
}

Here we have a variable, started that is initialised to false and a deferred method that changes that value to true. A first stab at testing this might look a little like this:

describe 'deferred execution tests', ->
  describe 'started should be set to true', ->
    Given => module 'somewhatAbstract'
    Given => inject ($rootScope) =>
      @scope = $rootScope.$new()
    When => inject ($timeout, $controller) =>
      $controller 'deferredController', { $scope: @scope, $timeout: $timeout }
    Then => expect(@scope.started).toBeTruthy()

Unfortunately, such a test will not pass because the deferred code would not execute until after the expectation was tested. We can mitigate this by using some AngularJS magic provided by the ngMock module.

The ngMock module adds the $timeout.flush() method so that code deferred using $timeout can be executed deterministically¹. The test can therefore be modified such that it passes by adding the highlighted line below.

describe 'deferred execution tests', ->
  describe 'started should be set to true', ->
    Given => module 'somewhatAbstract'
    Given => inject ($rootScope) =>
      @scope = $rootScope.$new()
    When => inject ($timeout, $controller) =>
      $controller 'deferredController', { $scope: @scope, $timeout: $timeout }
      $timeout.flush()
    Then => expect(@scope.started).toBeTruthy()

$q

For promises that were deferred using $q (including the promise returned from using $interval), we can use $scope.$apply() to complete a resolved or rejected promise and execute any code depending on that promise.

In the following contrived example, we have a controller with a start() method that returns a promise and a started() method that resolves that promise.

somewhatAbstract.controller('deferredController', ['$scope', '$timeout', '$q', function ($scope, $timeout, $q) {
    var deferred = $q.defer();

    $scope.start = function () {
        return deferred.promise;
    };

    $scope.started = function() {
        deferred.resolve({started:true}); };
    };
}]);

describe 'deferred execution tests', ->
  describe 'started should be set to true when promise resolves', ->
    Given => module 'somewhatAbstract'
    Given => inject ($rootScope, $q, $controller) =>
      @scope = $rootScope.$new()
      $controller 'deferredController', { $scope: @scope, $q: $q }
    When =>
      @scope.start().then (result) => @actual = result
      @scope.started()
      @scope.$apply()
    Then => expect(@result).toBe { started: true }

The preceding test validates our controller and its promise. If you delete the highlighted line, you would see that the test fails because the resolved promise is never completed.

Finally…

In this post, we have taken a brief look at how AngularJS supports the testing of asynchronous code execution deferred using $timeout, $interval, or $q. The ability to synchronously control otherwise asynchronous actions not only allows us to test that deferred code, but also to prevent it executing at all. This can be incredibly useful when isolating different parts of our code by reducing how much of it has to run to validate a specific method.

Of course, quite often, a promise is only resolved after an HTTP request responds or fails, such as when using $resource. When writing unit tests, you may not have nor want the luxury of a back-end server that responds appropriately to test requests. Instead, you either want to fake out $resource, or fake out and validate the HTTP requests and responses². In upcoming posts, we'll look at a simple $resource fake for the former and the special $httpBackend service that AngularJS provides for the latter. Until then, please leave your comments.

The `flush()` method even takes a delay parameter to control which timeouts will execute and a similar method exists for `$interval` [↩]
The requests that you expect your code to make and the responses that your code expects to receive – or doesn't, as the case may be [↩]