LINQ: Notation, Syntax, and Snags

Welcome to the final post in my four part series on LINQ. So far, we've talked about:

For our last look into LINQ (at least for this mini-series), I want to tackle the mini-war of "dot notation" versus "query syntax", and look at some of the pitfalls that can be avoided by using LINQ responsibly.

Let Battle Commence…

For anyone who has written LINQ using C# (or VB.NET), you are probably aware that there is more than one way to express the query (two of which, sane people might use):

  1. Old school static method calls
  2. Method syntax
  3. Query syntax

No one in their right mind should be using the first of these options; extension methods were invented to alleviate the pain that would be caused by writing LINQ this way1. Extension methods, static methods that can be called as though member methods, are the reason why we have the second option of method syntax (more commonly known as dot notation or fluent notation). The final option, query syntax, is also known as "syntactical sugar", some language keywords that can make coding easier. These keywords map to concepts found in LINQ methods and query syntax is what gives LINQ it's name; Language INtegrated Query2.

They all map to the same thing, a sequence of methods that can be executed, or translated into an expression tree, evaluated by a LINQ provider, and executed. Anything written in one of these approaches can be written using the others. There is often contention on whether to use dot notation or query syntax, as if one is inherently better than the other, but as we all know, only the Sith deal in absolutes3.  Hopefully, by the end of these examples you will see how each has its merits.

Why are LINQ queries not always called like regular methods?

Because sometimes, such as in LINQ-to-SQL or LINQ-to-Entity Framework, the method calls need to be translated into SQL or some other querying syntax, allowing queries to take advantage of server-side querying optimizations. For a more in-depth look at all things LINQ, including the way the language keywords map to the method calls, I recommend looking at Jon Skeet's Edulinq series, which is available as a handy e-book.

Before we begin, here is a quick summary of the C# keywords that we have for writing queries in query syntax: `from`, `group`, `orderby`, `let`, `join`, `where` and `select`.  There are also contextual keywords to be used in conjunction with one or two of the main keywords:`in`, `into`, `ascending`, `descending`, `by`, `on` and `equals`. Each of these keywords has a corresponding equivalent method or methods in LINQ although it can sometimes be a little more complicated as we shall see.

So, let us look at an example and see how it can be expressed using dot notation and query syntax4). For an example, let us look at a simple projection of people to their last names.

public struct Person
{
    public Person(string first, string last, DateTimeOffset dateOfBirth) : this()
    {
        FirstName = first;
        LastName = last;
        DateOfBirth = dateOfBirth;
    }
    
    public string FirstName { get; private set; }
    public string LastName { get; private set; }
    public DateTimeOffset DateOfBirth { get; private set; }
}

public static class Data
{
    public static IEnumerable<Person> People = new[] {
        new Person("John", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 34 )),
        new Person("Bill", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 20 )),
        new Person("Sarah", "Allans", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 19 )),
        new Person("John", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 44 )),
        new Person("James", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 78 )),
        new Person("Alex", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 30 )),
        new Person("Mabel", "Thomas", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 29 )),
        new Person("Sarah", "Brown", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 23 )),
        new Person("Gareth", "Smythe", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 100 )),
        new Person("Gregory", "Drake", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 90 )),
        new Person("Michael", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 56 )),
        new Person("Alex", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 22 )),
        new Person("William", "Pickwick", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 17 )),
        new Person("Marcy", "Dickens", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 18 )),
        new Person("Erica", "Waters", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 26 ))
    };
}
Data.People.Select(p => p.LastName);
from person in Data.People
select person.LastName;

These two queries do the exact same thing, but I find that the dot notation wins out because it takes less typing and it looks clearer. However, if we decide we want to only get the ones that were born before 1980, things look a little more even.

Data.People
    .Where(p => p.DateOfBirth.Year < 1980)
    .Select(p => p.LastName)
from person in Data.People
where person.DateOfBirth.Year < 1980
select person.LastName;

Here, there is not much difference between them, so I'd probably leave this to personal preference5. However, as soon as we want a distinct list, the dot notation starts to win out again because C# does not contain a `distinct` keyword (though VB.NET does).

Mixing dot notation and query syntax in a single query can look messy, as shown here:

(from person in Data.People
where person.DateOfBirth.Year < 1980
select person.LastName).Distinct()

So, I prefer to settle on just one style of LINQ declaration for any particular query, or to use intermediate variables and separate the query into parts (this is especially useful on complex queries as it also provides clarity; being terse is cool, but it is unnecessary, and a great way to get people to hate you and your code).

Data.People
    .Where(p => p.DateOfBirth.Year < 1980)
    .Select(p => p.LastName)
    .Distinct();
var lastNames = from person in Data.People
                where person.DateOfBirth.Year < 1980
                select person.LastName;
lastNames.Distinct();

The `Distinct()` method is not the only LINQ method that has no query syntax alternative, there are plenty of others like `Aggregate()`, `Except()`, or `Range()`. This often means dot notation wins out or is at least part of a query written in query syntax. So, thus far, dot notation seems to have the advantage in the battle against query syntax. It is starting to look like some of my colleagues are right, query syntax sucks. Even if we use ordering or grouping, dot notation seems to be our friend or at least is no more painful than query syntax:

Data.People
    .OrderBy (p => p.LastName)
    .ThenBy (p => p.FirstName)
    .GroupBy(p=>p.DateOfBirth.Year);
from person in Data.People
orderby person.LastName,
        person.FirstName
group person by person.DateOfBirth.Year;

However, it is not always so easy. What if we want to introduce variables, group something other than the original object, or use more than one source collection? It is in these scenarios where query syntax irons a lot more of the complexity. Let's assume we have another collection containing newsletters that we need to send out to all our people. To generate the individual mailings, we would need to combine these two collections6.

Data.People.SelectMany(
    person => newsletters,
    (person, newsletter) => new {person,newsletter} );
from person in Data.People
from newsletter in newsletters
select new {person, newsletter};

I know which one is clearer to read and easier to remember when I need to write a similar query. The dot notation example makes me think for a minute what it is doing; projecting each person to the newsletters collection and, using `SelectMany()`, flattening the list then selecting one result per person/newsletter combination. Our query syntax example is doing the same thing, but I don't need to think too hard to see that. Query syntax is starting to look useful.

If we were to throw in some mid-query variables (useful to avoid calculating something multiple times or to improve clarity), or join collections, query syntax becomes really useful. What if each newsletter is on a different topic and we only want to send newsletters to people who are interested in that topic?

people.SelectMany(
    person => person.Value,
    ( person, topic ) => new
    {
        person,
        topic
    } ).Join(
        newsletters,
        t => t.topic,
        newsletter => newsletter.Value,
        ( t, newsletter ) => new
        {
            t.person,
            newsletter
        } );
from person in Data.People
from topic in person.Topics
join newsletter in newsletters on topic equals newsletter.Topic
select new {person, newsletter};

I know for sure I would need to go look up how to do that in dot notation7. Query syntax is an easier way to write more complex queries like this and provided that you understand your query chain, you can declare clear, performant queries.

 

In conclusion…

In this post I have attempted to show how both dot notation and query syntax (aka fluent notation) have their vices and their virtues, and in turn, armed you with the knowledge to choose wisely.

So, think about whether someone can read and maintain what you have written. Break down complex queries into parts. Consider moving some things to lazily evaluated methods. Understand what you are writing; if you look at it and have to think about why it works, it probably needs reworking. Always favour clarity and simplicity over dogma and cleverness; to draw inspiration from Jurassic Park, even though you could, stop to think whether you should.

LINQ is a complex feature of C# and .NET (and all the other .NET languages) and there are many things I have not covered. So, if you have any questions, please leave a comment. If I can't answer it, I will hopefully be able to direct you to someone who can. Alternatively, check out Edulinq by the inimitable Jon Skeet, head over to StackOverflow where there is an Internet of people waiting to help (including Jon Skeet), or get binging (googling, yahooing, altavistaring, whatever…)8.

And that brings us to the end of this series on LINQ. From deferred execution and the query chain to dot notation versus query syntax, I hope that I have managed to paint a favourable picture of LINQ, and helped to clear up some of the prejudices and confusions that surround it. LINQ is a powerful weapon in the arsenal of a .NET programmer; to not use it, would be a waste.

  1. Just the thought of the nested method calls or high number of placeholder variables makes me shudder []
  2. I guess LIQ was too suggestive for Microsoft []
  3. That statement is an absolute, Obi Sith Kenobi []
  4. I am definitely leaving the nested static methods approach to you as an exercise (in futility []
  5. Though if you changed the `person` variable to `p`, there is less typing in the query syntax , if that is a metric you are concerned with []
  6. Yes, a nested `foreach` can achieve this simple example, but this is just illustrative, and I'd argue cleaner than a `foreach` approach []
  7. That's why I cheated and wrote it in query syntax, then used Resharper to change it to dot notation for me []
  8. Back in my day, it was called searching…grumble grumble []

LINQ: Deferred Execution

This is part of a short series on the basics of LINQ:

In the first rant post of this short series on LINQ, I explained the motivation behind writing this series in the first place, which can be summarised as:

People don't know LINQ and that impacts my ability to make use of it; I should try to fix that.

To start, I'm going to explain what I believe is the most important concept in LINQ; deferred execution.

So, what is deferred execution?

Deferred execution code is not executed until the result is needed; the execution is put off (deferred) until later. By doing this we can combine a series of actions without actually executing any of them, then execute them at the time we need a result. This allows us to limit the execution of computationally expensive operations until we absolutely need them.

That's my description, here is one from an MSDN tutorial on LINQ-to-XML that perhaps puts it more clearly:

Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required. Deferred execution can greatly improve performance when you have to manipulate large data collections, especially in programs that contain a series of chained queries or manipulations. In the best case, deferred execution enables only a single iteration through the source collection.

Some may be surprised to know that deferred execution was not new when LINQ arrived, it had already been around for quite some time in the form of iterator methods. In fact, it is iterator methods that give LINQ its deferred execution. Before we look at an iterator method, let's look at an example of immediate execution. For this example, we will give ourselves the task of taking a collection of people and outputting a collection of unique last names for all those born before 1980.

public struct Person
{
    public Person(string first, string last, DateTimeOffset dateOfBirth) : this()
    {
        FirstName = first;
        LastName = last;
        DateOfBirth = dateOfBirth;
    }
	
    public string FirstName { get; private set; }
    public string LastName { get; private set; }
    public DateTimeOffset DateOfBirth { get; private set; }
}

public static class Data
{
    public static IEnumerable<Person> People = new[] {
        new Person("John", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 34 )),
        new Person("Bill", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 20 )),
        new Person("Sarah", "Allans", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 19 )),
        new Person("John", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 44 )),
        new Person("James", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 78 )),
        new Person("Alex", "Jones", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 30 )),
        new Person("Mabel", "Thomas", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 29 )),
        new Person("Sarah", "Brown", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 23 )),
        new Person("Gareth", "Smythe", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 100 )),
        new Person("Gregory", "Drake", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 90 )),
        new Person("Michael", "Johnson", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 56 )),
        new Person("Alex", "Smith", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 22 )),
        new Person("William", "Pickwick", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 17 )),
        new Person("Marcy", "Dickens", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 18 )),
        new Person("Erica", "Waters", DateTimeOffset.Now - TimeSpan.FromDays( 365 * 26 ))
    };
}
IEnumerable<string> GetLastNamesImmediate()
{
    var distinctNamesToReturn = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !distinctNamesToReturn.Contains(person.LastName))
        {
            distinctNamesToReturn.Add(person.LastName);
        }
    }
    return distinctNamesToReturn;
}

When this method is called, it iterates over the entire collection and then returns its result, which also can then be iterated. If the collection of people were huge and we only cared about the first five names, this would be incredibly slow. To turn this into deferred execution, we can write it like this:

IEnumerable<string> GetLastNamesLazyDeferredWithoutLINQ()
{
    var lastNamesWeHaveSeen = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !lastNamesWeHaveSeen.Contains(person.LastName))
        {
            lastNamesWeHaveSeen.Add(person.LastName);
            yield return person.LastName;
        }
    }
}

Now, none of the code in this method is executed until the first time something calls `MoveNext()` on the returned enumerable. This means we could take the first five names without processing the entire collection of people, giving us a potentially enormous performance gain. If each item in the generated collection were computationally expensive to produce, without lazy evaluation, that expense would be multiplied by the total number of items in the collection on every call to the generator method; however, with lazy evaluation, the consumer of the collection gets to decide how many items are computed and therefore, how much work gets done. This ability to defer and control computationally expensive operations is the power of deferred execution.

However, not every deferred action necessarily has low overhead. Deferred execution actually comes in two flavours; eager evaluation and lazy evaluation (the example above is an example of lazy evaluation)1. Every action in a deferred execution chain uses either lazy or eager evaluation. Though lazy evaluation is preferred, sometimes it is not possible to evaluate one item at a time, such as when sorting. Eagerly evaluated deferred execution allows us to at least defer the effort until we want it done.

An eagerly evaluated version of the iterator method we have been looking at might look like this:

IEnumerable<string> GetLastNamesEagerDeferredWithoutLINQ()
{
    var distinctNamesToReturn = new List<string>();
    foreach (var person in Data.People)
    {
        if (person.DateOfBirth.Year < 1980 && !distinctNamesToReturn.Contains(person.LastName))
        {
            distinctNamesToReturn.Add(person.LastName);
        }
    }
	
    foreach (var name in distinctNamesToReturn)
    {
        yield return name;
    }
}

In this example, because it is still an iterator method (it returns its results using `yield return`), none of the code is executed until the very first time the `MoveNext()` method is called on the returned enumerable, and therefore, the execution is deferred. When `MoveNext()` gets called for the first time, the entire collection of data is processed at once and then the results are output one by one as needed. The difference between this and the immediate execution equivalent we first looked at is that in this version, no work is done until a result is demanded.

Allowing the consumer of a collection to control how much work is done rather than work being dictated by the collection generator allows us to manage data more efficiently by building chains of operations and then processing the result in one go when needed. Lazy evaluation gives us the additional ability to spread the effort across each call to `MoveNext()`. The key to writing good LINQ is understanding which actions are immediate, which are deferred and lazily evaluated, which are deferred and eagerly evaluated, and why it matters. We will take a look at that next time.

  1. Quite often, people use 'deferred execution' and 'lazy evaluation' interchangeably, but they are not actually synonymous, nor is 'immediate execution' synonymous with 'eager evaluation'. []