LINQ: Understanding Your Query Chain

This is part of a short series on the basics of LINQ:

This is the third part in my small series on LINQ and it covers what I feel is the most important thing to understand when using LINQ, query chains. We are going to build on the deferred execution concepts discussed in the last entry and look at why it is important to know your query operations.

Each method in a LINQ query is either immediately executed or deferred. When deferred, a method is either lazily evaluated one element at a time or eagerly evaluated as the entire collection. Usually, you can determine which is which from the documentation or, if that fails, a little experimentation. Why does it matter? This question from StackOverflow provides us with an example:

For those that did not read it or do not understand the problem, let me summarize. The original poster had a problem where values they had obtained from a LINQ query result, when passed into the `Except()` method on that same query, did not actually exclude anything. It was as if they had taken the sequence `1,2,3,4`, called `Exclude(2)`, and that had returned `1,2,3,4` instead of the expected `1,3,4`. On the surface, the code looked like it should work, so what was going on? To explain, we need a little more detail.

The example code has a class that described a user. An XML file contained user details and this is loaded into a sequence of `User` instances using LINQ-to-XML.

IEnumerable<IUser> users = doc.Element("Users").Elements("User").Select
        (u => new User
            { ID = (int)u.Attribute("id")
              Name = (string)u.Attribute("name")
            }
        ).OfType<IUser>();       //still a query, objects have not been materialized

As noted in the commentary, the poster understood that at this point, the query is not yet evaluated. With their query ready to be iterated, they use it to determine which users should be excluded using a different query.

public static IEnumerable<IUser> GetMatchingUsers(IEnumerable<IUser> users)
{
     IEnumerable<IUser> localList = new List<User>
     {
        new User{ ID=4, Name="James"},
        new User{ ID=5, Name="Tom"}
     }.OfType<IUser>();
     var matches = from u in users
                   join lu in localList
                       on u.ID equals lu.ID
                   select u;
     return matches;
}

And then using those results, the originally loaded list of users is filtered.

var excludes = users.Except(matches);    // excludes should contain 6 users but here it contains 8 users

Now, it is clear this code is not perfect and we could rewrite it to function without so many LINQ queries (I will give an example of that later in this post), but we do not care about the elegance of the solution; we are using this code as an example of why it is important to understand what a LINQ query is doing.

As noted in the commentary, when the line declaring the initial `users` query is executed, the query it defines has not. The query does not actually become a real list of users until it gets consumed1. Where does that happen? Go on, guess.

If you guessed `GetMatchingUsers`, you are wrong. All that method does is build an additional level of querying off the initial query and return that new query. If you guessed the `Except()` method, that's wrong too, because `Except()` is also deferred. In fact, the example only implies that something eventually looks at the results of `Except()` and as such, the query is never evaluated. So, for us to continue, let's assume that after the `excludes` variable (containing yet another unexecuted query), we have some code like this to consume the results of the query:

foreach (var user in excludes)
{
   Console.WriteLine(user.ID);
}

By iterating over `excludes`,  the query is executed and gives us some results. Now that we are looking at the query results, what happens?

First, the `Except()` method takes the very first element from the `users` query, which in turn, takes the very first `User` element from the XML document and turns it into a `User` instance. This instance is then cast to `IUser` using `OfType`2.

Next, the `Except()` method takes each of the elements in the `matches` query result and compares it to the item just retrieved from the `users` collection. This means the entire `matches` query is turned into a concrete list. This causes the `users` query to be reevaluated to extract the matched users. The instances of `User` created from the `matches` query are compared with each instance from the `users` query and the ones that do not match are returned for us to output to the console.

It seems like it should work, but it does not, and the key to why is in how queries and, more importantly, deferred execution work.

Each evaluation of a deferred query is unique. It is very important to remember this when using LINQ. If there is one thing to take away from reading my blog today, it is this. In fact, it's so important, I'll repeat it:

Each evaluation of a deferred query is unique.

It is important because it means that each evaluation of a deferred query (in most cases) results in an entirely new sequence with entirely new items. Consider the following iterator method:

IEnumerable<object> GetObject()
{
    yield return new Object();
}

It returns an enumerable that upon being consumed will produce a single `Object` instance. If we had a variable of the result of `GetObject()`, such as `var obj  = GetObject()` and then iterated `obj` several times, each time would give us a different `Object` instance. They would not match because on each iteration, the deferred execution is reevaluated.

If we go back to the question from StackOverflow armed with this knowledge, we can identify that `users` is evaluated twice by the `Except()` call. One time to get the list of exceptions out of the `matches` query and another to process the list that is being filtered. It is the equivalent of this:

IEnumerable<object> GetObjects()
{
    return new[]
    {
        new Object(),
        new Object(),
        new Object(),
        new Object(),
        new Object()
    };
}

void main()
{
   var objects = GetObjects().Except(GetObjects());
   foreach (var o in objects)
   {
        Console.WriteLine(o);
   }
}

From this code, we would never expect `objects` to contain nothing since the two calls to the immediately executed `GetObjects` would return collections of completely different instances. When execution is deferred, we get the same effect; each evaluation of a query is as if it were a separate method call.

To fix this problem, we need to make sure our query is executed once to make the results "concrete", then use those concrete results to do the rest of the work. This is not only important to ensure that the objects being manipulated are the same in all uses of the queried data, but also to ensure that we don't do work more than once3. To make the query concrete, we call an immediately executed method such as `ToList()`, evaluating the query and capturing its results in a collection.

This is our solution, as the original poster of our StackOverflow question indicated. If we change the original `users` query to be evaluated and stored, everything works as it should. With the help of some investigation and knowledge of how LINQ works, we now also know why.

Now that we understand a little more about LINQ we can consider how we might rewrite the original poster's example code. For example, we really should not to iterate the `users` list twice at all; we should see the failure of `Except()` as a code smell that we are iterating the collection too often. Though making it concrete with `ToList()` fixes the bug, it does not fix this inefficiency.

To do that, we can rewrite it to something like this:

public static bool IsMatchingUser(User user)
{
     var localList = new List<User> {
        new User{ ID=4, Name="James"},
        new User{ ID=5, Name="Tom"} }
      .Cast<IUser>()
      .AsReadOnly();

    return localList.Any(u=>u.ID == user.ID && u.Name == user.Name);
}

var users = doc
    .Element("Users")
    .Elements("User")
    .Select(u => new User {
        ID = (int)u.Attribute("id")
        Name = (string)u.Attribute("name") })
    .Where(u => IsMatchingUser(u));

This update only iterates over each user once, resulting in a collection that excludes the users we don't want4.

In conclusion…

My intention here was to show why it is fundamental to know which methods are immediately executed, which ones are deferred, and whether those deferred methods are lazily or eagerly evaluated. At the end of this post are some examples of each kind of LINQ method, but a good rule of thumb is that if the method returns a type other than `IEnumerable` or `IQueryable` (e.g. `int` or `List`), it is immediately executed; all other cases are probably using deferred execution. If a method does use deferred execution, it is also helpful to know which ones iterate the entire collection every time and which ones stop iterating when they have their answer, but for this you will need to consult documentation and possibly experiment with some code.

Just knowing these different types of methods can be a part of your query will often be enough to help you write better LINQ and debug queries faster.  By knowing your LINQ methods, you can improve performance and reduce memory overhead, especially when working with large data sets and slow network resources. Without this knowledge, you are likely to evaluate queries and iterate sequences too often, and instantiate objects too many times.

Hopefully you were able to follow this post and it has helped you get a better grasp on LINQ. In the final post of this series, I will ease up on the deep code analysis, and look at query syntax versus dot notation (aka fluent notation). In the meantime, if you have any comments, I'd love to read them.

Examples of LINQ Method Varieties

Immediate Execution

`Count()`, `Last()`, `ToList()`, `ToDictionary()`, `Max()`, `Aggregate()`
Immediate iterate the entire collection every time

`Any()`, `All()`, `First()`, `Single()`
Iterate the collection until a condition is or is not met

Deferred Execution, Eager Evaluation

`Distinct()`, `OrderBy()`, `GroupBy()`
Iterate the entire collection but only when the query is evaluated

Deferred Execution, Lazy Evaluation

`Select()`,`Where()`
Iterate the entire collection

`Take()`,`Skip()`
Iterate until the specified count is reached

  1. "consumed" is often used as an alternative to "iterated" []
  2. `Cast()` should have been used here since all the objects loaded are of the same type []
  3. This is something that becomes very important when working with large queries that can be time or resource consuming to run []
  4. with more effort, I am certain there are more things that could be done to improve this code, but we're not here for that, so I'll leave is as an exercise for you; I'm generous like that []

Resolving XML references from embedded resources

Recently, I wanted to validate some XML via an XSD schema. Due to some product constraints and intentions regarding versioning, the schema is an embedded resource and is referenced via the noNamespaceSchemaLocation attribute.

When loading XML in .NET, you can specify an XmlResolver via the XmlReaderSettings . As stated in MSDN, the default uses a new XmlUrlResolver without credentials. This works great when the file is local on disk, but not when it is squirreled away inside my resources.

What I needed was a special version of XmlResolver that understood how to find my embedded schemas. So I created a derivation, XmlEmbeddedResourceResolver, to do just that.

public class XmlEmbeddedResourceResolver : XmlUrlResolver
{
	public XmlEmbeddedResourceResolver( Assembly resourceAssembly )
	{
		_resourceAssembly = resourceAssembly;
	}

	public override object GetEntity( Uri absoluteUri, string role, Type ofObjectToReturn )
	{
		if ( absoluteUri == null ) throw new ArgumentNullException( "absoluteUri", "Must provide a URI" );

		if ( ShouldAttemptResourceLoad( absoluteUri, ofObjectToReturn ) )
		{
			var resourceStream = GetSchemaStreamFromEmbeddedResources( absoluteUri.AbsolutePath );
			if ( resourceStream != null )
			{
				return resourceStream;
			}
		}
		return base.GetEntity( absoluteUri, role, ofObjectToReturn );
	}

	public override async Task<object> GetEntityAsync( Uri absoluteUri, string role, Type ofObjectToReturn )
	{
		if ( absoluteUri == null ) throw new ArgumentNullException( "absoluteUri", "Must provide a URI" );

		if ( ShouldAttemptResourceLoad( absoluteUri, ofObjectToReturn ) )
		{
			var resourceStream = await GetSchemaStreamFromEmbeddedResourcesAsync( absoluteUri.AbsolutePath );
			if ( resourceStream != null )
			{
				return resourceStream;
			}
		}

		return await base.GetEntityAsync( absoluteUri, role, ofObjectToReturn );
	}

	private static bool ShouldAttemptResourceLoad( Uri absoluteUri, Type ofObjectToReturn )
	{
		return ( absoluteUri.Scheme == Uri.UriSchemeFile && ofObjectToReturn == null || ofObjectToReturn == typeof( Stream ) || ofObjectToReturn == typeof( object ) );
	}

	private Stream GetSchemaStreamFromEmbeddedResources( string uriPath )
	{
		var schemaFileName = Path.GetFileName( uriPath );
		var schemaResourceName = _resourceAssembly.GetManifestResourceNames().FirstOrDefault( n => n.EndsWith( schemaFileName ) );
		if ( schemaResourceName != null )
		{
			return _resourceAssembly.GetManifestResourceStream( schemaResourceName );
		}
		return null;
	}

	private Task<object> GetSchemaStreamFromEmbeddedResourcesAsync( string uriPath )
	{
		return Task.Run( () => (object)GetSchemaStreamFromEmbeddedResources( uriPath ) );
	}

	private readonly Assembly _resourceAssembly;
}

When asked to find a file-based reference, this steps in and looks in embedded resources first for a file of the same name. Since the file could be namespaced anywhere in the resources, I opted to look for any resource in any namespace with the same file name. If it is there, it loads it, otherwise it defers to the base implementation. This means there is no easy way to override the embedded file with a local one; however, that could be redressed by calling the base implementation first and then only searching embedded resources if that failed.

Note that I also implemented the async methods. I am certain my implementation is a little naive, but it generally works. Just be careful if you allow this to be used asynchronously as I discovered you can very easily create a deadlock when used in conjunction with locks. This is not necessarily a caveat of my implementation, but of asynchronous programming in general.

Hopefully, others will find this useful. Let me know in the comments if you use this or something similar.

Caching with LINQPad.Extensions.Cache

One of the tools that I absolutely adore during my day-to-day development is LINQPad . If you are not familiar with this tool and you are a .NET developer, you should go to www.linqpad.net right now and install it. The basic version is free and feature-packed, though I recommend upgrading to the professional version. Not only is it inexpensive, but it also adds some great features like Intellisense1 and Nuget package support.

I generally use LINQPad as a simple coding environment for poking around my data sources, crafting quick coding experiments, and debugging. Because LINQPad does not have the overhead of a solution or project, like a development-oriented tool such as Visual Studio, it is easy to get stuck into a task. I no longer write throwaway console or WinForms apps; instead I just throw together a quick LinqPad query. I could continue on the virtues of this tool2, but I would like to touch on one of its utility features.

As part of LINQPad , you get some useful methods and types for extending LINQPad , dumping information to LINQPad's output window, and more. Two of these methods are LINQPad.Extensions.Cache and Utils.Cache. Using either Cache method, you can execute some code and cache the result locally, then use the cached value for all subsequent runs of that query. This is incredibly useful for caching the results of an expensive database query or computationally-intensive calculation. To cache an IEnumerable<T>  or IObservable<T>  you can do something like this:

var thingThatTakesALongTime = from x in myDB.Thingymabobs
                              where x.Whatsit == "thingy"
                              select x;
var myThing = LINQPad.Extensions.Cache(thingThatTakesALongTime);

Or, since it's an extension method,

var myThing = thingThatTakesALongTime.Cache();

For other types, Util.Cache  will cache the result of an expression.

var x = Util.Cache(()=> { /* Something expensive */ });

The first time I run my LINQPad code, my lazily evaluated query or the expression is executed by the Cache method and the result is cached. From then on, each subsequent run of the code uses the cached value. Both Cache methods also take an optional name for the cached item, in case you want to differentiate items that might otherwise be indistinguishable (such as caching a loop computation).

This is, as I alluded earlier, one of many utilities provided within LINQPad that make it a joy to use. What tools do you find invaluable? Do you already use LINQPad ? What makes it a must have tool for you? I would love to hear your responses in the comments.

Updated to correct casing of LINQPad, draw attention to Cache being an extension method for some uses, and adding note of Util.Cache3.

  1. including for imported data types from your data sources []
  2. such as its support for F#, C#, SQL, etc. or its built-in IL disassembly []
  3. because, apparently, I am not observant to this stuff the first time around. SMH []

Creating and using your own AngularJS filters

I have been working on the client-side portion of a rather complex feature and I found myself needing to trim certain things off a string when binding it in my AngularJS code. This sounded like a perfect job for a filter. For those familiar with XAML development on .NET-related platforms like WPF, Silverlight and WinRT, a filter in Angular is similar to a ValueConverter. The set of built in filters for Angular is pretty limited and did not support my desired functionality, so I decided to write new filter of my own called trim. I even wrote some simple testing for it, just to make sure it works.

Testing

For the sake of argument, let's presume I followed TDD or BDD principles and wrote my test spec up front. I used jasmine to describe each of the behaviours I wanted1.

describe('trim filter tests', function () {
	beforeEach(module('awesome'));

	it('should trim whitespace', inject(function (trimFilter) {
		expect(trimFilter(' string with whitespace ')).toBe('string with whitespace');
	}));
		
	it('should trim given token', inject(function (trimFilter) {
		expect(trimFilter('stringtoken', 'token')).toBe('string');
	}));
		
	it('should trim token and remaining whitespace', inject(function (trimFilter) {
		expect(trimFilter(' string token ', 'token')).toBe('string');
	}));
});

An important point to note here is that for your filter to be injected, you have to append the word Filter onto the end. So if your filter is called bob, your test should have bobFilter as its injected parameter.

Implementing the Filter

With the test spec written, I could implement the filter. Like many things in Angular that aren't directives, filters are pretty easy to write. They are a specialization of a factory, returning a function that takes an input and some arbitrary parameters, and returning the filter output.

You add a filter to a module using the filter method. Below is the skeleton for my filter, trim.

var myModule = angular.module('awesome');

myModule.filter( 'trim', function() {
    return function (input, tokenToTrim) {
        var output = input;
        // Do stuff and return the result
        return output;
    };
});

Here I have created a module called awesome and then added a new filter called trim. My filter takes the input and a token that is to be trimmed from the input. However, currently, the filter does nothing with that token; it just returns the input. We can use this filter in an Angular binding as below.

<p style'font-style:italic'>Add More {{someValue | trim:'Awesome'}} Awesome</p>

You can see that I am applying the trim filter and passing the token, "Awesome". If someValue was "Awesome", this would output:

Add More Awesome Awesome

You can see that "Awesome" was not trimmed because we didn't actually implement the filter yet. Here is the implementation.

myModule.filter('trim', function () {
	return function (input, token) {
		var output = input.trim();

		if (token && output.substr(output.length - token.length) === token) {
			output = output.substr(0, output.length - token.length).trim();
		}
		return output;
	};
});

This takes the input and removes any extra spaces from the start and end. If we have a token and the trimmed input value ends with the token value, we take the token off the end, trim and trailing space and return that value. Our binding now gives us:

Add More Awesome

Perfect.

  1. Try not to get hung up on the quality of my tests, I know you are in awe []

Unit testing attribute driven late-binding

I've been working on a RESTful API using ASP WebAPI. It has been a great experience so far. Behind the API is a custom framework that involves some late-binding. I decorate certain types with an attribute that associates the decorated type with another type1. The class orchestrating the late-binding takes a collection of IDecorated instances. It uses reflection to look at their attributes to determine the type they are decorated with and then instantiates that type.

It's not terribly complicated. At least it wasn't until I tried to test it. As part of my development I have been using TDD, so I wanted unit tests for my late-binding code, but I soon hit a hurdle. In mocking IDecorated, how do I make sure the mocked concrete type has the appropriate attribute?

var mockedObject = new Mock();

// TODO: Add attribute

binder.DoSpecialThing( mockedObject.Object ).Should().BeAwesome();

I am using Moq for my mocking framework accompanied by FluentAssertions for my asserts2. Up until this point, Moq seemed to have everything covered, yet try as I might I couldn't resolve this problem of decorating the generated type. After some searching around I eventually found a helpful Stack Overflow question and answer that directed me to TypeDescriptor.AddAttribute, a .NET-framework method that provides one with the means to add attributes at run-time!

var mockedObject = new Mock();

TypeDescriptor.AddAttribute(
    mockedObject.Object.GetType(),
    new MyDecoratorAttribute( typeof(SuperCoolThing) );

binder.DoSpecialThing( new [] { mockedObject.Object } )
    .Should()
    .BeAwesome();

Yes! Runtime modification of type decoration. Brilliant.

So, why didn't it work?

My binding class that I was testing looked a little like this:

public IEnumerable<Blah> DoSpecialThing( IEnumerable<IDecorated> decoratedThings )
{
    return from thing in decoratedThings
           let converter = GetBlahConverter( d.GetType() )
           where d != null
           select converter.Convert( d );
}

private IConverter GetBlahConverter( Type type )
{
    var blahConverterAttribute = Attribute
        .GetCustomAttributes( type, true )
        .Cast<BlahConverterAttribute>()
        .FirstOrDefault();

    if ( blahConverterAttribute != null )
    {
        return blahConverterAttribute.ConverterType;
    }

    return null;
}

Looks fine, right? Yet when I ran it in the debugger and took a look, the result of GetCustomAttributes was an empty array. I was stumped.

After more time trying different things that didn't work than I'd care to admit, I returned to the StackOverflow question and started reading the comments; why was the answer accepted answer when it clearly didn't work? Lurking in the comments was the missing detail; if you use TypeDescriptor.AddAttributes to modify the attributes then you have to use TypeDescriptor.GetAttributes to retrieve them.

I promptly refactored my code with this detail in mind.

public IEnumerable<Blah> DoSpecialThing( IEnumerable<IDecorated> decoratedThings )
{
    return from thing in decoratedThings
           let converter = GetBlahConverter( d.GetType() )
           where d != null
           select converter.Convert( d );
}

private IConverter GetBlahConverter( Type type )
{
    var blahConverterAttribute = TypeDescriptor
        .GetAttributes( type )
        .OfType<BlahConverterAttribute>()
        .FirstOrDefault();

    if ( blahConverterAttribute != null )
    {
        return blahConverterAttribute.ConverterType;
    }

    return null;
}

Voila! My test passed and my code worked. This was one of those things that had me stumped for longer than it should have. I am sharing it in the hopes of making sure there are more hits when someone else goes Internet fishing for help. Now, I'm off to update Stack Overflow so that this is clearer there too.

  1. Something similar to TypeConverterAttribute usage in the BCL []
  2. Though I totally made up the BeAwesome() assertion in this blog post []