Write-through and Generational Caching Make a Great Team

Sep 28, 2012   //   by Jeff Treuting   //   Blog  //  No Comments

We all know that a great way to speed up any website is to throw some caching at it, right?  When considering caching strategies there are a lot of questions you should be asking.  What type of caching is right for my specific situation?  Do we have to worry about serving up stale data, and if so how do we go about handling that?  Plus another dozen or so questions.  My goal in this post is to show a generic caching strategy that can be implemented in most web applications and just work.

Caching Considerations

“So what exactly are you looking for when it comes to a generic non-website specific caching strategy?” you might probably didn’t ask.  Good question.

  1. Generic implementation
    By that I mean I can just throw it at any new project I’m working on and don’t have to customize it for the type of data I’m dealing with, or any specific cache implementation (i.e. Memcached, in-memory cache, etc.)
  2. Stays out of my way
    I don’t want to have to specifically clear out the cache for my entity whenever it is updated, or manage cache keys to invalidate previous queries.
  3. Don’t serve up stale data
    There are certain situations where stale data is acceptable, but for a generic implementation we want to be conservative and only serve up to date data.

There are 2 types of main data you will be dealing with in your application: single entities (think CRUD) and queries (give me all the posts in a category).  I’ll be dealing with these types differently.  For single entities, I’ll implement write-through caching and for queries I’ll be using generational caching.

Write-Through Caching

Here’s how to sum up write-through caching: when you write to the database you write to the cache.  Done and done.

Write-through caching is great for serving up single entities, not queries.  The single entity is always in cache so it will serve up really fast.  A great place for the write-through caching logic is in your repository layer and even better is to include it in a generic repository that is the base class for all your specific repository instances (that keeps it DRY), but that is information for another post.

Pretty simple, write? (terrible pun used for my satisfaction only I’m sure)

Let’s get to some code

I’m using a generic repository that uses Entity Framework 5 to access the database, so my methods will have a generic for the entity type.  If you implement this directly in a typed repository you will just need to change accordingly.

Let’s start with the caching provider.  Here I have the interface that we will code against.  It provides a generic caching interface and we will provide the concrete class via injection (personally I like using StructureMap).  The beauty of this approach is that you can start out with a simple in-memory caching provider and then scale up to a distributed cache like Memcached or something else later very easily.

    public interface ICachingProvider : IDisposable
    {
        void Set<T>(string key, T value, int? timeoutInSeconds = null);
        void Clear(string key);
        bool Exists(string key);
        bool Get<T>(string key, out T value);
        int Increment(string key, int defaultValue, int value);
    }

And the in-memory implementation looks like this.  It uses the built-in .NET runtime caching.

    /// <summary>
    /// Uses the .NET built-in MemoryCache as the caching provider.
    /// </summary>
    public class InMemoryCachingProvider : ICachingProvider
    {
        private static ObjectCache Cache
        {
            get { return MemoryCache.Default; }
        }

        private static readonly object LockObject = new object();

        public void Set<T>(string key, T value, int? cacheTime = null)
        {
            if (String.IsNullOrEmpty(key)) throw new ArgumentNullException("key");

            var policy = new CacheItemPolicy();

            if (cacheTime.HasValue)
            {
                policy.AbsoluteExpiration = DateTime.Now + TimeSpan.FromSeconds(cacheTime.Value);
            }

            Cache.Set(key, value, policy);
        }

        public void Clear(string key)
        {
            if (String.IsNullOrEmpty(key)) throw new ArgumentNullException("key");

            Cache.Remove(key);
        }

        public bool Exists(string key)
        {
            if (String.IsNullOrEmpty(key)) throw new ArgumentNullException("key");

            return Cache.Any(x => x.Key == key);
        }

        public bool Get<T>(string key, out T value)
        {
            if (String.IsNullOrEmpty(key)) throw new ArgumentNullException("key");

            value = default(T);

            try
            {
                if (!Exists(key))
                    return false;

                value = (T)Cache[key];
            }
            catch (Exception)
            {
                // ignore and use default
                return false;
            }

            return true;
        }

        public int Increment(string key, int defaultValue, int value)
        {
            if (String.IsNullOrEmpty(key)) throw new ArgumentNullException("key");

            lock (LockObject)
            {
                int current;
                if (!Get(key, out current))
                {
                    current = defaultValue;
                }

                var newValue = current + value;
                Set(key, newValue);
                return newValue;
            }
        }

        public void Dispose()
        {
            // no need to do anything
        }
    }

Now onto the repository bits.  I’m not going to show the whole thing because I don’t want the details of the generic repository to take away from the caching bits (plus I’m using a simplified version for my example here), so I’ll only be showing parts I think are necessary to understand what is going on.

Let’s get started with some helper methods we use for the write-through caching.  There is a method for setting the cache value and for making the cache key based on the type of the entity and the primary key.

        private void SetWriteThroughCache(int id, T item)
        {
            _cachingProvider.Set(CacheKey(id), item);
        }

        private string CacheKey(int id)
        {
            return String.Concat(_typeName, "/", id);
        }

The Get method is pretty straightforward I think.  It checks to see if the entity is already in cache and if not, get it from the database and put it into cache so it’s there the next time.

        public T Get(int id)
        {
            T item;

            // if the item is in the cache already then just return it
            if (_cachingProvider.Get(CacheKey(id), out item))
            {
                Debug.WriteLine("Found in cache for {0} with id of {1}", _typeName, id);
                return item;
            }

            Debug.WriteLine("NOT FOUND in cache for {0} with id of {1}", _typeName, id);

            // otherwise we need to call EF and get the item
            item = DbSet.Find(id); 

            // do the write-through caching so next time it won't hit the DB
            SetWriteThroughCache(id, item);

            return item;
        }

I’ll just show one of the CRUD operations since they are basically the same.

        public void Update(int id, T item)
        {
            // update the entity using EF
            Context.Entry(item).State = EntityState.Modified;
            Context.SaveChanges();

            // do the write-through caching
            SetWriteThroughCache(id, item);
        }

When we update the item in the database, we need to also update the cache, simple as that.  Same goes for the Add method, while in the Delete method we will just remove it from cache and then delete from the database.  All fairly straightforward.  Okay, so how can I query my entities instead of just getting one item at a time?  Wow, you ask great questions.  That leads right into my next part.

Generational Caching

The basic idea behind generational caching is that the majority of sites will have more reads than writes on a particular table.   If this is true, then there is a performance benefit to caching queries against that table until it is written to, then we clear out that cache.  Where generational caching shines is in the management of what information should be kept in cache and what should be cleared out.  Keeping track of what cached queries are stale because of a table update is tough.  To do this manually you’d have to loop through your cache (which caches are generally not designed for) look at each query and determine which tables it used or what rows within the table it used and if they have been updated or not.  Not fun.

Generational caching takes a simple, conservative approach.  If a table gets written to (insert, update or delete), then all cached queries for that table are no longer valid.  This is conservative because if I update a single row in a table then all queries (even those that don’t involve that row) are no longer used.  But it’s this simplicity that makes it a beautiful approach.  We don’t have to manage that process and we can get some good performance benefits as long as our site has more reads than writes in general.

Here’s how we do this. For each entity type we are tracking (most times it maps to one database table per entity) we setup a generation for each one that initially starts at 1 and is incremented each time the table is written to.  The generation will be stored in cache and will use a key like “Post/Generation”.  Then each time we cache queries or look for query results we use this generation as part of its key.  That’s it, that’s all there is to it.  “Really?  But you aren’t dealing with clearing the cache at all.”  Exactly, that’s the best part.  Since my generation is updated each time the table is written to and my cache lookup logic is using that generation, I will never serve up stale data and the old data won’t ever be accessed again.  The caching provider being used will clear out old data when the cache fills up so those old queries will get flushed on their own, no need to manually remove them.

Time for more code

Again we’ll start with the helper methods that deal with the generational part of generational caching.  The GetGeneration method gets the current generation from cache for this particular entity type (each type of entity gets its own generation and is independent of the others, so if you update the Post table it doesn’t affect the User cache).  NextGeneration is in charge of incrementing the generation by one, while the CacheKey builds up the appropriate cache key to use.

        private int GetGeneration()
        {
            int generation;
            return !_cachingProvider.Get(GetGenerationKey(), out generation) ? 1 : generation;
        }

        private int NextGeneration()
        {
            return _cachingProvider.Increment(GetGenerationKey(), 1, 1);
        }

        private string GetGenerationKey()
        {
            return String.Format("{0}/Generation", _typeName);
        }

        private string CacheKey(Expression<Func<T, bool>> predicate = null)
        {
            var key = predicate == null ? "All" : Md5Helper.CalculateMd5(predicate.ToString());
            return String.Concat(_typeName, "/", GetGeneration(), "/", key);
        }

The CacheKey version that accepts an Expression might need some additional explanation.  In our repository, we have a GetAll method that accepts an Expression (think LINQ) as a way to get only certain entities from a table.  To make a unique cache key for this we use the current generation (like we do for all generational caching keys) as well as an MD5 hash of the predicate itself.  Calling ToString() on the Expression returns a string representation of the query (isn’t that clever) which includes the specific values passed in, so it will be unique each time, and taking an MD5 of a unique string will itself be unique, so we are good there.

Only a couple more parts to go, I promise.

The GetAll method has an optional Expression parameter which lets us query our repository using LINQ syntax (or call GetAll() without an Expression to get everything).  We do the same basic logic in here as the Get() method where we check the cache and if it exists return it directly or query the database, cache the results and then return.  Pretty simple.

        public IEnumerable<T> GetAll(Expression<Func<T, bool>> predicate = null)
        {
            IEnumerable<T> items;

            // get the appropriate cache key for this query
            var key = CacheKey(predicate);

            // check if it's in the cache already
            if (!_cachingProvider.Get(key, out items))
            {
                Debug.WriteLine(String.Format("NOT FOUND cache for: {0}", key));

                // if not, then run the query
                items = predicate == null ? DbSet.ToList() : DbSet.Where(predicate).ToList();

                // and cache the results for next time
                _cachingProvider.Set(key, items);
            }
            else
            {
                Debug.WriteLine(String.Format("Found cache for: {0}", key));
            }

            // return the results either from cache or that were just run
            return items;
        }

So the most important piece that brings this all together is updating the generation whenever the table is updated.  So let’s add that to the repository and then we are done.  In Add(), Update() and Delete() we need to add a call to NextGeneration() and that takes care of it.

Here’s our updated Update method (couldn’t resist) but each one is similar.

        public void Update(int id, T item)
        {
            // udpate the entity using EF
            Context.Entry(item).State = EntityState.Modified;
            Context.SaveChanges();

            // do the write-through caching
            SetWriteThroughCache(id, item);

            // update the generation
            NextGeneration();
        }

To sum it all up

The combination of a repository layer, write-through caching and generational caching is a pretty powerful thing.  It is a straightforward way to gain some performance without really having to think too much.  It’s a conservative, generic approach that will work for most web applications, especially those that have many more reads than writes (which I’m betting is most sites you develop).

In the future, think about adding this strategy when doing new development or add it to an existing repository layer to gain a little boost right now.

Leave a comment

(will not be published)

CAPTCHA Image
*

Recent Posts