Sharding refers to horizontal partitioning of data across multiple machines. The idea is to split the load across many commodity machines, instead of buying huge expensive machines.
Raven has full support for sharding, and you can utilize sharding out of the box.
For the purpose of this document, we will use a shared blog model, with documents representing users, blogs and posts. We will shard the information among 5 Raven instances. Instead of sharding all documents across all Raven instances, we will instead dedicate one instance for user documents, one instance for blog documents and three instances for posts. (See more about why we choose this strategy in the Sharding Approaches section below).
Sharding strategy credits
Much of the sharding strategy design was taken from the Hibernate Shards project.
Full example
You can see the entire source code for this example in the Demo Apps/Raven.Sample.ComplexSharding project that is part of Raven's source.
Setting up the client for sharding
The first thing to do is to create a Shards instance. A Shards instance contains all the information that the Raven Client API needs to access each Raven sharded instance.
var shards = new Shards
{
new DocumentStore
{
Identifier = "Users",
Url = "http://localhost:8081",
Conventions =
{
GenerateDocumentKey = user => "users/" + ((User) user).Name
}
},
new DocumentStore {Identifier = "Blogs", Url = "http://localhost:8082"},
new DocumentStore {Identifier = "Posts #1", Url = "http://localhost:8083"},
new DocumentStore {Identifier = "Posts #2", Url = "http://localhost:8084"},
new DocumentStore {Identifier = "Posts #3", Url = "http://localhost:8085"}
};
You can see that
Shards is pretty simple, just a container for the standard DocumentStore that you already know from
the Raven Client API. The only thing slightly interesting here is the way we modify the default key generation strategy for users, a user document key is "users/[user name]", instead of "users/[identity generated by the database]" which is the default Raven Client API approach.
Now that we have the Shards data, we can move to the next phase, setting up the sharding streategy:
var shardStrategy = new ShardStrategy
{
ShardAccessStrategy = new ParallelShardAccessStrategy(),
ShardSelectionStrategy = new BlogShardSelectionStrategy(3),
ShardResolutionStrategy = new BlogShardResolutionStrategy(3),
};
The shard strategy control several important factors in the way the Raven Client API behave in the presence of sharding. ShardAccessStrategy dictate whatever we access each shard in a sequential manner or in parallel. ShardSelectionStrategy let Raven know how to go from an entity instance to the appropriate shard. ShardResolutionStrategy is used to figure out what shards we need to query when executing Raven's queries. We will deal with each of those in details in a short while.
With the shard strategy at hand, we are almost ready, we just need to create the ShardedDocumentStore and initialize it:
var documentStore = new ShardedDocumentStore(shardStrategy, shards);
documentStore.Initialize();
Using sharding
With that, we are pretty much done, we can start working with the sharded Raven setup as if we were using the standard Client API:
using (var session = documentStore.OpenSession())
{
var user = new User { Name = "ayende" };
var blog = new Blog { Name = "Ayende @ Rahien" };
session.Store(user);
session.Store(blog);
// we have to save to Raven to get the generated id for the blog instance
session.SaveChanges();
var posts = new List<Post>();
for (var i = 0; i < 6; i++)
{
var post = new Post
{
BlogId = blog.Id,
UserId = user.Id,
Content = "Just a post",
Title = "Post #" + (i + 1)
};
posts.Add(post);
session.Store(post);
}
session.SaveChanges();
}
From just the code above, you can't tell if you are talking to a single Raven instance or several instances. That is the idea, after the initial setup, you can pretty much ignore it.
The Raven Client API outputs the following log for this code snippet:
Saving 1 changes to Users
Saving 1 changes to Blogs
Saving 2 changes to Posts #1
Saving 2 changes to Posts #2
Saving 2 changes to Posts #3
All five Raven instances are involved, and you can see that Raven has batches all changes going to the same server into a single call.
Now, let us see how we can query in a sharded environment.
using (var session = documentStore.OpenSession())
{
session.Query<User>().ToArray();
session.Query<Blog>().ToArray();
session.Query<Post>().ToArray();
}
The code above will get us all the users, blogs and posts. This is the log output:
Executing query 'Tag:Users' on index 'Raven/DocumentsByEntityName' in 'Users'
Executing query 'Tag:Blogs' on index 'Raven/DocumentsByEntityName' in 'Blogs'
Executing query 'Tag:Posts' on index 'Raven/DocumentsByEntityName' in 'Posts #1'
Executing query 'Tag:Posts' on index 'Raven/DocumentsByEntityName' in 'Posts #2'
Executing query 'Tag:Posts' on index 'Raven/DocumentsByEntityName' in 'Posts #3'
As you can see, these three queries turned into actual 5 requests. The first two queries need to only hit the Users and Blogs, respectively, and they did. But when we want to get all the posts, we need to query all three shards dedicated to them.
Finally, what happen when we try to load a document by id?
using (var session = documentStore.OpenSession())
{
session.Load<User>("users/ayende");
session.Load<Blog>("blogs/1");
session.Load<Post>("posts/1/2");
session.Load<Post>("posts/2/2");
}
These four calls will result the following four requests:
Loading document [users/ayende] from Users
Loading document [blogs/1] from Blogs
Loading document [posts/1/2] from Posts #1
Loading document [posts/2/2] from Posts #2
So far, it seems that the Raven Client API is pretty intelligent. But how does it do that?
The sharding strategy
You probably remember that when we talked about the shard strategy, I promised that I'll explain about those later. First, let us remember how we define the shard strategy:
var shardStrategy = new ShardStrategy
{
ShardAccessStrategy = new ParallelShardAccessStrategy(),
ShardSelectionStrategy = new BlogShardSelectionStrategy(3),
ShardResolutionStrategy = new BlogShardResolutionStrategy(3),
};
Only the ParallelShardAccessStrategy is part of Raven. The two other classes are implemented by you, and they are responsible for teaching Raven about how to understand your sharding strategy (while generic sharding strategies apply, they tend to be a poor fit compared to one tailored made for your scenario, you can read more on that in the Sharding Approaches section below.
Shard Selection Strategy
A shard selection strategy let Raven knows what shard a new object or an existing one should go on.
The following is an implementation of a the (fairly complex) sharding strategy that we used so successfully. The implementation is discussed in detail below.
public class BlogShardSelectionStrategy : IShardSelectionStrategy
{
private readonly int numberOfShardsForPosts;
private int currentNewShardId;
public BlogShardSelectionStrategy(int numberOfShardsForPosts)
{
this.numberOfShardsForPosts = numberOfShardsForPosts;
}
public string ShardIdForNewObject(object obj)
{
var shardId = GetShardIdFromObjectType(obj);
if (obj is Post)
{
var nextPostShardId = Interlocked.Increment(ref currentNewShardId) % numberOfShardsForPosts;
nextPostShardId += 1;// to ensure base 1
((Post)obj).Id = "posts/" + nextPostShardId + "/"; // encode the shard id in the in the prefix.
shardId += nextPostShardId;
}
return shardId;
}
public string ShardIdForExistingObject(object obj)
{
var shardId = GetShardIdFromObjectType(obj);
if (obj is Post)
{
// the format of a post id is: 'posts' / 'shard id' / 'post id'
var id = ((Post)obj).Id.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
shardId += id[1];// add shard id
}
return shardId;
}
private static string GetShardIdFromObjectType(object instance)
{
if (instance is User)
return "Users";
if (instance is Blog)
return "Blogs";
if (instance is Post)
return "Posts #";
throw new ArgumentException("Cannot get shard id for '" + instance + "' because it is not a User, Blog or Post");
}
}
The ShardIdForNewObject implementation is straightforward, using GetShardIdFromObjectType to get the appropriate shard based on the instance type. In the case of a Post, just the name is not enough, we also need to deal with selecting a shard out of those dedicated to posts. We use a simple round robin logic to distribute new posts among all the posts shards. Note that we use Interlocked.Increment to increment the value. ShardSelectionStrategy must be thread safe.
Of particular interest is the id that we set for the new post:
((Post) obj).Id = "posts/" + nextPostShardId + "/";
Why are we giving it this id? Raven treats new documents id that ends with a '/' as a prefix, generating an identity value for them. So when saving a document with the id 'posts/1/' the actual id would be something like 'posts/1/1338'.
By ensuring that the shard id is part of the document key, we ensure that we can always know which shard to access for loading a document or when saving it back. It also ensures stable sharding approach when new shards are added later on. This is discussed in more details in the Sharding Approaches section below.
The second method of interest in this shard selection strategy is how we find out what shard an existing instance should go to: ShardIdForExistingObject
For Users and Blogs, the decision is easy, based on their type. For posts, it is slightly more complicated, but still fairly easy, since we can extract the shard id from the id of the document.
ShardSelectionStrategy helps when we want to save new or existing objects, but ShardResolutionStrategy works for querying and loading.
Shard Resolution Strategy
When Raven needs to query on top of a sharded data store, it needs to know what shards to query. Querying all shards, all the time, is possible, but it is slow and requires that all shards contain the same indexes as others (which you may wish to avoid because they store different sets of documents).
To decide what shards it needs to query, Raven uses the ShardResolutionStrategy. You can see an example of one below:
public class BlogShardResolutionStrategy : IShardResolutionStrategy
{
private readonly int numberOfShardsForPosts;
public BlogShardResolutionStrategy(int numberOfShardsForPosts)
{
this.numberOfShardsForPosts = numberOfShardsForPosts;
}
public IList<string> SelectShardIds(ShardResolutionStrategyData srsd)
{
if (srsd.EntityType == typeof(User))
return new[] { "Users" };
if (srsd.EntityType == typeof(Blog))
return new[] { "Blogs" };
if (srsd.EntityType == typeof(Post))
{
if (srsd.Key == null) // general query
return Enumerable.Range(0, numberOfShardsForPosts).Select(i => "Posts #" + (i + 1)).ToArray();
// we can optimize better, since the key has the shard id
// key structure is 'posts' / 'shard id' / 'post id'
var parts = srsd.Key.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
return new[] { "Posts #" + parts[1] };
}
throw new ArgumentException("Cannot get shard id for '" + srsd.EntityType +
"' because it is not a User, Blog or Post");
}
}
The interesting bits are in SelectShardIds, where we decide, based on the shard resolution data, what is the appropriate set of shards to query. In the case of User or Blog, the answer is easy, since they each live on a separate shard.
In the case of a Post, we make a distinction if we are querying for a specific document (in which case we can tell the exact shard because the shard id is encoded in the document key) or a query on all posts, in which case we return all the shards that holds posts.
Putting this logic in the ShardResolutionStrategy and ShardsSelectionStrategy isn't hard, but using those two classes, we gain an enormous level of sophistication inside Raven, since it is able to understand how to talk to just the relevant shards at any point in time.
Cross Shard Transactions
Raven is a transactional system, but when you are using shards, you are talking to several instances at the same time. Each instance is independent and share no data with the other instances. That may cause result in surprising behavior in situations such as this one:
using (var session = documentStore.OpenSession())
{
var user = new User { Name = "ayende" };
var blog = new Blog { Name = "Ayende @ Rahien" };
session.Store(user);
session.Store(blog);
session.SaveChanges();
}
When using a single Raven instance, failing to save the Blog will result in the User instance save rollback. But when using sharding, it is possible that the User will be saved (and committed) on one shard, while the Blog failed to save on another shard. The failure to save the Blog on the second server will not result in rollback on the first server.
If you do want such behavior, you need to utilize System.Transactions' Distributed Transaction support, by wrapping the calls inside a TransactionScope, like thus:
using (var tx = new TransactionScope())
using (var session = documentStore.OpenSession())
{
var user = new User { Name = "ayende" };
var blog = new Blog { Name = "Ayende @ Rahien" };
session.Store(user);
session.Store(blog);
session.SaveChanges();
tx.Complete();
}
Now we can relax, knowing that either both changes will complete successfully or both will be rolled back.
Sharding Approaches
This section is intentionally the last part of this document, since we wanted you to have some idea about how sharding works before we talk about the design concepts related to sharding. By now, I hope that it is clear why Raven doesn't come with generic implementation of ShardSelectionStrategy and ShardResolutionStrategy. It would be easy to provide ones that use some sort of modulus logic to scatter documents among all shards, but that wouldn't lead to optimal results. Instead, by letting you define the sharding strategies, we gain a lot of flexibility and performance. It also ensure that people think before just applying modulus to their entire data set.
How to shard your data
So far, we have seen two ways in which we can shard our data. We can either shard it by role (Users on one shard, Blogs on the other) or just using horizontal partitioning (Posts are partitioned among these three servers). Another way would be to split all data among all servers, that seems to be the common approach for people that are used to sharding relational databases. This approach is usually driven by the relational nature, since you would need to store related data on the same server, to ensure that you can join Users to Blogs to Posts.
Raven isn't relational, and if you follow Raven's best practices, each document is independent. That negate the need to store some data together for joining purposes. Once we are free from that, we can start considering the sharding approach on a clean slate.
In general, I prefer to shard using the method outlined so far. Dedicating certain shards for certain types of documents. The logic here is simple, different types of documents have drastically different access and growth patterns. While the number of Users and Blogs grow slowly, the number of Posts grows much more rapidly. By assigning Posts their own shard instances, I ensured that I can increase my ability to handle posts independently from the ability to handle Users.
That makes managing deployment easier, since we can easily look at the hot spots and manage them on a case by case basis. The problem of "we need to increase capacity to store Posts" is much simpler than "the database is overloaded", after all.
Stable sharding
One of the problems that is common when using shard is what do you do when you need to add a new shard.
This tend to be a problem because in many sharding implementation, the selection criteria for the shard used to store a particular piece of data is: primary_key % number_of_shards. When a new shard is introduced, the result of that calculation changes drastically.
The approach that we have used so far result in a stable sharding environment. We only use modulus when we create a new object, to select which shard it would go to. Once that is done, the shard id is encoded in the document key. That means that adding a new shard will not affect existing documents, we will still be able to locate them easily enough.
Data locality
Finally, there is one last consideration to take into account when working with shards. When you are have several data centers, it often make sense to shard the data not based on some essentially arbitrary criteria, but based on where it will be accessed most often. That tend to result in significant performance increase.
For example, if a user is located in China, it is probably safe to assume that most of the user's access would be from China, which means that we want to locate that user's data on the Asia's shards. That user's documents will probably have keys similar to "asia/22/invoices/34".