Imagine you want to learn more about database performance, and you have in your hands a very large book about databases in general. How can you search for your topic of interest?
More often than not, the answer is: go to the index (usually located at the back of the book), look up the topic (usually in an alphabetical list), and the index will tell you the page where that topic is discussed.
If the book doesn't have an index, you probably need to go through each page and try to find relevant information about performance - that would be a tedious and long endeavour.
Similarly, when you ask for some document in a database, the database tries to use an index to quickly find the results for you. If there's no index to use as reference, it has to check each document, the same way you would have to if your book didn't have an index. The database should probably be able to handle it if there aren't a lot of documents to search, but when your database has close to 1 billion documents and is being queried thousands of times in a second... then it becomes a problem.
Indexes are data structures that allow databases to quickly find documents in a collection. In Mongo you define an index with a command like this:
The above tells the database to construct an index on values from the action property of the events collection (additionally, it tells the database to build the index in the background). Since building an index is a blocking operation, for very large collections it can take many hours for indexes to finish building, causing the database to stop answering any other queries.
With the above index the database will be able to efficiently retrieve results when querying by action. For example, this query:
action: 'send email'
will be very fast, because the database can quickly get all documents matching the action send email and return it to you.
Now, if you want to view send email events since one month ago:
The database will be able to retrieve all send email events very quickly using the index. However, if send email is a very common event, the query will be very slow! This is because you can't further filter by date using that index, so the database will have to check each of the millions of send email events.
Defining efficient indexes
To fix the case when you want to filter our events by date range, you can continue building on top of the previously defined index, and define it like this:
The above compound index will improve the performance of the query that also filters by date. However, while it looks correct at a quick glance, if you visualize how the database would look for documents using the index, you can probably guess why it's not actually the best index for that query.
Imagine that you have a book of events. In the book's index, you see maybe 20 different types of events, and for each event you see a very long list of pages where a certain event can be found for a given date. Assuming that your events are more or less evenly distributed by action, and your collection size is 1 billion events, if you search by event first, you reduce the search area by roughly 1/20. This means that, out of an original 1 billion of documents, you now have to scan 50 million documents.
What if the index instead showed you the date first instead of the action first? You would be able to find events from a given day quickly, and then get the events grouped by action. Assuming you have 4 years worth of historic event data and are querying for the events of a single day, indexing by date first reduces your search area by 1/1460 in your first step! Now you have to scan only ~1.35 million documents - around 37 times less than if you scan by event type first. The index would look like this:
When creating a compound index like the one in this example, ask yourself this question: "Which property of my find query is the most 'unique' one?" In the above example, date is more unique than action, because each day has a new unique value. This 'uniqueness' property of a document attribute is called 'cardinality'. The higher cardinality you have for your first properties in the compound index, the better it will perform, because higher cardinality fields do a better job at reducing the search area of the query.
Ensuring that indexes are efficiently used
Now your collection has nicely defined indexes with high cardinality fields on top, ensuring that your search space is reduced significantly from the beginning. Great! Now how can you ensure that your database uses the index as efficiently as possible?
For indexes to be efficiently used, you want them to fit in the RAM available in the database server. RAM in Mongo is mostly used for keeping the most frequently requested data and indexes - this is known as the working set. On a WiredTiger storage engine, the default amount of RAM used for the working set is 50% of RAM - 1gb or 256mb, whichever is highest. Assuming you have a server with 32gb of RAM, this means there is 15gb for cache. Mongo uses this space to juggle around the most commonly retrieved data and indexes (it can load a subset of an index). Mongo also uses memory for other tasks, like managing connections and handling aggregations, not to mention other processes running in the machine beside Mongo.
It's not uncommon that, in order to support the variety of ways to query a collection, you'll need to define many indexes. It is very tempting to improve lookup queries by adding indexes without much consideration, but this is also an easy way to bloat the database with many indexes. You can inspect your database's overall index size like so:
This is the size of all indexes in the database in bytes. In this example it's 65gb. This is not an ideal size for a server with 32gb as per the example above! Because these indexes can't fit in memory, you'll be performing reads from disk and will be severely limited by disk I/O throughput.
It is not easy to know how much memory your database needs. Some questions you might want to consider:
How large is your data?
How frequently is data requested (to determine approximate working set sizes)?
How large are your indexes?
What is your projected data growth in the short/mid term?
Strategies to keep indexes size small
Here are a few ways to keep index sizes small roughly ordered in ascending difficulty:
Remove unused indexes
You can examine the indexes of a given collection and their usage like so:
In the above example, there are two indexes. Under accesses, you can see that the first index has been used many times. Meanwhile, the second index has not been used at all: it is a candidate for removal. Suppose there were a third access with very few accesses, say around 100. That index might be a candidate for removal. However, it is important to understand which query used the index to understand the repercussions of the removal of said index at the application level.
Note that the ops number might be deceptive, because the number of uses are counted since the time Mongo server process started, in this case, since November 2nd.
Remove redundant indexes
Similar to above, you can inspect the definition of your indexes. For example in this output:
You can see that both indexes are used, so at first glance they are both needed. However, the second index makes the first one redundant, since queries on action alone will be able to use the second index without problems. In general, for compound indexes, a query will be able to use it as long as the fields in the query appear in order. For instance a query with date only will not be able to use the second index above, because date is the second indexed property in the compound index.
Use sparse indexes
Index sizes can be significantly reduced by making indexes sparse. When defining an index, you can apply a constraint that tells the index which documents it should index. This constraint is named $partialFilterExpression. For example, in the scenario of querying events by type and date, the product requirement is to support searching events that were triggered by manual user interaction. Since you don't care about events triggered through automation or API usage, you can define an index like so:
Assuming that 60% of events are triggered by direct user interaction, this means that the index only indexes 60% of documents in the collection, saving memory.
Reduce collection size
The less data in a database, the smaller the indexes will be, and the less memory will be required to keep it in RAM and ensure snappy responses. You can reduce collection size by moving old data away to "cold" storage. In the case of the events collection, there are four years worth of data, and it is likely that there isn't a lot of demand to retrieve old data. You can remove that data from the database to another place for archiving, as long as you provide means to retrieve said data (with the understanding that it is a slower process.
Keep indexes simple.
Compound indexes are highly powerful, since they will support creating more granular filters. However, compound indexes are more complex to maintain and larger by nature. Try to minimize the number of fields in a compound index in order to keep them small.
This is probably easier said than done, but this requires a good deal of time spent in the design of the database schema. Just because Mongo is considered "schemaless" doesn't necessarily mean that you should just start writing code, disregard design of the schema, and figure it out along the way. To define a good schema, it is important to have good understanding of the product, considering current requirements as well as foreseeing future requirements. Designing a good schema requires some abilities in reading the future.
Sharding the database is another option. What this basically means is that data is partitioned by some criteria (a shard key) and kept on multiple clusters. This option may be even more complicated than the previous option - it requires a good deal of understanding of the schema and, moreover, that you have a plan for supporting sharding from the moment the schema is designed.
For a shard setup to be effective, the shard key must guarantee an even distribution of data across the shards. For example, imagine you have 5 shards storing data for 5 different action types. If 70% of events are send message, one shard will receive 70% of your data, while the rest will be receiving 30% across the 4 of them.
It can be surprising how often you can find low hanging fruit when it comes to improving performance in databases. Simply removing unused and redundant indexes can be a big boost in performance. Furthermore, spending time designing a solid, future proof schema can be rewarding in the long term, allowing you to run your database in smaller servers and helping to implement other performance improvements such as sharding.
Do you like working on interesting performance problems? We’re hiring!