Azure Cosmos Db TTL, your data cleanup manager

In every scenario where you store data, you need to make sure your data is as clean and relevant as possible. Especially when storing data in the cloud, where every byte stored and transferred costs money. Not only storing data has it's price tag, also having irrelevant/old/obsolete data will impact the overall performance so you might need to buy more capacity to keep the performance you need.

So for several reasons it's important to keep your data clean and tidy.

In general this is handled by maintenance jobs. A maintenance job can be a scheduled task or in case of SQL a SQL job., but for Cosmos DB jobs don't exist. Other options are a time-triggered Function to run a cleanup script, or to have a “scheduled task” you can use a Azure DevOps pipeline to run a script.

For Cosmos DB however, there since 2018 is already this rather unknown feature called TTL - Time To Live. It allows you to assign an expiration timestamp to a document, after which it will be removed by the Cosmos DB engine. The nice thing here is it doesn't cost you anything, as the cleanup of expired documents is handled by left over capacity on your Cosmos DB instance. This does mean the documents might not be deleted instantly after expiration, so this is not suitable for purposes where a direct delete is required. Although the expired documents might not be immediately deleted, they are flagged as expired so they won't show up in query results.

Configure TTL on Container level

When you have played with Cosmos DB before, you probably have seen the TTL settings page.

Container level TTL settings

There are three options regarding TTL:

  • Off -> entire feature is disabled
  • On (no default) -> feature is enabled, but is set by default to no expiration
  • On -> feature is enabled, and you need to specify a time to live in seconds

When you select the bottom option ‘On’, a textbox is displayed to key in the expiration in seconds.

Container level TTL on

The minimum is 1 second, the maximum is the INT32 max value, which is equivalent to 68 years. In the screenshot it's set to 10 seconds, which means any document you create in this container will become invisible after after 10 seconds and soon after be removed. Although it's a nice feature, it is an all-or-nothing approach which is not very suitable document maintenance.

Configure TTL on Document level

The TTL feature really becomes valuable when you can set it per document. In that way each document in the container can have it's own expiration value. GDPR is an example where this can be used for, as you have 30 days between data delete request and when the data needs to be gone.

If you want to benefit from document level TTL, you need to do only two things:

  1. configure TTL as ‘On (no default)’

Container level TTL on, without default

  1. add a field named ttl to your document

Document level TTL

It is that simple. The ttl value is compared with the _ts value, which is a system field in every document, and a document is marked as expired when applicable.

In code you'll use an entity or domain model where you need to have this ttl field, so it will end up in the document. Not only to be able to set TTL, but also to be able to update or disable it after the document has been created. After all, if you set the ttl field to value 1000, you can change that value while the document hasn't expired and even disable the TTL by setting the value to -1.

The piece of code below is an example of the ttl field where the default value is set to -1 which means ‘no expiration’.

/// <summary>
/// Specify the number of seconds this document may live in storage.
/// When expired, it will be deleted automatically
/// By default the value is -1, which means no expiration
/// </summary>
[JsonProperty("ttl")]
public int TimeToLiveInSeconds { get; set; } = -1;

So by default we have the same behavior as disabling the TTL feature entirely, but it allows us to change that behavior later on.

Use case

This blog post began by mentioning data cleanup maintenance jobs, and that's where I see the most utility in this feature. It is a nice and cheap way to handle this task.

In our scenario we have mobile clients who need to upload data. Before they can upload, they request an upload URL. The uploaded data requires processing, so for each requested upload URL we create a document in Cosmos DB to contain metadata about the processing status. If for whatever reason the upload fails, the document with metadata is abandoned. Uploads can fail for example because the network connection is unstable or was dropped or the app was closed prematurely. The app keeps trying to upload the data, but requests an upload URL on every attempt. This leaves us with abandoned documents with metadata for uploads that never will take place.

One option we had was to run a Function based maintenance job, but we found out TTL is a much more elegant solution. When the client app requests an upload URL, we set the TTL to a certain value. When the data upload is received, we reset the TTL to -1 so the document never expires from then on. This approach results in abandoned documents being removed automatically, and as this is not time critical we're fine it doesn't happen immediately.

Other use cases are around scenarios where data is no longer relevant after a certain amount of time. That can be logging data or for example telemetry data.

Final thoughts

I really found this to be an elegant way to get rid of documents which are not supposed to be there anymore. Especially the fact the number of RU needed to delete the expired documents, are consumed from unused capacity is really nice. So utilizing this feature will cost you totally nothing at all.

If you have any comments or remarks, you can reach me on Twitter @jeanpaulsmit.