with Kiran Kumar Naidu Vaddadi
If you have APIs that are consumed internally and/or published externally, you would want to collect a set of metrics to further your insight and inform your go-forward investments. These metrics can uncover opportunities, improve user experience, and assist in capacity planning. While not comprehensive, here are a few that might be of interest:
- Usage by Customer
- Usage by API
- User Experience
- Response Time
- Error Rate
The above metrics can easily be modeled as time-series data; a series of numerical data points over a time horizon. Think of it as a graph where one axis is always time and the other axis has some measurement (Response Time e.g.). These metrics are typically generated/ingested at a very high resolution. For popular APIs, it could be billions of data points per day, if not more. Therefore, selecting an appropriate database is a key design choice; one that should be deliberated upon carefully.
You can use a SQL/relational database or a NO-SQL database to store the time series data. However, we must ask ourselves if this is the best choice. Let me elaborate. Many of the NO-SQL databases can do “joins” to some extent like their SQL counterpart and conversely, most SQL databases can offer support for document-style data. Nevertheless, they are each optimized for different data types and access patterns. SQL databases can handle joins more efficiently and NO-SQL databases can handle document style data more naturally. One size does not fit all. As an application developer, you need to choose the database that is appropriate for your specific requirement.
Here are requirements based on the characteristics of the metrics data:
- Handle continuous, write-heavy, and highly concurrent load
- Support for low latency query
- Ability to read data in large chunks (time ranges)
- Delineate between recent and the older data
- Minimize storage via flexible compression schemes
There are some unique challenges here. While a general-purpose database can be morphed and twisted to solve the functional problem, it will have a hard time doing it. Moreover, it might result in compromising the non-functional aspects, such as performance, scalability, and availability. If only there was a special purpose database for time series data that can address the requirements above!
Time Series Database
Time Series Database is optimized for time series data (duh!). They are very well suited to address the requirements listed above. The scale at which this type of data accumulates makes it very hard for traditional databases to keep up with it. When dealing with this kind of volume, data organization, storage and compression become extremely important. A time-series database offers column-oriented storage and partitions the data based on time. As a result, time-based queries are significantly faster than traditional databases. It offers more opportunities for compression during ingestion via rollups and it provides hooks for dropping data as they age out.
In API Management we have similar requirements. Our customers deploy their APIs onto our Gateways, and they use our Portal to publish the APIs to their consumers. They (the customers) would like us to capture the API metrics such that they can visualize and analyze them on the portal. Our original solution leveraged ElasticSearch, but it fell short when it came to extreme scale. We needed to revisit our choice of technology. For the reasons stated earlier, we thought a Time Series Database would fit the bill more. There are quite a few choices out there, but after doing our due diligence, we chose Apache Druid. We are not alone, there are a host of companies and products that are powered by Druid.
The Druid website introduces the technology as “an open-source distributed data store that combines ideas from data warehouses, time-series databases, and search systems to create a high-performance real-time analytics database for a broad range of use cases”. Apache Druid can handle incredibly high ingestion loads. Case in point, Netflix; they process 2 million events per second and store 2 trillion events using Druid!
API Management Analytics Architecture
The picture above is a simplified representation of our revamped real-time analytics architecture. A fleet of API Gateways (could be hundreds) sends events within a time window (1 minute) to the ingestion server in a zipped format. The ingestion server unzips it in memory and streams each event onto Kafka using Kafka Streams APIs. The data is then streamed out to Druid, which stores the data in MinIO (an Amazon S3 compatible, server-side software storage stack). These metrics are then visualized, sliced, and diced in Portal by issuing queries against the Druid database.
In Druid, the data is stored in what is called a Datasource, which is similar to a table in a traditional RDBMS. Each row must have a primary timestamp column which is used for partitioning and sorting of data. Other than that, all the other columns can either be a dimension or a metric. While metrics can be continuous, dimensions must have discrete values. In our solution, we are interested in the following dimensions: Application, API, Key, HTTP Verb, HTTP Response Code. As for metrics, Response Time, Error Rate, and API Usage are good examples.
API management customers want to solve two distinct problems with analytics solutions. First is the Root Cause Analysis (RCA) and the other is Business Reporting (analytics). They need high precision for recent data for doing justice to RCA but are fine with low precision for older data. So, we decided to store API events in two identical data sources, one with a query granularity of HOUR and the other, MINUTE. Both with rollup enabled; we will explain that in a moment. The hourly Datasource is sparse but retained for 2 years to facilitate trending analysis over a long period of time. Whereas, the minute Datasource is dense, but retained for 24 hours only. This separation allows for efficient use of storage.
Rollup works much like the GROUP BY clause in SQL. The rows that share the same value for the dimensions would be collapsed into one row per minute. Figure 2 below shows the data being ingested.
The first 3 rows will be compressed into one row. The reason the 5th row is not included because it is in a different time (minute) boundary. Figure 3 below shows the resulting set of rows.
Since both the data sources are identical, we will reuse the same incoming data as shown in figure 2. This time the 1st 3 rows and the 5th row came in within the same time boundary (hour), so they collapsed into 1 row and the other one stays as it is as shown in figure 4.
Now that we have taken care of storing our time series data, we need to be able to surface the data. Druid offers SQL or JSON to query the database. We chose to use the native JSON queries (REST API) to display the charts on the portal. Here is an example of time-series data we are capturing and displaying for API publishers.
Here is another example of time series data shown in an API portal:
The chart above was generated by the query below. Let’s look at it from the bottom up. First, we have the “dimensions” (line 33), which specifies the list of dimensions you want to groupBy, in this case, it is just the API ID. Then you have the aggregation (line 27), where we are summing up all the page hits. Then we have the filter (line 6); we are specifying a tenant, API, and response code as the filter in this query.
Our original solution relied on Elasticsearch and we learned the hard way that it was not the optimal choice at a large scale. It is important to call out that Elasticsearch is a fine technology that can solve a plethora of problems aptly; it is just that we found for our use case, a time series database such as Druid, is a better fit. We were able to prove it beyond doubt by doing thorough performance and scalability tests.
Everything being equal, here is how it compared:
This architecture can handle a far bigger load since all the components of the solution are horizontally scalable.
Choosing the right technology is paramount; it could be the difference between a superior user experience and an inferior one. With the advent of special-purpose databases, careful consideration must be given to choosing the database that best fits the use case in question. We found Druid, a Time Series database, to be very effective in capturing and reporting on API metrics. Our testing validates that it can scale far better. It is always great to be able to deliver an easy-to-use, cost-effective, and scalable solution for your customers.