Elevating Tech Reliability at Scale: How We Build Dependable Daily Performance Testing Platform

GovTech Edu
GovTech Edu
Published in
9 min readAug 29, 2023

--

Writers: Hamnah Suhaeri & Muhammad Aditya Ridharrahman

Background

The performance test is often conducted when an application or a service is about to face extraordinary events. It plays a pivotal role in evaluating the resilience of our applications and justifying whether our application and underlying infrastructure can effectively manage user traffic. For instance, at GovTech Edu, we rigorously conduct performance tests before the public release of the teacher’s super-app Merdeka Mengajar and nationwide online assessment for students (ANBK). We make sure the application and infrastructure design can handle hundreds of thousands of requests at a time from students and teachers all across Indonesia

When the performance test result falls below the expected expectations, improvements and optimizations from the perspective of infrastructure architecture and configuration to code structure often need to be carried out.

Optimizing code structure is often challenging and cannot be done quickly, especially when there have been numerous code changes, and we only realize their impact on performance later. To the minimum extent possible, we need to analyze and pinpoint which parts of the code need to be optimized, as it may require risk assessment and regression functional testing for those changes. One solution that can be implemented to minimize this issue is to perform regular performance tests. With regular execution, it would be possible to provide fast feedback to our product engineers on how their code implementation performs.

Typically when performing an event-driven performance test, we trigger the script manually and need collaborative oversight with the infrastructure team to monitor the test result. That kind way of execution can be exhausting and can’t be implemented on a regular basis, hence we need a mechanism and supporting tools or platform to make the execution more efficient and autonomous. Furthermore, the platform must be operated on a self-serve basis so that the product engineering squad can define and execute their performance test following their needs.

In response to those challenges, we developed a mechanism and platform called Daily Performance Test, or simply DPT. DPT is a testing activity to assess performance metrics of microservices APIs that are executed scheduled and continuously so we can detect performance degradation as early as possible when the result surpasses certain thresholds. Here’s a glimpse of how we build it.

DPT Environment

Daily Performance Test in GovTech can be applied to all microservices and executed in the staging environment. The staging environment was chosen so we can identify the change before it goes to the production environment, and we do not want to add invalid data and metrics to the monitoring system in the production environment. Also, the staging environment is more customizable and manageable, especially in test data management.

However, when DPT is executed in staging, it is challenging to define how big the traffic load we need to generate for the microservices. All the data references we have are based on production environment configuration. Scaling up the test environment to be like production specification for DPT is never a choice since it will cost a lot. The more realistic option that we choose is to keep the staging environment as minimum as required during the load test and adjust the load generation following the scale comparison between production and staging.

DPT Schedule

For DPT, we need a consistent environment every time we perform the test so the test result is valid to be benchmarked to the previous test result; consistency is the foundation of DPT. Currently, we do not have a dedicated and fully isolated environment for DPT, so we are opting for nighttime testing when the activity and traffic in staging are low. Scheduling a pipeline can be achievable easily in Gitlab with a scheduled pipeline menu.

Although we execute the test at nighttime, all the reports and alerts will be sent in the morning, and encourage the developer to have a regular morning check

DPT Script

At GovTech Edu, we utilize K6 for every performance testing activity, including DPT. The test script used in DPT is the same as the typical k6 script we used in the eventual performance test. Something that we design to make DPT development easier, faster, and more manageable are the base configuration for load config and the report output.
Most microservices that come from the same team want to be tested with the same load configuration so that, at some point, they can benchmark the result. Below is the illustration of the base load configuration.

Base load-config illustration

And here is the snippet of the k6 daily performance test script

Sample of DPT Script

The k6 script typically consists of four parts:

  • The test script where we define the API call and assertions
  • The load or executor configuration where we define the load
  • The metrics threshold configuration
  • The report configuration, what kind of data that we want to collect

Pipeline Implementation

The DPT is executed in a single pipeline that can accommodate all test scripts developed by various teams from different tribes.

DPT is executed in parallel and isolatedly per tribe in the same pipeline instead of distributed in each microservices pipeline; this approach makes managing and monitoring the pipeline easier and more manageable. Here is the DPT main pipeline

DPT main pipeline

Each Tribe will have a particular downstream pipeline containing performance test jobs for each microservice maintained by the tribe so that customization can be carried out according to the needs of each tribe.

Sub pipeline that contains job for each microservices

After the test is done, we will collect the data, process it, and distribute it in several kinds of reporting media like HTML reports, slack channels, and BigQuery. We are utilizing the k6 feature for storing the test metrics data in a JSON format so we can extract and process the data as we need

K6 run <script-path> — out json=target/report.json

The image below is the end-to-end flow of the DPT that we design.

DPT end-to-end flow

We use three medium of reporting that every each medium have its specific purpose:

  • HTML report: for detailed report per execution
  • Slack report: as an alert, only send whenever the result exceeds a certain threshold
  • BigQuery: Table for storing the historical result as source data for the Looker studio

Here are the snippets from the GitLab pipeline configuration for the DPT pipeline

Main pipeline configuration
Downstream pipeline configuration

Threshold

Performance thresholds are pre-defined performance limits or benchmarks that indicate acceptable performance levels. The performance threshold is essential to be defined so we can be aware when our service metrics are crossing a particular value that we can’t tolerate. Thresholds become references when we need to send an alert to the team via Slack Channel.

Performance threshold standards can be different for every microservice, even for endpoints. The DPT in GovTech Edu is designed to have a mechanism for setting different thresholds for every microservice or even endpoint.

Two kinds of threshold configurations can be utilized, global threshold and per-endpoint threshold. A global threshold is a value that becomes the reference for all endpoints if the per-endpoint threshold is not defined, and the per-endpoint threshold is used when you need to define specific metrics for certain endpoints. In our design, a global threshold is defined in ENV configuration, and the per-endpoint threshold is something that you need to define at the test script level.

Here is the snippet of how we can implement the threshold at the endpoint level in the k6 script. We are utilizing the tags threshold feature from k6 to make the script more concise

Threshold implementation on the script

From the test result, we collect various metrics information such as average response time, maximum response time, error rate, and percentile information (90th, 95th, and 99th). All those metrics can be used as thresholds by the product team following what they need.

Looker Studio Dashboard

Looker Studio was chosen as the data visualization tool or dashboard because it is easy to customize, has stunning visualization, and can be integrated very well with other Google tools ecosystems we used. In Looker Studio, we show all the test result metrics in the table that the user can filter with several options that we provide

Performance metrics information that we provide in the dashboard are:

  • How many endpoints tested
  • The test status (status note and remarks)
  • Test metrics (p90, p95, p99, min, max, average of response time)
  • RPS-related (test sample, duration, number of failed samples)
  • Graph visualization of test metrics from time to time
  • Test Configuration
DPT data table

Besides the test metrics, we also provide supporting information like the URL of the execution pipeline and remarks information when your endpoint breaches a certain threshold, here is a snippet of how the failed endpoint is displayed to the user

Sample data with exceeding threshold condition

When there is a failed test or anomaly in the result, the development team should give some remark or comment on the data. Currently, we use a simple approach using Google Sheets and just join the table in Looker.

For data trends from time to time, we visualize some of the metrics like average and p95th response time trends. All the metrics that visualize are customizable depending on the user’s needs; below are the snippets of the visualization

Besides performance metrics, we also track the test configuration of each microservice so we can be aware if there are changes

Conclusion

Daily Performance Test is not an overhead, it can become a cost and time-saving approach if managed correctly. Regular execution and continuous evaluation mean each day, you only need to allocate a small portion of your time to analyze smaller, more isolated issues and changes. By implementing DPT, we can monitor and identify changes in the performance metrics that are caused by code or configuration changes from time to time, and from that, we can gain some benefits, for instance:

Early Regression Detection

Helps pinpoint the root cause of performance degradation sooner and easier than if you only rely on an “event-based performance test”. The development team has more time for improvement, so they can think of many options and not heavily rely on scaling the infrastructure alone.

Baseline and Constant Awareness

With DPT, we can have a performance baseline for our microservices so the development team is constantly aware of how the application performs. When an extraordinary event occurs in the future, the development team can perform risk assessments more efficiently and confidently

Continuous Performance Tuning

Having DPT metrics data allows the development team to make recommendations for performance improvements even during the development cycles

With DPT, we also find it helps us discover an issue that can’t be caught by normal API functional tests. From the image below you can see that there is a high error rate and high latency occurring in a specific endpoint even though the load of the test is not so high

Endpoint with a high error rate

This case can’t be caught by a simple API test that is set in the deployment pipeline, a functional test only calls the API once at a time, and the microservice will still return HTTP 200 but with high latency (most functional tests don’t implement latency assertion also).

After the development team perform an investigation, it is caused by unoptimized code that triggers OOM in the microservices, and after the fix, we can see there is an improvement also in the performance metrics

About Writers

Hamnah Suhaeri

Engineering Manager in GovTech Edu. She is involved in engineering as a software engineer in tests. With over 5 years of experience in quality engineering, she has developed a keen interest in entering the managerial realm of Quality Engineering platforms or Core QA at GovTech Edu.

Muhammad Aditya Ridharrahman

Senior Software Development Engineer in the Core QA Squad in GovTech Edu that current main focus is building and maintaining the test platform. Have more than 7 years of experience in the Software Quality domain and shaping himself to have a balanced skillset in technical and quality assurance processes

--

--