Stratified Random Sampling for Testing Large Dataset

Background Problems

QA team needs to verify and ensure all of the internal URL links that are given for almost ~6000 datasets don’t contain the UUID format on the URL prefixes quickly with limited time to tests and manpower (we only had fully-pledged to test by 1 day within 2 QA members). The reason why we do this is because, based on the SEO team, all of the URL links that have an ID on them might be impacted by the SEO performance. While the engineer’s team already resolved this issue, we, as the QA team, were given a task to verify the given datasets.

Proposed Solution

There are several ways to verify those URL links rather than just randomly cherry-picking to test the URL links. One thing that we can do is we can utilize the Stratified Random Sampling techniques and incorporate them together with Boundary Values Analysis. By doing this, we will pick up a subset of URLs in the given datasets so that it would represent the characteristics of the entire dataset. All of the entire steps that are included to do this:

First, applied the Boundary Values Analysis techniques. We need to select the URL link given for testing, we can choose based on several criteria, for this case, we would like to have to differentiate based on:

URL length: The length of the URL link to be tested can be categorized as short - medium - long. All the criteria for determining this depend on our needs and requirements
URL format: URL format, in this case, the previous URL has an additional identity such as UUID to avoid duplication of the process. The rest, it comes back to our requirements
URL character: characteristics of the given URL, does it have certain criteria such as hyphen, alphanumeric, or something else.

After quick checking, the only possibility for us to choose the right URL to identify domain boundaries is just based on the type of business entity. Right now, we can group them into at least 3 categories (alongside how many are they):

Sdn Bhd 4069 entities
Berhad 295 entities
Group 593 entities

Sdn Bhd	4069 entities
Berhad	295 entities
Group	593 entities

We will choose the confidence level and margin of error repetitively: 95 % and 5 %
Determine the number of sample sizes. While on these steps there are several biases such as we didn’t take accountability for unexpected URLs that probably could have a format with variability like the length of the URL, unstructured company names, parent-child company format, and so on. The formulation to get the proper sample sizes is followed with:
$n = \frac{Z ^{2} \times p \times ( 1 - p )}{E ^{2}}$
Where:
- n = the number of sample sizes
- Z = represented as Z-score to the desired confidence level (1.96 for 95 % confidence level)
- p = proportion of the population (we go along with the null hypothesis, where it fairly had a chance to capture the bugs or skipped the bugs — thus, we set the p-value as 0.5)
- E = represented as margin of error
Substitute all of these with the pre-defined numbers before and we would get the result of 385
Since we already know the number of the identified stratum based on the type of business entities, we can calculate the sample sizes of each stratum. We’re doing this to get the proportional and smaller numbers to test the URL links and allow us the get the diversity of the result after final testing. The formulation itself can be done:
$L = \frac{N ^{h}}{N} \times n$
Where:
- L = the number of stratified sampling
- Nh = the number of identified strata based on the URL characteristic
- N = total population (of dataset)
- n = the number of sample sizes (previously we got 385)
Substitute all of these with the pre-defined numbers before and we would get the result of each stratum:

Stratum Sample Size
Stratum-1 for Sdn Bhd 261
Stratum-2 for Berhad 19
Stratum-3 for Group 38

Stratum	Sample Size
Stratum-1	for Sdn Bhd 261
Stratum-2	for Berhad 19
Stratum-3	for Group 38

With this, we can start testing by randomly checking around 261 URL links containing the Berhad format, 19 URL links that contain the Sdn Bhd format, and lastly 38 URL links with the Group format. For us, the 385 results are much more doable and sufficient to test with our team’s situation and condition

Testing Results

Once we’re done with the steps above, we’re going to collect all of the URLs that are required to test. Due to several circumstances (and considerations), we’ll just pick up only 131 URLs that contain Berhad’s prefixes, but for the others, we still managed to test all of them. The testing results so far by incorporating these techniques and we got:

Stratum	Sample Size	Number of URLs	% of Successful
Stratum-1 for Sdn Bhd	261	131	99.73 %
Stratum-2 for Berhad	19	0	100 %
Stratum-3 for Group	36	2	99.73 %

While the percentage of success itself is quite high, we got a result above 99%, however, in reality we encountered some confusion and also “mindblown” moments because there are still some URLs that can’t be captured by this method, such as URLs that have the character “group” but not separated by hyphen (—), which isn’t a valid business entity. Then there are also conditions where the company doesn’t even have a slug prefix because their company name is written in Mandarin

are these even a valid company? unfortunately, yes

With this, we get new information which will then be carried out in an iterative testing process by adding the variables we have found

Caveats

There are several considerations why we need to do this, since the amount of dataset itself was huge, previously, several stakeholders and other QA tried to do simple random sampling testing and it turned out there were still issues that were missed. Thus, we need some alternative or other way to test scenarios like this.

It should be noted that using this method requires a few 1-2 additional data tests and also iterative monitoring to ensure that the latest URLs comply with the criteria that we’ve identified before. Additionally, there are several potential situations when this method isn’t suitable for use which could be used as a reflection for further scenarios in the meantime such as:

The dataset is overly Heterogeneous and the numbers still uncertain (possibly always up-to-date), potentially quite hard to map out all of the characteristics that are given
Limited resources to test and do all these things from the first steps. For instance, the required numbers to test for stratum 1 itself can be considered huge, we actually can reduce the numbers into smaller amounts with a chance of increasing the margin of error up to 10 %. But of course, by doing this, the possibility that we’ll end up with issues greatly wider
For further adoption, consider using the spreadsheet functions to automate the testing process
We didn’t opt out to choose the boundary analysis for URL length and URL format since it turns out that it’s quite difficult to define the number of characters that we need to limit, and what types of characters can be used because all of this is too verbose and vary (so that’s why we tend to ignore for those 2 criteria). Ideally, we’d to believe, introducing these 2 things will greatly improve the number of size samples to test

big thanks to mas Arif for reviewing this article and collaborating with me during this ideation and development phase!

Ryan Personal Blog

Explorer

Stratified Random Sampling for Testing Large Dataset

Background Problems

Proposed Solution

Testing Results

Caveats

Table of Contents

Backlinks