Test data generation techniques

Data Masking¶

Data masking generates a sanitized version of the data for testing and development, ensuring sensitive information remains protected. Below are some techniques to handle data masking:

Technique	Description
Substitution	Replace actual sensitive data with fictional or anonymized values.
Shuffling	Randomly shuffle the order of data records to break associations between sensitive information and other data elements.
Encryption	Use encryption algorithms to transform sensitive data into unreadable ciphertext. Only authorized users with decryption keys can access the original data.
Tokenization	Replace sensitive data with randomly generated tokens. Tokens map to the original data, allowing reversible access by authorized users.
Character Masking	Mask specific characters within sensitive data, revealing only a portion of the information.
Dynamic Data Masking	Dynamically control and limit the exposure of confidential data in real-time during query execution. In other words, sensitive data is masked at the moment of retrieval, just before being presented to the user (usually the masking logic is based on user roles).
Randomization	Introduce randomness to the values of sensitive data for creating diverse test datasets.

Data Subsetting¶

Data subsetting involves creating a smaller, yet representative, portion of a production database for testing and development purposes. Performing data subsetting provides the following benefits:

Reduce data volume. For testing purposes, smaller data volume minimizes resource requirements and therefore reduces maintenance needs.
Preserve data integrity. Subsetting a dataset does not change the relationship between rows, columns, and any entities within it.
Easily include/exclude data based on specific criteria relevant to the team’s testing needs. Depending on the requirements, the dataset can be used to verify and replicate specific scenarios that may be present in production.

Synthetic Data Generation¶

Synthetic data generation involves creating artificial datasets that replicate real-world data while excluding sensitive or confidential information. This method is typically used when acquiring real data is difficult (such as financial, medical, or legal data) or risky (like employee personal information).

In such cases, generating entirely new sets of data for testing purposes is a more practical and cost-effective approach as it only requires some specific criteria to tailor the synthetic data.

Tools that Aid Test Data Generation¶

Faker. A JavaScript library that generates fake (but reasonable) data that can be used for things such as unit testing, performance testing, building demos, and working without a completed backend. Various ports such as Bogus exists for C# versions.
Copilot can also be used to generate test data based on a specific set of conditions or scenarios. However, be mindful that using AI in generating test data should conform to the data policies of the company as well as compliance with the General Data Protection Regulation (GDPR) to ensure that no personally identifiable information (PII) or any sensitive data is used.