10 Data Desensitization How to Ensure Safe Access to Sensitive Data

10 Data Desensitization How to Ensure Safe Access to Sensitive Data #

Starting today, we will introduce the data desensitization feature in ShardingSphere. Data desensitization refers to the transformation of sensitive information using desensitization rules to achieve reliable protection of sensitive privacy data. Data security has always been an important and sensitive topic in daily development. Compared to traditional private deployment solutions, internet applications have higher requirements for data security and cover a wider range. Depending on the attributes of different industries and business scenarios, sensitive information in different systems may vary. However, personal information such as ID numbers, phone numbers, card numbers, user names, and account passwords generally need to be desensitized.

How does ShardingSphere abstract data desensitization? #

Data desensitization is relatively easy to understand in concept, but there are many approaches in the actual implementation process. Before introducing the specific development process based on data desensitization, it is necessary to first understand the abstract process of implementing data desensitization. Here, I will abstract the data desensitization process from three dimensions: the storage method of sensitive data, the encryption and decryption process of sensitive data, and the embedding of encryption and decryption in business code.

Drawing 0.png

For each dimension, I will provide the specific abstract process of this framework based on ShardingSphere, making it easier for you to understand how to use it. Let’s take a look together.

How are sensitive data stored? #

The question here is whether sensitive data needs to be stored in plaintext in the database. The answer is not absolute.

Let’s consider the first scenario first. For some sensitive data, it is obvious that we should store the encrypted data directly in ciphertext form in order to prevent any way of obtaining the plaintext data from the database. The most typical example of this kind of sensitive data is user passwords. We usually use irreversible encryption algorithms like MD5 to encrypt them, and only rely on the ciphertext form of the data when using it, without directly handling the plaintext.

However, for information such as user names and phone numbers, we cannot directly use irreversible encryption algorithms to encrypt them due to the need for statistical analysis and other aspects. We need to process the plaintext information. One common way to handle this is to use two columns to save one field, with one column storing the plaintext and the other column storing the ciphertext. This is the second scenario.

Clearly, we can consider the first scenario as a special case of the second scenario. In the first scenario, there is no plaintext column, only a ciphertext column.

ShardingSphere also abstracts based on these two scenarios. It names the plaintext column “plainColumn” and the ciphertext column “cipherColumn”. The plainColumn is optional, while the cipherColumn is required. Meanwhile, ShardingSphere also proposes the concept of a logical column “logicColumn”, which represents a virtual column only used for programming by developers:

Drawing 2.png

How is sensitive data encrypted and decrypted? #

Data desensitization is essentially an application scenario of encryption and decryption technology, so it naturally involves encapsulating various encryption and decryption algorithms and techniques. There are two traditional encryption methods: symmetric encryption, including DEA and AES, and asymmetric encryption, including RSA.

ShardingSphere also abstracts a ShardingEncryptor component internally to specifically encapsulate various encryption and decryption operations:

public interface ShardingEncryptor extends TypeBasedSPI {
    // Initialization
    void init();
    // Encryption
    String encrypt(Object plaintext);
    // Decryption
    Object decrypt(String ciphertext);
}

Currently, ShardingSphere has built-in two specific implementations of ShardingEncryptor: AESShardingEncryptor and MD5ShardingEncryptor. Of course, since ShardingEncryptor extends the TypeBasedSPI interface, developers can completely implement and dynamically load various custom ShardingEncryptors based on the microkernel architecture and the SPI mechanism provided by the JDK. We will discuss the microkernel architecture and SPI mechanism in ShardingSphere’s “Microkernel Architecture: How Does ShardingSphere Implement System Extensibility?” session in detail.

How to Embed Data Desensitization in Business Code? #

The last abstract point of data desensitization is how to embed the data desensitization process in the business code. This process should be as automated as possible, have low intrusiveness, and be transparent enough to developers.

We can describe the execution flow of data desensitization through a specific example. Suppose there is a user table in the system, which includes a user_name column. We consider this user_name column as sensitive data and need to desensitize it. According to the data storage scheme discussed earlier, two fields can be set in the user table: one representing the plaintext user_name (user_name_plain) and one representing the ciphertext user_name (user_name_cipher). Then, the application interacts with the database table through the logical column user_name:

Drawing 4.png

For this interaction process, we hope that there is a mechanism that can automatically map the logical column user_name to the columns user_name_plain and user_name_cipher. At the same time, we want to provide a configuration mechanism that allows developers to flexibly specify various encryption and decryption algorithms used in the desensitization process.

As an excellent open-source framework, ShardingSphere provides such a mechanism. So, how does it achieve this?

First, ShardingSphere parses the SQL passed in from the application, and based on the desensitization configuration provided by developers, it rewrites the SQL to automatically encrypt the plaintext data and store the encrypted ciphertext data in the database. When we query data, it retrieves the ciphertext data from the database and automatically decrypts it, finally returning the decrypted plaintext data to the user. ShardingSphere provides an automated and transparent data desensitization process, and business developers can use desensitized data just like ordinary data without having to worry about the implementation details of data desensitization.

System Transformation: How to Implement Data Desensitization? #

Next, let’s continue to transform the system and add the data desensitization feature. This process mainly consists of three steps: preparing for data desensitization, configuring data desensitization, and executing data desensitization.

Data Masking Preparation #

To demonstrate the data masking feature, we redefine an EncryptUser entity class, which includes commonly used fields such as username and password, corresponding to the columns in the encrypt_user table in the database:

public class EncryptUser {
    //User ID
    private Long userId;
    //Username (encrypted)
    private String userName;
    //Username (plaintext)
    private String userNamePlain;
    //Password (encrypted)
    private String pwd;
    ...
}

Next, we need to mention the definition of resultMap and the insert statement in the EncryptUserMapper, as shown below:

<mapper namespace="com.tianyilan.shardingsphere.demo.repository.EncryptUserRepository">
    <resultMap id="encryptUserMap" type="com.tianyilan.shardingsphere.demo.entity.EncryptUser">
        <result column="user_id" property="userId" jdbcType="INTEGER"/>
        <result column="user_name" property="userName" jdbcType="VARCHAR"/>
        <result column="pwd" property="pwd" jdbcType="VARCHAR"/>
    </resultMap> 
    <insert id="addEntity">
        INSERT INTO encrypt_user (user_id, user_name, pwd) VALUES (#{userId,jdbcType=INTEGER}, #{userName,jdbcType=VARCHAR}, #{pwd,jdbcType=VARCHAR})
    </insert>
    ...
</mapper>

Please note that we did not specify the user_name_plain field in resultMap, and similarly, this field is not specified in the insert statement.

With the Mapper in place, we can now build the Service layer components. In the EncryptUserServiceImpl class, we provide the processEncryptUsers and getEncryptUsers methods to insert and retrieve user lists, respectively.

@Service
public class EncryptUserServiceImpl implements EncryptUserService { 
    @Autowired
    private EncryptUserRepository encryptUserRepository;

    @Override
    public void processEncryptUsers() throws SQLException {
        insertEncryptUsers();
    }

    private List<Long> insertEncryptUsers() throws SQLException {
        List<Long> result = new ArrayList<>(10);
        for (Long i = 1L; i <= 10; i++) {
            EncryptUser encryptUser = new EncryptUser();
            encryptUser.setUserId(i);
            encryptUser.setUserName("username_" + i);
            encryptUser.setPwd("pwd" + i);
            encryptUserRepository.addEntity(encryptUser);
            result.add(encryptUser.getUserId());
        }

        return result;
    }
    ...
}
} 
@Override
public List<EncryptUser> getEncryptUsers() throws SQLException {
   return encryptUserRepository.findEntities();
}

}

Now, the business layer code is ready. Since the data encryption function is embedded in sharding-jdbc-spring-boot-starter, we do not need to introduce additional dependencies.

Configure data encryption #

In the overall architecture, similar to sharding and read-write separation, the entry point exposed by data encryption is also an EncryptDataSource object that conforms to the JDBC specifications. As shown in the following code, ShardingSphere provides the EncryptDataSourceFactory factory class to build the EncryptDataSource object:

public final class EncryptDataSourceFactory {

    public static DataSource createDataSource(final DataSource dataSource, final EncryptRuleConfiguration encryptRuleConfiguration, final Properties props) throws SQLException {
        return new EncryptDataSource(dataSource, new EncryptRule(encryptRuleConfiguration), props);
    }
}

As you can see, there is an EncryptRuleConfiguration class here, which contains two maps for configuring the list of encryptors and the list of encrypted table configurations:

// List of encryptor configurations
private final Map<String, EncryptorRuleConfiguration> encryptors;

// List of encrypted table configurations
private final Map<String, EncryptTableRuleConfiguration> tables;

Whereas EncryptorRuleConfiguration inherits a common abstract class, TypeBasedSPIConfiguration, in ShardingSphere, which contains the fields type and properties:

// Type (e.g., MD5/AES encryptor)
private final String type;

// Properties (e.g., key value used for AES encryptor)
private final Properties properties;

And EncryptTableRuleConfiguration is an internal map that contains multiple EncryptColumnRuleConfigurations. EncryptColumnRuleConfiguration in ShardingSphere is the configuration for encrypted columns, which includes definitions for plainColumn, cipherColumn, assistedQueryColumn, and encryptor:

public final class EncryptColumnRuleConfiguration {
    // Column storing plain text
    private final String plainColumn;
    
    // Column storing cipher text
    private final String cipherColumn;
    
    // Auxiliary query column
    private final String assistedQueryColumn;
    
    // Encryptor name
    private final String encryptor;
}

To sum up, we have listed the relationships between various configuration classes and the configurations required for data encryption using the following diagram:

Drawing 6.png

Now let’s get back to the code. To implement data encryption, we first need to define a data source, named dsencrypt:

spring.shardingsphere.datasource.names=dsencrypt
spring.shardingsphere.datasource.dsencrypt.type=com.zaxxer.hikari.HikariDataSource
spring.shardingsphere.datasource.dsencrypt.driver-class-name=com.mysql.jdbc.Driver
spring.shardingsphere.datasource.dsencrypt.jdbc-url=jdbc:mysql://localhost:3306/dsencrypt
spring.shardingsphere.datasource.dsencrypt.username=root
spring.shardingsphere.datasource.dsencrypt.password=root

After successful configuration, we configure the encryptors. Here, we define two encryptors, name_encryptor and pwd_encryptor, which are used to encrypt and decrypt the user_name and pwd columns, respectively. Note that in the following code, we use the AES symmetric encryption algorithm for the name_encryptor, while we directly use the irreversible MD5 hash algorithm for the pwd_encryptor:

spring.shardingsphere.encrypt.encryptors.name_encryptor.type=aes
spring.shardingsphere.encrypt.encryptors.name_encryptor.props.aes.key.value=123456
spring.shardingsphere.encrypt.encryptors.pwd_encryptor.type=md5

Next, we need to configure the encrypted table. For the scenario in the case study, we can choose to set the plainColumn, cipherColumn, and encryptor properties for the user_name column, while for the pwd column, since we don’t want to store plain text in the database, we only need to configure the cipherColumn and encryptor properties:

spring.shardingsphere.encrypt.tables.encrypt_user.columns.user_name.plainColumn=user_name_plain
spring.shardingsphere.encrypt.tables.encrypt_user.columns.user_name.cipherColumn=user_name
spring.shardingsphere.encrypt.tables.encrypt_user.columns.user_name.encryptor=name_encryptor
spring.shardingsphere.encrypt.tables.encrypt_user.columns.pwd.cipherColumn=pwd
spring.shardingsphere.encrypt.tables.encrypt_user.columns.pwd.encryptor=pwd_encryptor

Finally, ShardingSphere also provides a property switch, which can determine whether to directly query and return plain text data stored in the database table when both plain text and cipher text data are stored in the underlying table, or to query the cipher text data and decrypt it before returning:

spring.shardingsphere.props.query.with.cipher.comlum=true

Perform data encryption #

Now that the configuration work is ready, let’s execute the test case. First, execute the data insertion operation. In the table below, the corresponding fields in the table store the encrypted cipher text data:

Drawing 8.png Encrypted table data result

During this process, ShardingSphere will convert the original SQL statements into target statements for data encryption:

Drawing 9.png Diagram illustrating automatic SQL conversion

Then, let’s execute the query statement and get the console log:

2020-05-30 15:10:59.174  INFO 31808 --- [           main] ShardingSphere-SQL                       : Rule Type: encrypt
2020-05-30 15:10:59.174  INFO 31808 --- [           main] ShardingSphere-SQL                       : SQL: SELECT * FROM encrypt_user;
user_id: 1, user_name: username_1, pwd: 99024280cab824efca53a5d1341b9210
user_id: 2, user_name: username_2, pwd: 36ddda5af915d91549d3ab5bff1bafec
...

As you can see, the routing type here is “encrypt”, and the retrieved user_name is the decrypted plain text instead of the cipher text stored in the database. This is the effect of the spring.shardingsphere.props.query.with.cipher.comlum=true configuration. If this configuration is set to false, the returned result will be the cipher text.

Conclusion #

Data encryption is an important topic in database management and data access control. Today, we have explained the technical solution provided by ShardingSphere for data encryption. However, in reality, there are many implementation approaches for data encryption. ShardingSphere adopts an automated and transparent solution to achieve seamless integration between sensitive data storage, encryption, and application programs. Furthermore, today’s session demonstrated the database encryption transformation of a specific case study, providing specific configuration items and execution steps.

Here’s a question for you to think about: When using the data encryption module of ShardingSphere, what are the ways to set an item of data that needs to be encrypted?

This concludes the content of this lesson. In the next lesson, we will introduce the auxiliary functions related to orchestration and governance in ShardingSphere, focusing on the analysis of the configuration center.