30 Data Desensitization How to Implement Low Intrusiveness Data Desensitization Solutions Based on Rewriting Engines

30 Data Desensitization How to Implement Low-Intrusiveness Data Desensitization Solutions Based on Rewriting Engines #

Today, we will discuss the data desensitization module in ShardingSphere. As introduced in the lesson “10 | Data Desensitization: How to Ensure Secure Access to Sensitive Data?”, ShardingSphere provides a set of automatic data encryption and decryption mechanisms to achieve transparent data desensitization.

Overall Architecture of the Data Desensitization Module #

Just like in regular programming patterns, for data desensitization, we first obtain a DataSource as the entry point for the entire process. However, this is not an ordinary DataSource, but a specialized EncryptDataSource specifically designed for data desensitization. Similar to previous explanations of ShardingDataSource, ShardingConnection, ShardingStatement, and other topics, the approach for the data desensitization module follows a top-down logic.

Let’s review with the help of the following diagram:

image

In the diagram, classes related to the data desensitization module actually inherit from an abstract class. We have already covered this during the explanations on ShardingSphere, ShardingConnection, ShardingStatement, etc. Therefore, in the data desensitization module, we will focus on explaining a few key classes, while briefly reviewing the topics already covered.

Based on the diagram above, let’s start with the EncryptDataSource. The creation of EncryptDataSource relies on the EncryptDataSourceFactory, which is implemented as follows:

public final class EncryptDataSourceFactory {

    public static DataSource createDataSource(final DataSource dataSource, final EncryptRuleConfiguration encryptRuleConfiguration, final Properties props) throws SQLException {
        return new EncryptDataSource(dataSource, new EncryptRule(encryptRuleConfiguration), props);
    }
}

Here, an EncryptDataSource is directly created, depending on the EncryptRule configuration object. Now, let’s clarify what is included in the EncryptRule.

EncryptRule #

EncryptRule is a core object in the data desensitization module, which deserves a separate explanation. In the EncryptRule, the following three core variables are defined:

// Encryption and decryption engines
private final Map<String, ShardingEncryptor> encryptors = new LinkedHashMap<>();
// Desensitized tables
private final Map<String, EncryptTable> tables = new LinkedHashMap<>();
// Desensitization rule configuration
private EncryptRuleConfiguration ruleConfiguration;

We can divide these three variables into two parts. ShardingEncryptor is used for encryption and decryption, while EncryptTable and EncryptRuleConfiguration are more related to the configuration system of data desensitization.

Next, I will explain these two parts separately.

1. ShardingEncryptor #

In the EncryptRule, ShardingEncryptor is an interface representing a specific encryptor class. Here is the definition of the interface:

public interface ShardingEncryptor extends TypeBasedSPI {
    // Initialization
    void init();
    // Encryption
    String encrypt(Object plaintext);
    // Decryption
    Object decrypt(String ciphertext);
}

The ShardingEncryptor interface contains a pair of methods for encryption and decryption. The interface also inherits from the TypeBasedSPI interface, which means it will be dynamically loaded through the Service Provider Interface (SPI) mechanism.

ShardingEncryptorServiceLoader handles this process, and in the sharding-core-common project, we can also find the SPI configuration file, as follows:

Drawing 1.png

SPI Configuration File for ShardingEncryptor

Here, there are two implementation classes: MD5ShardingEncryptor and AESShardingEncryptor. As for the MD5 algorithm, we know that it is one-way hash, which means it cannot be reversed back to plaintext from the ciphertext. The implementation class for MD5ShardingEncryptor is as follows:

public final class MD5ShardingEncryptor implements ShardingEncryptor {

    private Properties properties = new Properties();

    @Override
    public String getType() {
        return "MD5";
    }

    @Override
    public void init() {
    }

    @Override
    public String encrypt(final Object plaintext) {
        return DigestUtils.md5Hex(String.valueOf(plaintext));
    }

    @Override
    public Object decrypt(final String ciphertext) {
        return ciphertext;
    }
}

On the other hand, AES is a symmetric encryption algorithm, which means it can be reversed back to plaintext from the ciphertext. The corresponding AESShardingEncryptor is as follows:

public final class AESShardingEncryptor implements ShardingEncryptor {

    private static final String AES_KEY = "aes.key.value";

    private Properties properties = new Properties();

    @Override
    public String getType() {
        return "AES";
    }

    @Override
    public void init() {
    }

    @Override
    @SneakyThrows
    public String encrypt(final Object plaintext) {
        byte[] result = getCipher(Cipher.ENCRYPT_MODE).doFinal(StringUtils.getBytesUtf8(String.valueOf(plaintext)));
        // Encryption using Base64
        return Base64.encodeBase64String(result);
    }

    @Override
    @SneakyThrows
    public Object decrypt(final String ciphertext) {
        if (null == ciphertext) {
            return null;
        }
        // Decrypt using Base64
        byte[] result = getCipher(Cipher.DECRYPT_MODE).doFinal(Base64.decodeBase64(String.valueOf(ciphertext)));
        return new String(result, StandardCharsets.UTF_8);
    }

    private Cipher getCipher(final int decryptMode) throws NoSuchPaddingException, NoSuchAlgorithmException, InvalidKeyException {
        Preconditions.checkArgument(properties.containsKey(AES_KEY), "No available secret key for `%s`.", AESShardingEncryptor.class.getName());
        Cipher result = Cipher.getInstance(getType());
DefaultSQLRewriteEngine rewriter = new DefaultSQLRewriteEngine();
        return rewriter.rewrite(sqlRewriteContext).getSql();
Object cipherValue = getEncryptRule().getEncryptValues(tableName, columnName, Collections.singletonList(originalValue)).iterator().next();

这里需要注意的是,encryptValues 方法可能返回多个值,因为在一些情况下加密后的值可能存在多个选项,比如加密算法中可能会使用随机数作为附加信息。但在这里的场景中,我们只取第一个值。

接下来,我们通过 parameterBuilder 的 addReplacedParameters 方法将参数替换为密文值,并将辅助查询字段和明文字段添加到 parameterBuilder 中。

通过以上的分析,我们了解到了一个具体的 ParameterRewriter 的实现机制。这种实现机制是典型的装饰器模式的典型应用,即使用装饰器模式在原有的方法基础上增强功能。

值得一提的是,EncryptParameterRewriterBuilder 中使用了 Builder 模式来构建 EncryptParameterRewriter 的集合,并经过 SQLRewriteContext 的调用顺序一步一步实现功能的增强。也正因为有这种机制,我们只需在原有的类完成基础功能后,再通过增加 Decorator 的方式来实现具体参数的加解密过程。

public String getRewriteSQL(
        final String dataSource,
        final ShardingRule shardingRule, final ConfigurationProperties props, final List<Object> parameters, final boolean showSQL) throws SQLException {
    SQLRewriteContext sqlRewriteContext = new SQLRewriteEngine(shardingRule, dataSource, getShardingTableMetaData(dataSource), props, parameters, parameters).rewrite();
    return sqlRewriteContext.generateSQL(showSQL);
}

getRewriteSQL 方法中,创建了一个 SQLRewriteEngine 对象,并调用其 rewrite() 方法来进行 SQL 改写。

SQLRewriteEngine 类中的 rewrite 方法中,在执行 SQLBuilder.buildToSQLNode 方法时,会根据语句类型创建不同的 SQLBuilder 对象。SQLBuilder 是抽象类,不同的语句类型有不同的实现类,例如 SelectSQLBuilderInsertSQLBuilderUpdateSQLBuilder 等。

我们以 SelectSQLBuilder 为例,看一下其 buildToSQLNode 方法的实现:

@Override
public SQLBuilderResult buildToSQLNode(final List<Object> parameters, final boolean isEncryptParameter) {
    queryResult = new SQLBuilderResult();
    queryResult.removeOrderBy();
    if (sqlStatement.getOrderBy().isGenerated() && sqlStatement.getOrderByItems().isEmpty()) {
        appendGeneratedOrderBy(queryResult, sqlStatementGroup.getSqlContexts(), parameters, isEncryptParameter);
    }
    queryResult.appendLiterals(sqlStatement.getSql());
    if (!sqlStatement.getOrderByItems().isEmpty()) {
        appendOrderBy(queryResult, parameters, isEncryptParameter);
    }
    appendQueryRouters(parameters, isEncryptParameter);
    appendUnion();
    return queryResult;
}

buildToSQLNode 方法中,首先通过调用 appendGeneratedOrderBy 方法生成排序语句,然后调用 appendLiterals 方法将原始 SQL 添加到结果集中,然后再添加排序语句,接着添加查询路由信息和 UNION 语句。

SQLRewriteContext 类中,有一个 ToStringSQLVisitor 类,它可以将 SQLNode 类型转换为 SQL 字符串。在 generateSQL 方法中,会调用 ToStringSQLVisitor 类的 visit 方法将 SQLNode 类型转换为 SQL 字符串。

而在 ToStringSQLVisitor 类中,根据语句类型,会调用相应的 visit 方法,例如 visitSelectvisitInsertvisitUpdate 等。这些 visit 方法负责遍历 SQLNode,并将 SQLNode 转换为 SQL 字符串。

接下来我们来看一下 ToStringSQLVisitorvisit 方法:

@Override
public String visit(final SQLNode sqlNode) {
    String className = sqlNode.getClass().getName();
    int visitorIndex = className.lastIndexOf(".") + 1;
    String visitorName = className.substring(visitorIndex);
    String visitMethodName = String.format("visit%s", visitorName);
    try {
        Method method = getClass().getMethod(visitMethodName, sqlNode.getClass());
        return (String) method.invoke(this, sqlNode);
    } catch (final ReflectiveOperationException ex) {
        throw new UnsupportedOperationException(String.format("Cannot support visitor for class [%s]", sqlNode.getClass()), ex);
    }
}

ToStringSQLVisitorvisit 方法是一个反射的实现,在这个方法中,首先获取 SQLNode 对应的类名,然后通过类名拼接出相应的 visit 方法名,最后使用反射来调用对应的 visit 方法,并将 SQLNode 作为参数传入。

这样,通过遍历 SQLNode,调用不同的 visit 方法,最终将 SQLNode 转换为 SQL 字符串。

ToStringSQLVisitor 类中,不同的 SQLNode 类型有不同的 visit 方法,例如:

private String visitSelect(final SelectStatement selectStatement) {
    append(selectStatement.getWithClause());
    if (null != selectStatement.getTable()) {
        appendVisit(selectStatement.getTable());
    }
    append(selectStatement.getProjections());
    append(selectStatement.getFrom());
    append(selectStatement.getWhere());
    append(selectStatement.getGroupBy());
    append(selectStatement.getWindow());
    append(selectStatement.getOrderBy());
    append(selectStatement.getLimit());
    return getSQLBuilder().toString();
}

private String visitInsert(final InsertStatement insertStatement) {
    append(insertStatement.getWithClause());
    append(insertStatement.getTable());
    append(insertStatement.getColumns());
    append(insertStatement.getValues());
    append(insertStatement.getSetAssignment());
    append(insertStatement.getQuery());
    append(insertStatement.getSetAssignment());
    return getSQLBuilder().toString();
}

private String visitUpdate(final UpdateStatement updateStatement) {
    append(updateStatement.getWithClause());
    append(updateStatement.getTable());
    for (AssignmentSegment each : updateStatement.getSetAssignment().getAssignments()) {
        appendVisit(each);
    }
    append(updateStatement.getWhere());
    return getSQLBuilder().toString();
}

以上是 ToStringSQLVisitor 类中的部分 visit 方法。不同的 SQLNode 类型有不同的处理逻辑,例如对于 SelectStatement 类型的节点,会先调用 append 方法将 WithClause 添加到结果集中,然后分别添加各部分的 SQL 字符串,最后通过调用 getSQLBuilder().toString() 方法将结果生成 SQL 字符串返回。

通过这样的流程,最终将 SQLNode 类型转换为 SQL 字符串,并返回给调用者。 Let’s go back to the executeQuery method of EncryptStatement. We have the following statement:

ResultSet resultSet = statement.executeQuery(getRewriteSQL(sql));

After executing the executeQuery method, we obtain a ResultSet. However, we don’t directly return this resultSet object. Instead, we need to encapsulate it and create an EncryptResultSet object as shown below:

this.resultSet = new EncryptResultSet(connection.getRuntimeContext(), sqlStatementContext, this, resultSet);

EncryptResultSet inherits from the AbstractUnsupportedOperationResultSet class, which in turn inherits from the AbstractUnsupportedUpdateOperationResultSet class. This AbstractUnsupportedUpdateOperationResultSet class, in turn, inherits from the WrapperAdapter class and implements the ResultSet interface. So EncryptResultSet is also an adapter, just like EncryptDataSource and EncryptConnection in essence.

Regarding EncryptResultSet, there are a lot of get methods, which don’t require specific introduction. The key point lies in the following method in the constructor:

mergedResult = createMergedResult(queryWithCipherColumn, resultSet);

As we know, in ShardingSphere, the execution engine is followed by the merge engine. In EncryptResultSet, we use the merge engine to generate a MergedResult.

For EncryptResultSet, it first checks whether the passed SQLStatement is a DALStatement. If it is, it calls DALEncryptMergeEngine to complete result merging; otherwise, it uses DQLEncryptMergeEngine. Let’s focus on DQLEncryptMergeEngine.

public final class DQLEncryptMergeEngine implements MergeEngine {
    private final EncryptorMetaData metaData;
    private final MergedResult mergedResult;
    private final boolean queryWithCipherColumn;

    @Override
    public MergedResult merge() {
        return new EncryptMergedResult(metaData, mergedResult, queryWithCipherColumn);
    }
}

DQLEncryptMergeEngine is very simple. Its merge method only constructs an EncryptMergedResult object and returns it. The core method getValue in EncryptMergedResult is shown below:

@Override
public Object getValue(final int columnIndex, final Class<?> type) throws SQLException {
        Object value = mergedResult.getValue(columnIndex, type);
        if (null == value || !queryWithCipherColumn) {
            return value;
        }
        Optional<ShardingEncryptor> encryptor = metaData.findEncryptor(columnIndex);
        return encryptor.isPresent() ? encryptor.get().decrypt(value.toString()) : value;
}

From the above process, we can see that the merging implementation in the data desensitization module is actually calling the decrypt method of ShardingEncryptor to decrypt the ciphertext of the encrypted column into plaintext.

With that, we have finished the introduction to the overall flow of the executeQuery method in EncryptStatement. After understanding the implementation process of this method, it becomes easier to comprehend other methods in EncryptStatement and EncryptPreparedStatement.

From Source Code Analysis to Daily Development #

For the topic discussed today, the content that can be directly applied to the daily development process is the abstraction process of ShardingEncryptor and the internal implementation mechanism of encryption and decryption. ShardingSphere uses the DigestUtils utility class to complete the application of the MD5 algorithm, as well as the Base64 utility class to implement the AES algorithm.

These two utility classes can be completely adopted in our own system, thus adding mature encryption and decryption algorithm implementation solutions.

Summary and Preview #

Today, we discussed the underlying principles of implementing data desensitization mechanism in ShardingSphere. We found that the data desensitization module relies on the rewriting engine and the merge engine in the sharding engine, with the rewriting engine playing a crucial role in the process of data desensitization. It completes the automatic encryption and decryption of plaintext and ciphertext data through column supplementation, as well as the transparent SQL conversion process.

Here’s a question for you to think about: What is the collaboration between the data desensitization module and the rewriting engine and merge engine in ShardingSphere? Feel free to discuss it in the comment section, and I will provide feedback on each answer.

After introducing the data desensitization mechanism today, tomorrow, we will discuss another useful function, which is orchestration and governance. We will explore the underlying principles of parsing configuration information and dynamically managing it based on the configuration center.