feat(mask): mask funcs #33479

stephenkgu · 2025-11-05T09:15:44Z

Description

Please briefly describe the code changes in this pull request.

Checklist

Please check the items in the checklist if applicable.

Is the user manual updated?
Are the test cases passed and automated?
Is there no significant decrease in test coverage?

… into fix/TS-7235-sha2

gemini-code-assist · 2025-11-05T09:16:18Z

Summary of Changes

Hello @stephenkgu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's data manipulation capabilities by integrating new cryptographic hash functions (SHA-1, SHA-2) and a data masking feature. It also refines existing Base64 encoding/decoding, making these operations more accessible and robust, particularly within SQL insert statements. These changes enhance data security and flexibility for handling various data types.

Highlights

New Cryptographic Hash Functions: Introduced support for SHA-1 and SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512) hashing algorithms, adding FUNCTION_TYPE_SHA1 and FUNCTION_TYPE_SHA2.
Data Masking Function: Added a new mask_full function (FUNCTION_TYPE_MASK_FULL) to facilitate data masking operations.
Base64 Enhancements: Refactored and improved Base64 encoding and decoding functionalities, including a new from_base64 function for decoding.
Integration with SQL Inserts: Modified the SQL parser to allow direct use of from_base64, to_base64, md5, sha, sha1, and sha2 functions within INSERT statements for binary columns.
Internal Utility Updates: New tsha.h and tsha.c files were added for SHA algorithm implementations, and tutil.h was updated to expose new SHA hash creation utilities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new hashing (SHA1, SHA2) and masking functions, along with a refactoring of Base64 handling. The implementation is largely functional, but I have identified several areas for improvement concerning code quality, maintainability, and potential bugs. Notably, there's a critical issue with a main function included in a library file which will cause linker errors. I've also found a high-severity issue related to function type aliasing that could lead to incorrect behavior, and several medium-severity issues regarding code duplication, inefficiency, and inconsistent coding practices. Addressing these points will significantly improve the robustness and clarity of the codebase.

gemini-code-assist · 2025-11-05T09:19:23Z

source/util/src/tsha.c

+#ifdef TEST_VECTORS
+
+/* FIPS 180-2 Validation tests */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+void test(const char *vector, uint8 *digest, uint32 digest_size) {
+  char output[2 * SHA512_DIGEST_SIZE + 1];
+  int  i;
+
+  output[2 * digest_size] = '\0';
+
+  for (i = 0; i < (int)digest_size; i++) {
+    sprintf(output + 2 * i, "%02x", digest[i]);
+  }
+
+  printf("H: %s\n", output);
+  if (strcmp(vector, output)) {
+    fprintf(stderr, "Test failed.\n");
+    exit(EXIT_FAILURE);
+  }
+}
+
+static void test_sha224_message4(uint8 *digest) {
+  /* Message of 929271 bytes */
+
+  sha224_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha224_init(&ctx);
+  for (i = 0; i < 929; i++) {
+    sha224_update(&ctx, message, sizeof(message));
+  }
+  sha224_update(&ctx, message, 271);
+
+  sha224_final(&ctx, digest);
+}
+
+static void test_sha256_message4(uint8 *digest) {
+  /* Message of 929271 bytes */
+
+  sha256_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha256_init(&ctx);
+  for (i = 0; i < 929; i++) {
+    sha256_update(&ctx, message, sizeof(message));
+  }
+  sha256_update(&ctx, message, 271);
+
+  sha256_final(&ctx, digest);
+}
+
+static void test_sha384_message4(uint8 *digest) {
+  /* Message of 929271 bytes */
+
+  sha384_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha384_init(&ctx);
+  for (i = 0; i < 929; i++) {
+    sha384_update(&ctx, message, sizeof(message));
+  }
+  sha384_update(&ctx, message, 271);
+
+  sha384_final(&ctx, digest);
+}
+
+static void test_sha512_message4(uint8 *digest) {
+  /* Message of 929271 bytes */
+
+  sha512_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha512_init(&ctx);
+  for (i = 0; i < 929; i++) {
+    sha512_update(&ctx, message, sizeof(message));
+  }
+  sha512_update(&ctx, message, 271);
+
+  sha512_final(&ctx, digest);
+}
+
+#ifdef TEST_VECTORS_LONG
+
+/* Validation tests with a message of 10 GB */
+
+static void test_sha224_long_message(uint8 *digest) {
+  sha224_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha224_init(&ctx);
+  for (i = 0; i < 10000000; i++) {
+    sha224_update(&ctx, message, sizeof(message));
+  }
+  sha224_final(&ctx, digest);
+}
+
+static void test_sha256_long_message(uint8 *digest) {
+  sha256_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha256_init(&ctx);
+  for (i = 0; i < 10000000; i++) {
+    sha256_update(&ctx, message, sizeof(message));
+  }
+  sha256_final(&ctx, digest);
+}
+
+static void test_sha384_long_message(uint8 *digest) {
+  sha384_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha384_init(&ctx);
+  for (i = 0; i < 10000000; i++) {
+    sha384_update(&ctx, message, sizeof(message));
+  }
+  sha384_final(&ctx, digest);
+}
+
+static void test_sha512_long_message(uint8 *digest) {
+  sha512_ctx ctx;
+  uint8      message[1000];
+  int        i;
+
+  memset(message, 'a', sizeof(message));
+
+  sha512_init(&ctx);
+  for (i = 0; i < 10000000; i++) {
+    sha512_update(&ctx, message, sizeof(message));
+  }
+  sha512_final(&ctx, digest);
+}
+
+#endif /* TEST_VECTORS_LONG */
+
+int main(void) {
+  static const char *vectors[4][5] = {/* SHA-224 */
+                                      {
+                                          "23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7",
+                                          "75388b16512776cc5dba5da1fd890150b0c6455cb4f58b1952522525",
+                                          "20794655980c91d8bbb4c1ea97618a4bf03f42581948b2ee4ee7ad67",
+                                          "c84cf4761583afa849ffd562c52a2e2a5104f1a4071dab6c53560d4f",
+                                          "b7bdc6c1f4f789f1456e68a005779a6c1f6199211008bee2801baf0d",
+                                      },
+                                      /* SHA-256 */
+                                      {
+                                          "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad",
+                                          "248d6a61d20638b8e5c026930c3e6039a33ce45964ff2167f6ecedd419db06c1",
+                                          "cdc76e5c9914fb9281a1c7e284d73e67f1809a48a497200e046d39ccc7112cd0",
+                                          "8c14f43ad81026351e9b60025b5420e6072ff617f5c72145b179599211514947",
+                                          "53748286337dbe36f5df22e7ef1af3ad71530398cf569adc7eb5fefa7af7003c",
+                                      },
+                                      /* SHA-384 */
+                                      {
+                                          "cb00753f45a35e8bb5a03d699ac65007272c32ab0eded1631a8b605a43ff5bed"
+                                          "8086072ba1e7cc2358baeca134c825a7",
+                                          "09330c33f71147e83d192fc782cd1b4753111b173b3b05d22fa08086e3b0f712"
+                                          "fcc7c71a557e2db966c3e9fa91746039",
+                                          "9d0e1809716474cb086e834e310a4a1ced149e9c00f248527972cec5704c2a5b"
+                                          "07b8b3dc38ecc4ebae97ddd87f3d8985",
+                                          "3de4a44dba8627278c376148b6d45be0a3a410337330ef3e1d9ca34c4593ecfc"
+                                          "8ce7a8415aefaca6b39d1112078cc3e0",
+                                          "073de8e641532032b2922c4af165baa88dfe5fdafb09575657406894b4b94996"
+                                          "8975eef50c1ef5be59ca0ecdaa996496",
+                                      },
+                                      /* SHA-512 */
+                                      {"ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a"
+                                       "2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f",
+                                       "8e959b75dae313da8cf4f72814fc143f8f7779c6eb9f7fa17299aeadb6889018"
+                                       "501d289e4900f7e4331b99dec4b5433ac7d329eeb6dd26545e96e55b874be909",
+                                       "e718483d0ce769644e2e42c7bc15b4638e1f98b13b2044285632a803afa973eb"
+                                       "de0ff244877ea60a4cb0432ce577c31beb009c5c2c49aa2e4eadb217ad8cc09b",
+                                       "a7e464bfd1f27201d7d0575c1a302cecef0004828e923e4255c8de6bae958a01"
+                                       "9294e3f851bf9a03013963cd1268687c549916438e465433957d4480bcaa8572",
+                                       "922637075a3ee5d87a7ce3ae7e04083d6daea7b541f264512290157ce3f81f9b"
+                                       "afcd3f9dc2d4fe0a6248a071709b96d0128d96c48220b2ab919b99187cb16fbf"}};
+
+  static const char message1[] = "abc";
+  static const char message2a[] =
+      "abcdbcdecdefdefgefghfghighijhi"
+      "jkijkljklmklmnlmnomnopnopq";
+  static const char message2b[] =
+      "abcdefghbcdefghicdefghijdefghijkefghij"
+      "klfghijklmghijklmnhijklmnoijklmnopjklm"
+      "nopqklmnopqrlmnopqrsmnopqrstnopqrstu";
+  uint8 *message3;
+  uint32 message3_len = 1000000;
+  uint8  digest[SHA512_DIGEST_SIZE];
+
+  message3 = malloc(message3_len);
+  if (message3 == NULL) {
+    fprintf(stderr, "Can't allocate memory\n");
+    return -1;
+  }
+  memset(message3, 'a', message3_len);
+
+  printf("SHA-2 FIPS 180-2 Validation tests\n\n");
+  printf("SHA-224 Test vectors\n");
+
+  sha224((const uint8 *)message1, strlen(message1), digest);
+  test(vectors[0][0], digest, SHA224_DIGEST_SIZE);
+  sha224((const uint8 *)message2a, strlen(message2a), digest);
+  test(vectors[0][1], digest, SHA224_DIGEST_SIZE);
+  sha224(message3, message3_len, digest);
+  test(vectors[0][2], digest, SHA224_DIGEST_SIZE);
+  test_sha224_message4(digest);
+  test(vectors[0][3], digest, SHA224_DIGEST_SIZE);
+#ifdef TEST_VECTORS_LONG
+  test_sha224_long_message(digest);
+  test(vectors[0][4], digest, SHA224_DIGEST_SIZE);
+#endif
+  printf("\n");
+
+  printf("SHA-256 Test vectors\n");
+
+  sha256((const uint8 *)message1, strlen(message1), digest);
+  test(vectors[1][0], digest, SHA256_DIGEST_SIZE);
+  sha256((const uint8 *)message2a, strlen(message2a), digest);
+  test(vectors[1][1], digest, SHA256_DIGEST_SIZE);
+  sha256(message3, message3_len, digest);
+  test(vectors[1][2], digest, SHA256_DIGEST_SIZE);
+  test_sha256_message4(digest);
+  test(vectors[1][3], digest, SHA256_DIGEST_SIZE);
+#ifdef TEST_VECTORS_LONG
+  test_sha256_long_message(digest);
+  test(vectors[1][4], digest, SHA256_DIGEST_SIZE);
+#endif
+  printf("\n");
+
+  printf("SHA-384 Test vectors\n");
+
+  sha384((const uint8 *)message1, strlen(message1), digest);
+  test(vectors[2][0], digest, SHA384_DIGEST_SIZE);
+  sha384((const uint8 *)message2b, strlen(message2b), digest);
+  test(vectors[2][1], digest, SHA384_DIGEST_SIZE);
+  sha384(message3, message3_len, digest);
+  test(vectors[2][2], digest, SHA384_DIGEST_SIZE);
+  test_sha384_message4(digest);
+  test(vectors[2][3], digest, SHA384_DIGEST_SIZE);
+#ifdef TEST_VECTORS_LONG
+  test_sha384_long_message(digest);
+  test(vectors[2][4], digest, SHA384_DIGEST_SIZE);
+#endif
+  printf("\n");
+
+  printf("SHA-512 Test vectors\n");
+
+  sha512((const uint8 *)message1, strlen(message1), digest);
+  test(vectors[3][0], digest, SHA512_DIGEST_SIZE);
+  sha512((const uint8 *)message2b, strlen(message2b), digest);
+  test(vectors[3][1], digest, SHA512_DIGEST_SIZE);
+  sha512(message3, message3_len, digest);
+  test(vectors[3][2], digest, SHA512_DIGEST_SIZE);
+  test_sha512_message4(digest);
+  test(vectors[3][3], digest, SHA512_DIGEST_SIZE);
+#ifdef TEST_VECTORS_LONG
+  test_sha512_long_message(digest);
+  test(vectors[3][4], digest, SHA512_DIGEST_SIZE);
+#endif
+  printf("\n");
+
+  printf("All tests passed.\n");
+
+  return 0;
+}
+
+#endif /* TEST_VECTORS */


This new library file tsha.c contains a main function and test vectors within an #ifdef TEST_VECTORS block. Including a main function in a library source file is a critical issue as it will lead to linker errors if another file in the project defines a main function. Test code should be kept separate from library code, in its own test file and target. Please move this entire test block to a dedicated test file.

gemini-code-assist · 2025-11-05T09:19:23Z

source/libs/function/src/builtins.c

  },
+  {
+    .name = "from_base64",
+    .type = FUNCTION_TYPE_BASE64,


The from_base64 function is assigned FUNCTION_TYPE_BASE64, which is the same type used for the to_base64 function. Since these are inverse operations, they should have distinct function types to avoid confusion and potential bugs, especially if the function dispatch logic relies on the type. I recommend defining a new type, such as FUNCTION_TYPE_FROM_BASE64, in functionMgt.h and using it here.

.type = FUNCTION_TYPE_FROM_BASE64,

gemini-code-assist · 2025-11-05T09:19:23Z

include/util/tsha.h

+typedef unsigned char      uint8;
+typedef unsigned int       uint32;
+typedef unsigned long long uint64;


The custom integer type definitions uint8, uint32, and uint64 are inconsistent with the rest of the codebase, which appears to use standard stdint.h types like uint32_t (as seen in T_SHA1_CTX). Using unsigned int for uint32 is not portable as it's not guaranteed to be 32 bits on all platforms. Since os.h is already included and it provides <stdint.h>, it's better to use the standard fixed-width integer types (uint8_t, uint32_t, uint64_t) for portability and consistency.

Suggested change

typedef unsigned char uint8;

typedef unsigned int uint32;

typedef unsigned long long uint64;

typedef uint8_t uint8;

typedef uint32_t uint32;

typedef uint64_t uint64;

gemini-code-assist · 2025-11-05T09:19:24Z

include/util/tutil.h

+static FORCE_INLINE int32_t taosCreateSHA2Hash(char *pBuf, int32_t len, uint32_t digestSize) {
+  uint8_t result[2 * SHA512_DIGEST_SIZE + 1] = {0};
+
+  switch (digestSize / 8) {
+    case SHA224_DIGEST_SIZE:
+      sha224((const uint8_t *)pBuf, len, result);
+      return sprintf(pBuf,
+                     "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%"
+                     "02x%02x%02x%02x",
+                     result[0], result[1], result[2], result[3], result[4], result[5], result[6], result[7], result[8],
+                     result[9], result[10], result[11], result[12], result[13], result[14], result[15], result[16],
+                     result[17], result[18], result[19], result[20], result[21], result[22], result[23], result[24],
+                     result[25], result[26], result[27]);
+    case SHA256_DIGEST_SIZE:
+      sha256((const uint8_t *)pBuf, len, result);
+      return sprintf(pBuf,
+                     "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%"
+                     "02x%02x%02x%02x%02x%02x%02x%02x",
+                     result[0], result[1], result[2], result[3], result[4], result[5], result[6], result[7], result[8],
+                     result[9], result[10], result[11], result[12], result[13], result[14], result[15], result[16],
+                     result[17], result[18], result[19], result[20], result[21], result[22], result[23], result[24],
+                     result[25], result[26], result[27], result[28], result[29], result[30], result[31]);
+    case SHA384_DIGEST_SIZE:
+      sha384((const uint8_t *)pBuf, len, result);
+      return sprintf(pBuf,
+                     "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%"
+                     "02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
+                     result[0], result[1], result[2], result[3], result[4], result[5], result[6], result[7], result[8],
+                     result[9], result[10], result[11], result[12], result[13], result[14], result[15], result[16],
+                     result[17], result[18], result[19], result[20], result[21], result[22], result[23], result[24],
+                     result[25], result[26], result[27], result[28], result[29], result[30], result[31], result[32],
+                     result[33], result[34], result[35], result[36], result[37], result[38], result[39], result[40],
+                     result[41], result[42], result[43], result[44], result[45], result[46], result[47]);
+    case SHA512_DIGEST_SIZE:
+      sha512((const uint8_t *)pBuf, len, result);
+      return sprintf(pBuf,
+                     "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%"
+                     "02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%"
+                     "02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
+                     result[0], result[1], result[2], result[3], result[4], result[5], result[6], result[7], result[8],
+                     result[9], result[10], result[11], result[12], result[13], result[14], result[15], result[16],
+                     result[17], result[18], result[19], result[20], result[21], result[22], result[23], result[24],
+                     result[25], result[26], result[27], result[28], result[29], result[30], result[31], result[32],
+                     result[33], result[34], result[35], result[36], result[37], result[38], result[39], result[40],
+                     result[41], result[42], result[43], result[44], result[45], result[46], result[47], result[48],
+                     result[49], result[50], result[51], result[52], result[53], result[54], result[55], result[56],
+                     result[57], result[58], result[59], result[60], result[61], result[62], result[63]);
+    default:
+      break;
+  }
+
+  return 0;
+}


The functions taosCreateMD5Hash, taosCreateSHA1Hash, and taosCreateSHA2Hash contain very long and hard-to-read sprintf calls to convert byte arrays to hexadecimal strings. This approach is error-prone and involves significant code duplication. I recommend creating a helper function to handle the hex conversion, which would make this code much cleaner, more maintainable, and less susceptible to errors. A similar refactoring should be applied to taosCreateMD5Hash and taosCreateSHA1Hash as well.

gemini-code-assist · 2025-11-05T09:19:24Z

source/libs/parser/src/parInsertSql.c

+static int32_t parseBinary(SInsertParseContext* pCxt, const char** ppSql, SToken* pToken, SColVal* pVal,
+                           SSchema* pSchema) {
+  int32_t   bytes = pSchema->bytes;
+  uint8_t** pData = &pVal->value.pData;
+  uint32_t* nData = &pVal->value.nData;
+
+  if (TK_NK_ID == pToken->type) {
+    char*   input = NULL;
+    int32_t inputBytes = 0;
+    char    tmpTokenBuf[TSDB_MAX_BYTES_PER_ROW];
+
+    if (0 == strncasecmp(pToken->z, "from_base64(", 12)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      int32_t outputBytes = tbase64_decode_len(inputBytes);
+      if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      *pData = taosMemoryMalloc(outputBytes);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      if (TSDB_CODE_SUCCESS != tbase64_decode(*pData, (const uint8_t*)input, inputBytes, outputBytes)) {
+        taosMemoryFree(*pData);
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_INVALID_DATA_FMT);
+      }
+      *nData = outputBytes;
+    } else if (0 == strncasecmp(pToken->z, "to_base64(", 10)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      int32_t outputBytes = tbase64_encode_len(inputBytes);
+      if (outputBytes + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      *pData = taosMemoryMalloc(outputBytes);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      tbase64_encode(*pData, input, inputBytes, outputBytes);
+      *nData = outputBytes;
+    } else if (0 == strncasecmp(pToken->z, "md5(", 4)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      if (MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      int32_t bufLen = TMAX(MD5_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes);
+      *pData = taosMemoryMalloc(bufLen);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      (void)memcpy(*pData, input, inputBytes);
+      int32_t len = taosCreateMD5Hash(*pData, inputBytes);
+      *nData = len;
+    } else if (0 == strncasecmp(pToken->z, "sha2(", 5)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_COMMA != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ", expected", pToken->z);
+      }
+
+      int64_t digestLen = 224;
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_INTEGER != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "integer [224, 256, 384, 512] expected", pToken->z);
+      } else {
+        if (TSDB_CODE_SUCCESS != toInteger(pToken->z, pToken->n, 10, &digestLen)) {
+          return buildSyntaxErrMsg(&pCxt->msg, "invalid integer format", pToken->z);
+        }
+
+        if (224 != digestLen && 256 != digestLen && 384 != digestLen && 512 != digestLen) {
+          return buildSyntaxErrMsg(&pCxt->msg, "sha2 digest length must be one of 224, 256, 384, 512", pToken->z);
+        }
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      if (SHA2_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      int32_t bufLen = TMAX(SHA2_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes);
+      *pData = taosMemoryMalloc(bufLen);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      (void)memcpy(*pData, input, inputBytes);
+      int32_t len = taosCreateSHA2Hash(*pData, inputBytes, digestLen);
+      *nData = len;
+    } else if (0 == strncasecmp(pToken->z, "sha(", 4) || 0 == strncasecmp(pToken->z, "sha1(", 5)) {
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_LP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "( expected", pToken->z);
+      }
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NULL == pToken->type) {
+        NEXT_VALID_TOKEN(*ppSql, *pToken);
+        if (TK_NK_RP == pToken->type) {
+          pVal->flag = CV_FLAG_NULL;
+
+          return TSDB_CODE_SUCCESS;
+        }
+      } else if (TK_NK_STRING != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, "string expected", pToken->z);
+      }
+
+      inputBytes = trimString(pToken->z, pToken->n, tmpTokenBuf, TSDB_MAX_BYTES_PER_ROW);
+      input = tmpTokenBuf;
+
+      NEXT_VALID_TOKEN(*ppSql, *pToken);
+      if (TK_NK_RP != pToken->type) {
+        return buildSyntaxErrMsg(&pCxt->msg, ") expected", pToken->z);
+      }
+
+      if (SHA1_OUTPUT_LEN + VARSTR_HEADER_SIZE > bytes) {
+        return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+      }
+
+      int32_t bufLen = TMAX(SHA1_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, inputBytes);
+      *pData = taosMemoryMalloc(bufLen);
+      if (NULL == *pData) {
+        return terrno;
+      }
+
+      (void)memcpy(*pData, input, inputBytes);
+      int32_t len = taosCreateSHA1Hash(*pData, inputBytes);
+      *nData = len;
+    } else {
+      return buildSyntaxErrMsg(&pCxt->msg, "invalid identifier", pToken->z);
+    }
+  } else {
+    if (pToken->n + VARSTR_HEADER_SIZE > bytes) {
+      return generateSyntaxErrMsg(&pCxt->msg, TSDB_CODE_PAR_VALUE_TOO_LONG, pSchema->name);
+    }
+
+    *pData = taosMemoryMalloc(pToken->n);
+    if (NULL == *pData) {
+      return terrno;
+    }
+    memcpy(*pData, pToken->z, pToken->n);
+    *nData = pToken->n;
+  }
+
+  pVal->flag = CV_FLAG_VALUE;
+  return TSDB_CODE_SUCCESS;
+}


The parseBinary function is over 200 lines long and contains a significant amount of duplicated code for parsing function calls with a single string argument (e.g., from_base64, to_base64, md5, sha1). This makes the function difficult to read and maintain. I suggest refactoring this by extracting the common logic for parsing a function with a single string argument into a helper function. This would greatly simplify parseBinary, reduce redundancy, and improve maintainability.

gemini-code-assist · 2025-11-05T09:19:24Z

source/libs/scalar/src/sclfunc.c

+int32_t shaFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput) {
+  SColumnInfoData *pInputData = pInput->columnData;
+  SColumnInfoData *pOutputData = pOutput->columnData;
+  int32_t          bufLen = TMAX(SHA1_OUTPUT_LEN + VARSTR_HEADER_SIZE + 1, pInputData->info.bytes);
+  char            *pOutputBuf = taosMemoryMalloc(bufLen);
+  if (!pOutputBuf) {
+    qError("sha function alloc memory failed");
+    return terrno;
+  }
+  for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+    if (colDataIsNull_s(pInputData, i)) {
+      colDataSetNULL(pOutputData, i);
+      continue;
+    }
+    char *input = colDataGetData(pInput[0].columnData, i);
+    if (bufLen < varDataLen(input) + VARSTR_HEADER_SIZE) {
+      bufLen = varDataLen(input) + VARSTR_HEADER_SIZE;
+      pOutputBuf = taosMemoryRealloc(pOutputBuf, bufLen);
+      if (!pOutputBuf) {
+        qError("sha function alloc memory failed");
+        return terrno;
+      }
+    }
+    char *output = pOutputBuf;
+    (void)memcpy(varDataVal(output), varDataVal(input), varDataLen(input));
+    int32_t len = taosCreateSHA1Hash(varDataVal(output), varDataLen(input));
+    varDataSetLen(output, len);
+    int32_t code = colDataSetVal(pOutputData, i, output, false);
+    if (TSDB_CODE_SUCCESS != code) {
+      taosMemoryFree(pOutputBuf);
+      SCL_ERR_RET(code);
+    }
+  }
+  pOutput->numOfRows = pInput->numOfRows;
+  taosMemoryFree(pOutputBuf);
+  return TSDB_CODE_SUCCESS;
+}


The implementation of shaFunction (and similarly sha2Function) involves copying the input data into a buffer which is then used as both input and output by the taosCreateSHA1Hash function. This memcpy for every row is inefficient. While the taosCreate...Hash functions are designed to work in-place, a more efficient approach would be to modify them to accept separate input and output buffers. This would eliminate the need for the memcpy in this loop, improving performance, especially with a large number of small rows. As taosCreateSHA1Hash is an inline function in a header, changing its signature to taosCreateSHA1Hash(char *outBuf, const char *inBuf, int32_t len) and updating its callers would be a beneficial refactoring.

stephenkgu and others added 22 commits October 27, 2025 16:47

feat(encoding): new security funcs

8b3608b

Merge branch '3.0' into fix/TS-7235

36b1751

fix null with {from, to}_base64

c154911

feat(md5): support md5 in inserting

25876aa

fix failed decoding cases

096c37a

new sha/sha1 digesting funcs

d6c20c2

fix CI compiling issue

1174581

feat(sha2): new sha funcs

21c59f5

Merge branch '3.0' into fix/TS-7235-sha2

ff235a7

fix string format printing

9befeea

Merge branch 'fix/TS-7235-sha2' of https://github.com/taosdata/TDengine…

964deba

… into fix/TS-7235-sha2

fix func name in error logs

3ec33b2

fix char type conversions

34b81f8

fix format strings

ab5561a

fix format string

6c8e3d5

fix

0292444

fix

c62f648

fix

3be1841

fix Mark & Mask

3620dfb

fix clang-query

fc79257

Merge branch '3.0' into fix/TS-7235-sha2

7967e6a

feat(mask): mask funcs

7c1a035

stephenkgu requested review from dapan1121, guanshengliang and hzcheng as code owners November 5, 2025 09:15

taosdata-bot bot added the team engine label Nov 5, 2025

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

base64: new func type for from_base64

265c6c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mask): mask funcs #33479

feat(mask): mask funcs #33479

Uh oh!

stephenkgu commented Nov 5, 2025

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(mask): mask funcs #33479

Are you sure you want to change the base?

feat(mask): mask funcs #33479

Uh oh!

Conversation

stephenkgu commented Nov 5, 2025

Description

Checklist

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants