feat: Add support for `RegExpExtract`/`RegExpExtractAll` #2831

danielhumanmod · 2025-11-30T05:18:46Z

Which issue does this PR close?

Closes #2708 .

Rationale for this change

Spark’s RegExpExtract and RegExpExtractAll expressions were previously not accelerated through Comet.
This PR adds full Rust-native implementations and the corresponding Spark–Comet bridging logic, enabling these two UDFs to run on the native engine while preserving Spark-compatible semantics.

What changes are included in this PR?

Spark-side serde / bridge
• Added CometRegExpExtract and CometRegExpExtractAll in strings.scala
Rust-native UDF implementation
• Introduced SparkRegExpExtract and SparkRegExpExtractAll as ScalarUDFImpl.
QueryPlanSerde integration
• Wired both expressions into the Comet proto conversion pipeline.

How are these changes tested?

Added new test cases in CometStringExpressionSuite
Added Rust-side unit tests for core extract and extract_all semantics.
All existing string expression tests continue to pass.

coderfender · 2025-12-01T07:18:25Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
+        // regexp_extract always returns String
+        Ok(DataType::Utf8)
+    }


@danielhumanmod , could we verify we dont return a LargeUtf8 as well ?

coderfender · 2025-12-01T07:19:30Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+            ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+            _ => {
+                return exec_err!("regexp_extract idx must be an integer literal");
+            }


Are we sure that this is only int32 and not higher / lower ints like In64 , In8 etc ?

Good point. I checked the Spark's implementation, the index there is cast into i32 only. I was wondering if we should follow their behavior

https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L811C21-L811C33

coderfender · 2025-12-01T07:22:25Z

spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala

+      "parquet.enable.dictionary" -> "true") {
+      // Use repeated values to trigger dictionary encoding
+      val data = (0 until 1000).map(i => {
+        val text = if (i % 3 == 0) "a1b2c3" else if (i % 3 == 1) "x5y6" else "no-match"


I believe that the tests should be a little more exhaustive and test out supr large strings / mixed inputs and multiple complex regex patterns

coderfender · 2025-12-01T07:24:27Z

spark/src/main/scala/org/apache/comet/serde/strings.scala

+    // Check if the pattern is compatible with Spark or allow incompatible patterns
+    expr.regexp match {
+      case Literal(pattern, DataTypes.StringType) =>
+        if (!RegExp.isSupportedPattern(pattern.toString) &&


What are the limitations of RegExp.isSupportedPattern(pattern.toString) here ? Could we make sure that we are in absolute coherence with spark perhaps ?

For now it is a placeholder introduced in this PR

@andygrove when you have a chance, could you share a bit more context on its purpose? I’d really appreciate it. Thank you!

coderfender · 2025-12-01T07:25:42Z

spark/src/main/scala/org/apache/comet/serde/strings.scala

+      case Literal(_, DataTypes.IntegerType) =>
+        Compatible()
+      case _ =>
+        Unsupported(Some("Only literal group index is supported"))


Could the input for group idx be a different numeric type (byte / int / long / short etc) ?

Good catch, will add support for these data types

coderfender · 2025-12-01T07:29:34Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+        // Pattern must be a literal string
+        let pattern_str = match pattern {
+            ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
+            _ => {


Might have to check if the string input is LargeUtf8 as well or ensure we are only sending Utf8 from spark side

coderfender · 2025-12-01T07:30:43Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+            match &args.args[2] {
+                ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,
+                _ => {
+                    return exec_err!("regexp_extract_all idx must be an integer literal");


We might want to have one common DataFusion::InternalError to make sure we throw right exception all along

coderfender · 2025-12-01T07:31:48Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+            internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)
+        })?;
+
+        match subject {


We could probably remove some duplication in the code with regex parsing here

Will extract some shared code, thanks for the suggestion!

martin-g · 2025-12-01T12:59:59Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+fn extract_group(text: &str, regex: &Regex, idx: usize) -> Result<String> {
+    match regex.captures(text) {
+        Some(caps) => {
+            // Spark behavior: throw error if group index is out of bounds


AFAIS Spark returns an empty string when there is no such group - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821

One of the tests below confirms that empty string is expected - https://github.com/apache/datafusion-comet/pull/2831/files#diff-a5bdde2aa2b633f697d86aa0e609cd72610de14e20e3d05c2e97763335c9e791R402

Thanks for the reference! After checking the Spark code

If index within valid range

returns the matched content if it exists

otherwise (i.e., the group is null), Spark returns an empty string

If index out of range, throws an exception (code)

the implementation should be aligned with the behavior

martin-g · 2025-12-01T13:16:13Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+
+        // idx must be a literal int
+        let idx_val = match idx {
+            ColumnarValue::Scalar(ScalarValue::Int32(Some(i))) => *i as usize,


Negative i will lead to a big idx_val here. Does it need some kind of validation ?

Good call, will fix this, thanks!

martin-g · 2025-12-01T13:17:00Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+
+        // Compile regex once
+        let regex = Regex::new(&pattern_str).map_err(|e| {
+            internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)


Suggested change

internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)

exec_err!("Invalid regex pattern '{}': {}", pattern_str, e)

Invalid regex is an user error.

martin-g · 2025-12-01T13:17:29Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+
+        // Compile regex once
+        let regex = Regex::new(&pattern_str).map_err(|e| {
+            internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)


Suggested change

internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)

exec_err!("Invalid regex pattern '{}': {}", pattern_str, e)

user error

codecov-commenter · 2025-12-01T15:56:04Z

Codecov Report

❌ Patch coverage is 9.09091% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.24%. Comparing base (f09f8af) to head (ff1ebd6).
⚠️ Report is 730 commits behind head on main.

Files with missing lines	Patch %	Lines
...rc/main/scala/org/apache/comet/serde/strings.scala	4.76%	40 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2831      +/-   ##
============================================
- Coverage     56.12%   54.24%   -1.88%     
- Complexity      976     1435     +459     
============================================
  Files           119      167      +48     
  Lines         11743    15232    +3489     
  Branches       2251     2531     +280     
============================================
+ Hits           6591     8263    +1672     
- Misses         4012     5752    +1740     
- Partials       1140     1217      +77

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

1. more data type support in scala side 2. unify errors as execution ones 3. reduce code duplication 4. negative index check

danielhumanmod added 3 commits November 29, 2025 21:17

prototype

68f74ae

refactor strings.scala

4dbed77

test, format and configs

f101362

danielhumanmod changed the title ~~feat: Add support for RegExpExtract/RegExpExtractAll (WIP)~~ feat: Add support for RegExpExtract/RegExpExtractAll Dec 1, 2025

danielhumanmod marked this pull request as ready for review December 1, 2025 06:21

coderfender suggested changes Dec 1, 2025

View reviewed changes

make regexp_extract more align with spark's behavior

ff1ebd6

martin-g reviewed Dec 1, 2025

View reviewed changes

danielhumanmod added 2 commits December 2, 2025 23:11

Solve comments (test not yet fixed)

87dfed4

1. more data type support in scala side 2. unify errors as execution ones 3. reduce code duplication 4. negative index check

fix regexp_extract_all test failure

d83cac5

	internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)
	exec_err!("Invalid regex pattern '{}': {}", pattern_str, e)

feat: Add support for RegExpExtract/RegExpExtractAll #2831

Are you sure you want to change the base?

feat: Add support for RegExpExtract/RegExpExtractAll #2831

Uh oh!

Conversation

danielhumanmod commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Add support for `RegExpExtract`/`RegExpExtractAll` #2831

feat: Add support for `RegExpExtract`/`RegExpExtractAll` #2831

danielhumanmod commented Nov 30, 2025 •

edited

Loading

codecov-commenter commented Dec 1, 2025 •

edited

Loading