Skip to content

Conversation

@timsaucer
Copy link
Member

@timsaucer timsaucer commented Nov 3, 2025

Which issue does this PR close?

Closes #1172

Rationale for this change

Since we now have the ability to pass Field information instead of just DataType with ScalarUDFs, this feature adds similar support for Python written UDFs. Without this feature you must write your UDFs in rust and expose them to Python. This enhancement greatly expands the use cases where PyArrow data can be leveraged.

What changes are included in this PR?

  • Adds a ScalarUDF implementation for python based UDFs instead of relying on the create_udf function
  • Adds support for converting to pyarrow array via FFI but including the Field schema instead of the data type
  • Add unit test

Are there any user-facing changes?

This expands on the current API and is backwards compatible.

@timsaucer
Copy link
Member Author

@kosiew Here is an alternate approach. Instead of relying on extension type features it is going to pass the Field information when creating the FFI array. This will capture pyarrow extensions as well as any other metadata that any user assigns on the input.

I'm going to leave it in draft until I can finish up those additional items on my check list.

What do you think?

cc @paleolimbot

@paleolimbot
Copy link
Member

What do you think?

Definitely! Passing the argument fields/return fields should do it. Using __arrow_c_schema__ might be more flexible than isinstance(x, pa.Field) (arro3, nanoarrow, and polars types would work too).

We have a slightly different signature model in SedonaDB ("type matchers") because the existing signature matching doesn't consider metadata, but at the Arrow/FFI level we're doing approximately the same thing: apache/sedona-db#228 . We do use the concept of SedonaType for arguments and return type (but these are serializable to/deserializable from fields).

src/udf.rs Outdated
Comment on lines 106 to 110
"_import_from_c",
(
addr_of!(array) as Py_uintptr_t,
addr_of!(schema) as Py_uintptr_t,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the use of PyArrow's private _import_from_c advisable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a near duplicate of how we already convert ArrayData into a pyarrow object. You can see the original here. The difference in this function is that we know the field instead of only the data type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more modern way is to use __arrow_c_schema__ (although I think import_from_c will be around for a while). It's only a few lines:

https://github.com/apache/sedona-db/blob/main/python/sedonadb/src/import_from.rs#L151-L157

@timsaucer
Copy link
Member Author

timsaucer commented Nov 6, 2025

Also worth evaluating while we're doing this: For scalar values, is it possible for them to contain metadata? If I do pa.scalar(uuid.uuid4().bytes, type=pa.uuid()) and I check the type I should have the extension data. Maybe this is already supported, but as part of this PR I want to evaluate that as well.

Opened new issue so there isn't too much scope creep

@timsaucer timsaucer force-pushed the feat/propagate-metadata branch from 39754e5 to dcb9e25 Compare January 19, 2026 19:15
@timsaucer timsaucer marked this pull request as ready for review January 19, 2026 19:19
@timsaucer timsaucer requested a review from Copilot January 19, 2026 19:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances Python UDFs to support PyArrow Field information (including metadata and nullability) instead of only DataType information, enabling more sophisticated data type handling in Python-written scalar UDFs.

Changes:

  • Implements a custom ScalarUDFImpl for Python UDFs instead of using the generic create_udf function
  • Adds PyArrowArrayExportable struct to support FFI conversion with Field schema information
  • Updates Python API to accept pa.Field objects while maintaining backwards compatibility with pa.DataType

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/udf.rs Replaces create_udf with custom PythonFunctionScalarUDF implementation supporting Field-based metadata
src/lib.rs Adds new array module to the crate
src/array.rs Implements PyArrowArrayExportable for FFI conversion with Field information
python/tests/test_udf.py Adds tests for UUID metadata handling and nullability preservation
python/datafusion/user_defined.py Updates API to accept Field/DataType with helper conversion functions
pyproject.toml Adds minimum PyArrow version constraint
docs/source/user-guide/common-operations/udf-and-udfa.rst Documents Field vs DataType usage and references Rust UDF blog post

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

fn return_type(&self, _arg_types: &[DataType]) -> datafusion::common::Result<DataType> {
unimplemented!()
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return_type method is unimplemented but may be called by DataFusion internals. Consider either implementing it by returning self.return_field.data_type().clone() or adding a comment explaining why it's safe to leave unimplemented.

Suggested change
unimplemented!()
Ok(self.return_field.data_type().clone())

Copilot uses AI. Check for mistakes.
def _decorator(
input_types: list[pa.DataType],
return_type: _R,
input_fields: list[pa.DataType],
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for input_fields parameter in _decorator should be Sequence[pa.DataType | pa.Field] | pa.DataType | pa.Field to match the signature of _function and maintain consistency with the actual accepted types.

Suggested change
input_fields: list[pa.DataType],
input_fields: Sequence[pa.DataType | pa.Field] | pa.DataType | pa.Field,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScalarUDFs created using datafusion.udf() do not propagate extension type metadata

3 participants