fix: Use binary(16) for UUID type to ensure Spark compatibility #2881

geruh · 2026-01-16T03:52:25Z

nit: do we need these lines with the session catlaog

kevinjqliu · 2026-01-05T19:07:16Z

I ran this test on current main branch with 1.10.1 and this is the stacktrace. This is different from the stacktrace in #2007

E pyspark.errors.exceptions.connect.SparkException: Job aborted due to stage failure: Task 0 in stage 142.0 failed 1 times, most recent failure: Lost task 0.0 in stage 142.0 (TID 287) (fcaa97ba83c2 executor driver): java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap') E at java.base/java.nio.ByteBuffer.compareTo(ByteBuffer.java:267) E at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52) E at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47) E at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:306) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79) E at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162) E at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390) E at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73) E at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108) E at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90) E at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99) E at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:126) E at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43) E at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141) E at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:148) E at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:186) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:72) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:72) E at scala.Option.exists(Option.scala:406) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:103) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72) E at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) E at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) E at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) E at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) E at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:143) E at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57) E at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111) E at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) E at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) E at org.apache.spark.scheduler.Task.run(Task.scala:147) E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647) E at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80) E at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77) E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650) E at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) E at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) E at java.base/java.lang.Thread.run(Thread.java:840) E E Driver stacktrace: E E JVM stacktrace: E org.apache.spark.SparkException E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:2935) E at scala.Option.getOrElse(Option.scala:201) E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2935) E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2927) E at scala.collection.immutable.List.foreach(List.scala:334) E at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2927) E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1295) E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1295) E at scala.Option.foreach(Option.scala:437) E at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1295) E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3207) E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3141) E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3130) E at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50) E at org.apache.spark.util.Utils$.getTryWithCallerStacktrace(Utils.scala:1439) E at org.apache.spark.util.LazyTry.get(LazyTry.scala:58) E at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:201) E at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:260) E at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) E at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:257) E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:197) E at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$2(SparkConnectPlanExecution.scala:155) E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:163) E at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:272) E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:125) E at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) E at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112) E at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:186) E at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:102) E at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111) E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:125) E at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:295) E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:124) E at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804) E at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:78) E at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:237) E at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.processAsArrowBatches(SparkConnectPlanExecution.scala:154) E at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:78) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:314) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:225) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:196) E at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:341) E at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804) E at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:341) E at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) E at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112) E at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:186) E at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:102) E at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111) E at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:340) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:196) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:125) E at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:347) E Caused by: java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap') E at java.nio.ByteBuffer.compareTo(ByteBuffer.java:267) E at java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52) E at java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47) E at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:306) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79) E at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162) E at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390) E at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103) E at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73) E at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108) E at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90) E at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99) E at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:126) E at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43) E at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141) E at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:148) E at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:186) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:72) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:72) E at scala.Option.exists(Option.scala:406) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:103) E at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72) E at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) E at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(:-1) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(:-1) E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(:-1) E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) E at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) E at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) E at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:143) E at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57) E at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111) E at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) E at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) E at org.apache.spark.scheduler.Task.run(Task.scala:147) E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647) E at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80) E at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77) E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650) E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) E at java.lang.Thread.run(Thread.java:840) .venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py:1882: SparkException

i also downloaded the 2 data files

➜ Downloads parquet schema 00000-0-562a1d32-f5da-4e09-836a-b7d0d4e737e6.parquet { "type" : "record", "name" : "schema", "fields" : [ { "name" : "uuid_col", "type" : [ "null", { "type" : "string", "logicalType" : "uuid" } ], "default" : null } ] } ➜ Downloads parquet schema 00000-284-30c509c4-f8e9-46fb-a1f1-169e1c928e00-0-00001.parquet { "type" : "record", "name" : "table", "fields" : [ { "name" : "uuid_col", "type" : { "type" : "string", "logicalType" : "uuid" } } ] }

The stacktrace has changed because of the fix that I made in the Java implementation. This PR (apache/iceberg#14027) has more details about the problem and in the issue #2372 I explain the problem from the pyiceberg side and why we are changing back to binary(16).

The Parquet files look wrong, and hot sure what happened there. UUID should annotate FILED_LEN_BYTE_ARRAY:

The pa.binary(16) change fixed the comparison issue but broke Parquet spec compliance by removing the UUID logical type annotation. We can get back to UUID in the visitor and raise an exception with a better message when the user tries to filter a UUID column, since PyArrow does not support filtering.

-Original file line number
+Diff line change
@@ Expand Up / @@ -789,7 +789,11 @@ def visit_string(self, _: StringType) -> pa.DataType: @@
             return pa.large_string()
         def visit_uuid(self, _: UUIDType) -> pa.DataType:
-            return pa.uuid()
+            # TODO: Change to uuid when PyArrow implements filtering for UUID types
+            # Using binary(16) instead of pa.uuid() because filtering is not
+            # implemented for UUID types in PyArrow
+            # (context: https://github.com/apache/iceberg-python/issues/2372)
+            return pa.binary(16)
         def visit_unknown(self, _: UnknownType) -> pa.DataType:
             """Type `UnknownType` can be promoted to any primitive type in V3+ tables per the Iceberg spec."""
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -828,7 +828,7 @@ def test_add_files_with_valid_upcast( @@
                     pa.field("list", pa.list_(pa.int64()), nullable=False),
                     pa.field("map", pa.map_(pa.string(), pa.int64()), nullable=False),
                     pa.field("double", pa.float64(), nullable=True),
-                    pa.field("uuid", pa.uuid(), nullable=True),
+                    pa.field("uuid", pa.binary(16), nullable=True),
                 )
             )
         )
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -610,15 +610,15 @@ def test_partitioned_tables(catalog: Catalog) -> None: @@
     def test_unpartitioned_uuid_table(catalog: Catalog) -> None:
         unpartitioned_uuid = catalog.load_table("default.test_uuid_and_fixed_unpartitioned")
         arrow_table_eq = unpartitioned_uuid.scan(row_filter="uuid_col == '102cb62f-e6f8-4eb0-9973-d9b012ff0967'").to_arrow()
-        assert arrow_table_eq["uuid_col"].to_pylist() == [uuid.UUID("102cb62f-e6f8-4eb0-9973-d9b012ff0967")]
+        assert arrow_table_eq["uuid_col"].to_pylist() == [uuid.UUID("102cb62f-e6f8-4eb0-9973-d9b012ff0967").bytes]
         arrow_table_neq = unpartitioned_uuid.scan(
             row_filter="uuid_col != '102cb62f-e6f8-4eb0-9973-d9b012ff0967' and uuid_col != '639cccce-c9d2-494a-a78c-278ab234f024'"
         ).to_arrow()
         assert arrow_table_neq["uuid_col"].to_pylist() == [
-            uuid.UUID("ec33e4b2-a834-4cc3-8c4a-a1d3bfc2f226"),
-            uuid.UUID("c1b0d8e0-0b0e-4b1e-9b0a-0e0b0d0c0a0b"),
-            uuid.UUID("923dae77-83d6-47cd-b4b0-d383e64ee57e"),
+            uuid.UUID("ec33e4b2-a834-4cc3-8c4a-a1d3bfc2f226").bytes,
+            uuid.UUID("c1b0d8e0-0b0e-4b1e-9b0a-0e0b0d0c0a0b").bytes,
+            uuid.UUID("923dae77-83d6-47cd-b4b0-d383e64ee57e").bytes,
         ]
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -1500,7 +1500,7 @@ def test_table_write_schema_with_valid_upcast( @@
                     pa.field("list", pa.list_(pa.int64()), nullable=False),
                     pa.field("map", pa.map_(pa.string(), pa.int64()), nullable=False),
                     pa.field("double", pa.float64(), nullable=True),  # can support upcasting float to double
-                    pa.field("uuid", pa.uuid(), nullable=True),
+                    pa.field("uuid", pa.binary(16), nullable=True),
                 )
             )
         )
@@ Expand Down Expand Up @@
         tbl.append(arr_table)
         lhs = [r[0] for r in spark.table(identifier).collect()]
-        rhs = [str(u.as_py()) for u in tbl.scan().to_arrow()["uuid"].combine_chunks()]
+        rhs = [str(uuid.UUID(bytes=u.as_py())) for u in tbl.scan().to_arrow()["uuid"].combine_chunks()]
         assert lhs == rhs
@@ Expand Down Expand Up @@
         assert tbl.metadata.next_row_id == initial_next_row_id + len(test_data), (
             "Expected next_row_id to be incremented by the number of added rows"
         )
+    @pytest.mark.integration
+    def test_write_uuid_in_pyiceberg_and_scan(session_catalog: Catalog, spark: SparkSession) -> None:
+        """Test UUID compatibility between PyIceberg and Spark.
+        UUIDs must be written as binary(16) for Spark compatibility since Java Arrow
+        metadata differs from Python Arrow metadata for UUID types.
+        """
+        identifier = "default.test_write_uuid_in_pyiceberg_and_scan"
+        catalog = load_catalog("default", type="in-memory")
+        catalog.create_namespace("ns")
+        schema = Schema(NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False))
+        test_data_with_null = {
+            "uuid_col": [
+                uuid.UUID("00000000-0000-0000-0000-000000000000").bytes,
+                None,
+                uuid.UUID("11111111-1111-1111-1111-111111111111").bytes,
+            ]
+        }
+        try:
+            session_catalog.drop_table(identifier=identifier)
+        except NoSuchTableError:
+            pass
+        table = _create_table(session_catalog, identifier, {"format-version": "2"}, schema=schema)
+        arrow_table = pa.table(test_data_with_null, schema=schema.as_arrow())
+        # Write with pyarrow
+        table.append(arrow_table)
+        # Write with pyspark
+        spark.sql(
+            f"""
+            INSERT INTO {identifier} VALUES ("22222222-2222-2222-2222-222222222222")
+            """
+        )
+        df = spark.table(identifier)
+        table.refresh()
+        assert df.count() == 4
+        assert len(table.scan().to_arrow()) == 4
+        result = df.where("uuid_col = '00000000-0000-0000-0000-000000000000'")
+        assert result.count() == 1
+        result = df.where("uuid_col = '22222222-2222-2222-2222-222222222222'")
+        assert result.count() == 1
+        result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("00000000-0000-0000-0000-000000000000").bytes)).to_arrow()
+        assert len(result) == 1
+        result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("22222222-2222-2222-2222-222222222222").bytes)).to_arrow()
+        assert len(result) == 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Use binary(16) for UUID type to ensure Spark compatibility #2881

Diff view

Diff view

There are no files selected for viewing

Uh oh!

geruh Jan 16, 2026

Uh oh!

kevinjqliu Jan 5, 2026

Uh oh!

kevinjqliu Jan 5, 2026

Uh oh!

ndrluis Jan 5, 2026

Uh oh!

Fokko Jan 6, 2026

Uh oh!

ndrluis Jan 6, 2026

Uh oh!

fix: Use binary(16) for UUID type to ensure Spark compatibility #2881

Are you sure you want to change the base?

fix: Use binary(16) for UUID type to ensure Spark compatibility #2881

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

geruh Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ndrluis Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ndrluis Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!