Skip to content

Conversation

@funcpp
Copy link

@funcpp funcpp commented Jan 22, 2026

Summary

This PR adds several Databricks Delta Lake SQL syntax features:

1. OPTIMIZE statement support

Adds support for the Databricks OPTIMIZE statement syntax:

OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, col2, ...)]

Reference: https://docs.databricks.com/en/sql/language-manual/delta-optimize.html

Key difference from ClickHouse: Databricks omits the TABLE keyword after OPTIMIZE.

2. PARTITIONED BY with optional column types

Databricks allows partition columns to reference existing table columns without specifying types:

CREATE TABLE t (col1 STRING, col2 INT) PARTITIONED BY (col1)
CREATE TABLE t (name STRING) PARTITIONED BY (year INT, month INT)

Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html

3. STRUCT type with colon syntax

Databricks uses Hive-style colon separator for struct field definitions:

STRUCT<field_name: field_type, ...>
ARRAY<STRUCT<finish_flag: STRING, survive_flag: STRING, score: INT>>

Reference: https://docs.databricks.com/en/sql/language-manual/data-types/struct-type.html

The colon is optional per the spec, so both field: type and field type syntaxes are now accepted.

Changes

  • Extended OptimizeTable AST to support Databricks-specific fields (predicate, zorder, has_table_keyword)
  • Added parse_column_def_for_partition() to handle optional column types in PARTITIONED BY
  • Added DatabricksDialect to STRUCT type parsing
  • Modified parse_struct_field_def() to accept optional colon separator

Test plan

  • Added tests for OPTIMIZE statement variations
  • Added tests for PARTITIONED BY with/without column types
  • Added tests for STRUCT type with colon syntax
  • Verified existing ClickHouse and BigQuery tests still pass
  • All tests pass (cargo test)

Add support for Databricks Delta Lake OPTIMIZE statement syntax:
- OPTIMIZE table_name [WHERE predicate] [ZORDER BY (col1, ...)]

This extends the existing OptimizeTable AST to support both ClickHouse
and Databricks syntax by adding:
- has_table_keyword: distinguishes OPTIMIZE TABLE (ClickHouse) from
  OPTIMIZE (Databricks)
- predicate: optional WHERE clause for partition filtering
- zorder: optional ZORDER BY clause for data colocation
Databricks allows partition columns to be specified without types when
referencing columns already defined in the table specification:

  CREATE TABLE t (col1 STRING, col2 INT) PARTITIONED BY (col1)
  CREATE TABLE t (name STRING) PARTITIONED BY (year INT, month INT)

This change introduces parse_column_def_for_partition() which makes the
data type optional by checking if the next token is a comma or closing
paren (indicating no type follows the column name).
Add support for Databricks/Hive-style STRUCT field syntax using colons:
  STRUCT<field_name: field_type, ...>

Changes:
- Add DatabricksDialect to STRUCT type parsing (alongside BigQuery/Generic)
- Modify parse_struct_field_def to handle optional colon separator between
  field name and type, supporting both:
  - BigQuery style: STRUCT<field_name field_type>
  - Databricks/Hive style: STRUCT<field_name: field_type>

This enables parsing complex nested types like:
  ARRAY<STRUCT<finish_flag: STRING, survive_flag: STRING, score: INT>>
@funcpp funcpp force-pushed the feature/databricks-delta-support branch from 1dd3157 to 77f4fae Compare January 22, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant