-
Notifications
You must be signed in to change notification settings - Fork 1k
Support calendar-based chunking #9119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bfc2b2b to
51c039c
Compare
422e5a0 to
44e5d7f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9119 +/- ##
==========================================
+ Coverage 82.42% 82.55% +0.12%
==========================================
Files 244 246 +2
Lines 47953 48597 +644
Branches 12235 12438 +203
==========================================
+ Hits 39525 40117 +592
- Misses 3553 3608 +55
+ Partials 4875 4872 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
284ecf4 to
1f59453
Compare
|
@fabriziomello, @natalya-aksman: please review this pull request.
|
|
Going to work on tests for continuous aggregates and other things. |
Is it the same thing as time_bucket, i.e. generates the same intervals for chunks if configured with the same bucket width? |
Yes, it is similar. But there are also differences, including (but probably not limited to):
Btw, when comparing to time_bucket(), I also found a potential issue with time_bucket(): #9136 |
Add support for specifying an origin point for aligning chunk boundaries. This allows chunks to be aligned to a specific reference point instead of the Unix epoch (or zero for integer dimensions). Changes: - Add interval_origin column to dimension catalog table - Add origin parameter to create_hypertable(), set_chunk_time_interval(), set_partitioning_interval(), and add_dimension() SQL functions - Add time_origin and integer_origin columns to dimensions view - Modify chunk boundary calculation to use origin when specified - Add type validation for origin (integer vs timestamp compatibility) - Add chunk_origin test for origin parameter functionality
cef20e4 to
4357292
Compare
Hypertables can now be configured to create chunks that align with the start and end of days, weeks, months, and years in the local time zone. This includes days and months that vary in length due to daylight savings or month of the year. This "calendar-based chunking" is achieved by anchoring chunk ranges at a user-configurable "origin" and calculating chunk ranges using local time zones and in units of, e.g., variable-length days and months. Therefore, a day-sized chunk can sometimes be 23 or 25 hours if it covers a daylight savings change. Currently, calendar-based chunking is guarded by a GUC, and turned off by default to preserve the existing behavior. To use calendar-based chunking, the GUC must be turned on and the hypertable configured with an Interval-type chunk interval. Existing hypertables are not affected by the GUC setting. The default origin is set to '2001-01-01 00:00' because that is the start of a new year and a Monday, so it also aligns with the start of a week (ISO). This means that chunk intervals set to `1 week` will lead to chunks that start on a Monday and ends on a Sunday. The origin-based approach was chosen because of the flexibility it gives; setting a different origin allows, e.g., daily chunks to start at noon instead of midnight. It also makes supporting chunk intervals of multiple months easy, as opposed to a truncation-based approach (e.g., date_trunc()), which only works with singular days, weeks, or months. Implementation-wise, a challenge of the origin-based approach is to calculate a chunk range for a point in the future from the given origin. Since, e.g., a `1 month` interval varies in size depending on which month it is, simple fixed-size interval arithmetics are not possible to calculate the N:th chunk range from the origin. Instead, the approach taken is to break down the calculations into full month, day, and sub-day units. But this only works for intervals that are non-fractional units of months, days, etc. As a fallback for arbitrary intervals, the range for a particular chunk is calculated from the origin by iteratively adding intervals until the desired point in time is covered. This iterative approach is optimized by (under-)estimating the number of intervals to the desired point, and then iterating from there. Since the iterative approach works for all types of intervals, the question is whether this approach is good enough for all cases, and the "broken-down" calculations are not needed. However, for this change, both approaches exist together, although this decision can be revisited in a future change. Changes include: - Add `interval` column to dimension catalog for INTERVAL type storage - Add `partition_origin` parameter to `by_range()` for origin specification - Create chunk_range.c/h for calendar-based interval calculations - Add GUC `timescaledb.enable_calendar_chunking` (default off) - Update dimension handling to support both fixed and calendar intervals - Update SQL API functions to support calendar intervals - Make caggs inherit calendar chunking from main hypertable - Add calendar_chunking regression test - Add calendar_chunking_integration test
A hypertable is either using legacy or calendar-based chunking as determined at hypertable creation time. This setting is sticky (stored in metadata) and it doesn't change even if the calendar chunking GUC is turned on. To allow users to switch to calendar-based chunking (and back), add a parameter `calendar_chunking=>true` to `set_chunk_time_interval()` and `set_partitioning_interval()`.
4357292 to
23055c2
Compare
Hypertables can now be configured to create chunks that align with the start and end of days, weeks, months, and years in the local time zone. This includes days and months that vary in length due to daylight savings or month of the year.
This "calendar-based chunking" is achieved by anchoring chunk ranges at a user-configurable "origin" and calculating chunk ranges using local time zones and in units of, e.g., variable-length days and months. Therefore, a day-sized chunk can sometimes be 23 or 25 hours if it covers a daylight savings change.
Currently, calendar-based chunking is guarded by a GUC, and turned off by default to preserve the existing behavior. To use calendar-based chunking, the GUC must be turned on and the hypertable configured with an Interval-type chunk interval. Existing hypertables are not affected by the GUC setting.
The default origin is set to '2001-01-01 00:00' because that is the start of a new year and a Monday, so it also aligns with the start of a week (ISO). This means that chunk intervals set to
1 weekwill lead to chunks that start on a Monday and ends on a Sunday.The origin-based approach was chosen because of the flexibility it gives; setting a different origin allows, e.g., daily chunks to start at noon instead of midnight. It also makes supporting chunk intervals of multiple months easy, as opposed to a truncation-based approach (e.g., date_trunc()), which only works with singular days, weeks, or months.
Implementation-wise, a challenge of the origin-based approach is to calculate a chunk range for a point in the future from the given origin. Since, e.g., a
1 monthinterval varies in size depending on which month it is, simple fixed-size interval arithmetic are not possible to calculate the N:th chunk range from the origin. Instead, the approach taken is to break down the calculations into full month, day, and sub-day units. But this only works for intervals that are non-fractional units of months, days, etc. As a fallback for arbitrary intervals, the range for a particular chunk is calculated from the origin by iteratively adding intervals until the desired point in time is covered. This iterative approach is optimized by (under-)estimating the number of intervals to the desired point, and then iterating from there.Since the iterative approach works for all types of intervals, the question is whether this approach is good enough for all cases, and the "broken-down" calculations are not needed. However, for this change, both approaches exist together, although this decision can be revisited in a future change.
Changes include:
intervalcolumn to dimension catalog for INTERVAL type storagepartition_originparameter toby_range()for origin specificationtimescaledb.enable_calendar_chunking(default off)The PR is divided into two commits. The first commit introduces the origin parameter:
Add support for specifying an origin point for aligning chunk boundaries. This allows chunks to be aligned to a specific reference point instead of the Unix epoch (or zero for integer dimensions).
Changes:
set_partitioning_interval(), and add_dimension() SQL functions
Disable-check: commit-count
Closes: #1500
Example usage: