ADR-0028: High-volume tables are range-partitioned ahead of need

ODD Platform creates range partitions ahead of need — one job runs at boot (Postgres advisory lock) and nightly (ShedLock), making double-width partitions per high-volume table.

Status

Accepted. Reconstructed from the codebase on 2026-05-30; the decision is live in the source today.

Context

The activity audit trail and the data-collaboration message table grow without bound and are append-heavy. PostgreSQL range partitioning keeps them manageable — but only if a partition always exists for the date a row lands on; a missing partition is an insert failure. So partitions must be created ahead of need, must be created once across a multi-replica cluster (not once per replica), and must be created for several tables without the job hard-coding each one. There are also two distinct moments coverage can lapse: right after a deployment or scale-up (a fresh replica may start against a schema whose partitions have not been extended), and on the rolling daily boundary.

Decision

The same partition-creation job runs at two triggers — at application boot and nightly — each serialised so only one creator runs at a time across the cluster, but by a different single-runner mechanism appropriate to each trigger.

  • At boot, a @PostConstruct init() acquires a Postgres advisory lock (partition.advisory-lock-id, default 90) via the leader-election manager, then creates any missing partitions. The advisory lock serialises concurrent replica boots, and running at startup means a freshly deployed or scaled replica gets coverage immediately rather than waiting for the next nightly run.

  • Nightly, a @Scheduled(cron = "0 1 0 * * *") run() is guarded by ShedLock (@SchedulerLock(name = "partitionCreationJob") plus a LockAssert.assertLocked() check), so exactly one replica performs the daily extension even though every replica's scheduler fires.

Both triggers call the same per-table creation logic, which uses double-width, single-cadence forward coverage: each created partition spans 2 × partitionDaysPeriod days while a new partition is appended every partitionDaysPeriod days, walking from the last existing partition up to now + partitionDaysPeriod. The 2:1 width-to-cadence ratio guarantees a partition always exists for any near-future insert — there is no boundary window where a row has nowhere to land.

Two further choices complete the design:

  • List-injection extensibility. The job consumes List<PartitionManager> — every Spring @Component extending AbstractPartitionManager is discovered automatically. ActivityTablePartitionManager registers unconditionally; MessageTablePartitionManager registers only when Data Collaboration is enabled (@ConditionalOnDataCollaboration). Adding a partitioned table is an add-a-class change.

  • Continue-on-failure across tables. A failure creating one table's partitions is caught per-manager, logged at ERROR, and the loop proceeds to the next table — the job maximises coverage across all tables rather than aborting on the first failure.

Consequences

  • Inserts into the activity and message tables always find a partition: coverage is created ahead of cadence at both boot and nightly, and the 2× overlap absorbs the gap between runs. A freshly deployed/scaled replica is covered at startup, not only after the next nightly cron.

  • Two single-runner mechanisms coexist by design — an advisory lock at boot (every replica boots and contends; the lock picks one) and ShedLock for the scheduled run (cluster-wide election for the cron). They use distinct primitives because the triggers differ; the advisory-lock id (90) lives in its own per-subsystem namespace.

  • The job only creates partitions — it never drops them. There is no retention or DROP path here, so partitions and their data accumulate until something else removes them. Reclaiming space is a separate concern: the housekeeping subsystem drops empty past partitions (see ADR-0045), and dropping non-empty aged partitions for retention is operator action.

  • Continue-on-failure means a single table's partition-creation failure surfaces only in ERROR logs, not as a request failure — coverage for the failing table can silently lapse until a later successful run. (A connection-level failure in the boot init() does throw and fail startup; a per-table failure inside the loop is caught and skipped.)

Evidence

  • odd-platform-api/.../partition/PostgreSQLPartitionCreationJob.java:26-27@Value("${partition.advisory-lock-id}") private long activityLockId;; :29-38@PostConstruct init() opens leaderElectionManager.acquire(activityLockId, false) and creates partitions at boot under the advisory lock.

  • odd-platform-api/.../partition/PostgreSQLPartitionCreationJob.java:40-43@Scheduled(cron = "0 1 0 * * *") + @SchedulerLock(name = "partitionCreationJob", lockAtLeastFor = "10m", lockAtMostFor = "10m") + LockAssert.assertLocked(): the nightly ShedLock-guarded run.

  • odd-platform-api/.../partition/PostgreSQLPartitionCreationJob.java:22private final List<PartitionManager> partitionManagers;; :53-61createPartitionIfNotExists(...) wraps each manager in try { … } catch (Exception e) { log.error(...); } (continue-on-failure).

  • odd-platform-api/.../partition/manager/AbstractPartitionManager.java:35new TablePartition(lastPartitionDate, lastPartitionDate.plusDays(partitionDaysPeriod * 2L)) (double width); :30 bufferDate = baseline.plusDays(partitionDaysPeriod), :33 while (lastPartitionDate.isBefore(bufferDate)), :37 lastPartitionDate = lastPartitionDate.plusDays(partitionDaysPeriod) (single cadence).

  • odd-platform-api/.../partition/manager/ActivityTablePartitionManager.java:9-13 (@Component, reads ${odd.activity.partition-period:30}) and .../MessageTablePartitionManager.java:16-21 (@Component @ConditionalOnDataCollaboration, reads ${datacollaboration.message-partition-period:30}), both extending AbstractPartitionManager.

  • odd-platform-api/src/main/resources/application.yml:197-198partition: advisory-lock-id: 90; :212-213odd.activity.partition-period: 30.

See also

Last updated