Gen1 filesystem and mount strategy
Purpose#
Define the storage, mount, and tuning strategy for the “nine files = nine dimensions” virtual DPU model. Optimize for low jitter, predictable write latency, and simple, reproducible boot‑time setup.
Filesystem selection#
-
Default: ext4 without journal on loopback
- Why: Ubiquitous, predictable, low metadata overhead when journal is disabled.
- Tradeoff: No crash safety needed (ephemeral volumes), so we trade durability for performance.
-
Ultra‑low jitter: tmpfs
- Use when: A dimension demands RAM‑speed and minimal IO variance.
- Caveat: Enforce size ceilings to prevent OOM.
-
Avoid for Gen1: btrfs, XFS, ZFS (excellent features, but we explicitly avoid snapshots and extra metadata complexity).
Creation and mounts#
-
Image creation:
- Preallocate: fallocate -l 4G /var/lib/vcg/d1.img
- Loop device: losetup -fP --direct-io=on /var/lib/vcg/d1.img
- Format (ext4 no journal):
- **mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 -O ^has_journal /dev/loopX
-
Mount options (ext4):
- Options: noatime,nodiratime,data=writeback,barrier=0,discard
- Rationale: Minimize metadata writes, allow writeback, disable barriers (safe for ephemeral), reclaim TRIM on SSD/NVMe.
-
Mount tmpfs (when selected):
- mount -t tmpfs -o size=2G,mpol=prefer:NUMA_NODE tmpfs /vcg/dX
Performance tuning#
-
I/O scheduler:
- NVMe: echo none > /sys/block/nvme0n1/queue/scheduler
- SATA SSD: echo mq-deadline > /sys/block/sdX/queue/scheduler
-
Readahead:
- Dimension mounts: Set small readahead to reduce cache pollution for random IO: blockdev --setra 64 /dev/loopX
-
NUMA and CPU affinity:
- Co‑locate IO and compute: numactl --cpunodebind=N --membind=N for D1–D3, D4–D6, D7–D9 groups.
- Pin cores: Assign non‑overlapping CPU sets per dimension to stabilize cadence.
-
Barrier notes:
- barrier=0: Acceptable because volumes are ephemeral and reset every boot. Do not use for persistent data.
Ephemeral discipline and snapshots#
-
Boot‑fresh invariant:
- Recreate images: At boot, reformat or verify images; do not restore previous state.
- Logs: Default to /run (tmpfs) with optional mirroring to persistent storage.
-
Snapshot‑on‑failure (opt‑in):
- Flow: On failure, remount read‑only, copy /vcg/dX to /var/log/vcg/snapshots/timestamp-dX.img.
- Toggle: Environment variable or config flag per dimension.
Validation checklist#
-
Filesystem health:
- Check: fsck.ext4 -n on each loop device after mount (read‑only check).
- Telemetry: Export mount latency, IO errors, and readahead settings.
-
Jitter probe:
- Test: Microbench random write latency (fio) with and without RTD token gates.
- Pass: p95/p99 within target bounds relative to baseline.