-
Notifications
You must be signed in to change notification settings - Fork 11
ephemeral: Replace systemd.volatile=overlay with fine-grained mounts #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request replaces the broad systemd.volatile=overlay with more granular, fine-grained tmpfs and overlayfs mounts for /var and /etc respectively. This is a well-reasoned change that solves a real-world issue with nested overlayfs when using podman inside the ephemeral VM. The implementation using systemd credentials injected via SMBIOS is clean and effective. The addition of an integration test to verify the new mount layout is excellent. My feedback is minor, focusing on improving code consistency and fixing a small typo in a comment.
fe6c2e0 to
b3cd27b
Compare
5516e29 to
cbec809
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the ephemeral VM setup to use fine-grained mounts for /etc and /var instead of a single systemd.volatile=overlay. This change is crucial for allowing Podman to use overlayfs for container storage inside /var/lib/containers, resolving previous 'too many levels of symbolic links' errors. The implementation introduces a new cpio module to embed systemd units directly into the initramfs, ensuring compatibility with older systemd versions. Integration tests have been added to verify the new mount layout. Overall, the changes are well-structured and address a significant functional limitation.
Instead of using systemd.volatile=overlay which overlaid all of / with a single tmpfs-backed overlayfs, set up /etc and /var separately: - /etc: overlayfs with tmpfs upper (transient changes, lost on reboot) - /var: real tmpfs with content copied from image (not overlayfs) The key benefit is that /var is now a real tmpfs, allowing podman to use overlayfs for container storage inside /var/lib/containers. With the old approach, the nested overlayfs caused "too many levels of symbolic links" errors. Implementation: The initramfs units are embedded in a CPIO archive that gets appended to the existing initramfs. This uses the Linux kernel's ability to concatenate multiple CPIO archives. Services running in initramfs (before switch-root): - bcvk-etc-overlay.service: Sets up overlay on /sysroot/etc using a bind-mounted lowerdir to avoid self-referential mount issues on older kernels. Uses index=off,metacopy=off for virtiofs compat. - bcvk-var-ephemeral.service: Copies /sysroot/var to tmpfs and bind mounts it back. - bcvk-copy-units.service: Copies bcvk-journal-stream.service to /run/systemd/system/ for systemd <256 compatibility. The /run tmpfs is preserved across switch-root via MS_MOVE. For systemd 256+, the journal-stream unit is created via SMBIOS credentials (systemd.extra-unit.*). For older versions like CentOS Stream 9 (systemd 252), the unit is copied from the initramfs since credential-based unit creation isn't supported. The execute service target is changed from default.target to multi-user.target with ConditionPathExists=!/etc/initrd-release to ensure it runs after switch-root, not in the initramfs. This is Phase 1 of issue bootc-dev#22, making ephemeral VMs more bootc-like. SELinux is still disabled (selinux=0); Phase 2 will add composefs support to enable proper SELinux labeling. Closes: bootc-dev#22 (Phase 1) Assisted-by: OpenCode (Sonnet 4) Signed-off-by: Colin Walters <walters@verbum.org>
cbec809 to
d616cc6
Compare
| Type=oneshot | ||
| RemainAfterExit=yes | ||
| TimeoutStartSec=60 | ||
| ExecStart=/usr/bin/mkdir -p /run/var-ephemeral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering if it would make sense to add an explicit tmpfs mount with a reasonable default size limit (e.g. ~2G), possibly configurable via a --var-size-limit option or something. Wouldn't a bootc image with a large /var (e.g. 10 GB) end up consuming that much RAM and potentially destabilize the VM or host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RAM usage is capped by qemu which comes from bcvk ephemeral run --memory. The tmpfs usage is inside that VM. There's nothing new here in that respect, it can't change anything about how we interact with the host.
As far as inside the VM and sizing of tmpfs; per https://www.kernel.org/doc/html/v6.3/filesystems/tmpfs.html
The limit of allocated bytes for this tmpfs instance. The default is half of your physical RAM without swap.
Though this relates to 3631ab0 and ...actually, we should be able to tweak the code here to also account for swap space when sizing the tmpfs and then just drop that code entirely.
(But really for that the real fix is again in containers/container-libs#144 so we aren't serializing a whole tarball into temporary memory)
There's really only heuristics here; if we allow the tmpfs to grow to 100% of RAM as the doc says it can cause failures in low memory situations.
In the end this PR isn't regressing anything, because we already had a tmpfs for / which was sized in the same way, we just shift that to /etc and /var.
But there's definitely followups we could do here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I was wrong, systemd defaults to different sizes for its tmpfs, and previously (before this PR) we used 25% not the 50% default https://github.com/systemd/systemd/blob/0910cb93244f4546a9c29158a86ec81efc790755/src/basic/mountpoint-util.h#L31
gursewak1997
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Instead of using systemd.volatile=overlay which overlaid all of / with a single tmpfs-backed overlayfs, set up /etc and /var separately:
The key benefit is that /var is now a real tmpfs, allowing podman to use overlayfs for container storage inside /var/lib/containers. With the old approach, the nested overlayfs caused "too many levels of symbolic links" errors.
Implementation uses systemd credentials to inject units that run in the initramfs before switch-root:
Both units use ConditionPathExists=/etc/initrd-release to only run in the initramfs context.
This is Phase 1 of issue #22, making ephemeral VMs more bootc-like. SELinux is still disabled (selinux=0); Phase 2 will add composefs support to enable proper SELinux labeling.
xref: #22 (Phase 1)
Assisted-by: OpenCode (Sonnet 4)