feat: adding firecracker micro-VM support #1382
No reviewers
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
run-multi-platform-tests
No milestone
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner!1382
Loading…
Reference in a new issue
No description provided.
Delete branch "adamcharnock/runner:pr/firecracker"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Firstly, I would like to apologise for the size of this PR. This is something we needed internally with some urgency, otherwise we would have engaged in more discussion first. My hope is that it has the redeeming traits of 1) being a separate fairly isolated feature that can be marked as 'experimental' in the release notes, and 2) actually already being used in production (by ourselves).
This was a lot of work to get working, but now it is serving us pretty solidly. I've also taken pains to try to minimize the number of "WTF" moments in the code, but I am sure there are still some there. I would love to get this merged in, but it does mean I need to rely on someone with better knowledge to point out the remaining WTF-points. Once pointed out I will endeavor to fix them!
What this adds
A new
firecrackerlabel scheme that runs CI jobs inside per-job Firecracker microVMs. Each job gets an isolated VM with its own filesystem, network, and (via jailer) cgroup memory limits.Needs either bare metal, or nested virtualization support.
Why?
We development this because of the pains of needing to use both docker and tailscale within our CI workflows. This is relatively easy in VMs, but doing it in containers is somewhere between tricky and impossible.
This also gives:
Previous work
There was previous discussion of adding Firecracker support in #152, coupled with a design change. This PR keeps the design the same, and starts VMs much like LXC containers. The runner then connects into the VMs via SSH to execute the jobs.
Interesting features
There is certainly room for further development, but this was the minimum-viably-useful implementation for our purposes.
Label format
Where
smallmaps to a named profile in the runner config:Workflows then use runs-on: small to select a VM size.
Platform support
Firecracker is Linux-only. The package compiles on all platforms via stub files (firecracker_other.go, memory_other.go) that return clear errors. No existing behaviour is changed on non-Linux systems.
Additional fixes
There are some additional issues that came up while working on firecracker support. There are also in this PR,
but we can likely split them out if needs be (the following 3 points were AI generated, because it is more thorough than I am).
Bug: Cleanup skipped on job timeout
act/runner/job_executor.go:168— Post-steps (cache save, etc.) only run when a job is canceled, not when it times out. Changedctx.Err() == context.Canceled to ctx.Err() != nil.Improvement: Set TERM=dumb
act/runner/run_context.go:1524— AddsTERM=dumbto job environment. Standard CI practice (GitHub Actions does this). Prevents ANSI escape codes polluting logs.Code quality: PickLabel() helper
internal/pkg/labels/labels.go:89-100— New method that finds which label name matches a runs-on value. Used by Firecracker but generic enough to be useful elsewhere.Adam's AI Transparency Statement
See also: AI Agreement. I have also taken the time to read the proceeding conversation, and the various views shared.
The code within this PR is the result of extensive planning between myself and Opus 4.5, and was written by the same (in this specific case, this was a lot of work). In doing this I exercised my 20+ years of experience as a software & infrastructure engineer to guide this tool with the attention and diligence that I would guide a junior engineer. I stake my name and reputation to the code I am submitting, in the context of whatever comments I have made above.
Every word written in the associated issues and PRs was written by my own hand (unless indicated otherwise). Every sentiment I express is my own.
Regarding licensing, I have consulted with the Anthropic consumer terms, which state:
I propose this shows that I have made "an effort to verify that you can submit this under the license of the repository", as required by section 2 of the AI Agreement. Moreover, I think it shows that I can indeed do so.
My hope in making this statement is to show that I can responsibly and honestly contribute to the Forgejo project, as I would very much like to continue to do. I will attach this statement to all PRs to which it applies.
I agree to the Forgejo Developer Certificate of Origin. Let me know if there is a formal signing procedure I need to follow.
I'm not sure what you're expecting when you drop a 6000 LOC PR without any participation in discussion about the feature, and without any technical design discussion. Especially considering that contributors (myself included) have expressed clearly in that feature request that we don't believe the current project resources can adequately support the complexity added by another execution backend.
"This is something we needed internally with some urgency" sounds a lot like you've developed this feature and now you want to offload the maintenance to an open source project.
As an experienced engineer, please consider this from my angle, and help me understand what you expect to happen with this PR?
Hello @mfenniak,
This PR certainly wasn't submitted with any expectation whatsoever, I'm sorry if that did not come across.
Given we had developed it already, I thought it sensible to present it as something we had. We developed this 100% in the knowledge that if the community did not wish to merge it, then it would fall on us to maintain it. I want to be super clear that at no point did we want to "offload" this. We are totally open to hearing a "no", or a "interesting proof of concept, but needs more thought," or anything else really.
I confess I had not seen the issue you linked to, I had only seen 152 in the forgejo repository where it appeared that there was some interest but that it ultimately didn't go anywhere. I also saw a little interest on HN so wanted to get the PR out as it sounded like something others may have use of, merged or otherwise.
I had hoped that my other discussion would have shown my intention to be a positive and contributing member to this community.
I expected people to contribute whatever they were willing to give on the matter, and no more. We are all volunteers here, after all.
I know I'm new here, but I have been making a strong effort to be receptive to feedback and respond above and beyond what was asked. I started what I think has been a very productive conversation on the environments feature, produced mock-ups, and made narrated demo videos.
Please also consider this from my angle – a new member of the community, who is already clearly responsive to feedback, who then receives fairly hostile and shaming response in a public forum 😢
I of course accept the decision that the community lacks the resources to support another execution backend. If anything this PR may help demonstrate this with its MVP solution of a sizable 6000 LOC.
I do however see there is discussion of a plugin system. I would therefore hope that the presence of this PR could positively contribute to the design of such plugin system in future.
Until then we will maintain an internal fork. Should anyone wish this to be publicly available just let me know and I will see what I can arrange.
@adamcharnock wrote in #1382 (comment):
I'm not a native speaker and depending on the day and topic I might struggle to phrase things properly. I also don't want to speak for others. When in doubt, please ask.
I appreciate your efforts and your responsiveness. I found the discussion about environments exemplary. I wish it would be like that every time. Unfortunately, this PR did not start well. To a certain extent, it's the project's fault because there's no clear communication what's expected and what should happen before opening a PR. Finding issues is hard because they are scattered all over the place. There are multiple similar issues and it's not reasonable to expect that people do PhD level research before posting something. I prefer if people do things wrong instead of not at all and I'm happy to guide them in the right direction.
The initial PR description had some problems. Not discussing the feature on purpose together with the expressed urgency and the expectation to land this PR can be interpreted as a code drop. I don't point that out to shame you or to assign blame, but to help you figure out where the problem might have originated. From my side, you haven't lost any reputation or standing and I'd be happy if you would continue working with us.
About the PR itself. Again, only speaking for myself.
The feature itself is interesting. As already indicated, it should ultimately be a plug-in. I want to stress that it shouldn't be interpreted as a lack of interest or an effort to get rid of annoying feature requests. Having isolated back-ends helps with testing and maintenance. It should also enable others to extend Forgejo Runner without being dependent on its maintainers.
The PR itself is very useful because it helps shaping a potential plug-in API. Unfortunately, I expect that it won't be easy to get it right. There are many topics in this PR that have to be resolved before it can be turned into a reusable feature. Not because the PR was bad, but because Forgejo Runner didn't have to deal with these things before: SSH, CPU/memory limits, operating system images, extended label configuration (we have identified that need before). Would you be willing to help? If not, this PR by itself is most likely dead. Again, not because it's bad, but because it's unlikely to land as is. Which would be a shame.
While I have worked with various VM technologies in the past, I haven't yet worked with Firecracker. So it will take me some time to get up to speed. Apart from that, I'm currently working on something else, so it might take some weeks before I can offer substantial technical feedback.
On condition that everybody agrees with the general direction and is willing to help: Does anybody have an idea where and how to start? 😅
Re-reading my PR, I can see that I said "I would love to get this merged in", which I imagine could have come off as pressuring or a wish to offload. If that was the case, I apologize.
My intention was only to indicate enthusiasm, and an overall positive feeling around what we have working here. It was pretty exciting to get it all functioning after quite a while of going around the houses trying to get a lot of different jigsaw pieces to fit together.
6k lines is clearly a lot. FWIW, the bulk of the code (in
firecracker/) is about 1500 lines, plus some additions elsewhere. The rest is a lot of tests and documentation.EDIT: I think we crossed posts @aahlenst – thank you for taking the time. It is late here, and I'll catch up properly in the morning
I've been very pleased with the discussion, effort, and time put into https://codeberg.org/forgejo/discussions/issues/440, and I absolutely don't want to lose the possibility of a new contributor in the community with ambitious goals. And of course, I'm technically interested in this specific feature as well. Let's figure out a plan forward.
My first thought is that the firecracker support could be maintained in a fork, for the short-to-medium term. This would accomplish a few things:
We could host the fork in the forgejo-contrib org in Codeberg, which is an area where Forgejo's Code of Conduct is applied by the moderation team, but no technical maintenance is taken on by Forgejo contributors as a whole.
We can link to the project in various places, such as the runner's README, the Forgejo monthly report, the Forgejo docs, and describe it as an experimental fork. Along with those links, make an explicit request for feedback asking people to share their experience even if it's uninteresting, perhaps onto a pinned issue on the repo.
Overall, I think we'd get the opportunity to learn and defer all other decisions to a time when we've learned more.
Thank you for the responses everyone, I really apprecaite it. I'd be very happy for this to live on as a fork or as a plug-in. My attention has been taken elsewhere over the last week on to some other tasks, and it'll probably stay that way for another week or so, but I will come back here and pick this up.
Until then, I'll occasionally pop in here and drop a comment just to leave things that I am learning about this new system as I go. The first of which is:
Learning: This does not support the
containers:directive. Eg:533fffd670c8d49907caView command line instructions
Manual merge helper
Use this merge commit message when completing the merge manually.
Checkout
From your project repository, check out a new branch and test the changes.