Feature Request: Pluggable Backend Architecture #107
Labels
No labels
Stage
Idea
Stage
Ready
No milestone
No assignees
5 participants
Notifications
Due date
No due date set.
Reference
forgejo/forgejo-actions-feature-requests#107
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
How to use this feature request
First hand experience
Needs and benefits
Forgejo Runner currently supports Docker, Host, and LXC backends, all hardcoded into the runner. There is community interest in additional backends (Kubernetes, Podman, Firecracker, Proxmox, etc.), but each one would require modifying core files and adding backend-specific knowledge to the runner codebase.
A plugin architecture would:
Feature Description
Each backend is a separate binary communicating with the runner over gRPC, using something like HashiCorp go-plugin. The runner would launch the plugin binary when needed, and the plugin would implement a defined gRPC interface to manage execution environments.
How it would work:
The runner discovers plugins through config:
When a job's
runs-onlabel matches a plugin scheme, the runner launches the corresponding binary. The plugin handles the full environment lifecycle (create, exec, copy files, remove). The runner communicates via a gRPC protocol.gRPC interface sketch:
The current
Containerinterface has 15 methods, but several are Docker-internal (ConnectToNetwork,ReplaceLogWriter,Pullas a separate step). A plugin protocol could be simpler — roughly 9 RPCs:Capabilities— declares what the backend supports (Docker actions? own networking? default paths?). This replaces the scatteredIsHostEnv()/IsK8sEnv()identity checks with a single upfront declaration.Create— provisions the environment (image pull folded in). Returns metadata the runner needs (paths, OS info). Service containers are passed here so the plugin handles them in whatever way makes sense for its platform.Start— boots the environment.Exec— runs a command, streams stdout/stderr back.CopyIn/CopyOut— stream tar data for file transfers.UpdateEnv— reads an env file inside the environment, returns parsed variables.IsHealthy— health check.Remove— tears down everything.On the runner side, a gRPC client adapter would implement
ExecutionsEnvironmentby translating calls to plugin RPCs. This adapter is the only runner code that knows about go-plugin. Plugin authors write a standalone binary with its owngo.modand dependencies — the runner never imports K8s/Podman/etc. code.Docker and Host backends would stay built-in. The plugin system is additive.
Approach 2: External orchestration + host mode
Instead of the runner managing execution environments, external tooling creates the environment and runs the runner inside it in host mode.
For example with Kubernetes: a K8s operator watches for pending jobs, creates a pod with the runner binary, the runner starts in host mode, executes the job, and the pod gets cleaned up.
This is compelling for remote infrastructure (OpenStack, Proxmox) where controlling processes from outside a VM is unreliable or impossible — it's much easier to control a job from inside.
Trade-offs: Simpler for the runner (just host mode), natural for remote VMs. But more moving parts in deployment, service containers become the orchestrator's responsibility, and it loses the "configure a label and go" UX.
Both could coexist
Incremental steps that help either way
Some small refactoring would improve extensibility regardless of which approach is chosen:
IsHostEnv()/IsK8sEnv()identity checks with capability queries — makes adding any backend safer and avoids the "missed a guard" class of bugs.labels.Parse()andPickPlatform()don't reject unknown schemes.NewContainerInputfrom Docker-specific types (nat.PortSet/nat.PortMap).GetLXC()/GetK8s()onExecutionsEnvironmentwith something generic (e.g.BackendName() string).What needs to happen before a feature request is ready to be implemented?
Users can complete the first step (accumulating first and experience) on their own, even if this feature request did not catch the eye of someone with the necessary skills to implement it. And when it reaches that point, it will stand out and have a much higher chance of being implemented.
To fully grasp the scope of a feature request, and to brainstorm possible solutions, a feature request will generally wait until several users have provided their perspective.
Thumbs-up reactions help gauge popularity, but do not provide the same amount of useful information.
Results from discussions and additional user experiences are incorporated into a final summary to provide a single reference for the developers working on this change.
This can be done by the author of the issue or anyone else in a followup comment.
Stage/Ideais changed toStage/Ready.Depending on the feature request it can be in Forgejo or Forgejo runner.
A copy/paste of the "Needs and benefit" and "Feature description" should be used, with link to this issue so the developer knows where to find more details if they need to.
I'm more of a fan of "Approach 1", to start with. I don't find the responsibilities between different software components clear in Approach 2 -- the "orchestration" system would either require 30% of the logic that the runner has today, or, would need to be a version of the runner.
Regarding Approach 1, it's a little hard to foresee what (if anything) will make this difficult. At a high-level, it sounds good to me. Let me see if I can tease out some complexity though...
I think that we'll need cancellation capabilities that are managed by the runner, and that are more fine-grained than
Remove. For example, a step can be defined withtimeout-minutes, and the next step in the job can be set asif: failed(). In this case, the step, and whatever command it is executing, needs to be cancelled while the job continues to live.I see that you've included an
IsHealthyAPI in the design -- but as I'm reviewinggo-plugin, I see that there's a standard gRPC health checking service that is required for the protocol. Does that serve the same purpose?The runner will need to be prepared at any time for a situation where the plugin panics, crashes,
SIGTERMorSIGKILL. Do you have any thoughts on what that would look like? Less on the technical side of how, but more generally what happens to the task on Forgejo, are there resources that can be cleaned up still, etc.How do you envision a gRPC interface method like
Execworking to stream log data back to the runner?You've noted that "Docker and Host backends would stay built-in.", and I agree, that makes sense. However, Forgejo Runner will need some capability to test the plugin capability. Perhaps we'd need a "test plugin" that can be used in an end-to-end test, and can trigger some of the more complex interactions to ensure that the runner supports them correctly.
Hey, i'm currently testing a POC of a plugin implementation here https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin. I decided to implement the gRPC interface myself and not use go-plugin (because i don't like the way hashicorp play with licenses, and a plain gRPC service is simpler — the plugin is just a standalone server the runner connects to, no managed process).
I went the full streaming route where i pipe the logger of the plugin directly to the response of the Exec stream (gRPC stream is bilateral). Each Write() on stdout/stderr sends a chunk back to the runner in real-time. Had to add mutex protection since both can write concurrently.
For cancellation — gRPC context cancellation propagates naturally, so when the runner cancels (e.g. timeout-minutes expires), the in-flight Exec stream gets terminated. The plugin can clean up the command while keeping the environment alive for the next step. This is more granular than Remove which tears down everything.
For IsHealthy vs gRPC health checking — IsHealthy in my design checks whether a specific environment (pod, container) is still running, it's per-job. Standard gRPC health would be for the server itself. They serve different purposes, but we could add the standard one alongside.
For plugin crash/SIGKILL — if the plugin dies mid-job, all RPCs fail immediately with gRPC Unavailable. The runner already handles container errors and reports them to Forgejo, so the job would be marked failed. Orphaned resources (pods etc) would need the plugin to clean up on restart or some TTL/garbage collection. Haven't solved that part yet.
For testing — agreed, a test plugin would be useful. I already have adapter tests with a mock gRPC server implementing the full interface. A "dummy" plugin that just runs commands on the host (like the Host backend) would let us test the plugin lifecycle in CI without K8s.
My poc is not perfect yet, as we speak i still have some issues with passing the data around and the setup job is much slower than in-tree because of gRPC overhead on file transfers. I'm thinking for the sidecar case to add a CopyLocal RPC where the plugin just reads from a shared volume path instead of streaming compressed data
over gRPC.
edit: I managed to have my poc working and the slow file transfer was because of caching that i didn't set correctly for the plugin :D
I think approach 2 is pretty well covered by external tooling. People are already using KEDA, and integration for GARM is being worked on. Some refinements are still necessary. But that's a separate discussion.
There seems to be strong demand for approach 1. It has the added benefit that we can move LXC out of core and hopefully fix some of its problems.
Regarding labels, #105 should help with supporting arbitrary plug-ins.
One thing we have to consider how plug-ins are going to be discovered and what additional configuration might be necessary within Forgejo Runner.
Hey @adamcharnock, this discussion might be of interest for you because of your Firecracker efforts.
here is the RPC protocol for the MVP plugin https://git.erwanleboucher.dev/eleboucher/runner/src/branch/main/act/plugin/proto/v1/plugin.proto , i'm currently testing it for my homelab https://git.erwanleboucher.dev/eleboucher/homelab/src/branch/main/kubernetes/apps/selfhosted/forgejo/runner/helmrelease.yaml and trying to get people from my community to use it as well
@eleboucher wrote in #107 (comment)...
Just reading through these notes, and it all sounds promising. I think to some extent a POC is needed to understand the problems in depth -- I would be doing the same thing -- but just a note of caution to not get too locked into a solution since we may still have design thoughts that would change direction. 🙂
I'm not easily convinced that this is the right choice. go-plugin solves a lot of problems, and it is broadly used making it a solid base of known functionality. While hashicorp has done dumb licensing things in the past, their open source licensed codebases have lived on in forks quite happily, and the MPL licensed go-plugin would still be usable.
🤔 I don't understand this note. They're different processes on each side of the gRPC -- why would you need a mutex, which is a single-process concurrency mechanism?
Good point, I've actually reconsidered and implemented go-plugin support as a second transport option (pluginsv2). The plain gRPC approach stays as v1 (plugin is a standalone server the runner connects to), and go-plugin is v2 (runner launches the plugin binary as a subprocess). Both share the same proto interface, and the same plugin binary supports either mode. So users can choose based on their deployment model, sidecar container (v1) or embedded binary (v2) until we decide for a way.
Both mutexes are within a single process, not cross-process:
There's also the Caddy option, where you build the main binary with the plugins injected as direct dependencies by generating a suitable
main.goon the fly.@whitequark wrote in #107 (comment):
Interesting approach... the core technical details are documented in their main.go which helped me interpret this description.
I can see the advantages of this approach:
context.Contextandchancould be parameters/returns, if needed).../v13/...as a module, and the plugins can continue to work against v12 and upgrade to v13 when they're ready to incorporate those changes.Caddy has a complexity in the use of an external build tool (xcaddy) to support this workflow, but I don't think that complexity is really necessary for us.
gRPC gives a theoretical advantage that plugins could be written in a language other than Go. But I'm not sure if that's a realistic use-case or a need.
From extensively using Caddy, I can say that the utility provided by xcaddy is very marginal in practice and for a tool consumed by a more specialized audience like Forgejo Actions runner it's really not essential; I would expect everyone capable of using the runner to also be capable of using the Go toolchain in a basic manner. (We should of course also document it, if we go for this approach.)
i guess for me it's also that we are have a contract which act like a documentation how to write a plugin. and well we never know maybe some people might need their plugin running in a cobol env. I'm not a huge fan of playing with a binary like xcaddy or the go-plugin as it can be used maliciously, where a simple grpc client does the trick and limit the angle of attack, you can see in my 2 implementation that it's pretty straightforward and rather easy to expand.
the thing with xCaddy means that we will use a 3rd party distributed forgejo runner is it something ok?
if you don't match the grpc proto then it wouldn't compile as well. as well as you import the proto file from a version see there https://git.erwanleboucher.dev/eleboucher/runner-k8s-plugin/src/branch/main/k8s_pod.go#L18
In what way would using gRPC here improve security? Could you state a specific threat model please?
Fair point, I should be more precise. My concern is less about a strict threat model and more about attack surface and trust boundaries.
With the Caddy/go-plugin approach, you’re executing arbitrary compiled code directly in the same process (or a subprocess spawned with elevated trust). If a plugin is malicious or compromised, it has full access to the runner’s memory space, environment variables, file descriptors, etc. — there’s no isolation boundary.
With gRPC, the plugin runs as a separate process with its own isolation. The runner communicates with it over a well-defined protocol. A compromised plugin can still do damage, but it’s limited to what you explicitly expose through the proto contract — you don’t get implicit access to the runner’s internals just by being loaded into it.
That said, I’ll admit this is more of a defense-in-depth argument than a hard security boundary — if someone’s running a malicious plugin on their own runner, they already have code execution anyway. So the security delta is real but probably not the strongest argument for gRPC in this context.
The stronger arguments for me remain the explicit contract as documentation and language agnosticism — even if the latter is theoretical for now.
We are talking about plugins for things like "running Firecracker VMs" here, right? In other words, plugins that are already more than capable of extracting job secrets by substituting the code that actually runs in a VM (or the VM image) with something malicious.
Yes you are right. I’m still not a huge fan of the xcaddy solution for the other point above
I want to do #105. Because labels affect back-ends and their configuration, it blends into this one. #105 is only about the user-facing aspects. So I'll dump my current plan here because it affects plug-in registration and plug-in instance creation.
It's very early. Almost no code exists, everything in here is pseudo-code, and sudden U-turns should be expected.
Happy to read comments and answer questions.
There's a
Backend:Examples of
BackendareDockerBackend,LXCBackend, orHostBackend.Backendis the static companion ofExecutionsEnvironment(to be renamed toExecutionEnvironment, without the S in the middle).Backendinstances have a static configuration in the runner configuration:That static configuration is called
BackendConfiguration:Each
Backendhas to implement its ownBackendConfiguration, for example,DockerConfiguration. In the case ofDocker, it corresponds to todaysconfig.Container.In addition to the
BackendConfigurationthat applies to theBackendfor its entire lifetime, aBackendhas to know how to turn a label likedebian-latestinto a virtual machine or container. As an added complication,debian-latestcan mean something when a job comes fromexample.comand something completely different when it comes fromcodeberg.org, because Forgejo Runner supports connection-specific label configurations:So far, labels were interpreted and validated by Forgejo Runner. That is not longer feasible with plugins and when there are many
Backend-specific options. Therefore, from now on it is up to theBackendto interpret theLabelConfiguration. That's whyBackendhas a bound function calledValidateLabelConfiguration().A label configuration can come in two different shapes:
Or a single string for backwards compatibility and when using CLI options, which will be validated with
ValidateLabelString():As far as I know, there is no standardized way to encode arrays as query parameters. We'll use whatever Go understands. Possible alternative:
ParseLabelConfigurationString(str string) (LabelConfiguration, error). Forgejo Runner would then have to store the parsedLabelConfiguration.The runner has to know which
Backendinstances are available. So, eachBackendhas to register itself. Unfortunately, we have to read the runner configuration before we can initialize aBackend. We also need something that knows how to initialize aBackend. The runner doesn't know how to do that. An interface method onBackendis only available after the runner has created theBackend. So, we need aFactorythat eachBackendhas to provide.map[string]anyis what comes out of the runner configuration. How do we find the correctFactory?We also need something to hold onto all the
Backendinstances:I don't love those global variables. But it's most likely the easiest approach to get started.
When the runner starts up, the various
Factoryinstances register themselves by callingFactoryRegistry.Register(), for example,FactoryRegistry.Register("docker", &DockerFactory{}). They can either do that in an init function or whatever other mechanism we ultimately use for loading plug-ins.So, now everything is in place for initializing the runner:
Configurationfor allBackendsthat should be enabled.Configurationfigures that out by looping over all label configurations and collecting all backend IDs in a set.BackendID:FactoryusingFactoryRegistry.Get(id).Factory.CreateBackend()and pass the contents ofconfig.Backend[id]as argument.Backendinstance inBackendRegistry.LabelConfiguration:GetBackend()on theLabelConfiguration.BackendRegistryfor theBackendwith the given ID.Validate()on theBackend.When a new
ExecutionEnvironmentis required to run a job, the runner asks theBackendfor a new instance by invokingCreateExecutionEnvironment(). The details are hazy because the runner is currently aware of the type ofBackendit is talking to and I have yet to figure out how to make that agnostic.The type
Labelbecomes much simpler and will semantically match theLabelin Forgejo, which should lead to less confusion.All that is supposed to be fully backwards-compatible.
Quick follow up, i want to push for keeping just the plain gRPC approach (plugin is a standalone server, runner connects to it) and drop the go-plugin path.
Both are MVPs in my fork right now, nothing is production yet, i kept go-plugin around to keep options open while we figured it out. So this isn't ripping out finished work, just picking a direction before we invest more.
The thing that pushed me there is that the plugin system is about whatever backend the community wants to build: Firecracker, Podman, Proxmox, Nomad, weird in-house stuff. Each of those has its own deployment shape. A k8s plugin wants to live in the cluster with its own ServiceAccount, a Firecracker one probably runs as a daemon on the hypervisor host, a Proxmox one needs API access from somewhere. The plain gRPC approach fits all of them because the runner just connects to an address (unix socket, local tcp, remote mTLS, whatever). The go-plugin path forces the plugin to be a child process of the runner with the runner's identity, that works for "binary on a host" but breaks the moment a backend wants its own creds, own lifecycle, or to be shared across runners.
A standalone gRPC server can hold state across jobs — k8s informers, watch caches, kubeconfig, leader leases, connection pools. With go-plugin the subprocess is killed at the end of each job (Close() runs in cleanUpJobContainer), so every job re-parses the kubeconfig, re-opens the API connections, re-warms whatever caches the plugin has. That's real per-job latency for any backend that talks to a remote control plane.
And on "is it easy to write", my k8s plugin's main.go is ~150 lines, plain gRPC server, health service, listen on a socket, handle SIGTERM. Anyone who has touched gRPC in Go can ship a working plugin in an afternoon, the proto really does act as the docs. and also mean that anyone can implement the proto in Rust/Python/whatever
I'll open a PR to Integrate the grcp plugin.
@eleboucher wrote in #107 (comment):
To me, it sounds like the problem here is the supposed plug-in abstraction if it doesn't allow the plug-in to hold onto state for the duration of Forgejo Runner's lifetime. That should be redesigned.
(I have no opinion on gRPC or one of the alternatives, at least not yet.)
I'll chime in and say that the xcaddy model is absolutely not ideal, because it forces the users to build and manage the image with the relevant plugin. Even if you build the image, for Kubernetes, you would need to push the modified image on a private registry for your nodes to pull it.
Usually people do not bother and rely on docker images built by third party, which can obviously create supply chain issues. If a plugin architecture is used, the separation of concerns is way better : the users are using the official Forgejo image, and then add the relevant plugins they want to use which can then talk to with GRPC or equivalent.
Using a Kubernetes runner is also not something marginal, it's widely used across all forges from homelabs to enterprises, I think it's useful to mention it anyways.
Thanks !
by this i meant more job state for example if someone makes a custom plugin for their own company they will be able to isolate their state etc etc.
i opened the PR here forgejo/runner#1500
A plug-in interface is a significant commitment. Evolving Forgejo Runner is not easy. It already has to retain compatibility with GitHub Actions, multiple Forgejo versions, and previous versions of Forgejo Runner. A plug-in interface reduces the wiggle room even more because it creates an internal boundary that can only be changed rarely and carefully. Therefore, a lot has to happen before I am willing to consider to commit to a plug-in interface:
Right now, none of these points has been met. Exposing a plug-in interface is one of the last steps, not the first.
In the meantime, the existing interfaces can be cleaned up and improved. Perhaps other changes can be made to Forgejo Runner that make it easier to maintain forks with alternative back-ends and pave the way for a plug-in interface, like additional tests.
I think a couple of these are actually closer than they look:
On testing: my PR already has act/plugin/testplugin/, a ~500 LOC host-mode reference plugin that implements the full proto with its own tests. Right now it's just a unit-test fixture. I can split that out into its own PR if it's useful to have that signal independently of whether the rest of big PR lands. If you have something more specific in mind for what a testing plan should cover, I'd rather know upfront than guess at it.
On config/discovery: yeah, the loading story isn't fully written. My read is it fits naturally on top of the Backend/Factory/LabelConfiguration design from your discussions A gRPC plugin is just a Factory where CreateBackend dials a remote endpoint. I'd rather write that as a follow-up to your design than propose something that competes with it.
On demand: you're right, but there is also a huge demand for the kubernetes plugin https://codeberg.org/forgejo/discussions/issues/66 .
On the VM POC: only the k8s one exists and I can't realistically build a Firecracker plugin myself in any reasonable timeframe. If that's a hard gate, the most realistic path is probably @adamcharnock or whoever else is interested in that direction. I'm happy to help with proto changes, review, shaping the interface around what a VM backend actually needs, just can't be the one writing it.
I guess what we can try is actually to move docker and LXC to even use this interface so we have this as the source of truth and one thing to maintain, Docker and LXC will be 1st party plugin without the need to specify it and the rest will be community plugin
Anyway, I'm not trying to rush #1500. I just want to make sure I'm working on the right things while the gates get resolved rather than waiting around. If the most useful thing right now is the incremental cleanup from the original issue (capability queries, opening label schemes, decoupling from Docker types) I can start there.
I'm interested in Firecracker support, but have no hard ETA
@aahlenst made forgejo/runner#1503 to cleanup and improve the interface as discussed
After looking at the interfaces changes proposed by @eleboucher (thanks again!), I think we have to spend some more time on improving the interface further.
A GitHub Actions workflow comes with certain expectations.
runs-onis a VM and everything else runs inside that VM.container,services, andusesare all containers built around the semantics of Docker and compatible container runtimes. A plug-in interface has to preserve those semantics while allowing plug-ins to ignore them to a certain extent. For example, Forgejo Runner already cheats a little by not honouringruns-onwhencontainer.imageis defined. However, the overall semantics are preserved. On the other hand, LXC violates most of the semantics, which leads to a pretty bad user experience.Let's take an extreme example: Tart. It creates macOS virtual machines and interacts with them over SSH. If I wanted to create a plug-in for Tart, I would only want to start a Tart VM for
runs-on. For everything else, I would have to somehow start containers inside that macOS VM, either using Docker Desktop, Podman Desktop, or Apple's container. That means that the interface has to express "Start an execution environment for a job", "Start a container for a job", "Start a service container", "Start a step container". Otherwise, I cannot preserve the workflow semantics. As far as I can see,ExecutionsEnvironmentdoesn't provide that information.We need a better separation of concerns. For example,
ConnectToNetwork()doesn't really make sense because not every plug-in has Docker-style networks. Some methods are too narrow. Forgejo Runner doesn't know when it's the right time to pull an image. That's up to the plug-ins.@eleboucher wrote in #107 (comment):
That is great. I do not want to discourage anybody or discount the existing achievements. My intent is to provide guidance and to manage expectations, including those of bystanders. However, that is my POV. It does not necessarily represent the POV of the other contributors.
In any case, I do not expect that you @eleboucher solve all problems on your own.
My personal preference right now is to concentrate on internal improvements of Forgejo Runner to prepare it for plug-ins. That should ideally happen in small, targeted steps like forgejo/runner#1503. That is vastly easier to digest and much easier to provide meaningful feedback.
My hunch is that we have to move in that direction and that it is a good idea, anyway.
fair points, the gap is real. looking at my proto (act/plugin/proto/v1/plugin.proto in the plugin branch), Create bundles runs-on + job container + services into one RPC, and step-level docker actions aren't modelled at all, just gated by a supports_docker_actions capability. that works for k8s where the pod is everything, but it doesn't fit Tart where the plugin only owns the VM.
so yeah splitting helps:
a k8s plugin implements CreateJobEnvironment and absorbs the rest, a Tart plugin only implements CreateJobEnvironment and the runner falls back to a default docker driver for the others. that's the case that makes the split actually worth it the existing docker code becomes a default driver any VM plugin can reuse instead of reimplementing it over SSH.
on ConnectToNetwork and image pull: already aligned on the proto side (no ConnectToNetwork RPC, pull folded into Create with a force_pull hint, plugin owns the timing). same treatment fits the in-tree interface.
Agreed on incremental, that's what i'd rather be doing. once that #1503 is merged, rough sequence from here:
Each lands in isolation, and the Backend/Factory machinery from #105 slots in once the interface is in better shape. moving docker and LXC onto the same interface plugins use ends up being the natural endpoint, docker as default driver, LXC as built-in plugin, Tart only overrides the outer layer.
happy to drive any of these, let me know which is most useful next.
actually ConnectToNetwork is deadcode
I guess thinking about Tart, we want to express Github 4 execution layer: runs-on, container:, services:, uses: docker://. therefore i think this interface will suit well
where Back-ends that don't support a kind return ErrUnsupported, gated upfront via Capabilities so the runner can fail early or ignore.
If we manage to make docker reusable then Tart, Firecracker, anything that boots a VM with Docker inside becomes trivial.
Cross-checked against GitLab Runner and Woodpecker — both refuse nested docker actions on k8s by default; users opt into DinD via a privileged service container. So k8s refusing step containers is consistent with the rest of the industry.
@eleboucher wrote in #107 (comment):
To me, it sounds like you're proposing that a workflow with
runs-on: firecrackercould get a service container outside the firecracker VM, provided by a Docker-compatible container engine.I doubt this is a good choice, at least not by default. Either, administrators have to be able to disable that (what happens in that case?) or it should lead to errors outright. Otherwise, workflow authors would be able to escape the confinement that
runs-onshould provide without the runner administrator being able to do something against that.For the time being, my preference is to be as predictable and secure as possible. That means that plug-ins either have to implement a capability or it leads to errors, period. There shouldn't be any opt-out, either.
I like your proposal for the interfaces.
I am not sure about that one.
hostbasically imitates the GitHub Actions Runner. That meansruns-onis ignored, all three container factories should produce actual containers as it does today.How do you imagine that to work? With a remote Docker host? That is a very interesting idea 🤔 It would certainly simplify the implementation of plug-ins and be a lot easier than sending Docker commands to a VM. But it raises a lot of questions, especially when plug-ins do not live inside the same process.
I think i misled with the "runner falls back" phrasing. What i meant is that a plug-in explicitly returns a DockerExecutionEnvironment from its own CreateExecutionEnvironment. The runner never substitutes anything. If a plug-in says "no CreateStepContainer", that's an error, full stop.
That makes the security story straightforward: the plug-in owns the boundary. A Tart plug-in pointing its Docker client at the in-VM socket means every container runs inside the VM. The runner never reaches outside on its own.
You're right, my bad. So host should implement all three inner factories, effectively the same impl as the docker back-end. Which is actually a useful data point as host and docker would be able to share the same DockerExecutionEnvironment internally, which is exactly what we want out of pulling docker into a reusable piece
On tart, yes, remote or tunneled docker host.concretely the plug-in would boots the VM, opens an SSH tunnel to the docker unix socket inside it (or talks to a TLS-protected docker daemon on the VM's network), and points its docker client at that endpoint. containers spawned via that client run inside the VM.
i think the cleanest way to make "use the docker environment" work cross-process is to flip who owns the docker calls. instead of the plug-in implementing CreateJobContainer / CreateServiceContainer / CreateStepContainer itself and proxying to docker, the plug-in just exposes a docker endpoint and the runner uses it. to keep things simple, we always proxy.
concretely:
This answers the cross-process question because the plug-in doesn't need any docker logic at all, no matter what language it's written in. plug-in authors only need to know how to start a proxy and expose a socket.
It also keeps the security story clean: the plug-in is responsible for what it exposes. as long as the proxy points at the in-VM daemon and not the host's, all containers run inside the VM. the runner connects to whatever the plug-in told it to, never picks an endpoint on its own
we'd express it in capabilities and the proto: a VM-like plug-in returns a delegate_to_docker block in CreateExecutionEnvironmentResponse with the endpoint + TLS material. Other plug-ins (k8s, etc.) don't return that block, and the runner sends them the inner Create RPCs as normal.
The extracted docker logic can look like that forgejo/runner#1507
@eleboucher wrote in #107 (comment):
That's something we should at least try manually before finalizing the API.
Sounds sensible.
I've seen in your proposal that you're binding the client to the context. I'm not a fan, at least not yet, because it's implicit.
GetDockerClient()is called all over the place and when the injected client isn't present, it creates a new instance using different rules. So it's looks like a source of subtle bugs to me. Passing it as function argument would be very explicit and match what we're trying to do: "Use this client to stop container X."I don't know yet what should be done exactly, instead. It's weird that that a
Clientinstance is attached to acontainerReferenceand that there are so many free floating functions. That makes it hard to scope and cache some information. Perhaps there could be some additional type likeDockerEndpoint(not a great name) that is bound to a particular Docker socket and provides methods likeRunnerArch()that can also be cached. Right now, we're callingGetHostInfo()over and over again and that tanks performance with Podman.Is there a particular reason for the introduction of the alias
dockercontainer? That makes the diff rather verbose.@whitequark wrote in #107 (comment):
What about QEMU instead? Its MicroVM back-end is slower, but QEMU as a whole is way more versatile and is able to host other guests than Linux.
I do want non-Linux guests (Windows is the major priority) so I am interested in that. I was initially planning to start with Firecracker to get decent Linux performance and then start doing harder things, but I am quite ignorant of the field all things considered so this may not be a good plan. (Other things have taken priority so I'm still in the "I really need to set up a lab and compare the available solutions" phase of the development.)
@aahlenst i updated my PR with your feedback, i guess we are now in a good place. if you can have a look again
@aahlenst looks like the plugin PR is next or do you want me to handle #105 first ?
@eleboucher Oh, that would be terrific if you could start with #105. I don't have the cycles necessary to do it at the moment. ❤️
@aahlenst wrote in #107 (comment):
forgejo/runner#1571/files here you go