-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Prerequisites
- I have searched existing issues and discussions to avoid duplicates
- I am using the latest version (or have tested against main/nightly)
Description
Bifrost has a bug where worker goroutines and streaming goroutines don't clean up on context cancellation, causing them to block indefinitely and leak.
[reported using Claude]
-
Worker Loop Doesn't Monitor Context: The
requestWorkerfunction (bifrost.go:2016-2135) usesfor req := range queuewhich only exits when the queue channel is closed duringShutdown(). Workers never checkbifrost.ctx.Done()for cancellation, so they continue blocking on<-queuewaiting for requests even after the context is cancelled. -
Streaming Goroutines Block on I/O: The streaming goroutine in
HandleOpenAIChatCompletionStreaming(openai.go:786-939) blocks onscanner.Scan()at line 804. The context cancellation check happens inside the loop, so it never executes while blocked on I/O. When context times out, the goroutine stays permanently blocked waiting for data.
Steps to reproduce
https://gist.github.com/pjcdawkins/6f63fad7eea19c3d698b2740aaf21959
Expected behavior
Goroutines should return to baseline after context cancellation. When bifrost.Init() is called with a context and that context is cancelled, all worker goroutines and streaming goroutines should clean up and exit.
Actual behavior
Multiple goroutines leak (workers + streaming goroutines). They remain blocked indefinitely until process exit, causing memory growth and resource exhaustion.
Affected area(s)
Core (Go)
Version
v1.2.22
Environment
- Go version: 1.25
- OS: Linux
- Affected providers: any providers using streaming
Relevant logs/output
### Root Cause Analysis
**Issue 1: Worker Loop (`bifrost.go:2026`)**
The worker loop only exits on channel close:
for req := range queue { // ONLY EXITS ON CHANNEL CLOSE
// ... request processing
}
This loop only exits when the queue channel is closed in `Shutdown()` at line 2514. When a context is cancelled, the worker doesn't check `bifrost.ctx.Done()` and continues blocking.
**Why passing context to `Init()` doesn't fix it**: While `bifrost.Init()` accepts and stores a context in `bifrost.ctx`, the `requestWorker` goroutines don't monitor this context. They only monitor the per-request context (`req.Context`) for individual operations, not for the worker lifecycle itself.
**Issue 2: Streaming I/O Blocking (`openai.go:804`)**
for scanner.Scan() { // BLOCKS HERE ON I/O
// Context check happens AFTER scan completes
select {
case <-ctx.Done():
return
default:
}
// ... process line
}
The `ctx.Done()` check is inside the loop, so it never executes while blocked on `scanner.Scan()`. When context times out, the HTTP client may not immediately close the connection, and the goroutine stays blocked indefinitely.
### Proposed Fixes
**Fix 1: Make Workers Monitor Context Cancellation**
In `bifrost.go:requestWorker`:
func (bifrost *Bifrost) requestWorker(...) {
// Monitor both queue closure AND context cancellation
for {
select {
case <-bifrost.ctx.Done():
bifrost.logger.Debug("worker exiting due to context cancellation")
return
case req, ok := <-queue:
if !ok {
return // Queue closed - shutdown
}
// Process request...
}
}
}
**Fix 2: Make Streaming Goroutines Respect Context**
Monitor context in parallel with I/O and force-close response body on cancellation:
go func() {
done := make(chan struct{})
defer close(done)
// Monitor context and force cleanup
go func() {
select {
case <-ctx.Done():
if resp.BodyStream() != nil {
resp.BodyStream().Close()
}
case <-done:
}
}()
// Existing streaming logic...
}()
## Additional Notes
The channel send pattern at lines 2094-2129 correctly prevents workers from blocking on sends using select with timeout. However, this doesn't help when workers are already blocked on channel receives (`<-queue`) or I/O operations (`scanner.Scan()`).
## Workarounds
Until fixed, consumers can:
1. Create fresh Bifrost instances for each isolated operation
2. Explicitly call `Shutdown()` when done with a Bifrost instance
3. Accept the leak for short-lived processes that exit soon anyway
4. Set aggressive timeouts on the HTTP client levelRegression?
No response
Severity
Medium (some functionality impaired)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status