Skip to content

[Bug]: Goroutines leak on context cancellation #828

@pjcdawkins

Description

@pjcdawkins

Prerequisites

  • I have searched existing issues and discussions to avoid duplicates
  • I am using the latest version (or have tested against main/nightly)

Description

Bifrost has a bug where worker goroutines and streaming goroutines don't clean up on context cancellation, causing them to block indefinitely and leak.

[reported using Claude]

  1. Worker Loop Doesn't Monitor Context: The requestWorker function (bifrost.go:2016-2135) uses for req := range queue which only exits when the queue channel is closed during Shutdown(). Workers never check bifrost.ctx.Done() for cancellation, so they continue blocking on <-queue waiting for requests even after the context is cancelled.

  2. Streaming Goroutines Block on I/O: The streaming goroutine in HandleOpenAIChatCompletionStreaming (openai.go:786-939) blocks on scanner.Scan() at line 804. The context cancellation check happens inside the loop, so it never executes while blocked on I/O. When context times out, the goroutine stays permanently blocked waiting for data.

Steps to reproduce

https://gist.github.com/pjcdawkins/6f63fad7eea19c3d698b2740aaf21959

Expected behavior

Goroutines should return to baseline after context cancellation. When bifrost.Init() is called with a context and that context is cancelled, all worker goroutines and streaming goroutines should clean up and exit.

Actual behavior

Multiple goroutines leak (workers + streaming goroutines). They remain blocked indefinitely until process exit, causing memory growth and resource exhaustion.

Affected area(s)

Core (Go)

Version

v1.2.22

Environment

- Go version: 1.25
- OS: Linux
- Affected providers: any providers using streaming

Relevant logs/output

### Root Cause Analysis

**Issue 1: Worker Loop (`bifrost.go:2026`)**

The worker loop only exits on channel close:

for req := range queue {  // ONLY EXITS ON CHANNEL CLOSE
    // ... request processing
}


This loop only exits when the queue channel is closed in `Shutdown()` at line 2514. When a context is cancelled, the worker doesn't check `bifrost.ctx.Done()` and continues blocking.

**Why passing context to `Init()` doesn't fix it**: While `bifrost.Init()` accepts and stores a context in `bifrost.ctx`, the `requestWorker` goroutines don't monitor this context. They only monitor the per-request context (`req.Context`) for individual operations, not for the worker lifecycle itself.

**Issue 2: Streaming I/O Blocking (`openai.go:804`)**


for scanner.Scan() {  // BLOCKS HERE ON I/O
    // Context check happens AFTER scan completes
    select {
    case <-ctx.Done():
        return
    default:
    }
    // ... process line
}


The `ctx.Done()` check is inside the loop, so it never executes while blocked on `scanner.Scan()`. When context times out, the HTTP client may not immediately close the connection, and the goroutine stays blocked indefinitely.


### Proposed Fixes

**Fix 1: Make Workers Monitor Context Cancellation**

In `bifrost.go:requestWorker`:

func (bifrost *Bifrost) requestWorker(...) {
    // Monitor both queue closure AND context cancellation
    for {
        select {
        case <-bifrost.ctx.Done():
            bifrost.logger.Debug("worker exiting due to context cancellation")
            return
        case req, ok := <-queue:
            if !ok {
                return  // Queue closed - shutdown
            }
            // Process request...
        }
    }
}


**Fix 2: Make Streaming Goroutines Respect Context**

Monitor context in parallel with I/O and force-close response body on cancellation:

go func() {
    done := make(chan struct{})
    defer close(done)

    // Monitor context and force cleanup
    go func() {
        select {
        case <-ctx.Done():
            if resp.BodyStream() != nil {
                resp.BodyStream().Close()
            }
        case <-done:
        }
    }()

    // Existing streaming logic...
}()


## Additional Notes

The channel send pattern at lines 2094-2129 correctly prevents workers from blocking on sends using select with timeout. However, this doesn't help when workers are already blocked on channel receives (`<-queue`) or I/O operations (`scanner.Scan()`).

## Workarounds

Until fixed, consumers can:
1. Create fresh Bifrost instances for each isolated operation
2. Explicitly call `Shutdown()` when done with a Bifrost instance
3. Accept the leak for short-lived processes that exit soon anyway
4. Set aggressive timeouts on the HTTP client level

Regression?

No response

Severity

Medium (some functionality impaired)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions