Performance Optimization

Memory Usage

High memory consumption by backend services

Problem: Backend processes consuming excessive RAM (>4GB).Solution:

Monitor memory usage:

# Check process memory
ps aux | grep -E "langgraph|uvicorn|python" | awk '{print $4"% "$11}'

# Docker container memory
docker stats --no-stream

# System memory
free -h  # Linux
vm_stat  # macOS

Enable context summarization:

# In config.yaml
summarization:
  enabled: true  # ✅ Enable to reduce memory
  trigger:
    - type: tokens
      value: 15564  # Trigger when reaching ~16K tokens
  keep:
    type: messages
    value: 10  # Keep recent 10 messages only
  trim_tokens_to_summarize: 15564

Reduce memory injection:

# In config.yaml
memory:
  enabled: true
  max_injection_tokens: 2000  # ✅ Limit memory in prompt
  max_facts: 100  # Limit stored facts
  fact_confidence_threshold: 0.7  # Only high-confidence facts

Limit concurrent subagents:

# In backend/src/subagents/executor.py (default: 3)
MAX_CONCURRENT_SUBAGENTS = 2  # Reduce from 3 to 2

Configure thread pool sizes:

# In backend/src/subagents/executor.py
_scheduler_pool = ThreadPoolExecutor(max_workers=2)  # Default: 3
_execution_pool = ThreadPoolExecutor(max_workers=2)  # Default: 3

Use lightweight models for non-critical tasks:

models:
  - name: gpt-4  # Primary model for agent
    # ...
  - name: gpt-4o-mini  # Lightweight for summarization/memory
    use: langchain_openai:ChatOpenAI
    model: gpt-4o-mini
    max_tokens: 2048

# Use lightweight model for background tasks
summarization:
  model_name: gpt-4o-mini
memory:
  model_name: gpt-4o-mini
title:
  model_name: gpt-4o-mini

Memory leak in long-running sessions

Problem: Memory usage grows continuously over multiple conversations.Solution:

Clear conversation history periodically:

# Start fresh thread for new task
# Frontend: Click "New Conversation"
# API: Use new thread_id

Enable aggressive summarization:

summarization:
  enabled: true
  trigger:
    - type: messages
      value: 20  # Summarize after 20 messages (more aggressive)
  keep:
    type: messages
    value: 5  # Keep only 5 recent messages

Clean up old threads:

# Remove old thread data
find backend/.deer-flow/threads -type d -mtime +7 -exec rm -rf {} \;
# Removes threads older than 7 days

Restart services periodically:

# Automated restart script
make stop
sleep 5
make dev

Monitor and set memory limits (Docker):

# In docker-compose-dev.yaml
services:
  langgraph:
    deploy:
      resources:
        limits:
          memory: 4G  # Limit to 4GB
        reservations:
          memory: 2G  # Reserve 2GB
  gateway:
    deploy:
      resources:
        limits:
          memory: 2G

Out of memory errors in sandbox containers

Problem: OOMKilled or memory errors in sandbox containers.Solution:

Increase Docker memory limit:

# Docker Desktop: Settings → Resources → Memory
# Increase to at least 4GB (8GB recommended)

Configure container resource limits:

# In config.yaml for Docker sandbox
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  # Add resource limits (if supported by your setup)
  memory_limit: "2g"  # 2GB per sandbox
  cpu_limit: 2  # 2 CPU cores

For Kubernetes provisioner mode:

# In docker/provisioner/app/provisioner.py
# Adjust Pod resource limits:
resources=client.V1ResourceRequirements(
    requests={"memory": "512Mi", "cpu": "500m"},
    limits={"memory": "2Gi", "cpu": "2000m"}  # Increase limits
)

Clean up sandbox artifacts:

# Clear outputs from old threads
find backend/.deer-flow/threads -type f -name "*.png" -mtime +1 -delete
find backend/.deer-flow/threads -type f -name "*.pdf" -mtime +1 -delete

Limit file sizes in sandbox:

# Agent instruction in system prompt
# "Keep generated files under 10MB each"
# "Use compression for large data files"

Context Window Optimization

Context length exceeded errors

Problem: Context length exceeded or Maximum token limit errors.Solution:

Enable summarization (most important):

# In config.yaml
summarization:
  enabled: true  # ✅ Critical for long conversations
  trigger:
    - type: fraction
      value: 0.8  # Trigger at 80% of model's limit
  keep:
    type: messages
    value: 10

Use models with larger context windows:

models:
  # GPT-4 Turbo: 128K context
  - name: gpt-4-turbo
    use: langchain_openai:ChatOpenAI
    model: gpt-4-turbo-preview

  # Claude 3.5 Sonnet: 200K context
  - name: claude-3.5-sonnet
    use: langchain_anthropic:ChatAnthropic
    model: claude-3-5-sonnet-20241022

  # Gemini 2.5 Pro: 1M context
  - name: gemini-2.5-pro
    use: langchain_google_genai:ChatGoogleGenerativeAI
    model: gemini-2.5-pro

Reduce memory injection tokens:

memory:
  max_injection_tokens: 1000  # Reduce from 2000

Limit skill injection:

// In extensions_config.json
// Disable unused skills to reduce prompt size
{
  "skills": {
    "research": {"enabled": true},
    "report-generation": {"enabled": true},
    "image-generation": {"enabled": false},  // Disable if not needed
    "video-generation": {"enabled": false},
    "slide-creation": {"enabled": false}
  }
}

Optimize tool descriptions:

# Keep tool descriptions concise
# Tools with long descriptions add to context
# Edit tool docstrings in backend/src/sandbox/tools.py

Use subagents for isolated context:

# Each subagent has its own context
# Break complex tasks into subagents instead of long single conversation
# Agent will automatically delegate when appropriate

Slow response times due to large context

Problem: Model takes >30 seconds to respond as conversation grows.Solution:

Aggressive summarization:

summarization:
  enabled: true
  trigger:
    - type: tokens
      value: 8000  # Lower threshold for faster response
    - type: messages
      value: 15  # Or after 15 messages
  keep:
    type: messages
    value: 5  # Keep fewer messages

Start new thread for new topics:

# Don't try to do everything in one thread
# Use frontend "New Conversation" button
# Or create new thread_id via API

Offload data to files:

# Instead of keeping large data in conversation:
# 1. Write to /mnt/user-data/workspace/data.json
# 2. Reference file path in subsequent messages
# 3. Read only what's needed

Disable thinking mode for simple queries:

# Thinking mode adds significant tokens
# Only enable for complex reasoning tasks
models:
  - name: deepseek-v3
    supports_thinking: true
    # Manually toggle thinking via frontend toggle

Sandbox Performance

Slow container startup times

Problem: Sandbox container takes >10 seconds to start.Solution:

Pre-pull sandbox image (most important):

# Run before first use
make setup-sandbox

# Or manually:
docker pull enterprise-public-cn-beijing.cr.volces.com/vefaas-public/all-in-one-sandbox:latest

Use Apple Container on macOS (faster than Docker):

# Install Apple Container
# Download from: https://github.com/apple/container/releases

# DeerFlow automatically uses it if available
container --version
container system start

Keep existing sandbox instead of recreating:

# In config.yaml - use existing sandbox URL
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  base_url: http://localhost:8080  # Reuse existing sandbox
  auto_start: false  # Don't start new containers

Optimize Docker storage driver:

# Check current driver
docker info | grep "Storage Driver"

# overlay2 is fastest (default on modern systems)
# If using devicemapper or aufs, consider migration

Use local sandbox for development:

# Fastest option - no container overhead
sandbox:
  use: src.sandbox.local:LocalSandboxProvider
# Note: Less isolation, use only for development

Slow file operations in sandbox

Problem: Reading/writing files in sandbox is sluggish.Solution:

Check mount type (macOS):

# macOS: Avoid volumes, use bind mounts
# Already configured correctly in DeerFlow
# Verify in docker inspect output

Reduce file I/O:

# Read files once and cache in conversation
# Avoid repeated read_file calls for same file

# Write batch files together:
# - Better: Write all outputs at end
# - Worse: Write each file individually during processing

Use smaller files:

# Large files (>100MB) slow down sandbox
# Compress before uploading
# Split large datasets into chunks

For Kubernetes provisioner - use local paths:

# Ensure SKILLS_HOST_PATH and THREADS_HOST_PATH
# point to local SSD, not network storage

Optimize Docker Desktop settings (macOS):

# Docker Desktop → Settings → Resources
# Use VirtioFS instead of gRPC FUSE (faster)
# Enable "Use new Virtualization framework"

Bash commands execute slowly in sandbox

Problem: Shell commands take longer than expected in sandbox.Solution:

Increase container CPU allocation:

# Docker Desktop → Settings → Resources
# Increase CPUs to 4+ (from default 2)

Use local sandbox for CPU-intensive tasks:

# For development/testing
sandbox:
  use: src.sandbox.local:LocalSandboxProvider

Optimize commands:

# Avoid: Slow commands
find / -name "*.py"  # Searches entire filesystem

# Better: Targeted commands
find /mnt/user-data -name "*.py"  # Only search workspace

Use bash agent for complex command sequences:

# Instead of multiple bash tool calls:
# Delegate to bash subagent for multi-step shell tasks
# Bash agent is optimized for command execution

Scaling Considerations

Multiple concurrent users or threads

Problem: Performance degrades with multiple simultaneous conversations.Solution:

Use provisioner mode with Kubernetes:

# In config.yaml
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  provisioner_url: http://provisioner:8002
  # Each thread gets isolated Pod

Configure resource limits per thread:

# In docker/provisioner/app/provisioner.py
# Set appropriate limits based on your cluster capacity
resources=client.V1ResourceRequirements(
    requests={"memory": "256Mi", "cpu": "250m"},  # Per sandbox
    limits={"memory": "1Gi", "cpu": "1000m"}
)

Implement request queuing (advanced):

# Add rate limiting in backend/src/gateway/app.py
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/api/chat")
@limiter.limit("10/minute")  # Max 10 requests per minute
async def chat(request: Request):
    # ...

Use multiple model providers:

# Distribute load across providers
models:
  - name: openai-gpt4
    use: langchain_openai:ChatOpenAI
    # ...
  - name: anthropic-claude
    use: langchain_anthropic:ChatAnthropic
    # ...
  - name: google-gemini
    use: langchain_google_genai:ChatGoogleGenerativeAI
    # ...

Horizontal scaling (advanced):

# Run multiple backend instances behind load balancer
# Use Redis for shared state/checkpointing
# Configure in docker-compose or Kubernetes

Database/storage bottlenecks (memory.json, threads)

Problem: Slow memory updates or thread data access.Solution:

Increase memory debounce time:

# In config.yaml
memory:
  debounce_seconds: 60  # Increase from 30 to reduce writes

Use SSD for thread storage:

# Ensure backend/.deer-flow/ is on SSD, not HDD
# Check with:
df -Th backend/.deer-flow/

Clean up old threads regularly:

# Cron job to clean threads older than 30 days
0 2 * * * find /path/to/backend/.deer-flow/threads -type d -mtime +30 -exec rm -rf {} \;

Limit fact storage:

memory:
  max_facts: 50  # Reduce from 100
  fact_confidence_threshold: 0.8  # Higher threshold = fewer facts

Use external database (advanced):

# Replace file-based memory with PostgreSQL/MongoDB
# Implement custom memory backend in backend/src/agents/memory/
# Use LangGraph's PostgresSaver for checkpointing

Network latency to LLM providers

Problem: Slow responses due to network issues.Solution:

Use geographically closer endpoints:

# For providers with regional endpoints
models:
  - name: gpt-4
    use: langchain_openai:ChatOpenAI
    base_url: https://api.openai.com/v1  # Default
    # Some providers offer regional endpoints

Increase timeout for slow connections:

models:
  - name: gpt-4
    timeout: 120  # Increase from 60 to 120 seconds
    max_retries: 3  # Retry on timeout

Use local models (Ollama, LM Studio):

models:
  - name: ollama-local
    use: langchain_openai:ChatOpenAI
    model: llama3.2
    base_url: http://localhost:11434/v1
    api_key: ollama
    # Zero network latency

Implement caching (advanced):

# Cache LLM responses for repeated queries
# Use LangChain's caching layer:
from langchain.cache import InMemoryCache, SQLiteCache
from langchain.globals import set_llm_cache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

Monitor provider status:

# Check provider status pages
# OpenAI: https://status.openai.com/
# Anthropic: https://status.anthropic.com/
# If provider is degraded, switch to backup model

Performance Monitoring

How to monitor and profile DeerFlow performance

Solution:

Enable detailed logging:

# In backend/src/ files
import logging
logging.basicConfig(level=logging.DEBUG)
# Check logs/backend.log for bottlenecks

Monitor resource usage:

# Real-time monitoring
watch -n 1 'ps aux | grep -E "langgraph|uvicorn" | grep -v grep'

# Docker stats
docker stats

# System monitoring
htop  # Linux
Activity Monitor  # macOS

Track API response times:

# Add timing to requests
time curl -X POST http://localhost:2026/api/langgraph/threads/xxx/runs/stream \
  -H "Content-Type: application/json" \
  -d '{...}'

Profile Python code (advanced):

# Add profiling to backend
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 slowest functions

Use LangSmith for LLM tracing (advanced):

# Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-key
export LANGCHAIN_PROJECT=deerflow

# Restart DeerFlow
# View traces at https://smith.langchain.com

Benchmark specific operations:

# Test sandbox performance
time docker exec <sandbox-container> bash -c "for i in {1..100}; do echo test > /tmp/test_$i.txt; done"

# Test model latency
cd backend
time uv run python -c "from langchain_openai import ChatOpenAI; llm = ChatOpenAI(model='gpt-4'); print(llm.invoke('test').content)"

Next Steps

Common Issues - General troubleshooting
Sandbox Errors - Container-specific issues
Configuration Guide - Optimize your configuration
Architecture - Understand system design for better optimization

Architecture

Development

Troubleshooting

Performance Optimization

Memory Usage

Context Window Optimization

Sandbox Performance

Scaling Considerations

Performance Monitoring

Next Steps

Architecture

Development

Troubleshooting

Documentation Index

​Memory Usage

​Context Window Optimization

​Sandbox Performance

​Scaling Considerations

​Performance Monitoring

​Next Steps

Memory Usage

Context Window Optimization

Sandbox Performance

Scaling Considerations

Performance Monitoring

Next Steps