Ollama 源代码结构与架构设计深度解析

🎓 AI 学习中心 / Ollama 实战

你正在学习 AI 学习中心的 Ollama 实战 课程 · 返回快速开始

概述

Ollama 是一个用 Go 语言编写的开源本地 LLM 运行时系统，它让在本地运行大语言模型变得简单高效。本文将深入探讨 Ollama 的核心架构、设计理念和技术实现。

核心架构概览

系统架构模式

Ollama 采用经典的客户端-服务器（Client-Server）模型，具有高度模块化的设计，将不同关注点分离到独立组件中。

设计目标

可扩展性 - 易于添加新功能和支持新模型
资源高效 - 充分利用硬件资源（CPU、GPU、内存）
跨平台兼容 - 支持 macOS、Linux、Windows
易用性 - 简单的 CLI 和 API 接口

架构层次

┌─────────────────────────────────────────┐
│         CLI / REST API Client           │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│          Server Layer (HTTP)            │
│  - API Routes (server/routes.go)        │
│  - Request Handling                     │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│       Request Scheduling Layer          │
│  - Queue Management                     │
│  - Concurrency Control                  │
│  - Priority Handling                    │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│       Model Execution Layer             │
│  - LLM Inference Engine                 │
│  - GPU Acceleration (CUDA/Metal/ROCm)   │
│  - GGML Backend                         │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         Storage Layer                   │
│  - Model Registry                       │
│  - Blob Storage                         │
│  - Manifest Management                  │
└─────────────────────────────────────────┘

核心组件详解

1. 服务器层（Server Layer）

服务器层是 Ollama 的入口点，负责处理所有 HTTP 请求。

主要职责

接收和解析 HTTP 请求
路由分发到相应的处理函数
返回响应给客户端
处理错误和异常

核心文件

// server/routes.go
func (s *Server) GenerateHandler(c *gin.Context) {
    var req api.GenerateRequest
    if err := c.ShouldBindJSON(&req); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }
    
    // 提交到调度队列
    s.sched.Submit(req)
}

API 端点

端点	方法	功能
`/api/generate`	POST	生成文本
`/api/chat`	POST	对话接口
`/api/pull`	POST	下载模型
`/api/push`	POST	上传模型
`/api/tags`	GET	列出模型
`/api/show`	POST	显示模型信息

2. 请求调度层（Request Scheduling）

调度层是 Ollama 的核心创新之一，负责管理并发请求和资源分配。

并发控制

type Scheduler struct {
    maxConcurrent int           // 最大并发数（默认 10）
    queue         chan *Request // 请求队列
    running       map[string]*Request
    mu            sync.Mutex
}

func (s *Scheduler) Submit(req *Request) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    
    if len(s.running) >= s.maxConcurrent {
        // 加入等待队列
        s.queue <- req
        return nil
    }
    
    // 立即执行
    go s.execute(req)
    return nil
}

队列机制

FIFO 队列 - 先进先出，公平调度
优先级支持 - 可为请求设置优先级
超时控制 - 防止请求长时间占用资源
取消支持 - 允许客户端取消请求

3. 模型执行层（Model Execution）

执行层负责实际的 LLM 推理，是性能的关键。

GGML 集成

Ollama 使用 GGML（Georgi Gerganov's Machine Learning）作为推理后端。

cpp

// llm/llama.cpp (简化示例)
struct llama_context {
    llama_model model;
    llama_vocab vocab;
    llama_kv_cache kv_cache;
    
    // GPU 相关
    ggml_backend_t backend;
    ggml_backend_buffer_t buffer;
};

// 推理函数
int llama_decode(
    struct llama_context * ctx,
    struct llama_batch batch
) {
    // 1. 准备输入
    // 2. 前向传播
    // 3. 生成输出
    // 4. 更新 KV cache
}

GPU 加速

Ollama 支持多种 GPU 加速方案：

平台	加速方案	支持情况
NVIDIA	CUDA	✅ 完全支持
Apple Silicon	Metal	✅ 完全支持
AMD	ROCm	✅ 部分支持
Intel	oneAPI	🚧 实验性

// gpu/gpu.go
func InitGPU() (*GPU, error) {
    // 检测可用 GPU
    if hasCUDA() {
        return initCUDA()
    } else if hasMetal() {
        return initMetal()
    } else if hasROCm() {
        return initROCm()
    }
    
    // 回退到 CPU
    return initCPU(), nil
}

4. 存储层（Storage Layer）

存储层管理模型文件和元数据。

Modelfile 概念

Modelfile 类似 Dockerfile，定义模型的配置：

dockerfile

FROM llama3.2:3b

# 设置系统提示词
SYSTEM """
你是一个专业的 AI 助手
"""

# 设置参数
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

# 设置模板
TEMPLATE """{{ .System }}

User: {{ .Prompt }}
Assistant:"""

存储结构

~/.ollama/
├── models/
│   ├── manifests/          # 模型清单
│   │   └── registry.ollama.ai/
│   │       └── library/
│   │           └── llama3.2/
│   │               └── 3b
│   └── blobs/              # 模型数据块
│       ├── sha256-abc123...
│       ├── sha256-def456...
│       └── ...
└── history/                # 对话历史

Blob 管理

type BlobStore struct {
    path string
}

func (bs *BlobStore) Get(digest string) (io.ReadCloser, error) {
    path := filepath.Join(bs.path, "blobs", digest)
    return os.Open(path)
}

func (bs *BlobStore) Put(r io.Reader) (string, error) {
    // 计算 SHA256
    h := sha256.New()
    tr := io.TeeReader(r, h)
    
    // 写入临时文件
    tmp, _ := os.CreateTemp(bs.path, "blob-*")
    io.Copy(tmp, tr)
    
    // 重命名为最终文件
    digest := hex.EncodeToString(h.Sum(nil))
    os.Rename(tmp.Name(), filepath.Join(bs.path, "blobs", digest))
    
    return digest, nil
}

技术栈深度分析

为什么选择 Go 语言？

1. 并发支持

Go 的 Goroutines 和 Channels 使并发编程变得简单：

// 并发处理多个请求
func (s *Server) handleRequests() {
    for i := 0; i < s.workers; i++ {
        go func() {
            for req := range s.queue {
                s.process(req)
            }
        }()
    }
}

2. 快速编译

Go 的编译速度极快，提高开发效率：

bash

# 编译整个 Ollama 项目
time go build
# real    0m15.234s  (相比 C++ 快很多)

3. 静态类型

类型安全减少运行时错误：

type GenerateRequest struct {
    Model    string   `json:"model"`
    Prompt   string   `json:"prompt"`
    Stream   bool     `json:"stream"`
    Options  Options  `json:"options"`
}

// 编译时检查类型
func generate(req GenerateRequest) error {
    // req.Model 保证是 string 类型
}

4. 跨平台

一次编译，多平台运行：

bash

# 编译 macOS 版本
GOOS=darwin GOARCH=arm64 go build

# 编译 Linux 版本
GOOS=linux GOARCH=amd64 go build

# 编译 Windows 版本
GOOS=windows GOARCH=amd64 go build

混合构建系统

Ollama 使用混合构建系统，结合 Go 和 C/C++：

makefile

# Makefile (简化)
.PHONY: build
build: build-llama build-go

build-llama:
	cd llm && cmake -B build && cmake --build build

build-go:
	go build -o ollama cmd/ollama/main.go

CMake 用于 C/C++ 组件

cmake

# llm/CMakeLists.txt
cmake_minimum_required(VERSION 3.12)
project(llama.cpp)

# CUDA 支持
if(LLAMA_CUDA)
    enable_language(CUDA)
    add_definitions(-DGGML_USE_CUDA)
endif()

# Metal 支持
if(LLAMA_METAL)
    add_definitions(-DGGML_USE_METAL)
endif()

add_library(llama SHARED llama.cpp ggml.c)

Go 原生构建

// build tags 控制编译
//go:build cuda
// +build cuda

package gpu

// #cgo LDFLAGS: -lcuda -lcublas
// #include "cuda_runtime.h"
import "C"

func initCUDA() (*GPU, error) {
    // CUDA 初始化代码
}

目录结构详解

ollama/
├── api/                    # API 定义
│   ├── types.go           # 数据类型
│   └── client.go          # 客户端实现
├── cmd/                   # 命令行工具
│   └── ollama/
│       └── main.go        # 入口点
├── server/                # 服务器实现
│   ├── routes.go          # API 路由
│   ├── images.go          # 镜像管理
│   └── modelpath.go       # 模型路径解析
├── llm/                   # LLM 推理引擎
│   ├── llama.cpp          # llama.cpp 集成
│   ├── ggml.c             # GGML 后端
│   └── generate.go        # Go 接口
├── gpu/                   # GPU 加速
│   ├── gpu.go             # GPU 抽象层
│   ├── gpu_cuda.go        # CUDA 实现
│   ├── gpu_metal.go       # Metal 实现
│   └── gpu_rocm.go        # ROCm 实现
├── format/                # 模型格式
│   ├── gguf.go            # GGUF 格式解析
│   └── safetensors.go     # SafeTensors 支持
├── parser/                # Modelfile 解析
│   └── parser.go
├── progress/              # 进度跟踪
│   └── progress.go
└── version/               # 版本信息
    └── version.go

性能优化技术

1. KV Cache

KV Cache 是提升推理速度的关键：

cpp

struct llama_kv_cache {
    struct ggml_tensor * k;  // key cache
    struct ggml_tensor * v;  // value cache
    
    int n;     // 当前缓存的 token 数
    int size;  // 最大容量
};

// 使用 KV cache 避免重复计算
void llama_decode_with_cache(
    struct llama_context * ctx,
    llama_token token
) {
    // 只计算新 token 的 attention
    // 复用之前的 KV cache
}

2. 批处理

批量处理多个请求提高吞吐量：

type Batch struct {
    tokens   []int32
    positions []int32
    seqIDs   []int32
}

func (e *Engine) ProcessBatch(batch Batch) []Response {
    // 一次前向传播处理多个请求
    outputs := e.forward(batch)
    
    // 分发结果
    return e.distribute(outputs)
}

3. 量化

模型量化减少内存占用和提升速度：

量化类型	精度	内存占用	速度	质量
F32	32-bit	100%	1x	最佳
F16	16-bit	50%	1.5x	优秀
Q8_0	8-bit	25%	2x	良好
Q4_0	4-bit	12.5%	3x	可接受
Q2_K	2-bit	6.25%	4x	较差

// 加载量化模型
model, err := LoadModel("llama3.2:3b-q4_0")

实战应用

基于 Ollama 构建自定义应用

1. FastAPI 后端集成

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx

app = FastAPI()

OLLAMA_BASE_URL = "http://localhost:11434"

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.2:3b"

@app.post("/chat")
async def chat(request: ChatRequest):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{OLLAMA_BASE_URL}/api/generate",
            json={
                "model": request.model,
                "prompt": request.message,
                "stream": False
            },
            timeout=60.0
        )
        
        if response.status_code != 200:
            raise HTTPException(status_code=500, detail="Ollama error")
        
        return response.json()

2. 流式响应处理

python

from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    async def generate():
        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                f"{OLLAMA_BASE_URL}/api/generate",
                json={
                    "model": request.model,
                    "prompt": request.message,
                    "stream": True
                }
            ) as response:
                async for line in response.aiter_lines():
                    if line:
                        yield f"data: {line}\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

3. 模型管理

python

@app.get("/models")
async def list_models():
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{OLLAMA_BASE_URL}/api/tags")
        return response.json()

@app.post("/models/pull")
async def pull_model(model: str):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{OLLAMA_BASE_URL}/api/pull",
            json={"name": model},
            timeout=300.0
        )
        return response.json()

深入学习路径

1. 源码阅读顺序

入口点 - cmd/ollama/main.go
服务器 - server/routes.go
API 定义 - api/types.go
模型加载 - llm/llama.go
GPU 加速 - gpu/gpu.go

2. 调试技巧

bash

# 启用调试日志
OLLAMA_DEBUG=1 ollama serve

# 查看详细输出
OLLAMA_VERBOSE=1 ollama run llama3.2:3b

# 性能分析
go tool pprof http://localhost:11434/debug/pprof/profile

3. 贡献代码

bash

# 克隆仓库
git clone https://github.com/ollama/ollama.git
cd ollama

# 安装依赖
go mod download

# 构建
make build

# 运行测试
go test ./...

# 提交 PR
git checkout -b feature/my-feature
git commit -m "Add my feature"
git push origin feature/my-feature

总结

Ollama 通过精心设计的架构和高效的实现，让本地运行 LLM 变得简单可靠。其核心优势包括：

模块化设计 - 清晰的层次结构，易于理解和扩展
高效并发 - Go 语言的 Goroutines 提供出色的并发性能
GPU 加速 - 支持多种 GPU 平台，充分利用硬件资源
简单易用 - 类似 Docker 的使用体验，降低学习成本

理解 Ollama 的架构不仅能帮助你更好地使用它，还能为构建自己的 AI 应用提供宝贵的参考。

📅 发布日期2026年2月11日

📂 分类新概念

🏷️ 标签

Ollama架构设计Go语言LLM开源

Ollama 源代码结构与架构设计深度解析 ​

概述 ​

核心架构概览 ​

系统架构模式 ​

设计目标 ​

架构层次 ​

核心组件详解 ​

1. 服务器层（Server Layer） ​

主要职责 ​

核心文件 ​

API 端点 ​

2. 请求调度层（Request Scheduling） ​

并发控制 ​

队列机制 ​

3. 模型执行层（Model Execution） ​

GGML 集成 ​

GPU 加速 ​

4. 存储层（Storage Layer） ​

Modelfile 概念 ​

存储结构 ​

Blob 管理 ​

技术栈深度分析 ​

为什么选择 Go 语言？ ​

1. 并发支持 ​

2. 快速编译 ​

3. 静态类型 ​

4. 跨平台 ​

混合构建系统 ​

CMake 用于 C/C++ 组件 ​

Go 原生构建 ​

目录结构详解 ​

性能优化技术 ​

1. KV Cache ​

2. 批处理 ​

3. 量化 ​

实战应用 ​

基于 Ollama 构建自定义应用 ​

1. FastAPI 后端集成 ​

2. 流式响应处理 ​

3. 模型管理 ​

深入学习路径 ​

1. 源码阅读顺序 ​

2. 调试技巧 ​

3. 贡献代码 ​

相关资源 ​

总结 ​

Ollama 源代码结构与架构设计深度解析

概述

核心架构概览

系统架构模式

设计目标

架构层次

核心组件详解

1. 服务器层（Server Layer）

主要职责

核心文件

API 端点

2. 请求调度层（Request Scheduling）

并发控制

队列机制

3. 模型执行层（Model Execution）

GGML 集成

GPU 加速

4. 存储层（Storage Layer）

Modelfile 概念

存储结构

Blob 管理

技术栈深度分析

为什么选择 Go 语言？

1. 并发支持

2. 快速编译

3. 静态类型

4. 跨平台

混合构建系统

CMake 用于 C/C++ 组件

Go 原生构建

目录结构详解

性能优化技术

1. KV Cache

2. 批处理

3. 量化

实战应用

基于 Ollama 构建自定义应用

1. FastAPI 后端集成

2. 流式响应处理

3. 模型管理

深入学习路径

1. 源码阅读顺序

2. 调试技巧

3. 贡献代码

相关资源

总结