364 lines
8.6 KiB
Markdown
364 lines
8.6 KiB
Markdown
|
|
# RAG 召回率与相关性评估指南
|
|||
|
|
|
|||
|
|
本指南介绍如何评估 RAG 系统的召回率(Recall)和相关性(Relevance)。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 核心概念
|
|||
|
|
|
|||
|
|
### 1. 召回率 (Recall)
|
|||
|
|
|
|||
|
|
召回率衡量的是:**在所有相关文档中,有多少被检索出来了?**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Recall@k = (前 k 个结果中的相关文档数量) / (总相关文档数量)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
例如:
|
|||
|
|
- 总共有 5 篇相关文档
|
|||
|
|
- 检索返回 10 篇,其中 3 篇是相关的
|
|||
|
|
- Recall@10 = 3/5 = 60%
|
|||
|
|
|
|||
|
|
### 2. 精确率 (Precision)
|
|||
|
|
|
|||
|
|
精确率衡量的是:**在检索出来的文档中,有多少是相关的?**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Precision@k = (前 k 个结果中的相关文档数量) / k
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
例如:
|
|||
|
|
- 检索返回 10 篇,其中 3 篇是相关的
|
|||
|
|
- Precision@10 = 3/10 = 30%
|
|||
|
|
|
|||
|
|
### 3. F1 分数 (F1 Score)
|
|||
|
|
|
|||
|
|
F1 分数是召回率和精确率的调和平均数:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
F1@k = 2 * Recall@k * Precision@k / (Recall@k + Precision@k)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. 平均倒数排名 (MRR)
|
|||
|
|
|
|||
|
|
MRR 衡量第一个相关文档的排名:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
MRR = 1/m * sum(1/rank_i for i=1..m)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
其中 rank_i 是第 i 个相关文档第一次出现的排名。
|
|||
|
|
|
|||
|
|
例如:
|
|||
|
|
- 测试用例 1:第一个相关文档在第 2 位 → 1/2 = 0.5
|
|||
|
|
- 测试用例 2:第一个相关文档在第 1 位 → 1/1 = 1.0
|
|||
|
|
- 测试用例 3:第一个相关文档在第 3 位 → 1/3 ≈ 0.333
|
|||
|
|
- MRR = (0.5 + 1.0 + 0.333) / 3 ≈ 0.611
|
|||
|
|
|
|||
|
|
### 5. 相关性评分
|
|||
|
|
|
|||
|
|
相关性评分评估检索到的文档与查询的相关程度,通常使用:
|
|||
|
|
- 人工标注(Human Evaluation)
|
|||
|
|
- LLM 评估(LLM-as-a-Judge)
|
|||
|
|
- 相关性模型(Cross-Encoder)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🛠️ 如何评估
|
|||
|
|
|
|||
|
|
### 方法一:使用内置评估模块
|
|||
|
|
|
|||
|
|
我们的项目已经内置了评估模块 `app.rag.evaluate`。
|
|||
|
|
|
|||
|
|
#### 1. 准备测试用例
|
|||
|
|
|
|||
|
|
首先,需要准备带有标注的测试用例:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from app.rag.evaluate import RetrievalTestCase
|
|||
|
|
|
|||
|
|
test_cases = [
|
|||
|
|
RetrievalTestCase(
|
|||
|
|
query="什么是 RAG 系统?",
|
|||
|
|
relevant_doc_ids=["doc_rag_1", "doc_rag_2", "doc_rag_3"],
|
|||
|
|
expected_answer="RAG 是 Retrieval-Augmented Generation 的缩写..."
|
|||
|
|
),
|
|||
|
|
RetrievalTestCase(
|
|||
|
|
query="如何使用 LangChain?",
|
|||
|
|
relevant_doc_ids=["doc_langchain_1", "doc_langchain_2"],
|
|||
|
|
expected_answer="LangChain 的使用步骤包括..."
|
|||
|
|
),
|
|||
|
|
# 更多测试用例...
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**重要提示:**
|
|||
|
|
- 每个查询需要知道哪些文档是相关的
|
|||
|
|
- 相关文档需要有唯一的 ID
|
|||
|
|
- expected_answer 是可选的,用于评估答案质量
|
|||
|
|
|
|||
|
|
#### 2. 运行评估
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import asyncio
|
|||
|
|
from app.rag.evaluate import RAGEvaluator, generate_test_report
|
|||
|
|
|
|||
|
|
# 初始化评估器
|
|||
|
|
evaluator = RAGEvaluator(rag_pipeline, test_cases)
|
|||
|
|
|
|||
|
|
# 运行评估
|
|||
|
|
metrics = asyncio.run(evaluator.evaluate_retrieval(k_list=[1, 3, 5, 10]))
|
|||
|
|
|
|||
|
|
# 生成报告
|
|||
|
|
report = generate_test_report(metrics)
|
|||
|
|
print(report)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3. 运行示例脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd backend
|
|||
|
|
python scripts/evaluate_rag.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方法二:手动计算召回率
|
|||
|
|
|
|||
|
|
如果你想手动计算,步骤如下:
|
|||
|
|
|
|||
|
|
#### 步骤 1:准备测试数据
|
|||
|
|
|
|||
|
|
准备一个测试查询列表,每个查询对应相关文档的 ID:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
test_queries = [
|
|||
|
|
{
|
|||
|
|
"query": "什么是 RAG?",
|
|||
|
|
"relevant_ids": ["doc1", "doc3", "doc5"]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"query": "如何优化 RAG?",
|
|||
|
|
"relevant_ids": ["doc2", "doc4"]
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 步骤 2:运行检索
|
|||
|
|
|
|||
|
|
对于每个查询,运行 RAG 检索,记录返回的文档 ID:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def run_retrieval(query):
|
|||
|
|
"""运行检索,返回文档 ID 列表"""
|
|||
|
|
docs = rag_pipeline.retrieve(query)
|
|||
|
|
return [doc.metadata["id"] for doc in docs]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 步骤 3:计算召回率
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def calculate_recall(retrieved_ids, relevant_ids, k):
|
|||
|
|
"""计算 Recall@k"""
|
|||
|
|
top_k = retrieved_ids[:k]
|
|||
|
|
relevant_in_top_k = set(top_k) & set(relevant_ids)
|
|||
|
|
recall = len(relevant_in_top_k) / len(relevant_ids)
|
|||
|
|
return recall
|
|||
|
|
|
|||
|
|
# 示例
|
|||
|
|
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
|
|||
|
|
relevant = ["doc1", "doc3", "doc5"]
|
|||
|
|
print(f"Recall@3: {calculate_recall(retrieved, relevant, k=3):.2%}") # 2/3 = 66.67%
|
|||
|
|
print(f"Recall@5: {calculate_recall(retrieved, relevant, k=5):.2%}") # 3/3 = 100%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 步骤 4:聚合结果
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
all_recalls_at_1 = []
|
|||
|
|
all_recalls_at_3 = []
|
|||
|
|
all_recalls_at_5 = []
|
|||
|
|
|
|||
|
|
for test_case in test_queries:
|
|||
|
|
retrieved = run_retrieval(test_case["query"])
|
|||
|
|
recall_1 = calculate_recall(retrieved, test_case["relevant_ids"], k=1)
|
|||
|
|
recall_3 = calculate_recall(retrieved, test_case["relevant_ids"], k=3)
|
|||
|
|
recall_5 = calculate_recall(retrieved, test_case["relevant_ids"], k=5)
|
|||
|
|
|
|||
|
|
all_recalls_at_1.append(recall_1)
|
|||
|
|
all_recalls_at_3.append(recall_3)
|
|||
|
|
all_recalls_at_5.append(recall_5)
|
|||
|
|
|
|||
|
|
print(f"Average Recall@1: {np.mean(all_recalls_at_1):.2%}")
|
|||
|
|
print(f"Average Recall@3: {np.mean(all_recalls_at_3):.2%}")
|
|||
|
|
print(f"Average Recall@5: {np.mean(all_recalls_at_5):.2%}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方法三:评估相关性
|
|||
|
|
|
|||
|
|
评估相关性有几种方法:
|
|||
|
|
|
|||
|
|
#### 方案 A:使用 LLM 评估(LLM-as-a-Judge)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from app.rag.evaluate import RelevanceEvaluator
|
|||
|
|
|
|||
|
|
# 初始化评估器
|
|||
|
|
evaluator = RelevanceEvaluator(llm)
|
|||
|
|
|
|||
|
|
# 评估相关性
|
|||
|
|
score, reason = asyncio.run(evaluator.evaluate_relevance(query, document))
|
|||
|
|
|
|||
|
|
print(f"相关性评分: {score}/5")
|
|||
|
|
print(f"理由: {reason}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 方案 B:使用重排模型评分
|
|||
|
|
|
|||
|
|
重排模型本身可以给出相关性分数:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from app.model_services import get_rerank_service
|
|||
|
|
|
|||
|
|
rerank_service = get_rerank_service()
|
|||
|
|
|
|||
|
|
# 获取相关性分数
|
|||
|
|
scores = rerank_service.compute_scores(
|
|||
|
|
query="什么是 RAG?",
|
|||
|
|
documents=["doc1", "doc2", "doc3"]
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 方案 C:人工标注
|
|||
|
|
|
|||
|
|
最准确但也最耗时的方法是让人工标注相关性:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 相关性评分标准
|
|||
|
|
relevance_levels = {
|
|||
|
|
5: "完全相关,直接回答了问题",
|
|||
|
|
4: "高度相关,包含关键信息",
|
|||
|
|
3: "部分相关,有一些相关信息",
|
|||
|
|
2: "弱相关,提及但不太相关",
|
|||
|
|
1: "不相关,基本无关",
|
|||
|
|
0: "完全无关"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 如何解释结果
|
|||
|
|
|
|||
|
|
### 召回率低怎么办?
|
|||
|
|
|
|||
|
|
如果 Recall@k 低,可能的原因:
|
|||
|
|
|
|||
|
|
1. **检索器召回能力不足**
|
|||
|
|
- 嵌入模型不合适
|
|||
|
|
- 检索算法太简单
|
|||
|
|
- 解决方案:改用更好的嵌入模型、使用混合检索
|
|||
|
|
|
|||
|
|
2. **查询理解不够**
|
|||
|
|
- 查询改写效果不好
|
|||
|
|
- 解决方案:增加查询改写的多样性
|
|||
|
|
|
|||
|
|
3. **文档分块策略不好**
|
|||
|
|
- 分块太小/太大
|
|||
|
|
- 解决方案:调整 chunk_size,使用父子分块
|
|||
|
|
|
|||
|
|
### 精确率低怎么办?
|
|||
|
|
|
|||
|
|
如果 Precision@k 低,可能的原因:
|
|||
|
|
|
|||
|
|
1. **检索结果噪声多**
|
|||
|
|
- 解决方案:加强重排序
|
|||
|
|
|
|||
|
|
2. **文档切分有问题**
|
|||
|
|
- 不相关的片段也被检索到
|
|||
|
|
- 解决方案:改进切分策略
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 评估最佳实践
|
|||
|
|
|
|||
|
|
### 1. 测试用例构建
|
|||
|
|
|
|||
|
|
- ✅ **覆盖多样的查询类型**:事实型、概念型、操作型
|
|||
|
|
- ✅ **每个查询有多个相关文档**:避免单点依赖
|
|||
|
|
- ✅ **包含难例**:测试边界情况
|
|||
|
|
- ✅ **定期更新**:随着知识库变化更新测试用例
|
|||
|
|
|
|||
|
|
### 2. 评估指标选择
|
|||
|
|
|
|||
|
|
- **快速迭代**:关注 Recall@3, Recall@5
|
|||
|
|
- **正式发布**:完整评估所有指标
|
|||
|
|
- **用户体验**:同时评估答案质量
|
|||
|
|
|
|||
|
|
### 3. A/B 测试
|
|||
|
|
|
|||
|
|
当你改进 RAG 系统时,使用 A/B 测试:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# A 版本(旧版本)
|
|||
|
|
metrics_a = evaluator.evaluate_retrieval()
|
|||
|
|
|
|||
|
|
# B 版本(新版本)
|
|||
|
|
metrics_b = evaluator_new.evaluate_retrieval()
|
|||
|
|
|
|||
|
|
# 对比
|
|||
|
|
print(f"Recall@5 改进: {metrics_b.recall_at_k[5] - metrics_a.recall_at_k[5]:.2%}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 完整评估报告示例
|
|||
|
|
|
|||
|
|
运行评估后,会生成这样的报告:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
================================================================================
|
|||
|
|
RAG 系统评估报告
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【召回率 Recall@k】
|
|||
|
|
Recall@1: 60.00%
|
|||
|
|
Recall@3: 85.00%
|
|||
|
|
Recall@5: 95.00%
|
|||
|
|
Recall@10: 100.00%
|
|||
|
|
|
|||
|
|
【精确率 Precision@k】
|
|||
|
|
Precision@1: 100.00%
|
|||
|
|
Precision@3: 90.00%
|
|||
|
|
Precision@5: 80.00%
|
|||
|
|
Precision@10: 55.00%
|
|||
|
|
|
|||
|
|
【F1 分数 F1@k】
|
|||
|
|
F1@1: 0.7500
|
|||
|
|
F1@3: 0.8718
|
|||
|
|
F1@5: 0.8636
|
|||
|
|
F1@10: 0.7097
|
|||
|
|
|
|||
|
|
【平均倒数排名 MRR】: 0.8500
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
指标说明:
|
|||
|
|
- Recall@k: 前 k 个结果中包含多少比例的相关文档
|
|||
|
|
- Precision@k: 前 k 个结果中有多少比例是相关文档
|
|||
|
|
- F1@k: 召回率和精确率的调和平均数
|
|||
|
|
- MRR: 第一个相关文档的排名的倒数的平均值
|
|||
|
|
================================================================================
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔗 相关文件
|
|||
|
|
|
|||
|
|
- `backend/app/rag/evaluate.py` - 评估模块
|
|||
|
|
- `backend/scripts/evaluate_rag.py` - 评估示例脚本
|
|||
|
|
- `backend/app/rag/pipeline.py` - RAG 流水线
|
|||
|
|
- `backend/app/model_services/` - 模型服务
|