Databrick 对 Long Context RAG 的评测

整体结果

四项评测集的平均正确率

DocQA 的正确率

HotpotQA 的正确率

评测方案

评测方案中的主要设置

retrieval 阶段

  • Embedding:text-embedding-3-large
  • Chunk Size:512 tokens
  • Chunk Overlap:256 tokens
  • Vector Store:FAISS

generation 阶段

  • 模型:gpt-4o, claude-3-5-sonnet, claude-3-opus, etc.
  • temperature: 0
  • max_output_tokens: 1024

召回率 Recall@k

# Retrieved chunks15132961125189253317381
Recall@k \ Context Length2k4k8k16k32k64k96k128k160k192k
Databricks DocsQA0.5470.8560.9060.9570.9780.9860.9930.9930.9930.993
FinanceBench0.0970.2870.4930.6030.7640.8560.9160.9160.9160.916
NQ0.8450.9921.01.01.01.01.01.01.01.0
HotPotQA0.3820.6720.7510.7970.8330.8640.8800.8900.8900.890
Average0.4680.7020.7880.8390.8940.9270.9470.950.950.95
Recall@k

原文

https://www.databricks.com/blog/long-context-rag-performance-llms

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *