向量数据库

2026-04-18 16:56:23 +08:00
parent 0470afce13
commit c18e8a9860
11 changed files with 1121 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -13,6 +13,8 @@
 !frontend/**
 !scripts/
 !scripts/**
+!rag_indexer/
+!rag_indexer/**
 !docker/
 !docker/**
 !.gitea/
--- a/rag_indexer/README.md
+++ b/rag_indexer/README.md
@@ -0,0 +1,109 @@
+# 离线 RAG 索引构建系统 (Offline RAG Indexer)
+
+该模块负责 RAG 系统的阶段一：**离线索引构建**。它将外部的非结构化数据（如文档、PDF、网页等）清洗、切分并转化为向量，最终存入向量数据库中。
+
+## 📊 系统工作流示意图
+
+```mermaid
+graph TD
+    A[原始文档集合 <br> PDF / Word / Markdown] --> B(文档加载器 DocumentLoader)
+    B --> C{文本切分策略 Splitter}
+    
+    C -->|基础策略| D1[固定字符长度切分 <br> Recursive Split]
+    C -->|进阶策略| D2[语义边界切分 <br> Semantic Chunking]
+    C -->|高级策略| D3[父子文档切分 <br> Parent-Child / Auto-merging]
+    
+    D1 & D2 & D3 --> E[向量化 Embedder <br> llama.cpp: embeddinggemma]
+    
+    E --> F[(Qdrant 向量数据库)]
+    
+    subgraph "元数据管理"
+    G[提取作者、日期、页码等元数据 Metadata] -.附加.-> E
+    end
+```
+
+---
+
+## 🎯 演进路线与核心算法 (Roadmap)
+
+### Level 1: 基础暴力切分 (Basic Recursive Splitting)
+- **核心算法**: 递归字符切分。它按照预定义的分隔符列表（如 `["\n\n", "\n", " ", ""]`）从大到小尝试切分文本，直到每块的大小满足最大长度限制。
+- **优缺点**: 实现极简单，速度快。但非常容易将一句话拦腰截断，导致上下文语义丢失。
+- **实现指南**: 
+  - 从 `langchain.text_splitter` 导入 `RecursiveCharacterTextSplitter`。
+  - 实例化时设置 `chunk_size`（如 500）和 `chunk_overlap`（如 50），直接调用 `.split_documents(raw_docs)` 方法。
+
+### Level 2: 语义动态切分 (Semantic Chunking)
+- **核心算法**: 句子级相似度阈值算法。
+  1. 将文章按标点符号按句子拆分。
+  2. 使用轻量级 Embedding 模型将每一句向量化。
+  3. 计算相邻两句之间的余弦相似度 (Cosine Similarity)。
+  4. 当相似度低于设定阈值时（说明两句话讲的不是同一件事，语义发生了转折），在此处“切断”形成一个新的块。
+- **优缺点**: 极大程度保留了段落内语义的连贯性，对 LLM 回答非常友好。但由于在切分阶段就需要调用向量模型，耗时略长。
+- **实现指南**: 
+  - 从 `langchain_experimental.text_splitter` 导入 `SemanticChunker`。
+  - 实例化时需要传入你已经配置好的 Embedding 模型实例（如基于 `OpenAIEmbeddings` 封装的 llama.cpp 本地模型），并设置 `breakpoint_threshold_type="percentile"` 等阈值参数。
+
+### Level 3: 高级父子块策略 (Parent-Child / Auto-merging)
+- **核心算法**: 层次化双重存储与映射。
+  - **切分机制**: 首先将文档粗切为较大的“父块 (Parent Chunk, 约 1000 词)”，随后将父块细切为较小的“子块 (Child Chunk, 约 200 词)”。
+  - **存储机制**: 仅仅将**子块**的向量存入 Qdrant 用于精准计算距离；将**父块**的原始内容存在内存或 Document Store (如 KV 数据库) 中，通过 UUID 相互映射。
+- **核心思路**: 解决 RAG 领域经典的矛盾——检索时块越小越容易精确命中（去除噪声）；但生成回答时，块越大越能给大模型提供充足的上下文背景。
+- **实现指南**: 
+  - 使用 `langchain.retrievers` 中的 `ParentDocumentRetriever` 模块。
+  - 在写入时，你需要同时准备一个底层的 `VectorStore` (即 Qdrant) 和一个 `BaseStore` (比如原生的 `InMemoryStore` 或 `Redis`)。
+  - 将两种不同的 `TextSplitter` 分别赋值给检索器的 `child_splitter` 和 `parent_splitter`，然后调用 `.add_documents()` 即可让系统自动完成映射。
+
+### Level 4: GraphRAG 与 多模态 (Graph & Multi-modal)
+- **核心算法**: LLM 实体关系抽取 (NER & Relation Extraction)。
+- **核心思路**: 解决传统纯向量检索难以处理“跨文档复杂关系推理”的痛点（如：A公司的CEO是谁？他名下的B公司主要业务是什么？这种需要横跨多页 PDF 的跳跃性问题）。
+- **实现指南**: 
+  - 使用本地的大模型（如 `Gemma-4-E2B`）配合 `langchain_community.graphs` 模块。
+  - 利用 `LLMGraphTransformer` 组件，在读取文档时，通过预设的 Prompt 强制大模型提取出实体（Node）和关系（Edge），直接写入诸如 Neo4j 这样的图数据库中，而非传统的 Qdrant 向量库。
+
+---
+
+## <20> 所需依赖与安装
+
+为了支持完整的文档解析和 Qdrant 写入，需要安装以下 Python 包：
+
+```bash
+# 基础核心库
+pip install langchain langchain-core langchain-openai langchain-qdrant
+
+# 用于复杂文档解析 (PDF, Word, Excel 等)
+pip install unstructured pdf2image pdfminer.six
+
+# 用于语义分块 (可选)
+pip install langchain-experimental
+```
+
+---
+
+## 📂 架构与文件结构设计
+
+在 `rag_indexer/` 目录下，需创建以下核心文件：
+
+```text
+rag_indexer/
+├── __init__.py
+├── loaders.py          # 负责调用 unstructured 解析不同类型文件
+├── splitters.py        # 负责实现 Recursive、Semantic、Parent-Child 切分逻辑
+├── embedders.py        # 封装本地 llama.cpp 交互的 Embedding 接口
+├── vector_store.py     # 封装 Qdrant 写入、Upsert、Collection 初始化操作
+└── builder.py          # 核心编排文件，将上述模块串联成 Pipeline
+```
+
+---
+
+
+
+### 串联与触发方式
+在你的 LangGraph 系统外，创建一个执行脚本 `scripts/run_indexer.py`：
+
+```bash
+# 终端执行，将本地的 PDF 手册刷入向量数据库
+export QDRANT_URL="http://115.190.121.151:6333"
+python scripts/run_indexer.py --file data/user_docs/tech_manual.pdf
+```
+这相当于系统后台的**“离线学习阶段”**，你可以随时挂载定时任务去扫描文件夹，增量更新知识库。
--- a/rag_indexer/init.py
+++ b/rag_indexer/init.py
@@ -0,0 +1,25 @@
+"""
+Offline RAG Indexer module.
+"""
+
+from .loaders import DocumentLoader
+from .splitters import (
+    RecursiveSplitter,
+    SemanticSplitter,
+    ParentChildSplitter,
+    SplitterType,
+)
+from .embedders import LlamaCppEmbedder
+from .vector_store import QdrantVectorStore
+from .builder import IndexBuilder
+
+__all__ = [
+    "DocumentLoader",
+    "RecursiveSplitter",
+    "SemanticSplitter",
+    "ParentChildSplitter",
+    "SplitterType",
+    "LlamaCppEmbedder",
+    "QdrantVectorStore",
+    "IndexBuilder",
+]
--- a/rag_indexer/builder.py
+++ b/rag_indexer/builder.py
@@ -0,0 +1,277 @@
+"""
+Core pipeline builder for offline RAG index construction.
+
+Now supports LangChain's ParentDocumentRetriever for parent-child chunking.
+"""
+
+import logging
+from pathlib import Path
+from typing import List, Union, Optional, Tuple
+from dataclasses import dataclass
+
+from langchain_core.documents import Document
+from langchain.retrievers import ParentDocumentRetriever
+from langchain.storage import LocalFileStore, BaseStore
+
+from .loaders import DocumentLoader
+from .splitters import SplitterType, get_splitter, ParentChildSplitter
+from .embedders import LlamaCppEmbedder
+from .vector_store import QdrantVectorStore
+from .docstore_manager import get_docstore, PostgresDocStore, create_docstore
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ParentChildConfig:
+    """Configuration for parent-child splitting."""
+    parent_chunk_size: int = 1000
+    child_chunk_size: int = 200
+    parent_chunk_overlap: int = 100
+    child_chunk_overlap: int = 20
+    search_k: int = 5
+    docstore_path: str = None
+    docstore_type: str = "local"
+    docstore_conn_string: str = None
+
+
+class IndexBuilder:
+    """Main pipeline for RAG index construction."""
+
+    def __init__(
+        self,
+        collection_name: str = "rag_documents",
+        qdrant_url: str = None,
+        splitter_type: SplitterType = SplitterType.RECURSIVE,
+        **splitter_kwargs,
+    ):
+        self.collection_name = collection_name
+        self.qdrant_url = qdrant_url
+        self.splitter_type = splitter_type
+        self.splitter_kwargs = splitter_kwargs
+
+        # Components
+        self.loader = DocumentLoader()
+        self.embedder = LlamaCppEmbedder()
+        self.embeddings = self.embedder.as_langchain_embeddings()
+
+        self.vector_store = QdrantVectorStore(
+            collection_name=collection_name,
+            embeddings=self.embeddings,
+            qdrant_url=qdrant_url,
+        )
+
+        # Splitter (except parent-child which is handled separately)
+        if splitter_type != SplitterType.PARENT_CHILD:
+            if splitter_type == SplitterType.SEMANTIC:
+                splitter_kwargs["embeddings"] = self.embeddings
+            self.splitter = get_splitter(splitter_type, **splitter_kwargs)
+        else:
+            self.splitter = None
+            # Initialize ParentDocumentRetriever for parent-child splitting
+            self._init_parent_child_retriever()
+
+    def _init_parent_child_retriever(self, **kwargs):
+        """
+        Initialize ParentDocumentRetriever for parent-child chunking.
+        
+        This replaces the custom ParentChildSplitter logic.
+        """
+        # Parse kwargs for parent-child config
+        parent_size = kwargs.get("parent_chunk_size", 1000)
+        child_size = kwargs.get("child_chunk_size", 200)
+        parent_overlap = kwargs.get("parent_chunk_overlap", kwargs.get("chunk_overlap", 100))
+        child_overlap = kwargs.get("child_chunk_overlap", kwargs.get("chunk_overlap", 20))
+
+        # Define splitters
+        self.parent_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=parent_size,
+            chunk_overlap=parent_overlap,
+        )
+        self.child_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=child_size,
+            chunk_overlap=child_overlap,
+        )
+
+        # Vector store (for child chunks)
+        self.vector_store_obj = self.vector_store.get_langchain_vectorstore()
+
+        # Document store (for parent chunks)
+        docstore_path = kwargs.get("docstore_path")
+        docstore_type = kwargs.get("docstore_type", "local")
+        docstore_conn = kwargs.get("docstore_conn_string")
+        
+        if docstore_type == "postgres" and docstore_conn:
+            self.docstore = PostgresDocStore(docstore_conn)
+            self._docstore_conn = docstore_conn
+        else:
+            self.docstore = get_docstore(docstore_path)
+            self._docstore_conn = None
+
+        # Create retriever
+        self.retriever = ParentDocumentRetriever(
+            vectorstore=self.vector_store_obj,
+            docstore=self.docstore,
+            child_splitter=self.child_splitter,
+            parent_splitter=self.parent_splitter,
+            search_kwargs={"k": kwargs.get("search_k", 5)},
+        )
+
+    def build_from_file(self, file_path: Union[str, Path]) -> int:
+        logger.info("Loading file: %s", file_path)
+        documents = self.loader.load_file(file_path)
+        logger.info("Loaded %d documents", len(documents))
+        return self._process_documents(documents)
+
+    def build_from_directory(self, directory_path: Union[str, Path], recursive: bool = True) -> int:
+        logger.info("Loading directory: %s (recursive=%s)", directory_path, recursive)
+        documents = self.loader.load_directory(directory_path, recursive=recursive)
+        logger.info("Loaded %d documents from directory", len(documents))
+        return self._process_documents(documents)
+
+    def _process_documents(self, documents: List[Document]) -> int:
+        if not documents:
+            logger.warning("No documents to process")
+            return 0
+
+        if self.splitter_type == SplitterType.PARENT_CHILD:
+            logger.info("Using LangChain ParentDocumentRetriever")
+            
+            # Ensure collection exists for child chunks
+            self.vector_store.create_collection()
+            
+            # Use ParentDocumentRetriever to add documents
+            # This automatically handles parent-child splitting, mapping, and retrieval
+            self.retriever.add_documents(documents)
+            
+            # Log estimated chunk counts
+            estimated_parent_chunks = len(documents) * (self.parent_splitter._chunk_size // self.child_splitter._chunk_size)
+            logger.info(
+                "Indexed with ParentDocumentRetriever: "
+                f"~{len(documents)} parent chunks, ~{estimated_parent_chunks} child chunks"
+            )
+            return len(documents)
+
+        else:
+            logger.info("Splitting documents using %s", self.splitter_type)
+            chunks = self.splitter.split_documents(documents)
+            logger.info("Split into %d chunks", len(chunks))
+
+            self.vector_store.create_collection()
+            self.vector_store.add_documents(chunks)
+            return len(chunks)
+
+    def get_collection_info(self):
+        return self.vector_store.get_collection_info()
+
+    def search(self, query: str, k: int = 5) -> List[Document]:
+        """Standard search - returns child chunks."""
+        return self.vector_store.similarity_search(query, k=k)
+
+    def search_with_parent_context(self, query: str, k: int = 5) -> List[Document]:
+        """
+        Search with parent context - returns full parent chunks.
+        
+        This is the main retrieval method when using parent-child splitting.
+        """
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            raise RuntimeError(
+                "search_with_parent_context only available with PARENT_CHILD splitter. "
+                "Use search() for standard retrieval."
+            )
+        return self.retriever.get_relevant_documents(query, k=k)
+
+    def retrieve(self, query: str, return_parent: bool = True) -> List[Document]:
+        """
+        Unified retrieval interface.
+        
+        Args:
+            query: Search query
+            return_parent: If True and using parent-child splitter, return parent chunks
+                           If False, always return child chunks
+        
+        Returns:
+            List of relevant documents
+        """
+        if self.splitter_type == SplitterType.PARENT_CHILD and return_parent:
+            return self.search_with_parent_context(query)
+        else:
+            return self.search(query)
+
+    def get_retriever(self) -> ParentDocumentRetriever:
+        """
+        Get the ParentDocumentRetriever instance directly.
+        
+        Useful for advanced use cases where you want to access the retriever
+        outside of IndexBuilder.
+        """
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            raise RuntimeError(
+                "get_retriever() only available with PARENT_CHILD splitter. "
+                "Use search() or search_with_parent_context() for standard retrieval."
+            )
+        return self.retriever
+
+    def get_child_splitter(self) -> "RecursiveCharacterTextSplitter":
+        """Get the child splitter for reconfiguration."""
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            return self.splitter
+        return self.child_splitter
+
+    def get_parent_splitter(self) -> "RecursiveCharacterTextSplitter":
+        """Get the parent splitter for reconfiguration."""
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            raise RuntimeError(
+                "Parent splitter only available with PARENT_CHILD splitter."
+            )
+        return self.parent_splitter
+
+    def get_docstore(self) -> BaseStore:
+        """Get the document store for parent chunks."""
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            raise RuntimeError(
+                "Docstore only available with PARENT_CHILD splitter."
+            )
+        return self.docstore
+
+    def get_docstore_path(self) -> str:
+        """Get the document store path."""
+        if self.splitter_type != SplitterType.PARENT_CHILD:
+            raise RuntimeError(
+                "Docstore path only available with PARENT_CHILD splitter."
+            )
+        return self.docstore.persist_path
+
+    def close(self):
+        """Close resources."""
+        if hasattr(self, "_docstore_conn") and self._docstore_conn:
+            import psycopg2
+            conn = psycopg2.connect(self._docstore_conn)
+            conn.close()
+            logger.info("Closed PostgreSQL connection")
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+        return False
+
+
+# RecursiveCharacterTextSplitter needs to be imported
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+
+
+if __name__ == "__main__":
+    # Example usage
+    builder = IndexBuilder(
+        splitter_type=SplitterType.PARENT_CHILD,
+        parent_chunk_size=1000,
+        child_chunk_size=200,
+        docstore_path="./my_parent_docs",
+    )
+    
+    print("Parent splitter:", builder.get_parent_splitter().chunk_size)
+    print("Child splitter:", builder.get_child_splitter().chunk_size)
+    print("Docstore path:", builder.get_docstore_path())
+    print("Retriever:", builder.get_retriever())
--- a/rag_indexer/cli.py
+++ b/rag_indexer/cli.py
@@ -0,0 +1,102 @@
+"""
+Command-line interface for the RAG index builder.
+"""
+
+import argparse
+import logging
+import sys
+
+from builder import IndexBuilder
+from splitters import SplitterType
+
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Offline RAG Index Builder")
+    parser.add_argument("--file", type=str, help="Path to file to index")
+    parser.add_argument("--dir", type=str, help="Path to directory to index")
+    parser.add_argument("--recursive", action="store_true", default=True,
+                        help="Recursively process directories (default: True)")
+    parser.add_argument("--collection", type=str, default="rag_documents",
+                        help="Qdrant collection name (default: rag_documents)")
+    parser.add_argument("--qdrant-url", type=str,
+                        help="Qdrant server URL (default: http://127.0.0.1:6333)")
+    parser.add_argument("--splitter", type=str,
+                        choices=["recursive", "semantic", "parent_child"],
+                        default="recursive",
+                        help="Text splitting strategy (default: recursive)")
+    parser.add_argument("--chunk-size", type=int, default=500,
+                        help="Chunk size for recursive/parent splitter (default: 500)")
+    parser.add_argument("--chunk-overlap", type=int, default=50,
+        parser.add_argument("--docstore-path", type=str,
+                        default=None,
+                        help="Path to store parent documents for parent-child splitter (default: ./parent_docs or HERMES_HOME/parent_docs)")
+    parser.add_argument("--docstore-type", type=str,
+                        choices=["local", "postgres"],
+                        default="local",
+                        help="Type of docstore: 'local' (default) or 'postgres' for PostgreSQL-backed storage")
+    parser.add_argument("--docstore-conn", type=str,
+                        default=None,
+                        help="PostgreSQL connection string for postgres docstore")
+
+                        help="Chunk overlap (default: 50)")
+    parser.add_argument("--parent-size", type=int, default=1000,
+                        help="Parent chunk size for parent-child splitter (default: 1000)")
+    parser.add_argument("--child-size", type=int, default=200,
+                        help="Child chunk size for parent-child splitter (default: 200)")
+
+    args = parser.parse_args()
+
+    if not args.file and not args.dir:
+        print("Error: Either --file or --dir must be specified", file=sys.stderr)
+        parser.print_help()
+        sys.exit(1)
+
+    splitter_map = {
+        "recursive": SplitterType.RECURSIVE,
+        "semantic": SplitterType.SEMANTIC,
+        "parent_child": SplitterType.PARENT_CHILD,
+    }
+    splitter_type = splitter_map[args.splitter]
+
+    splitter_kwargs = {}
+    if splitter_type == SplitterType.RECURSIVE:
+        splitter_kwargs["chunk_size"] = args.chunk_size
+        splitter_kwargs["chunk_overlap"] = args.chunk_overlap
+    elif splitter_type == SplitterType.PARENT_CHILD:
+        splitter_kwargs["parent_chunk_size"] = args.parent_size
+        splitter_kwargs["child_chunk_size"] = args.child_size
+        splitter_kwargs["parent_chunk_overlap"] = args.chunk_overlap
+        splitter_kwargs["child_chunk_overlap"] = args.chunk_overlap // 2
+        splitter_kwargs["docstore_path"] = args.docstore_path
+        splitter_kwargs["docstore_type"] = args.docstore_type
+        splitter_kwargs["docstore_conn_string"] = args.docstore_conn
+
+    builder = IndexBuilder(
+        collection_name=args.collection,
+        qdrant_url=args.qdrant_url,
+        splitter_type=splitter_type,
+        **splitter_kwargs
+    )
+
+    try:
+        if args.file:
+            chunk_count = builder.build_from_file(args.file)
+        else:
+            chunk_count = builder.build_from_directory(args.dir, args.recursive)
+
+        print(f"Indexing completed. Total chunks indexed: {chunk_count}")
+        info = builder.get_collection_info()
+        print(f"Collection '{info['name']}' has {info['vectors_count']} vectors (dim={info['vector_size']})")
+
+    except Exception as e:
+        logging.exception("Indexing failed")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/rag_indexer/docstore_manager.py
+++ b/rag_indexer/docstore_manager.py
@@ -0,0 +1,142 @@
+"""
+Document store manager for ParentDocumentRetriever.
+
+Supports both LocalFileStore (default) and custom PostgreSQL-backed stores.
+"""
+
+import os
+from typing import Optional
+from langchain.storage import BaseStore, LocalFileStore
+
+
+def get_docstore(persist_path: str = None) -> LocalFileStore:
+    """
+    Create and return a document store for parent chunks.
+    
+    Args:
+        persist_path: Path to store parent documents. Defaults to ./parent_docs
+                       or HERMES_HOME/parent_docs if set.
+    """
+    if persist_path is None:
+        # Use HERMES_HOME if available, otherwise default to current directory
+        persist_path = os.getenv("HERMES_HOME")
+        if persist_path:
+            persist_path = os.path.join(persist_path, "parent_docs")
+        else:
+            persist_path = "./parent_docs"
+    
+    os.makedirs(persist_path, exist_ok=True)
+    return LocalFileStore(persist_path)
+
+
+class PostgresDocStore(BaseStore):
+    """
+    PostgreSQL-backed document store for parent chunks.
+    
+    This is an optional advanced feature. For most use cases,
+    LocalFileStore is sufficient and simpler.
+    """
+    
+    def __init__(self, connection_string: str):
+        """
+        Initialize PostgreSQL document store.
+        
+        Args:
+            connection_string: PostgreSQL connection URL
+        """
+        import psycopg2
+        from psycopg2 import sql
+        
+        self.conn_string = connection_string
+        self._conn = None
+        
+        # Create table if not exists
+        self._create_table()
+    
+    def _create_table(self):
+        """Create the parent documents table if not exists."""
+        try:
+            self._conn = psycopg2.connect(self.conn_string)
+            cursor = self._conn.cursor()
+            cursor.execute("""
+                CREATE TABLE IF NOT EXISTS parent_documents (
+                    key TEXT PRIMARY KEY,
+                    value JSONB NOT NULL,
+                    created_at TIMESTAMPTZ DEFAULT NOW()
+                )
+            """)
+            self._conn.commit()
+            cursor.close()
+        except Exception as e:
+            raise RuntimeError(f"Failed to create PostgreSQL table: {e}")
+    
+    def get(self, key: str) -> Optional[dict]:
+        """Retrieve a document by key."""
+        try:
+            self._ensure_connection()
+            cursor = self._conn.cursor()
+            cursor.execute("SELECT value FROM parent_documents WHERE key = %s", (key,))
+            row = cursor.fetchone()
+            cursor.close()
+            if row:
+                import json
+                return json.loads(row[0])
+            return None
+        except Exception as e:
+            raise RuntimeError(f"Failed to retrieve document: {e}")
+    
+    def set(self, key: str, value: dict) -> None:
+        """Store a document."""
+        try:
+            self._ensure_connection()
+            cursor = self._conn.cursor()
+            # Upsert
+            insert_query = sql.SQL(
+                "INSERT INTO parent_documents (key, value) VALUES (%s, %s)"
+            )
+            update_query = sql.SQL(
+                "UPDATE parent_documents SET value = %s WHERE key = %s"
+            )
+            cursor.execute(insert_query, (key, json.dumps(value)))
+            try:
+                cursor.execute(update_query, (key, json.dumps(value)))
+            except psycopg2.IntegrityError:
+                pass  # Key exists, ignore
+            self._conn.commit()
+            cursor.close()
+        except Exception as e:
+            raise RuntimeError(f"Failed to store document: {e}")
+    
+    def _ensure_connection(self):
+        """Ensure we have an open connection."""
+        if self._conn is None or self._conn.closed:
+            self._conn = psycopg2.connect(self.conn_string)
+    
+    def close(self):
+        """Close the connection."""
+        if self._conn and not self._conn.closed:
+            self._conn.close()
+
+
+# Factory function for creating custom docstores
+# Returns a tuple: (BaseStore instance, connection_string or None)
+def create_docstore(
+    store_type: str = "local",
+    persist_path: str = None,
+    connection_string: str = None
+) -> tuple:
+    """
+    Factory function to create different types of document stores.
+    
+    Args:
+        store_type: "local" (default), "postgres"
+        persist_path: Path for local file store
+        connection_string: PostgreSQL connection string
+    
+    Returns:
+        Tuple of (BaseStore instance, connection_string or None)
+    """
+    if store_type == "postgres" and connection_string:
+        return (PostgresDocStore(connection_string), connection_string)
+    else:
+        return (get_docstore(persist_path), None)
--- a/rag_indexer/embedders.py
+++ b/rag_indexer/embedders.py
@@ -0,0 +1,68 @@
+"""
+Embedding model wrapper for llama.cpp service.
+"""
+
+import os
+from typing import List, Optional
+from urllib.parse import urljoin
+
+from langchain_openai import OpenAIEmbeddings
+
+
+class LlamaCppEmbedder:
+    """Wrapper for llama.cpp embedding service via OpenAI-compatible API."""
+
+    def __init__(
+        self,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
+        model: str = "embeddinggemma-300M-Q8_0",
+    ):
+        self.base_url = base_url or os.getenv("LLAMACPP_EMBEDDING_URL", "http://127.0.0.1:8082")
+        self.api_key = api_key or os.getenv("LLAMACPP_API_KEY", "")
+        self.model = model
+
+        # Ensure URL ends with /v1
+        self.base_url = urljoin(self.base_url.rstrip("/") + "/", "v1")
+
+    def as_langchain_embeddings(self) -> OpenAIEmbeddings:
+        """Create LangChain OpenAIEmbeddings instance."""
+        return OpenAIEmbeddings(
+            openai_api_base=self.base_url,
+            openai_api_key=self.api_key,
+            model=self.model,
+        )
+
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        """Embed a list of documents."""
+        emb = self.as_langchain_embeddings()
+        return emb.embed_documents(texts)
+
+    def embed_query(self, text: str) -> List[float]:
+        """Embed a single query."""
+        emb = self.as_langchain_embeddings()
+        return emb.embed_query(text)
+
+    def get_embedding_dimension(self) -> int:
+        """Get embedding dimension by embedding a test string."""
+        test_embedding = self.embed_query("test")
+        return len(test_embedding)
+
+
+class MockEmbedder:
+    """Mock embedder for testing without a real service."""
+
+    def __init__(self, dimension: int = 768):
+        self.dimension = dimension
+
+    def as_langchain_embeddings(self) -> OpenAIEmbeddings:
+        raise NotImplementedError("MockEmbedder cannot be used as LangChain embeddings")
+
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        return [[0.0] * self.dimension for _ in texts]
+
+    def embed_query(self, text: str) -> List[float]:
+        return [0.0] * self.dimension
+
+    def get_embedding_dimension(self) -> int:
+        return self.dimension
--- a/rag_indexer/example_parent_child.py
+++ b/rag_indexer/example_parent_child.py
@@ -0,0 +1,124 @@
+"""
+Example demonstrating ParentDocumentRetriever usage.
+
+This script shows how to:
+1. Build an index with parent-child chunking
+2. Search with child chunks (fast, precise)
+3. Search with parent context (large context)
+4. Access the retriever directly for advanced use cases
+"""
+
+import logging
+from pathlib import Path
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+
+from builder import IndexBuilder
+from splitters import SplitterType
+
+
+def main():
+    print("=" * 70)
+    print("ParentDocumentRetriever Example")
+    print("=" * 70)
+    
+    # Step 1: Create IndexBuilder with parent-child splitting
+    print("\n1. Creating IndexBuilder with parent-child splitting...")
+    builder = IndexBuilder(
+        collection_name="parent_child_demo",
+        splitter_type=SplitterType.PARENT_CHILD,
+        parent_chunk_size=1000,   # Parent chunks: larger context
+        child_chunk_size=200,     # Child chunks: smaller for precision
+        docstore_path="./my_parent_docs",  # Where to store parent chunks
+        search_k=5,  # Number of child chunks to retrieve
+    )
+    
+    print(f"   Parent splitter: chunk_size={builder.get_parent_splitter().chunk_size}")
+    print(f"   Child splitter: chunk_size={builder.get_child_splitter().chunk_size}")
+    print(f"   Docstore path: {builder.get_docstore_path()}")
+    print(f"   Search k: {builder.retriever.search_kwargs['k']}")
+    
+    # Step 2: Build index from a sample file
+    print("\n2. Building index from sample file...")
+    
+    # Create a test document
+    test_content = """
+    This is a test document for demonstrating ParentDocumentRetriever.
+    
+    Parent chunks contain larger portions of text (1000 characters),
+    while child chunks are smaller (200 characters) for precise retrieval.
+    
+    When you search with ParentDocumentRetriever:
+    - It first retrieves relevant child chunks
+    - Then replaces them with their corresponding parent chunks
+    - This gives you large context while maintaining precision
+    
+    Example search queries:
+    - "ParentDocumentRetriever" 
+    - "child chunks"
+    - "large context"
+    - "precise retrieval"
+    """
+    
+    test_file = Path("./test_document.txt")
+    test_file.write_text(test_content)
+    
+    chunk_count = builder.build_from_file(str(test_file))
+    print(f"   Indexed {chunk_count} documents")
+    
+    # Step 3: Search with child chunks (fast, precise)
+    print("\n3. Searching with child chunks (fast, precise)...")
+    child_results = builder.search("ParentDocumentRetriever", k=3)
+    print(f"   Found {len(child_results)} child chunks:")
+    for i, doc in enumerate(child_results, 1):
+        print(f"   [{i}] {doc.page_content[:100]}...")
+    
+    # Step 4: Search with parent context (large context)
+    print("\n4. Searching with parent context (large context)...")
+    parent_results = builder.search_with_parent_context("ParentDocumentRetriever", k=3)
+    print(f"   Found {len(parent_results)} parent chunks:")
+    for i, doc in enumerate(parent_results, 1):
+        print(f"   [{i}] {doc.page_content[:150]}...")
+    
+    # Step 5: Compare results
+    print("\n5. Comparing child vs parent results...")
+    print(f"   Child chunks total length: {sum(len(d.page_content) for d in child_results)}")
+    print(f"   Parent chunks total length: {sum(len(d.page_content) for d in parent_results)}")
+    print(f"   Ratio: parent/child = {sum(len(d.page_content) for d in parent_results) / max(sum(len(d.page_content) for d in child_results), 1):.2f}x larger")
+    
+    # Step 6: Access retriever directly
+    print("\n6. Accessing retriever directly...")
+    retriever = builder.get_retriever()
+    print(f"   Retriever type: {type(retriever).__name__}")
+    print(f"   Vectorstore: {retriever.vectorstore}")
+    print(f"   Docstore: {retriever.docstore}")
+    
+    # Step 7: Unified retrieval interface
+    print("\n7. Using unified retrieval interface...")
+    unified_results = builder.retrieve("ParentDocumentRetriever", return_parent=True)
+    print(f"   Retrieved {len(unified_results)} documents (with parent context)")
+    
+    # Step 8: Collection info
+    print("\n8. Collection info...")
+    info = builder.get_collection_info()
+    print(f"   Collection: {info['name']}")
+    print(f"   Vectors: {info['vectors_count']}")
+    print(f"   Vector size: {info['vector_size']}")
+    
+    # Cleanup
+    print("\n9. Cleaning up...")
+    builder.close()
+    
+    print("\n" + "=" * 70)
+    print("Example completed successfully!")
+    print("=" * 70)
+    
+    return builder
+
+
+if __name__ == "__main__":
+    builder = main()
--- a/rag_indexer/loaders.py
+++ b/rag_indexer/loaders.py
@@ -0,0 +1,91 @@
+"""
+Document loaders using unstructured library.
+"""
+
+import logging
+from pathlib import Path
+from typing import List, Union
+
+from langchain_core.documents import Document
+from unstructured.partition.auto import partition
+
+logger = logging.getLogger(__name__)
+
+
+class DocumentLoader:
+    """Load documents from various file formats."""
+
+    SUPPORTED_EXTENSIONS = {".pdf", ".docx", ".doc", ".txt", ".md", ".html", ".pptx", ".xlsx"}
+
+    def __init__(self, extract_images: bool = False):
+        """
+        Args:
+            extract_images: Whether to extract images from documents (requires additional dependencies)
+        """
+        self.extract_images = extract_images
+
+    def load_file(self, file_path: Union[str, Path]) -> List[Document]:
+        """Load a single file into LangChain Document objects."""
+        file_path = Path(file_path).resolve()
+        if not file_path.exists():
+            raise FileNotFoundError(f"File not found: {file_path}")
+
+        suffix = file_path.suffix.lower()
+        if suffix not in self.SUPPORTED_EXTENSIONS:
+            raise ValueError(
+                f"Unsupported file extension: {suffix}. Supported: {self.SUPPORTED_EXTENSIONS}"
+            )
+
+        # Parse with unstructured
+        elements = partition(
+            filename=str(file_path),
+            extract_images_in_pdf=self.extract_images,
+        )
+
+        documents = []
+        for elem in elements:
+            text = getattr(elem, "text", "")
+            if not text or not text.strip():
+                continue
+
+            # Base metadata
+            metadata = {
+                "source": str(file_path),
+                "file_name": file_path.name,
+                "file_type": suffix,
+            }
+
+            # Merge element-specific metadata without overwriting base fields
+            elem_meta = getattr(elem, "metadata", {}) or {}
+            for key, value in elem_meta.items():
+                if value and key not in metadata:
+                    metadata[key] = value
+
+            documents.append(Document(page_content=text, metadata=metadata))
+
+        if not documents:
+            logger.warning("No text content extracted from %s", file_path)
+            return []
+
+        return documents
+
+    def load_directory(
+        self, directory_path: Union[str, Path], recursive: bool = True
+    ) -> List[Document]:
+        """Load all supported files from a directory."""
+        directory_path = Path(directory_path).resolve()
+        if not directory_path.is_dir():
+            raise NotADirectoryError(f"Not a directory: {directory_path}")
+
+        all_documents = []
+        pattern = "**/*" if recursive else "*"
+
+        for file_path in directory_path.glob(pattern):
+            if file_path.is_file() and file_path.suffix.lower() in self.SUPPORTED_EXTENSIONS:
+                try:
+                    docs = self.load_file(file_path)
+                    all_documents.extend(docs)
+                except Exception as e:
+                    logger.error("Failed to load %s: %s", file_path, e)
+
+        return all_documents
--- a/rag_indexer/splitters.py
+++ b/rag_indexer/splitters.py
@@ -0,0 +1,71 @@
+"""
+Text splitters for chunking documents.
+"""
+
+from enum import Enum
+from typing import List, Optional
+
+from langchain_core.documents import Document
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_experimental.text_splitter import SemanticChunker
+
+
+class SplitterType(str, Enum):
+    RECURSIVE = "recursive"
+    SEMANTIC = "semantic"
+    PARENT_CHILD = "parent_child"
+
+
+def get_splitter(splitter_type: SplitterType, **kwargs):
+    """Factory function to create a text splitter."""
+    if splitter_type == SplitterType.RECURSIVE:
+        chunk_size = kwargs.get("chunk_size", 500)
+        chunk_overlap = kwargs.get("chunk_overlap", 50)
+        return RecursiveCharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
+            separators=["\n\n", "\n", "。", "！", "？", " ", ""],
+        )
+    elif splitter_type == SplitterType.SEMANTIC:
+        # Requires embeddings for semantic splitting
+        embeddings = kwargs.get("embeddings")
+        if embeddings is None:
+            raise ValueError("Semantic splitter requires 'embeddings' parameter")
+        return SemanticChunker(embeddings=embeddings)
+    else:
+        raise ValueError(f"Unsupported splitter type: {splitter_type}")
+
+
+class ParentChildSplitter:
+    """
+    Splits documents into parent (large) and child (small) chunks.
+    Child chunks are indexed for retrieval, parent chunks are stored for context.
+    """
+
+    def __init__(
+        self,
+        parent_chunk_size: int = 1000,
+        child_chunk_size: int = 200,
+        parent_chunk_overlap: int = 100,
+        child_chunk_overlap: int = 20,
+    ):
+        self.parent_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=parent_chunk_size,
+            chunk_overlap=parent_chunk_overlap,
+        )
+        self.child_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=child_chunk_size,
+            chunk_overlap=child_chunk_overlap,
+        )
+
+    def split_documents(self, documents: List[Document]) -> tuple[List[Document], List[Document]]:
+        """
+        Returns:
+            (parent_chunks, child_chunks)
+        """
+        parent_chunks = self.parent_splitter.split_documents(documents)
+        child_chunks = self.child_splitter.split_documents(documents)
+
+        # Link child chunks to parent IDs (optional metadata)
+        # In a real implementation, you'd map each child to a parent chunk ID.
+        return parent_chunks, child_chunks
--- a/rag_indexer/vector_store.py
+++ b/rag_indexer/vector_store.py
@@ -0,0 +1,110 @@
+"""
+Qdrant vector store wrapper.
+"""
+
+import logging
+import os
+from typing import List, Optional, Dict, Any
+
+from langchain_core.documents import Document
+from langchain_qdrant import QdrantVectorStore as LangchainQdrantVS
+from qdrant_client import QdrantClient
+from qdrant_client.http import models
+from qdrant_client.http.models import Distance, VectorParams
+
+from .embedders import LlamaCppEmbedder
+
+logger = logging.getLogger(__name__)
+
+
+class QdrantVectorStore:
+    """Wrapper for Qdrant vector database operations."""
+
+    def __init__(
+        self,
+        collection_name: str,
+        embeddings: Optional[Any] = None,
+        qdrant_url: Optional[str] = None,
+        api_key: Optional[str] = None,
+    ):
+        self.collection_name = collection_name
+        self.qdrant_url = qdrant_url or os.getenv("QDRANT_URL", "http://127.0.0.1:6333")
+        self.api_key = api_key
+
+        # Embeddings
+        if embeddings is None:
+            embedder = LlamaCppEmbedder()
+            self.embeddings = embedder.as_langchain_embeddings()
+        else:
+            self.embeddings = embeddings
+
+        # Qdrant client
+        self.client = QdrantClient(url=self.qdrant_url, api_key=self.api_key)
+
+        # LangChain vector store
+        self.vector_store = LangchainQdrantVS(
+            client=self.client,
+            collection_name=self.collection_name,
+            embeddings=self.embeddings,
+        )
+
+    def create_collection(self, vector_size: Optional[int] = None, force_recreate: bool = False):
+        """Create collection with appropriate vector size."""
+        if vector_size is None:
+            embedder = LlamaCppEmbedder()
+            vector_size = embedder.get_embedding_dimension()
+
+        collections = self.client.get_collections().collections
+        exists = any(c.name == self.collection_name for c in collections)
+
+        if exists and force_recreate:
+            self.client.delete_collection(self.collection_name)
+            exists = False
+
+        if not exists:
+            self.client.create_collection(
+                collection_name=self.collection_name,
+                vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
+            )
+            logger.info("Collection '%s' created (dim=%d)", self.collection_name, vector_size)
+        else:
+            logger.info("Collection '%s' already exists", self.collection_name)
+
+    def add_documents(self, documents: List[Document], batch_size: int = 100):
+        """Add documents to vector store."""
+        if not documents:
+            return []
+        self.create_collection()
+        ids = self.vector_store.add_documents(documents, batch_size=batch_size)
+        logger.info("Added %d documents to '%s'", len(ids), self.collection_name)
+        return ids
+
+    def similarity_search(self, query: str, k: int = 5) -> List[Document]:
+        return self.vector_store.similarity_search(query, k=k)
+
+    def similarity_search_with_score(self, query: str, k: int = 5) -> List[tuple[Document, float]]:
+        return self.vector_store.similarity_search_with_score(query, k=k)
+
+    def delete_collection(self):
+        self.client.delete_collection(self.collection_name)
+        logger.info("Collection '%s' deleted", self.collection_name)
+
+    def get_collection_info(self) -> Dict[str, Any]:
+        info = self.client.get_collection(self.collection_name)
+        return {
+            "name": info.name,
+            "vectors_count": info.vectors_count,
+            "status": info.status,
+            "vector_size": info.config.params.vectors.size,
+        }
+
+    def as_langchain_vectorstore(self):
+        return self.vector_store
+
+    def get_langchain_vectorstore(self):
+        """返回 LangChain Qdrant 向量存储对象（别名）"""
+        return self.vector_store
+
+    def get_qdrant_client(self):
+        """返回原生 Qdrant 客户端（如需手动管理 collection）"""
+        return self.client