解析文档后,我们可以获得结构化或半结构化的数据。现在的主要任务是将它们分解成更小的块来提取详细的特征,然后嵌入这些特征来表示它们的语义,其在RAG中的位置如图1所示:





  • Embedding-based
  • Model-based
  • LLM-based




pip install llama-index-core
pip install llama-index-readers-file
pip install llama-index-embeddings-openai


(llamaindex_010) Florian:~ Florian$ pip list | grep llamallama-index-core 0.10.12llama-index-embeddings-openai 0.1.6llama-index-readers-file 0.1.5llamaindex-py-client 0.1.13


from llama_index.core.node_parser import ( SentenceSplitter, SemanticSplitterNodeParser,)from llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.core import SimpleDirectoryReader

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"
# load documentsdir_path = "YOUR_DIR_PATH"documents = SimpleDirectoryReader(dir_path).load_data()

embed_model = OpenAIEmbedding()splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)
nodes = splitter.get_nodes_from_documents(documents)for node in nodes: print('-' * 100) print(node.get_content())



  • sentences:当前句子;
  • index:当前句子的序号;
  • combined_sentence:一个滑动窗口,包括[index-self-buffer_size,index,index+self.buffer_size]3句话(默认情况下,self-buffer_size=1)。它是一种用于计算句子之间语义相关性的工具。组合前句和后句的目的是减少噪音,更好地捕捉连续句子之间的关系;
  • combined_sentence_embedding:combined_sentence的嵌入。



(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py ......----------------------------------------------------------------------------------------------------We argue that current techniques restrict thepower of the pre-trained representations, espe-cially for the fine-tuning approaches. The ma-jor limitation is that standard language models areunidirectional, and this limits the choice of archi-tectures that can be used during pre-training. Forexample, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only at-tend to previous tokens in the self-attention layersof the Transformer (Vaswani et al., 2017). Such re-strictions are sub-optimal for sentence-level tasks,and could be very harmful when applying fine-tuning based approaches to token-level tasks suchas question answering, where it is crucial to incor-porate context from both directions.In this paper, we improve the fine-tuning basedapproaches by proposing BERT: BidirectionalEncoder Representations from Transformers.BERT alleviates the previously mentioned unidi-rectionality constraint by using a “masked lan-guage model” (MLM) pre-training objective, in-spired by the Cloze task (Taylor, 1953). Themasked language model randomly masks some ofthe tokens from the input, and the objective is topredict the original vocabulary id of the maskedarXiv:1810.04805v2 [cs.CL] 24 May 2019----------------------------------------------------------------------------------------------------word based only on its context. Unlike left-to-right language model pre-training, the MLM ob-jective enables the representation to fuse the leftand the right context, which allows us to pre-train a deep bidirectional Transformer. In addi-tion to the masked language model, we also usea “next sentence prediction” task that jointly pre-trains text-pair representations. The contributionsof our paper are as follows:• We demonstrate the importance of bidirectionalpre-training for language representations. Un-like Radford et al. (2018), which uses unidirec-tional language models for pre-training, BERTuses masked language models to enable pre-trained deep bidirectional representations. Thisis also in contrast to Peters et al. ----------------------------------------------------------------------------------------------------......


  • 测试结果表明,块的粒度相对较粗。
  • 图2还显示了这种方法是基于页面的,并且没有直接解决跨越多个页面的块的问题。
  • 通常,基于嵌入的方法的性能在很大程度上取决于嵌入模型。实际效果需要进一步评估。


2.1 Naive BERT






2.2 Cross Segment Attention

      论文《Text Segmentation by Cross Segment Attention》[3]提出了三种跨段注意力模型,如图4所示:





2.3 SeqModel

       跨段模型独立地对每个句子进行矢量化,不考虑任何更广泛的上下文信息。SeqModel中提出了进一步的增强,如论文“Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation”[5]中所述。



from modelscope.outputs import OutputKeysfrom modelscope.pipelines import pipelinefrom modelscope.utils.constant import Tasks
p = pipeline( task = Tasks.document_segmentation, model = 'damo/nlp_bert_document-segmentation_english-base')
print('-' * 100)
result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')

       测试数据最后附加了一句话,“Today is a good day”,但结果并没有把“Today is a good day”分开。

(modelscope) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_seqmodel.py 2024-02-24 17:09:36,288 - modelscope - INFO - PyTorch version 2.2.1 Found.2024-02-24 17:09:36,288 - modelscope - INFO - Loading ast index from /Users/Florian/.cache/modelscope/ast_indexer......----------------------------------------------------------------------------------------------------...... We demonstrate the importance of bidirectional pre-training for language representations.Unlike Radford et al.(2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations.This is also in contrast to Peters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures.BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.Today is a good day


       论文《Dense X Retrieval: What Retrieval Granularity Should We Use?》[8]引入了一个新的检索单元,称为proposition。proposition被定义为文本中的原子表达式,每个命题都封装了一个不同的事实,并以简洁、自包含的自然语言格式呈现。




PROPOSITIONS_PROMPT = PromptTemplate( """Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out ofcontext.1. Split compound sentence into simple sentences. Maintain the original phrasing from the inputwhenever possible.2. For any named entity that is accompanied by additional descriptive information, separate thisinformation into its own distinct proposition.3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentencesand replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of theentities they refer to.4. Present the results as a list of strings, formatted in JSON.
Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown inother parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares werefrequently seen in gardens in spring, and thus may have served as a convenient explanation for theorigin of the colored eggs hidden there for children. Alternatively, there is a European traditionthat hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, andboth occur on grassland and are first seen in the spring. In the nineteenth century the influenceof Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.German immigrants then exported the custom to Britain and America where it evolved into theEaster Bunny."Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor ofmedicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany untilthe 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis aboutthe possible explanation for the connection between hares and the tradition during Easter", "Hareswere frequently seen in gardens in spring.", "Hares may have served as a convenient explanationfor the origin of the colored eggs hidden in gardens for children.", "There is a European traditionthat hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Bothhares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenthcentury the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popularthroughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit toBritain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny inBritain and America." ]
Input: {node_text}Output:""")

       在上一节基于嵌入的方法中,我们安装了LlamaIndex 0.10.12的关键组件。但如果我们想使用DenseXRetrievalPack,我们还需要运行pip install-lama-index-llms-openai。安装后,当前与LlamaIndex相关的组件如下:

(llamaindex_010) Florian:~ Florian$ pip list | grep llamallama-index-core 0.10.12llama-index-embeddings-openai 0.1.6llama-index-llms-openai 0.1.6llama-index-readers-file 0.1.5llamaindex-py-client 0.1.13


from llama_index.core.readers import SimpleDirectoryReaderfrom llama_index.core.llama_pack import download_llama_pack
import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
# Download and install dependenciesDenseXRetrievalPack = download_llama_pack( "DenseXRetrievalPack", "./dense_pack")
# If you have already downloaded DenseXRetrievalPack, you can import it directly.# from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack
# Load documentsdir_path = "YOUR_DIR_PATH"documents = SimpleDirectoryReader(dir_path).load_data()

# Use LLM to extract propositions from every document/nodedense_pack = DenseXRetrievalPack(documents)
response = dense_pack.run("YOUR_QUERY")


class DenseXRetrievalPack(BaseLlamaPack): def __init__( self, documents: List[Document], proposition_llm: Optional[LLM] = None, query_llm: Optional[LLM] = None, embed_model: Optional[BaseEmbedding] = None, text_splitter: TextSplitter = SentenceSplitter(), similarity_top_k: int = 4, ) -> None: """Init params.""" self._proposition_llm = proposition_llm or OpenAI( model="gpt-3.5-turbo", temperature=0.1, max_tokens=750, )
embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)
nodes = text_splitter.get_nodes_from_documents(documents) sub_nodes = self._gen_propositions(nodes)
all_nodes = nodes + sub_nodes all_nodes_dict = {n.node_id: n for n in all_nodes}
service_context = ServiceContext.from_defaults( llm=query_llm or OpenAI(), embed_model=embed_model, num_output=self._proposition_llm.metadata.num_output, )
self.vector_index = VectorStoreIndex( all_nodes, service_context=service_context, show_progress=True )
self.retriever = RecursiveRetriever( "vector", retriever_dict={ "vector": self.vector_index.as_retriever( similarity_top_k=similarity_top_k ) }, node_dict=all_nodes_dict, )
self.query_engine = RetrieverQueryEngine.from_args( self.retriever, service_context=service_context )



> /Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/base.py(91)__init__() 90 ---> 91 all_nodes = nodes + sub_nodes 92 all_nodes_dict = {n.node_id: n for n in all_nodes}

ipdb> sub_nodes[20]IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Our paper demonstrates the importance of bidirectional pre-training for language representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)ipdb> sub_nodes[21]IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Radford et al. (2018) uses unidirectional language models for pre-training.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)ipdb> sub_nodes[22]IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='BERT uses masked language models to enable pre-trained deep bidirectional representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)ipdb> sub_nodes[23]IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Peters et al. (2018a) uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)ipdb> sub_nodes[24]IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Pre-trained representations reduce the need for many heavily-engineered task-specific architectures.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)


      从小到大的索引结构是通过 self._gen_propositions构建的,代码如下:

async def _aget_proposition(self, node: TextNode) -> List[TextNode]: """Get proposition.""" inital_output = await self._proposition_llm.apredict( PROPOSITIONS_PROMPT, node_text=node.text ) outputs = inital_output.split("\n")
all_propositions = []
for output in outputs: if not output.strip(): continue if not output.strip().endswith("]"): if not output.strip().endswith('"') and not output.strip().endswith( "," ): output = output + '"' output = output + " ]" if not output.strip().startswith("["): if not output.strip().startswith('"'): output = '"' + output output = "[ " + output
try: propositions = json.loads(output) except Exception: # fallback to yaml try: propositions = yaml.safe_load(output) except Exception: # fallback to next output continue
if not isinstance(propositions, list): continue
assert isinstance(all_propositions, list) nodes = [TextNode(text=prop) for prop in all_propositions if prop]
return [IndexNode.from_text_node(n, node.node_id) for n in nodes]
def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]: """Get propositions.""" sub_nodes = asyncio.run( run_jobs( [self._aget_proposition(node) for node in nodes], show_progress=True, workers=8, ) )
# Flatten list return [node for sub_node in sub_nodes for node in sub_node]

    对于每个原始node,异步调用self_aget_proposition通过PROPOSITIONS_PROMPT获取LLM的返回inital_output,然后基于inital_out获取命题并构建TextNode。最后,将这些TextNode与原始node相关联,即[IndexNode.from_text_node(n,node.node_id)for n in nodes]。








