Skip to content

Chunk

class Chunk(
    content: str, 
    container_uid: str, 
    title: Optional[str] = None, 
    level: Optional[int] = None, 
    document_title : Optional[str] = None, 
    section_number: Optional[str] = None)

Overview

A chunk is a representation of a segment of a document with specific attributes. This is the representation used in RAG application to store data into the database.

Parameters

  • content : str

    • The content of the chunk.
  • container_uid : str

    • The unique identifier of the container where the content belongs.
  • title : Optional[str]

    • The title of the container where the content belongs.
  • level : Optional[int]

    • The level of the container where the content belongs.
  • document_title : Optional[str]

    • The title of the document where the content belongs.
  • section_number : Optional[str]

    • The section number of the container where the content belongs.

Attributes

  • uid : str

    • The unique identifier of the chunk.
  • content : str

    • The content of the chunk.
  • container_uid : str

    • The unique identifier of the container where the content belongs.
  • title : Optional[str]

    • The title of the container where the content belongs.
  • level : Optional[int]

    • The level of the container where the content belongs.
  • document_title : Optional[str]

    • The title of the document where the content belongs.
  • section_number : Optional[str]

    • The section number of the container where the content belongs.
  • embeddings : Optional[ndarray]

    • The embeddings of the content.
  • metadata : Optional[Dict[str, Any]]

    • The metadata of the chunk.

Methods

def add_metadata(
    self, 
    key: str, 
    value: Any
    ) -> None
Add custom metadatas to the chunk. This can be used to store additional information about the chunk.

def generate_embeddings(
    self, 
    ll_agent: LlmAgent
    ) -> None
Use a language model to generate embeddings for the content of the chunk.

def to_dict(self) -> Dict[str, Any]
Serialize the chunk into a dictionary.

Usage Example

chunks = []
for i in range(num_chunks):
            start_idx = i * ChunkExtractor.MAX_TOKENS
            end_ids = start_idx + ChunkExtractor.MAX_TOKENS
            chunk_content = tokenizer.decode(tokens[start_idx:end_ids])
            chunk = Chunk(
                content=chunk_content,
                container_uid=container_uid,
                title=title,
                level=level,
                document_title=document_title,
                section_number=section_number
            ) # Creating a chunk
            chunks.append(chunk)