Skip to content

Tokenizer

class Tokenizer()

Overview

The Tokenizer class is responsible for tokenizing a text. It is used in particular by the ChunkExtractor class to extract chunks from a document. The default hexamind Tokenizer uses the MistralTokenizer.

You can implement your own tokenizer by creating a child class of the ITokenizer interface define as follow:

class ITokenizer(ABC):
    @abstractmethod
    def tokenize(self, text: str) -> List[str]:
        pass

    @abstractmethod
    def decode(self, tokens: List[str]) -> str:
        pass

    @abstractmethod
    def count_tokens(self, text: str) -> int:
        pass

Attributes

  • tokenizer : The tokenizer used to tokenize the text.
    • Here MistralTokenizer.from_model("open-mixtral-8x22b")

Methods

def tokenize(self, text: str) -> List[int]
Transform sequence of characters into a sequence of tokens.

def decode(self, tokens: List[int]) -> str
Transform sequence of tokens into a sequence of characters.

def count_tokens(self, text: str) -> int
Count the number of tokens in a text.

Usage Example

tokenizer = Tokenizer()
tokens = tokenizer.tokenize("Hello, World!")