Tokenizer
Overview
The Tokenizer class is responsible for tokenizing a text. It is used in particular by the ChunkExtractor class to extract chunks from a document.
The default hexamind Tokenizer uses the MistralTokenizer.
You can implement your own tokenizer by creating a child class of the ITokenizer interface define as follow:
class ITokenizer(ABC):
@abstractmethod
def tokenize(self, text: str) -> List[str]:
pass
@abstractmethod
def decode(self, tokens: List[str]) -> str:
pass
@abstractmethod
def count_tokens(self, text: str) -> int:
pass
Attributes
tokenizer: The tokenizer used to tokenize the text.- Here
MistralTokenizer.from_model("open-mixtral-8x22b")
- Here
Methods
Transform sequence of characters into a sequence of tokens.
Transform sequence of tokens into a sequence of characters.