TokenDagger is a high-performance text segmentation tool designed to accelerate natural language processing (NLP) tasks. It is an alternative to OpenAI's TikToken that optimizes the speed of word splitting, and performs particularly well when working with code and large-scale text. Open sourced on GitHub by developer Matthew Wolfe, the project provides a TikToken 100%-compatible interface that developers can use without modifying existing code.TokenDagger uses the PCRE2 engine to optimize regular expression matching, simplifying the byte-pair-encoding (BPE) algorithm and dramatically improving performance. Tests have shown that it is 4 times faster than TikToken in terms of code segmentation speed, and increases throughput by 2-3 times when processing 1GB text files. The project is suitable for developers, data scientists and AI researchers who need efficient text processing.
Function List
- Efficient Segmentation: Based on PCRE2 engine, optimize regular expression matching, significantly improve the speed of text segmentation.
- Compatible with TikToken: As a direct replacement for TikToken, it can be integrated without modifying existing code.
- Simplified BPE algorithm: optimized byte pair encoding, reduced special token Performance overhead of processing.
- Open source support: Full source code is available, allowing developers to customize or contribute improvements.
- Cross-platform support: support for Linux, macOS and other systems, easy to deploy in a variety of development environments.
- Test Suite: Built-in benchmarking tool to validate split-word performance and compare it to TikToken.
Using Help
Installation process
TokenDagger is easy to install for developers familiar with Python and Git. Here are the detailed installation steps, based on Ubuntu/Debian systems (other systems will need to adjust the dependency installation commands accordingly):
- Clone Code Repository
Use the Git command to clone the TokenDagger repository locally:git clone git@github.com:M4THYOU/TokenDagger.git
This will download the latest source code for TokenDagger.
- Installing the PCRE2 Development Library
TokenDagger uses PCRE2 for efficient regular expression matching and requires the installation of the development library:sudo apt install libpcre2-dev
- Updating submodules
The project relies on a number of external components that need to be initialized and sub-modules updated:git submodule update --init --recursive
- Installing the Python Development Environment
Ensure that your system has a Python3 development environment:sudo apt update && sudo apt install -y python3-dev
- Install TikToken (optional)
If you need to run a test suite to compare performance with TikToken, you need to install TikToken:pip3 install tiktoken
- Compile and run
Go to the project directory and run the Python script or test suite:cd TokenDagger python3 setup.py install
Once installed, TokenDagger can be imported and used via Python.
Usage
The core function of TokenDagger is efficient word splitting, suitable for processing code, documents or large-scale text. Below is the main function operation flow:
1. Integration into existing projects
TokenDagger is fully compatible with TikToken's API. Developers simply replace TikToken's import statement with TokenDagger. e.g.:
# 原代码
from tiktoken import encoding_for_model
# 替换为
from tokendagger import encoding_for_model
Without any subsequent code changes, TokenDagger takes over the task of word segmentation, providing faster processing.
2. Segmentation operations
TokenDagger supports the standard split word operations. Here is a simple example:
from tokendagger import encoding_for_model
encoder = encoding_for_model("gpt-3.5-turbo")
text = "Hello, this is a sample text for tokenization."
tokens = encoder.encode(text)
print(tokens)
This code converts the input text into a list of tokens, which is faster than TikToken, especially when dealing with long text or code.
3. Handling code-splitting
TokenDagger is particularly good at code disambiguation. Suppose you need to process Python code:
code = """
def hello_world():
print("Hello, World!")
"""
tokens = encoder.encode(code)
print(len(tokens)) # 输出 token 数量
Tests have shown that TokenDagger processes similar code 4x faster than TikToken, making it suitable for scenarios that require fast code parsing.
4. Running benchmark tests
TokenDagger provides a built-in test suite where developers can verify performance:
python3 -m tokendagger.benchmark
The test results will show a speed comparison between TokenDagger and TikToken on different datasets, such as 1GB text files or code samples.
5. Customization development
Developers can modify the TokenDagger source code to fit specific needs. The project has a clear directory structure, and the core tokenization logic is located in the tokendagger/core
in. Developers can adapt PCRE2 regular expressions or BPE algorithms to optimize for specific use cases.
caveat
- Environmental requirements: Ensure that Python 3.6+ and the PCRE2 library are installed on your system.
- Performance Test Environment: Official benchmarks were conducted on an AMD EPYC 4584PX processor, actual performance may vary by hardware.
- Community Support: If you have problems, file an issue on GitHub or check the documentation.
application scenario
- AI Model Development
TokenDagger is suitable for the preprocessing phase of large-scale language models (LLMs), quickly converting text into tokens to improve training efficiency. For example, AI developers can use it to process large-scale datasets and reduce data preprocessing time. - Code Analysis Tools
In code review or static analysis tools, TokenDagger quickly parses source code and generates token sequences for building syntax highlighting, code completion, or error detection features. - Big Data Text Processing
Data scientists can use TokenDagger to process massive amounts of text, such as log files or social media data. Its high throughput significantly reduces processing time. - Education and Research
Students and researchers can use TokenDagger to learn lexical algorithms or experiment with NLP, and the project is open source and well documented for academic exploration.
QA
- What is the difference between TokenDagger and TikToken?
TokenDagger is a high-performance alternative to TikToken, using the PCRE2 engine and optimized BPE algorithms, which is much faster, especially in code segmentation by a factor of 4 and text processing throughput by a factor of 2-3. - Do I need to change my code to use TokenDagger?
No. TokenDagger is fully compatible with TikToken's API, so you can switch seamlessly by simply replacing the import statement. - What programming languages does TokenDagger support?
It is designed primarily for Python developers, but the participle function can handle any text, including code in a variety of programming languages. - How can I verify the performance of TokenDagger?
Run the built-in benchmarkspython3 -m tokendagger.benchmark
In addition, you can compare the speed of TokenDagger and TikToken.