Overseas access: www.kdjingpai.com
Ctrl + D Favorites

TokenDagger is a high-performance text segmentation tool designed to accelerate natural language processing (NLP) tasks. It is an alternative to OpenAI's TikToken that optimizes the speed of word splitting, and performs particularly well when working with code and large-scale text. Open sourced on GitHub by developer Matthew Wolfe, the project provides a TikToken 100%-compatible interface that developers can use without modifying existing code.TokenDagger uses the PCRE2 engine to optimize regular expression matching, simplifying the byte-pair-encoding (BPE) algorithm and dramatically improving performance. Tests have shown that it is 4 times faster than TikToken in terms of code segmentation speed, and increases throughput by 2-3 times when processing 1GB text files. The project is suitable for developers, data scientists and AI researchers who need efficient text processing.

TokenDagger:高性能文本分词工具-1

 

Function List

  • Efficient Segmentation: Based on PCRE2 engine, optimize regular expression matching, significantly improve the speed of text segmentation.
  • Compatible with TikToken: As a direct replacement for TikToken, it can be integrated without modifying existing code.
  • Simplified BPE algorithm: optimized byte pair encoding, reduced special token Performance overhead of processing.
  • Open source support: Full source code is available, allowing developers to customize or contribute improvements.
  • Cross-platform support: support for Linux, macOS and other systems, easy to deploy in a variety of development environments.
  • Test Suite: Built-in benchmarking tool to validate split-word performance and compare it to TikToken.

 

Using Help

Installation process

TokenDagger is easy to install for developers familiar with Python and Git. Here are the detailed installation steps, based on Ubuntu/Debian systems (other systems will need to adjust the dependency installation commands accordingly):

  1. Clone Code Repository
    Use the Git command to clone the TokenDagger repository locally:

    git clone git@github.com:M4THYOU/TokenDagger.git
    

This will download the latest source code for TokenDagger.

  1. Installing the PCRE2 Development Library
    TokenDagger uses PCRE2 for efficient regular expression matching and requires the installation of the development library:

    sudo apt install libpcre2-dev
    
  2. Updating submodules
    The project relies on a number of external components that need to be initialized and sub-modules updated:

    git submodule update --init --recursive
    
  3. Installing the Python Development Environment
    Ensure that your system has a Python3 development environment:

    sudo apt update && sudo apt install -y python3-dev
    
  4. Install TikToken (optional)
    If you need to run a test suite to compare performance with TikToken, you need to install TikToken:

    pip3 install tiktoken
    
  5. Compile and run
    Go to the project directory and run the Python script or test suite:

    cd TokenDagger
    python3 setup.py install
    

    Once installed, TokenDagger can be imported and used via Python.

Usage

The core function of TokenDagger is efficient word splitting, suitable for processing code, documents or large-scale text. Below is the main function operation flow:

1. Integration into existing projects

TokenDagger is fully compatible with TikToken's API. Developers simply replace TikToken's import statement with TokenDagger. e.g.:

# 原代码
from tiktoken import encoding_for_model
# 替换为
from tokendagger import encoding_for_model

Without any subsequent code changes, TokenDagger takes over the task of word segmentation, providing faster processing.

2. Segmentation operations

TokenDagger supports the standard split word operations. Here is a simple example:

from tokendagger import encoding_for_model
encoder = encoding_for_model("gpt-3.5-turbo")
text = "Hello, this is a sample text for tokenization."
tokens = encoder.encode(text)
print(tokens)

This code converts the input text into a list of tokens, which is faster than TikToken, especially when dealing with long text or code.

3. Handling code-splitting

TokenDagger is particularly good at code disambiguation. Suppose you need to process Python code:

code = """
def hello_world():
print("Hello, World!")
"""
tokens = encoder.encode(code)
print(len(tokens))  # 输出 token 数量

Tests have shown that TokenDagger processes similar code 4x faster than TikToken, making it suitable for scenarios that require fast code parsing.

4. Running benchmark tests

TokenDagger provides a built-in test suite where developers can verify performance:

python3 -m tokendagger.benchmark

The test results will show a speed comparison between TokenDagger and TikToken on different datasets, such as 1GB text files or code samples.

5. Customization development

Developers can modify the TokenDagger source code to fit specific needs. The project has a clear directory structure, and the core tokenization logic is located in the tokendagger/core in. Developers can adapt PCRE2 regular expressions or BPE algorithms to optimize for specific use cases.

caveat

  • Environmental requirements: Ensure that Python 3.6+ and the PCRE2 library are installed on your system.
  • Performance Test Environment: Official benchmarks were conducted on an AMD EPYC 4584PX processor, actual performance may vary by hardware.
  • Community Support: If you have problems, file an issue on GitHub or check the documentation.

 

application scenario

  1. AI Model Development
    TokenDagger is suitable for the preprocessing phase of large-scale language models (LLMs), quickly converting text into tokens to improve training efficiency. For example, AI developers can use it to process large-scale datasets and reduce data preprocessing time.
  2. Code Analysis Tools
    In code review or static analysis tools, TokenDagger quickly parses source code and generates token sequences for building syntax highlighting, code completion, or error detection features.
  3. Big Data Text Processing
    Data scientists can use TokenDagger to process massive amounts of text, such as log files or social media data. Its high throughput significantly reduces processing time.
  4. Education and Research
    Students and researchers can use TokenDagger to learn lexical algorithms or experiment with NLP, and the project is open source and well documented for academic exploration.

 

QA

  1. What is the difference between TokenDagger and TikToken?
    TokenDagger is a high-performance alternative to TikToken, using the PCRE2 engine and optimized BPE algorithms, which is much faster, especially in code segmentation by a factor of 4 and text processing throughput by a factor of 2-3.
  2. Do I need to change my code to use TokenDagger?
    No. TokenDagger is fully compatible with TikToken's API, so you can switch seamlessly by simply replacing the import statement.
  3. What programming languages does TokenDagger support?
    It is designed primarily for Python developers, but the participle function can handle any text, including code in a variety of programming languages.
  4. How can I verify the performance of TokenDagger?
    Run the built-in benchmarks python3 -m tokendagger.benchmarkIn addition, you can compare the speed of TokenDagger and TikToken.
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish