Overseas access: www.kdjingpai.com
Ctrl + D Favorites

SimpleDeepSearcher is an open source framework designed to enhance the capabilities of Large Language Models (LLMs) for complex information retrieval tasks. It generates high-quality inference and search traces by simulating real web search behavior, helping models to be trained efficiently without the need for large amounts of data. Compared to traditional Retrieval Augmented Generation (RAG) or Reinforcement Learning (RL) methods, SimpleDeepSearcher uses a small amount of selected data to enable the model to autonomously perform complex reasoning and search tasks through knowledge distillation and self-distillation techniques. Developed by the RUCAIBox team and released under the MIT license, the project is intended for researchers and developers to optimize the search capabilities of large language models. The official documentation and code is hosted on GitHub and was last updated in April 2025.

SimpleDeepSearcher:通过网页搜索增强大语言模型的智能检索工具-1

 

Function List

  • Web Search Simulation: simulate the search behavior of real users in an open web environment, generating multiple rounds of inference and search trajectories.
  • Data synthesis and screening: Generate high-quality training data through diversity query sampling and multidimensional response filtering.
  • Efficient monitoring of fine-tuning: Supervised fine-tuning (SFT) can be accomplished with only a small amount of selected data, reducing computational costs.
  • Compatible with multiple models: Supports the existing underlying large language model and dialog model without additional fine-tuning of cold-start commands.
  • Open Source Code and Documentation: Complete training code, inference code, and model checkpoints are provided for developers' ease of use and research.

 

Using Help

Installation process

SimpleDeepSearcher is a Python based open source project , the runtime environment requires Python 3.10 or above. The following are detailed installation steps:

  1. Cloning Project Warehouse
    Clone the SimpleDeepSearcher repository locally by running the following command in a terminal:

    git clone https://github.com/RUCAIBox/SimpleDeepSearcher.git
    cd SimpleDeepSearcher
    
  2. Creating a Virtual Environment
    Use conda to create and activate a virtual environment and ensure dependency isolation:

    conda create --name simpledeepsearcher python=3.10
    conda activate simpledeepsearcher
    
  3. Installation of dependencies
    Install the core dependency libraries required by the project, such as vLLM, deepspeed, and datasets. run the following command:

    pip install vllm==0.6.5
    pip install packaging ninja flash-attn --no-build-isolation
    pip install deepspeed accelerate datasets
    
  4. Configuring the Search API
    SimpleDeepSearcher uses the Google Search API for online search. Users need to get the subscription key and endpoint URL for Google Search and configure it when running the script. Example:

    export GOOGLE_SUBSCRIPTION_KEY="YOUR_KEY"
    export GOOGLE_ENDPOINT="https://google.serper.dev/search"
    
  5. Preparing the model path
    The user needs to specify the paths to the inference model and the summary model. Example:

    export MODEL_PATH="/path/to/your/reasoning_model"
    export SUMMARIZATION_MODEL_PATH="/path/to/your/summarization_model"
    
  6. Running inference scripts
    After completing the configuration, run the inference script for testing or deployment:

    export CUDA_VISIBLE_DEVICES=0,1
    python -u inference/inference.py \
    --dataset_name YOUR_DATASET_NAME \
    --cache_dir_base cache \
    --output_dir_base output \
    --model_path "$MODEL_PATH" \
    --summarization_model_path "$SUMMARIZATION_MODEL_PATH" \
    --summarization_model_url YOUR_SUMMARIZATION_MODEL_URL \
    --google_subscription_key "$GOOGLE_SUBSCRIPTION_KEY" \
    --google_endpoint "$GOOGLE_ENDPOINT" > output/output.log 2>&1
    

Main Functions

The core functionality of SimpleDeepSearcher is to enhance the inference of large language models through web search. Below is a detailed flow of how the main features work:

  1. Data synthesis and search track generation
    SimpleDeepSearcher generates multiple rounds of inference trajectories by simulating user search behavior in a real web environment. Users can configure data_synthesis module, specifying search terms and question types. The system automatically samples diverse questions from open-domain QA resources and obtains relevant web content through the Google search API. The generated data, including questions, search terms, web page results, and inference paths, are saved in the cache folder.
    Operational Steps:

    • compiler data_synthesis_config.json, set query sampling parameters (e.g., domain diversity, keyword complexity).
    • (of a computer) run python data_synthesis.py Generate initial data.
    • probe cache/synthesis_data catalog to ensure complete data generation.
  2. Data filtering and optimization
    The program provides multi-dimensional response screening to ensure the quality of training data. Users can access the program via the response_curation.py Scripts that filter data based on criteria such as problem difficulty, inference path length, and search effectiveness.
    Operational Steps:

    • Run the following command to start screening:
      python response_curation.py --input_dir cache/synthesis_data --output_dir cache/curated_data
      
    • The filtered data is saved in the cache/curated_data in which only high-quality training samples are retained.
  3. Model supervised fine-tuning
    SimpleDeepSearcher uses supervised fine-tuning (SFT) to optimize large language models. Users need to prepare a base model (e.g., QWEN2.5-32B) and use the filtered data for fine-tuning.
    Operational Steps:

    • configure sft_config.json, specify the model path and training parameters (e.g., learning rate, batch size).
    • Run the following command to initiate fine tuning:
      python sft_train.py --config sft_config.json
      
    • After training is complete, the model checkpoints are saved in the output/checkpoints Catalog.
  4. Reasoning and Testing
    Users can test the search and inference capabilities of the model through inference scripts. The inference results are output to the output/results Catalog with generated answers and reasoning paths.
    Operational Steps:

    • Configure inference parameters (e.g., dataset name, output directory).
    • Run the inference script (refer to the commands in the installation process).
    • ferret out output/output.log Check the results of the reasoning.

Featured Function Operation

  • Diversity query sampling: SimpleDeepSearcher uses a diversity query sampling strategy to select problems based on domain heterogeneity, keyword diversity and knowledge unit complexity. Users can find a list of queries in the query_sampling_config.json in which the sampling parameters are adjusted to ensure that the generated questions cover a wide range of domains and difficulty levels.
  • Knowledge Distillation and Self-Distillation: The project generates high-quality training data using powerful inference models through knowledge distillation techniques. Users can specify a powerful pre-trained model (e.g., LLaMA or GPT family) as the teacher model, run the distillation.py script for data generation.
  • Real-time web search: SimpleDeepSearcher supports real-time web search and combines with Google search API to get the latest information dynamically. Users need to make sure the API key is valid and check the network connection.

caveat

  • Ensure that the network environment is stable to support real-time web searches.
  • Check the model path and API key configuration to avoid runtime errors.
  • Dependency library versions are regularly updated to ensure compatibility.
  • The project code and documentation follow the MIT license, and users are required to cite the source (e.g., thesis). @article{sun2025simpledeepsearcher}).

 

application scenario

  1. academic research
    SimpleDeepSearcher helps researchers optimize the performance of large language models in information retrieval tasks. For example, in paper retrieval or data analysis, the model can quickly access relevant literature or datasets through web searches to improve research efficiency.
  2. Q&A System Development
    Developers can use SimpleDeepSearcher to build intelligent Q&A systems. The system simulates user search behavior and generates accurate answers, making it suitable for customer service bots or educational platforms.
  3. Complex problem reasoning
    For problems that require multi-step reasoning (e.g., math or logic problems), SimpleDeepSearcher helps models provide more accurate answers by generating reasoning trajectories, suitable for online education or competition platforms.

 

QA

  1. What prerequisites does SimpleDeepSearcher require?
    Users will need a Python 3.10+ environment, a Google search API key, and a pre-trained large language model. Make sure to install all dependent libraries and configure the correct path to the model.
  2. How do you ensure the quality of the data generated?
    The project provides multi-dimensional response filtering, and users can adjust the filtering parameters (e.g., question difficulty, inference path length) to retain high-quality data.
  3. Are other search APIs supported?
    The current version mainly supports Google search API, users can modify the code to adapt to other search services, but need to adjust the API call logic.
  4. How long does the training take?
    Training time depends on the model size and amount of data. Fine-tuning a QWEN2.5-32B model with 871 high-quality samples typically takes several hours (GPU environment).
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish