Overseas access: www.kdjingpai.com

Ctrl + D Favorites

Multimodal real-time interactive products

 Submit Website

BAGEL
BAGEL is an open source multimodal base model developed by the ByteDance Seed team and hosted on GitHub.It integrates text comprehension, image generation, and editing capabilities to support cross-modal tasks. The model has 7B active parameters (14B parameters in total) and uses Mixture-of-Tra...
05-22 9930kudos
RealtimeVoiceChat
RealtimeVoiceChat is an open source project focused on real-time, natural conversations with artificial intelligence via voice. Users use the microphone to input voice, the system captures the audio through the browser, quickly converts it to text, generates a reply from a large language model (LLM), and then converts the text to speech output, the whole...
05-06 1.2 K0kudos
Stepsailor: Integrating AI Command Bars in Existing SaaS Offerings
Stepsailor is a tool for developers that centers around an AI command bar. Developers can use it to make their software products understand what the user says, for example, the user says "add new task", the software will automatically execute. It is integrated into SaaS products through a simple SDK, and does not require developers to know ...
04-10 8210kudos
OpenAvatarChat: a modularly designed digital human conversation tool
OpenAvatarChat is an open source project developed by the HumanAIGC-Engineering team and hosted on GitHub. It is a modular digital human conversation tool that allows users to run full functionality on a single PC. The project combines real-time video, speech recognition, and digital human technology...
04-05 9820kudos
VideoMind
VideoMind is an open source multimodal AI tool focused on inference, Q&A and summary generation for long videos. It was developed by Ye Liu of the Hong Kong Polytechnic University and a team from Show Lab at the National University of Singapore. The tool mimics the way humans understand video by splitting the task into planning, localization, checking...
04-02 1.0 K0kudos
MoshiVis
MoshiVis is an open source project developed by Kyutai Labs and hosted on GitHub. It is based on the Moshi speech-to-text model (7B parameters), with about 206 million new adaptation parameters and the frozen PaliGemma2 visual coder (400M parameters), allowing the model...
03-28 1.0 K0kudos
Qwen2.5-Omni
Qwen2.5-Omni is an open source multimodal AI model developed by Alibaba Cloud Qwen team. It can process multiple inputs such as text, images, audio, and video, and generate text or natural speech responses in real-time. The model was released on March 26, 2025, and the code and model files are hosted on GitH....
03-27 1.6 K0kudos
xiaozhi-esp32-server: Xiaozhi AI chatbot open source back-end services
xiaozhi-esp32-server is a tool to provide backend service for Xiaozhi AI chatbot (xiaozhi-esp32). It is written in Python and based on the WebSocket protocol to help users quickly build a server to control ESP32 devices. This project is suitable ...
03-18 1.5 K0kudos
Baichuan-Audio
Baichuan-Audio is an open source project developed by Baichuan Intelligence (baichuan-inc), hosted on GitHub, focusing on end-to-end voice interaction technology. The project provides a complete audio processing framework that can transform speech input into discrete audio tokens , and then through a large model to generate a pair of ...
02-28 1.0 K0kudos
PowerAgents: AI Intelligent Body Platform for Timing Web Tasks
PowerAgents is an AI intelligences platform focused on web automation tasks, which allows users to create and deploy AI intelligences capable of clicking, entering and extracting data. The platform supports setting tasks to run automatically on an hourly, daily or weekly basis, and users can watch the intelligences at work in real time. It doesn't...
02-28 1.2 K0kudos
Step-Audio
Step-Audio is an open source intelligent speech interaction framework designed to provide out-of-the-box speech understanding and generation capabilities for production environments. The framework supports multi-language dialog (e.g., Chinese, English, Japanese), emotional speech (e.g., happy, sad), regional dialects (e.g., Cantonese, Szechuan), adjustable speech rate...
02-19 1.2 K0kudos
Gemini Cursor: an AI desktop smart assistant built on Gemini that can see, hear and speak
Gemini Cursor is a desktop intelligent assistant based on Google's Gemini 2.0 Flash (experimental) model. It enables visual, auditory, and voice interactions via a multimodal API, providing a real-time, low-latency user experience. The project, created by @13point5, aims to pass...
02-12 1.1 K0kudos
DeepSeek-VL2
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) visual language models that significantly improve the performance of its predecessor, DeepSeek-VL. The models excel in tasks such as visual quizzing, optical character recognition, document/table/diagram comprehension, and visual localization.De...
02-12 1.4 K0kudos
AI Web Operator: Browser Automation, an Open Source Implementation of OpenAI Operator
AI Web Operator is an open source AI browser operator tool designed to simplify the user experience in the browser by integrating multiple AI technologies and SDKs. Built on Browserbase and the Vercel AI SDK, the tool supports a variety of Large Language Models (LLM)...
01-31 1.3 K0kudos
SpeechGPT 2.0-preview: an end-to-end anthropomorphic speech dialog grand model for real-time interaction
SpeechGPT 2.0-preview is the first anthropomorphic real-time interaction system introduced by OpenMOSS, which is trained on millions of hours of speech data. SpeechGPT 2.0-previ...
01-30 1.1 K0kudos
OpenAI Realtime Agents
OpenAI Realtime Agents is an open source project that aims to show how OpenAI's real-time APIs can be utilized to build multi-intelligent body speech applications. It provides a high-level intelligent body model (borrowed from OpenAI Swarm) that allows developers to build complex multi-intelligent body speech systems in a short period of time. The project ...
01-19 1.5 K0kudos
Bailing
Bailing (Bailing) is an open-source voice conversation assistant designed to engage in natural conversations with users through speech. The project combines speech recognition (ASR), voice activity detection (VAD), large language modeling (LLM), and speech synthesis (TTS) technologies to implement a voice conversation robot similar to GPT-4o...
01-19 1.5 K0kudos
Weebo: a real-time voice chatbot that provides a natural language conversational experience
Weebo is an open source real-time voice chatbot that utilizes Whisper Small for speech recognition, Llama 3.2 for natural language generation, and Kokoro-82M for speech synthesis. The project was developed by Amanvir Parhar with the aim of providing a native...
01-17 1.3 K0kudos
OmAgent: an intelligent body framework for building multimodal smart devices
OmAgent is a multimodal intelligent body framework developed by Om AI Lab to provide powerful AI-powered functionality for smart devices. The project enables developers to create efficient, real-time interactive experiences on a wide range of smart devices by integrating state-of-the-art multimodal base models and smart body algorithms.OmAgent does...
01-17 1.3 K0kudos

English