Seedance 1.0
Seedance 1.0 是由字节跳动 Seed 团队开发的一款 AI 视频生成工具,专注于将文本或图像转化为高质量视频内容。用户只需输入文字描述或上传图片,Seedance 即可生成分辨率高达 1080p 的视频,适用于创意内容创作、.....
Gemma 3n
Google is expanding its footprint for inclusive AI with the release of Gemma 3 and Gemma 3 QAT, open source models that run on a single cloud or desktop gas pedal. If Gemma 3 brought powerful cloud and desktop capabilities to developers, this May 20, 2025 release...
MoviiGen 1.1
MoviiGen1.1 is an open source AI tool developed by ZuluVision that focuses on generating high quality videos from text. It supports 720P and 1080P resolutions and is especially suitable for professional video production that requires cinematic visual effects. Users can generate videos from simple text descriptions with natural dynamic...
HiDream-I1
HiDream-I1 is an open source image generation base model with 17 billion parameters to quickly generate high quality images. Users only need to enter a textual description, and the model can generate images in a variety of styles including realistic, cartoon, art, and more. Developed by the HiDream.ai team and hosted on GitHub, the project picks...
Imagen 4
Google DeepMind's recently launched Imagen 4 model, the latest iteration of its image generation technology, is quickly becoming an industry focal point. The model has made significant progress in improving the richness, accuracy of detail, and speed of image generation, working to bring the user's imagination to life in ways never before...
BAGEL
BAGEL is an open source multimodal base model developed by the ByteDance Seed team and hosted on GitHub.It integrates text comprehension, image generation, and editing capabilities to support cross-modal tasks. The model has 7B active parameters (14B parameters in total) and uses Mixture-of-Tra...
MiniMax Speech 02
With the continuous evolution of AI technologies, personalized and highly natural voice interaction has become a key requirement for many intelligent applications. However, existing text-to-speech (TTS) technologies still face challenges in meeting large-scale personalized tones, multilingual coverage, and highly realistic emotion expression. To address these line...
Windsurf SWE-1
SWE-1: A New Generation of Cutting-Edge Models for Software Engineering Recently, the much-anticipated SWE-1 family of models was released. Designed to optimize the entire software engineering process, this family of models goes far beyond the traditional task of writing code. Currently, the SWE-1 family consists of three well-positioned models:...
VideoMind
VideoMind is an open source multimodal AI tool focused on inference, Q&A and summary generation for long videos. It was developed by Ye Liu of the Hong Kong Polytechnic University and a team from Show Lab at the National University of Singapore. The tool mimics the way humans understand video by splitting the task into planning, localization, checking...
MoshiVis
MoshiVis is an open source project developed by Kyutai Labs and hosted on GitHub. It is based on the Moshi speech-to-text model (7B parameters), with about 206 million new adaptation parameters and the frozen PaliGemma2 visual coder (400M parameters), allowing the model...
Qwen2.5-Omni
Qwen2.5-Omni is an open source multimodal AI model developed by Alibaba Cloud Qwen team. It can process multiple inputs such as text, images, audio, and video, and generate text or natural speech responses in real-time. The model was released on March 26, 2025, and the code and model files are hosted on GitH....
StarVector
StarVector is an open source project created by developers such as Juan A. Rodriguez to convert images and text into Scalable Vector Graphics (SVG). This tool uses a visual language model that understands image content and text instructions to generate high-quality SVG code. Its core ...
LaWGPT
LaWGPT is an open source project supported by the Machine Learning and Data Mining Research Group of Nanjing University, which is dedicated to building a large language model based on Chinese legal knowledge. It extends the proprietary word lists in the legal domain based on generalized Chinese models (such as Chinese-LLaMA and ChatGLM), and through large-scale...
Baichuan-Audio
Baichuan-Audio is an open source project developed by Baichuan Intelligence (baichuan-inc), hosted on GitHub, focusing on end-to-end voice interaction technology. The project provides a complete audio processing framework that can transform speech input into discrete audio tokens , and then through a large model to generate a pair of ...
Step-Audio
Step-Audio is an open source intelligent speech interaction framework designed to provide out-of-the-box speech understanding and generation capabilities for production environments. The framework supports multi-language dialog (e.g., Chinese, English, Japanese), emotional speech (e.g., happy, sad), regional dialects (e.g., Cantonese, Szechuan), adjustable speech rate...
DeepSeek-VL2
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) visual language models that significantly improve the performance of its predecessor, DeepSeek-VL. The models excel in tasks such as visual quizzing, optical character recognition, document/table/diagram comprehension, and visual localization.De...
Hibiki
Hibiki 是由 Kyutai Labs 开发的一款高保真度实时语音翻译模型。与传统的离线翻译不同,Hibiki 能够在用户讲话的同时,实时生成目标语言的自然语音翻译,并提供文本翻译。该模型采用多流架构,能够同时处理输入语...
VITA
VITA is a leading open source interactive multimodal large language modeling project, pioneering the ability to achieve true full multimodal interaction. The project launched VITA-1.0 in August 2024, pioneering the first open source interactive fully modal large language model.In December 2024, the project launched a major upgrade...
AnyText
AnyText is a revolutionary multilingual visual text generation and editing tool developed based on the diffusion model. It generates natural, high-quality multilingual text in images and supports flexible text editing capabilities. It was developed by a team of researchers and received Spotlight honors at the ICLR 2024 conference...