综合介绍 OneFileLLM 是一个开源命令行工具,旨在将多种数据源整合成单一文本文件,方便输入大语言模型(LLM)。它支持处理 GitHub 仓库、ArXiv 论文、YouTube 视频转录、网页内容、Sci-Hub 论文和本地文件,自...
综合介绍 Chatlog 是一个开源工具,专注于从微信本地数据库提取和查询聊天记录。它支持微信 3.x 和 4.0 版本,覆盖 Windows 和 macOS 系统。用户可以通过命令行、终端界面或 HTTP API 操作,查看聊天记录、联系...
Comprehensive Introduction Versatile OCR Program is an open source Optical Character Recognition (OCR) tool designed specifically for processing complex academic and educational documents. It can extract text, tables, mathematical formulas, diagrams and schematics from PDF, images and other documents and generate structured suitable for machine learning training...
General Introduction DevDocs is a completely free open source tool developed by the CyberAGI team and hosted on GitHub. It is designed for programmers and software developers to start from the URL of a technical document, automatically crawl the relevant pages and organize them into concise Markdown or JSON files. It has a built-in...
Comprehensive Introduction It can automatically analyze the layout of PDF documents, identify text, titles, images, tables, formulas and other elements in the page, and determine their correct order. The tool supports OCR functionality , you can convert scanned PDF to searchable text. It runs on Docker , provides two models: visual model (Vis...
综合介绍 serverless-markdown-convertor 是一个免费的开源工具,基于 Cloudflare Worker 和 Workers AI 开发,能将多种文件转换为 Markdown 格式。它支持 PDF、图片、Offi...
综合介绍 GPT-Crawler 是由 BuilderIO 团队开发的一个开源工具,托管在 GitHub 上。它通过输入一个或多个网站 URL,爬取页面内容,生成结构化的知识文件(output.json),用于创建自定义 GPT 或 AI ...
General Introduction pure.md is a tool designed for AI agents and developers that focuses on quickly converting web content or files to Markdown format. It bypasses anti-crawler restrictions through proxy services, extracts the core data of a web page, and outputs a concise Markdown file. Whether it's a dynamic web page, PDF file...
General Introduction Cloudsquid is a company founded in 2023 in Berlin, Germany, focused on simplifying document processing with artificial intelligence. Its core product is an online data extraction platform that allows users to upload PDFs, images, audio, video, etc., and simply state what data needs to be extracted, e.g., "Find...
General Introduction PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It is developed by oomol-lab and hosted on GitHub for users who like to organize their eBooks. The tool runs through a local AI model without the need for an Internet connection, which is both privacy-preserving and square...
Comprehensive Introduction Supametas.AI is a data processing platform that specializes in organizing web pages, documents, audio and video, and other cluttered information into structured data that AI can use. It supports collecting data from multiple sources, including web links, APIs, local files, etc., and then exporting it to JSON or Markdown format. Platform...
综合介绍 MarkPDFDown 是一个开源工具。它利用多模态大语言模型,把 PDF 文件转为 Markdown 格式。开发者是 GitHub 用户 jorben。这个工具的目标很简单:让 PDF 文档变得更易编辑和分享。它能识别文档中的标...
Comprehensive Introduction SmolDocling is a Visual Language Model (VLM) developed by the ds4sd team in collaboration with IBM, based on SmolVLM-256M and hosted on the Hugging Face platform. It is the world's smallest VLM with only 256M parameters....
The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g., HTML). In today's information age, a large amount of important tabular data still exists in an unstructured state (e.g., pictures of information statistics in scanned documents, pd...
In the long history of human civilization, every leap in the way information is acquired and parsed has profoundly driven social progress. From the ancient hieroglyphics, to the portable papyrus, to the later emergence of the printing press and today's wave of digitization, each technological innovation has greatly expanded the transmission of human knowledge...
综合介绍 Firecrawl MCP Server 是由 MendableAI 开发的一款开源工具,基于 Model Context Protocol (MCP) 协议实现,与 Firecrawl API 集成,提供强大的网页抓取和数据提取...
综合介绍 olmOCR 是由 Allen Institute for Artificial Intelligence (AI2) 的 AllenNLP 团队开发的一款开源工具,专注于将 PDF 文件转换为线性化文本,特别适合用于大规模语言模...
综合介绍 par_scrape 是一个基于 Python 的开源网页爬虫工具,由开发者 Paul Robello 在 GitHub 上推出,旨在帮助用户从网页中智能提取数据。它整合了 Selenium 和 Playwright 两种强大的浏...
综合介绍 PDF-Extract-Kit 是一个由 OpenDataLab 团队开发的开源项目,专注于从复杂多样的 PDF 文档中高效提取高质量内容。它集成了先进的文档解析技术,支持布局检测、公式识别、表格提取和 OCR 等功能,适用于.....