Nab theme, more professional navigation theme
Ctrl + D Favorites
Current Position:fig. beginning " AI Tool Library

E2M: Convert multiple file formats to Markdown for easy document formatting unification

2024-12-11 903

General Introduction

E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. E2M utilizes a parser-transformer architecture that efficiently parses and transforms file content, providing flexible configuration options for data retrieval enhanced generation (RAG) and model training or fine-tuning. E2M's goal is to provide users with high-quality data conversion services that simplify the process of document format harmonization. Each format has a specialized parser and converter, using the Parser parser to extract text and images from the file, and the Converter converter to convert the extracted content to Markdown.

E2M: Convert multiple file formats to Markdown, easily achieve document formatting uniformity-1

 

Function List

  • file parsing: Supports parsing of multiple file types, including text and image data.
  • format conversion: Convert the parsed data into Markdown format.
  • Multiple parsers and converters: Parsers and converters that support different engines and strategies.
  • Open source and flexible configuration: Provides open source code and flexible configuration options that can be customized by the user.
  • API Services: Provides API services for easy integration into other applications.

 

Using Help

Installation process

  1. Creating the Environment::
   conda create -n e2m python=3.10
conda activate e2m
  1. Update pip::
   pip install --upgrade pip
  1. Installation of E2M::
    • Install via git (recommended): bash
      pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
    • Installation via pip: bash
      pip install --upgrade wisup_e2m
    • Manual installation: bash
      git clone https://github.com/wisupai/e2m.git
      cd e2m
      pip install poetry
      poetry build
      pip install dist/wisup_e2m-0.1.63-py3-none-any.whl

Usage

  1. Starting the API service::
   gunicorn wisup_e2m.api.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
  1. Access to API Documentation: Open your browser and visithttp://127.0.0.1:8000/docsTo view the API documentation and usage examples, click here.

Main function operation flow

  1. File parsing and conversion::
    • Parses the contents of the file using a parser:
     from wisup_e2m.parsers import PdfParser
    parser = PdfParser()
    text_data = parser.parse('example.pdf')
    
    • Use a converter to convert the parsed content to Markdown format:
     from wisup_e2m.converters import TextConverter
    converter = TextConverter()
    markdown_data = converter.convert(text_data)
    
  2. Customized Configuration::
    • Modify the configuration fileconfig.yaml, adjust the parameters of the parser and converter according to the needs:
     parsers:
    pdf:
    engine: 'unstructured'
    converters:
    text:
    engine: 'litellm'
    
  3. Integration into other applications::
    • Integrate E2M into other applications using API services to send HTTP requests for file parsing and conversion: python
      import requests
      response = requests.post('http://127.0.0.1:8000/convert', files={'file': open('example.pdf', 'rb')})
      markdown_data = response.text

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Scan the code to follow

qrcode

Contact Us

Top

en_USEnglish