Nab theme, more professional navigation theme
Ctrl + D Favorites
Current Position:fig. beginning " AI Tool Library

magic-html: extract body data from HTML URL, output plain text/markdown

2024-09-01 1.1 K

General Introduction

magic-html is a Python library designed to simplify the process of extracting body region content from HTML. Whether dealing with complex HTML structures or simple web pages, this library aims to provide a convenient and efficient interface for users. It supports multimodal extraction, multiple layoutextractor, including articles, forums and microsoft articles, and also supports latex formula extraction conversion.

Function List

  • Extract HTML body area content
  • Support for multimodal extraction
  • Supports article, forum and weibo post layouts
  • Support latex formula extraction and conversion
  • Customize the output in plain text or markdown format

 

Using Help

mounting

To install magic-html, use the pip command:

pip install magic-html

utilization

Once installed, it can be used with the following code:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)

Functional operation flow

  1. Initialize the extractor: First you need to import the magic-html library and initialize the extractor.
  2. Preparing HTML content: Prepare the HTML code from which the content needs to be extracted, which can be in the form of a string.
  3. Calling the extraction method: Useextractmethod to extract the body content. Different HTML types can be specified as needed, such as articles, forums, or WeChat posts.
  4. output result: The extraction result can be in plain text or markdown format, depending on the user's needs.

typical example

Below is a complete example showing how to extract the body content from a simple HTML page:

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

# 示例HTML内容
html = """
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 提取数据
data = extractor.extract(html)
print(data)

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Scan the code to follow

qrcode

Contact Us

Top

en_USEnglish