[PYTHON/LANGCHAIN] HTMLHeaderTextSplitter 클래스 : 생성자에서 return_each_element 인자를 사용해 HTML 문자열에서 문서 리스트 구하기

■ HTMLHeaderTextSplitter 클래스의 생성자에서 return_each_element 인자를 사용해 HTML 문자열에서 문서 리스트를 구하는 방법을 보여준다.

▶ main.py


from langchain_text_splitters import HTMLHeaderTextSplitter

htmlString = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
</html>
"""

headerTupeListToSplitOn = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

htmlHeaderTextSplitter = HTMLHeaderTextSplitter(headerTupeListToSplitOn, return_each_element = True)

documentList = htmlHeaderTextSplitter.split_text(htmlString)

print(documentList)

"""
[
    Document(page_content='Foo'),
    Document(page_content='Some intro text about Foo.', metadata={'Header 1': 'Foo'}),
    Document(page_content='Bar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
    Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
    Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
    Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
    Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
    Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
    Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})
]
"""

from langchain_text_splitters import HTMLHeaderTextSplitter

htmlString = """

<!DOCTYPE html>

<html>

<body>

<div>

<p>Some intro text about Foo.</p>

<div>

<h2>Bar main section</h2>

<p>Some intro text about Bar.</p>

<h3>Bar subsection 1</h3>

<p>Some text about the first subtopic of Bar.</p>

<h3>Bar subsection 2</h3>

<p>Some text about the second subtopic of Bar.</p>

</div>

<div>

<p>Some text about Baz</p>

</div>

<br>

<p>Some concluding text about Foo</p>

</div>

</body>

</html>

"""

headerTupeListToSplitOn = [

("h1", "Header 1"),

("h2", "Header 2"),

("h3", "Header 3"),

]

htmlHeaderTextSplitter = HTMLHeaderTextSplitter(headerTupeListToSplitOn, return_each_element = True)

documentList = htmlHeaderTextSplitter.split_text(htmlString)

print(documentList)

"""

[

Document(page_content='Foo'),

Document(page_content='Some intro text about Foo.', metadata={'Header 1': 'Foo'}),

Document(page_content='Bar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),

Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),

Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),

Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),

Document(page_content='Baz', metadata={'Header 1': 'Foo'}),

Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),

Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})

]

"""

▶ requirements.txt


annotated-types==0.7.0
certifi==2024.6.2
charset-normalizer==3.3.2
idna==3.7
jsonpatch==1.33
jsonpointer==3.0.0
langchain-core==0.2.10
langchain-text-splitters==0.2.2
langsmith==0.1.82
lxml==5.2.2
orjson==3.10.5
packaging==24.1
pydantic==2.7.4
pydantic_core==2.18.4
PyYAML==6.0.1
requests==2.32.3
tenacity==8.4.2
typing_extensions==4.12.2
urllib3==2.2.2

annotated-types==0.7.0

certifi==2024.6.2

charset-normalizer==3.3.2

idna==3.7

jsonpatch==1.33

jsonpointer==3.0.0

langchain-core==0.2.10

langchain-text-splitters==0.2.2

langsmith==0.1.82

lxml==5.2.2

orjson==3.10.5

packaging==24.1

pydantic==2.7.4

pydantic_core==2.18.4

PyYAML==6.0.1

requests==2.32.3

tenacity==8.4.2

typing_extensions==4.12.2

urllib3==2.2.2

※ pip install langchain-text-splitters lxml 명령을 실행했다.