[PYTHON/LANGCHAIN] RecursiveCharacterTextSplitter 클래스 : create_documents 메소드를 사용해 HTML 문자열에서 문서 리스트 구하기

■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 HTML 문자열에서 문서 리스트를 구하는 방법을 보여준다.

▶ main.py


from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language

codeString = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>
        </div>
        <div>
            As an open-source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.HTML, chunk_size = 60, chunk_overlap = 0)

documentList = recursiveCharacterTextSplitter.create_documents([codeString])

print(documentList)

"""
[Document(page_content='<!DOCTYPE html>\n<html>'), Document(page_content='<head>\n        <title>🦜️🔗 LangChain</title>'), Document(page_content='<style>\n            body {\n                font-family: Aria'), Document(page_content='l, sans-serif;\n            }\n            h1 {'), Document(page_content='color: darkblue;\n            }\n        </style>\n    </head'), Document(page_content='>'), Document(page_content='<body>'), Document(page_content='<div>\n            <h1>🦜️🔗 LangChain</h1>'), Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡'), Document(page_content='</p>\n        </div>'), Document(page_content='<div>\n            As an open-source project in a rapidly dev'), Document(page_content='eloping field, we are extremely open to contributions.'), Document(page_content='</div>\n    </body>\n</html>')]
"""

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.text_splitter import Language

codeString = """

<!DOCTYPE html>

<html>

<head>

<title>🦜️🔗 LangChain</title>

<style>

body {

font-family: Arial, sans-serif;

}

h1 {

color: darkblue;

}

</style>

</head>

<body>

<div>

<h1>🦜️🔗 LangChain</h1>

<p>⚡ Building applications with LLMs through composability ⚡</p>

</div>

<div>

As an open-source project in a rapidly developing field, we are extremely open to contributions.

</div>

</body>

</html>

"""

recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.HTML, chunk_size = 60, chunk_overlap = 0)

documentList = recursiveCharacterTextSplitter.create_documents([codeString])

print(documentList)

"""

[Document(page_content='<!DOCTYPE html>\n<html>'), Document(page_content='<head>\n <title>🦜️🔗 LangChain</title>'), Document(page_content='<style>\n body {\n font-family: Aria'), Document(page_content='l, sans-serif;\n }\n h1 {'), Document(page_content='color: darkblue;\n }\n </style>\n </head'), Document(page_content='>'), Document(page_content='<body>'), Document(page_content='<div>\n <h1>🦜️🔗 LangChain</h1>'), Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡'), Document(page_content='</p>\n </div>'), Document(page_content='<div>\n As an open-source project in a rapidly dev'), Document(page_content='eloping field, we are extremely open to contributions.'), Document(page_content='</div>\n </body>\n</html>')]

"""

▶ requirements.txt


aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
async-timeout==4.0.3
attrs==23.2.0
certifi==2024.6.2
charset-normalizer==3.3.2
frozenlist==1.4.1
greenlet==3.0.3
idna==3.7
jsonpatch==1.33
jsonpointer==3.0.0
langchain==0.2.6
langchain-core==0.2.10
langchain-text-splitters==0.2.2
langsmith==0.1.82
multidict==6.0.5
numpy==1.26.4
orjson==3.10.5
packaging==24.1
pydantic==2.7.4
pydantic_core==2.18.4
PyYAML==6.0.1
requests==2.32.3
SQLAlchemy==2.0.31
tenacity==8.4.2
typing_extensions==4.12.2
urllib3==2.2.2
yarl==1.9.4

aiohttp==3.9.5

aiosignal==1.3.1

annotated-types==0.7.0

async-timeout==4.0.3

attrs==23.2.0

certifi==2024.6.2

charset-normalizer==3.3.2

frozenlist==1.4.1

greenlet==3.0.3

idna==3.7

jsonpatch==1.33

jsonpointer==3.0.0

langchain==0.2.6

langchain-core==0.2.10

langchain-text-splitters==0.2.2

langsmith==0.1.82

multidict==6.0.5

numpy==1.26.4

orjson==3.10.5

packaging==24.1

pydantic==2.7.4

pydantic_core==2.18.4

PyYAML==6.0.1

requests==2.32.3

SQLAlchemy==2.0.31

tenacity==8.4.2

typing_extensions==4.12.2

urllib3==2.2.2

yarl==1.9.4

※ pip install langchain langchain-text-splitters 명령을 실행했다.