[PYTHON/LANGCHAIN] SemanticChunker 클래스 : create_documents 메소드를 사용해 의미론적 유사성 기준으로 문서 리스트 구하기
■ SemanticChunker 클래스의 create_documents 메소드를 사용해 의미론적 유사성 기준으로 문서 리스트를 구하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ SemanticChunker 클래스의 create_documents 메소드를 사용해 의미론적 유사성 기준으로 문서 리스트를 구하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ RecursiveJsonSplitter 클래스의 split_json 메소드에서 convert_lists 인자를 사용해 리스트도 분할 대상으로 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) jsonDictionaryList = recursiveJsonSplitter.split_json(json_data = jsonDictionary, convert_lists = True) for splitDictionary in jsonDictionaryList[:3]: print(splitDictionary) print() """ {'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': {'0': 'tracer-sessions'}, 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': {'0': {'API Key': {}}, '1': {'Tenant ID': {}}, '2': {'Bearer Auth': {}}}}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': {'0': {'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}}}}}} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
■ RecursiveJsonSplitter 클래스의 create_documents 메소드를 사용해 JSON 딕셔너리에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) documentList = recursiveJsonSplitter.create_documents(texts = [jsonDictionary]) for document in documentList[:3]: print(document) print() """ page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}' page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}' page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}' """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 C# 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ using System; class Program { static void Main() { int age = 30; // Change the age value as needed // Categorize the age without any console output if (age < 18) { // Age is under 18 } else if (age >= 18 && age < 65) { // Age is an adult } else { // Age is a senior citizen } } } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.CSHARP, chunk_size = 128, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='using System;'), Document(page_content='class Program\n{\n static void Main()\n {\n int age = 30; // Change the age value as needed'), Document(page_content='// Categorize the age without any console output\n if (age < 18)\n {\n // Age is under 18'), Document(page_content='}\n else if (age >= 18 && age < 65)\n {\n // Age is an adult\n }\n else\n {'), Document(page_content='// Age is a senior citizen\n }\n }\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveJsonSplitter 클래스의 split_json 메소드를 사용해 JSON 딕셔너리에서 분리한 JSON 딕셔너리 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) jsonDictionaryList = recursiveJsonSplitter.split_json(json_data = jsonDictionary) for splitDictionary in jsonDictionaryList[:3]: print(splitDictionary) print() """ {'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※
■ MarkdownHeaderTextSplitter 클래스에서 생성한 문서 리스트를 RecursiveCharacterTextSplitter 객체를 사용해 청크 크기를 제한하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
from langchain_text_splitters import MarkdownHeaderTextSplitter from langchain_text_splitters import RecursiveCharacterTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn, return_each_line = True) documentList1 = markdownHeaderTextSplitter.split_text(codeString) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 260, chunk_overlap = 30) documentList2 = recursiveCharacterTextSplitter.split_documents(documentList1) for document in documentList2: print(document) """ page_content='Hi this is Jim' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
■ MarkdownHeaderTextSplitter 클래스의 생성자에서 return_each_line 인자를 사용해 마크다운 라인별로 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from langchain_text_splitters import MarkdownHeaderTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn, return_each_line = True) documentList = markdownHeaderTextSplitter.split_text(codeString) for document in documentList: print(document) """ page_content='Hi this is Jim' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
■ MarkdownHeaderTextSplitter 클래스의 split_text 메소드를 사용해 마크다운 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from langchain_text_splitters import MarkdownHeaderTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn) documentList = markdownHeaderTextSplitter.split_text(codeString) for document in documentList: print(document) """ page_content='Hi this is Jim \nHi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 PHP 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ <?php namespace foo; class Hello { public function __construct() { } } function hello() { echo "Hello World!"; } interface Human { public function breath(); } trait Foo { } enum Color { case Red; case Blue; } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.PHP, chunk_size = 50, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='<?php\nnamespace foo;'), Document(page_content='class Hello {'), Document(page_content='public function __construct() { }\n}'), Document(page_content='function hello() {\n echo "Hello World!";\n}'), Document(page_content='interface Human {\n public function breath();\n}'), Document(page_content='trait Foo { }\nenum Color\n{\n case Red;'), Document(page_content='case Blue;\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 HASKELL 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ main :: IO () main = do putStrLn "Hello, World!" -- Some sample functions add :: Int -> Int -> Int add x y = x + y """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.HASKELL, chunk_size = 50, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='main :: IO ()'), Document(page_content='main = do\n putStrLn "Hello, World!"\n-- Some'), Document(page_content='sample functions\nadd :: Int -> Int -> Int\nadd x y'), Document(page_content='= x + y')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 SOLIDITY 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ pragma solidity ^0.8.20; contract HelloWorld { function add(uint a, uint b) pure public returns(uint) { return a + b; } } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.SOL, chunk_size = 128, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='pragma solidity ^0.8.20;'), Document(page_content='contract HelloWorld {\n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 HTML 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ <!DOCTYPE html> <html> <head> <title>🦜️🔗 LangChain</title> <style> body { font-family: Arial, sans-serif; } h1 { color: darkblue; } </style> </head> <body> <div> <h1>🦜️🔗 LangChain</h1> <p>⚡ Building applications with LLMs through composability ⚡</p> </div> <div> As an open-source project in a rapidly developing field, we are extremely open to contributions. </div> </body> </html> """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.HTML, chunk_size = 60, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='<!DOCTYPE html>\n<html>'), Document(page_content='<head>\n <title>🦜️🔗 LangChain</title>'), Document(page_content='<style>\n body {\n font-family: Aria'), Document(page_content='l, sans-serif;\n }\n h1 {'), Document(page_content='color: darkblue;\n }\n </style>\n </head'), Document(page_content='>'), Document(page_content='<body>'), Document(page_content='<div>\n <h1>🦜️🔗 LangChain</h1>'), Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡'), Document(page_content='</p>\n </div>'), Document(page_content='<div>\n As an open-source project in a rapidly dev'), Document(page_content='eloping field, we are extremely open to contributions.'), Document(page_content='</div>\n </body>\n</html>')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 LATEX 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ \documentclass{article} \begin{document} \maketitle \section{Introduction} Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis. \subsection{History of LLMs} The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance. \subsection{Applications of LLMs} LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics. \end{document} """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.LATEX, chunk_size = 60, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle'), Document(page_content='\\section{Introduction}\nLarge language models (LLMs) are a'), Document(page_content='type of machine learning model that can be trained on vast'), Document(page_content='amounts of text data to generate human-like language. In'), Document(page_content='recent years, LLMs have made significant advances in a'), Document(page_content='variety of natural language processing tasks, including'), Document(page_content='language translation, text generation, and sentiment'), Document(page_content='analysis.'), Document(page_content='\\subsection{History of LLMs}\nThe earliest LLMs were'), Document(page_content='developed in the 1980s and 1990s, but they were limited by'), Document(page_content='the amount of data that could be processed and the'), Document(page_content='computational power available at the time. In the past'), Document(page_content='decade, however, advances in hardware and software have'), Document(page_content='made it possible to train LLMs on massive datasets, leading'), Document(page_content='to significant improvements in performance.'), Document(page_content='\\subsection{Applications of LLMs}\nLLMs have many'), Document(page_content='applications in industry, including chatbots, content'), Document(page_content='creation, and virtual assistants. They can also be used in'), Document(page_content='academia for research in linguistics, psychology, and'), Document(page_content='computational linguistics.\n\n\\end{document}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 마크다운 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ # 🦜️🔗 LangChain ⚡ Building applications with LLMs through composability ⚡ ## Quick Install ```bash # Hopefully this code block isn't split pip install langchain ``` As an open-source project in a rapidly developing field, we are extremely open to contributions. """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.MARKDOWN, chunk_size = 60, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='# 🦜️🔗 LangChain'), Document(page_content='⚡ Building applications with LLMs through composability ⚡'), Document(page_content='## Quick Install\n\n```bash'), Document(page_content="# Hopefully this code block isn't split"), Document(page_content='pip install langchain'), Document(page_content='```'), Document(page_content='As an open-source project in a rapidly developing field, we'), Document(page_content='are extremely open to contributions.')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 타입스크립트 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ function helloWorld(): void { console.log("Hello, World!"); } // Call the function helloWorld(); """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.TS, chunk_size = 60, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='function helloWorld(): void {'), Document(page_content='console.log("Hello, World!");\n}'), Document(page_content='// Call the function\nhelloWorld();')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 자바스크립트 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ function helloWorld() { console.log("Hello, World!"); } // Call the function helloWorld(); """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.JS, chunk_size = 60, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='function helloWorld() {\n console.log("Hello, World!");\n}'), Document(page_content='// Call the function\nhelloWorld();')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 파이썬 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ def hello_world(): print("Hello, World!") # Call the function hello_world() """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.PYTHON, chunk_size = 50, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='def hello_world():\n print("Hello, World!")'), Document(page_content='# Call the function\nhello_world()')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
■ RecursiveCharacterTextSplitter 클래스의 from_language 정적 메소드를 사용해 RecursiveCharacterTextSplitter 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 |
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.PYTHON, chunk_size = 50, chunk_overlap = 0) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip install langchain
■ RecursiveCharacterTextSplitter 클래스의 get_separators_for_language 정적 메소드를 사용해 언어별 분리자 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 |
from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_text_splitters import Language separatorList = RecursiveCharacterTextSplitter.get_separators_for_language(Language.CSHARP) print(separatorList) """ ['\ninterface ', '\nenum ', '\nimplements ', '\ndelegate ', '\nevent ', '\nclass ', '\nabstract ', '\npublic ', '\nprotected ', '\nprivate ', '\nstatic ', '\nreturn ', '\nif ', '\ncontinue ', '\nfor ', '\nforeach ', '\nwhile ', '\nswitch ', '\nbreak ', '\ncase ', '\nelse ', '\ntry ', '\nthrow ', '\nfinally ', '\ncatch ', '\n\n', '\n', ' ', ''] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
■ Language 열거형 값을 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 |
from langchain.text_splitter import Language languageList = [e.value for e in Language] print(languageList) """ ['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'swift', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'haskell', 'elixir'] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters 명령을 실행했다.
■ CharacterTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
from langchain_text_splitters import CharacterTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() characterTextSplitter = CharacterTextSplitter( separator = "\n\n", chunk_size = 1000, chunk_overlap = 200, length_function = len, is_separator_regex = False ) stringList = characterTextSplitter.split_text(fileContent) for string in stringList[:3]: print(string) """ Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving. And the costs and the threats to America and the world keep rising. They keep moving. And the costs and the threats to America and the world keep rising. That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. The United States is a member along with 29 other nations. It matters. American diplomacy matters. American resolve matters. Putin’s latest attack on Ukraine was premeditated and unprovoked. He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. We countered Russia’s lies with truth. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
■ CharacterTextSplitter 클래스의 create_documents 메소드에서 metadatas 인자를 사용해 메타 데이터를 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from langchain_text_splitters import CharacterTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() characterTextSplitter = CharacterTextSplitter( separator = "\n\n", chunk_size = 1000, chunk_overlap = 200, length_function = len, is_separator_regex = False ) documentList = characterTextSplitter.create_documents( [fileContent], metadatas = [{"document" : 1}, {"document" : 2}] ) for document in documentList: print(document.metadata) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
■ CharacterTextSplitter 클래스의 create_documents 메소드를 사용해 문자열 리스트에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from langchain_text_splitters import CharacterTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() characterTextSplitter = CharacterTextSplitter( separator = "\n\n", chunk_size = 1000, chunk_overlap = 200, length_function = len, is_separator_regex = False ) documentList = characterTextSplitter.create_documents([fileContent]) print(len(documentList)) """ 49 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
■ HTMLSectionSplitter 클래스에서 생성한 문서 리스트를 RecursiveCharacterTextSplitter 객체를 사용해 청크 크기를 제한하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
from langchain_text_splitters import HTMLSectionSplitter from langchain_text_splitters import RecursiveCharacterTextSplitter htmlString = """ <!DOCTYPE html> <html> <body> <div> <h1>Foo</h1> <p>Some intro text about Foo.</p> <div> <h2>Bar main section</h2> <p>Some intro text about Bar.</p> <h3>Bar subsection 1</h3> <p>Some text about the first subtopic of Bar.</p> <h3>Bar subsection 2</h3> <p>Some text about the second subtopic of Bar.</p> </div> <div> <h2>Baz</h2> <p>Some text about Baz</p> </div> <br> <p>Some concluding text about Foo</p> </div> </body> </html> """ headerTupeListToSplitOn = [ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ] htmlSectionSplitter = HTMLSectionSplitter(headerTupeListToSplitOn) documentList1 = htmlSectionSplitter.split_text(htmlString) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 100, chunk_overlap = 10) documentList2 = recursiveCharacterTextSplitter.split_documents(documentList1) print(documentList2) """ [ Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}), Document(page_content='Bar main section \n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}), Document(page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}), Document(page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}), Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'}) ] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
annotated-types==0.7.0 beautifulsoup4==4.12.3 bs4==0.0.2 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 lxml==5.2.2 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 soupsieve==2.5 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ psp
■ HTMLHeaderTextSplitter 클래스의 split_text 메소드를 사용해 HTML 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
from langchain_text_splitters import HTMLSectionSplitter htmlString = """ <!DOCTYPE html> <html> <body> <div> <h1>Foo</h1> <p>Some intro text about Foo.</p> <div> <h2>Bar main section</h2> <p>Some intro text about Bar.</p> <h3>Bar subsection 1</h3> <p>Some text about the first subtopic of Bar.</p> <h3>Bar subsection 2</h3> <p>Some text about the second subtopic of Bar.</p> </div> <div> <h2>Baz</h2> <p>Some text about Baz</p> </div> <br> <p>Some concluding text about Foo</p> </div> </body> </html> """ headerTupeListToSplitOn = [ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ] htmlSectionSplitter = HTMLSectionSplitter(headerTupeListToSplitOn) documentList = htmlSectionSplitter.split_text(htmlString) print(documentList) """ [ Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}), Document(page_content='Bar main section \n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}), Document(page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}), Document(page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}), Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'}) ] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
annotated-types==0.7.0 beautifulsoup4==4.12.3 bs4==0.0.2 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 lxml==5.2.2 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 soupsieve==2.5 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ psp install