■ RecursiveCharacterTextSplitter 클래스의 생성자에서 chunk_size/chunk_overlap 인자를 사용해 재귀적 문자 텍스트 분할기를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
|
from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter webBaseLoader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0) |
■ CharacterTextSplitter 클래스의 생성자에서 chunk_size/chunk_overlap 인자를 사용해 문자 텍스트 분할기를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
|
from langchain_text_splitters import CharacterTextSplitter characterTextSplitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0) |
■ CharacterTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
|
from transformers import GPT2TokenizerFast from langchain_text_splitters import CharacterTextSplitter with open("sample.txt") as textIOWrapper: fileContent = textIOWrapper.read() gpt2TokenizerFast = GPT2TokenizerFast.from_pretrained("gpt2") characterTextSplitter = CharacterTextSplitter.from_huggingface_tokenizer(gpt2TokenizerFast, chunk_size = 100, chunk_overlap = 0) stringList = characterTextSplitter.split_text(fileContent) for string in stringList: print(string) print() """ '잔느'는 귀족의 딸로 부모의 사랑을 받으며 어려움 없이 자란 소녀였다. 열일곱 살인 그녀는, 수녀원 부속 여학교를 졸업하고 행복에 대한 기대로 가득 차 있었다. 아버지인 '시몽 자크 르 페르튀 데 보' 남작은 선량하고 다정했고, 어머니는 따뜻했다. 다만 어머니는 심장비대증으로 고생 중이었는데, 로잘리가 잘 부축했다. 따뜻한 집안 분위기로 인해, 하녀인 '로잘리'는 둘째 딸 같은 대접을 받았다. 마을의 피코 신부는 이제 막 수녀원을 졸업한 잔느에게 한 청년을 소개해 주었다. 그는 '줄리앙 장 드 라마르' 자작으로, 검소하고 외모가 출중한 청년이었다. 줄리앙과 함께 소풍을 다녀온 잔느는 줄리앙을 사랑하게 되고, 결국 둘은 결혼을 한다. 그러나 결혼식 후, 그들의 첫날밤은 달콤하지 않았다. 잔느는 난폭하게 자신의 욕구만 채우고 잠든 줄리앙에게 심한 모욕감을 느낀다. 심지어 신혼여행 중 줄리앙은 잔느의 용돈 2천 프랑을 맡아준다며 가져가지만, 잔느가 쇼핑을 할 때마다 심한 눈치를 주었다. 결국 잔느는 쇼핑도 제대로 못하고, 줄리앙에게 자신의 용돈을 고스란히 빼앗기고 만다. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 tenacity==8.4.2 tokenizers==0.19.1 tqdm==4.66.4 transformers==4.42.3 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ CharacterTextSplitter 클래스의 from_huggingface_tokenizer 메소드를 사용해 CharacterTextSplitter 객체를 만드는 방법을 보여준다. ▶ main.py
|
from transformers import GPT2TokenizerFast from langchain_text_splitters import CharacterTextSplitter gpt2TokenizerFast = GPT2TokenizerFast.from_pretrained("gpt2") characterTextSplitter = CharacterTextSplitter.from_huggingface_tokenizer(gpt2TokenizerFast, chunk_size = 100, chunk_overlap = 0) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 tenacity==8.4.2 tokenizers==0.19.1 tqdm==4.66.4 transformers==4.42.3 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters transformers
더 읽기
■ KonlpyTextSplitter 클래스의 split_text 메소드를 사용해 한글 문자열에서 한글 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
from langchain_text_splitters import KonlpyTextSplitter with open("sample.txt") as textIOWrapper: fileContent = textIOWrapper.read() konlpyTextSplitter = KonlpyTextSplitter() stringList = konlpyTextSplitter.split_text(fileContent) for string in stringList[:3]: print(string) print() """ ' 잔느' 는 귀족의 딸로 부모의 사랑을 받으며 어려움 없이 자란 소녀였다. 열일곱 살인 그녀는, 수녀원 부속 여학교를 졸업하고 행복에 대한 기대로 가득 차 있었다. 아버지인 ' 시 몽 자크 르 페르튀 데 보' 남작은 선량하고 다정했고, 어머니는 따뜻했다. 다만 어머니는 심장 비대증으로 고생 중이었는데, 로잘리가 잘 부축했다. 따뜻한 집안 분위기로 인해, 하녀인 ' 로잘리' 는 둘째 딸 같은 대접을 받았다. 마을의 피코 신부는 이제 막 수녀원을 졸업한 잔느에게 한 청년을 소개해 주었다. 그는 ' 줄 리 앙 장 드 라 마르' 자작으로, 검소하고 외모가 출중한 청년이었다. 줄리앙과 함께 소풍을 다녀온 잔 느는 줄리앙을 사랑하게 되고, 결국 둘은 결혼을 한다. 그러나 결혼식 후, 그들의 첫날 밤은 달콤하지 않았다. 잔 느는 난폭하게 자신의 욕구만 채우고 잠든 줄리앙에게 심한 모욕감을 느낀다. 심지어 신혼여행 중 줄리앙은 잔느의 용돈 2천 프랑을 맡아 준다며 가져가지만, 잔느가 쇼핑을 할 때마다 심한 눈치를 주었다. 결국 잔 느는 쇼핑도 제대로 못하고, 줄리앙에게 자신의 용돈을 고스란히 빼앗기고 만다. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 JPype1==1.5.0 jsonpatch==1.33 jsonpointer==3.0.0 konlpy==0.6.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 lxml==5.2.2 numpy==2.0.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ NLTKTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
|
from langchain_text_splitters import NLTKTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() nltkTextSplitter = NLTKTextSplitter(chunk_size = 1000) strnigList = nltkTextSplitter.split_text(fileContent) print(strnigList[0]) """ Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 click==8.1.7 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 nltk==3.8.1 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tqdm==4.66.4 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ SentenceTransformersTokenTextSplitter 클래스의 count_tokens 메소드를 사용해 토큰 수를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
from langchain_text_splitters import SentenceTransformersTokenTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() sentenceTransformersTokenTextSplitter = SentenceTransformersTokenTextSplitter( model_name = "sentence-transformers/all-mpnet-base-v2", tokens_per_chunk = 384, chunk_overlap = 32 ) tokenCount = sentenceTransformersTokenTextSplitter.count_tokens(text = fileContent) print(tokenCount) """ 8093 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 MarkupSafe==2.1.5 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 orjson==3.10.5 packaging==24.1 pillow==10.3.0 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 scikit-learn==1.5.0 scipy==1.14.0 sentence-transformers==3.0.1 sympy==1.12.1 tenacity==8.4.2 threadpoolctl==3.5.0 tokenizers==0.19.1 torch==2.3.1 tqdm==4.66.4 transformers==4.42.3 triton==2.3.1 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters sentence-transformers
더 읽기
■ SentenceTransformersTokenTextSplitter 클래스를 사용해 문장 변환 토큰 텍스트 분리자를 만드는 방법을 보여준다. ▶ main.py
|
from langchain_text_splitters import SentenceTransformersTokenTextSplitter sentenceTransformersTokenTextSplitter = SentenceTransformersTokenTextSplitter( model_name = "sentence-transformers/all-mpnet-base-v2", tokens_per_chunk = 384, chunk_overlap = 32 ) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 MarkupSafe==2.1.5 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 orjson==3.10.5 packaging==24.1 pillow==10.3.0 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 scikit-learn==1.5.0 scipy==1.14.0 sentence-transformers==3.0.1 sympy==1.12.1 tenacity==8.4.2 threadpoolctl==3.5.0 tokenizers==0.19.1 torch==2.3.1 tqdm==4.66.4 transformers==4.42.3 triton==2.3.1 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ TokenTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
from langchain_text_splitters import TokenTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() tokenTextSplitter = TokenTextSplitter(chunk_size = 10, chunk_overlap = 0) stringList = tokenTextSplitter.split_text(fileContent) for string in stringList[:3]: print(string) print() """ Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tiktoken==0.7.0 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ RecursiveCharacterTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
|
from langchain_text_splitters import RecursiveCharacterTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( model_name = "gpt-4", chunk_size = 100, chunk_overlap = 0 ) stringList = recursiveCharacterTextSplitter.split_text(fileContent) for string in stringList[:3]: print(string) print() """ Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tiktoken==0.7.0 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ RecursiveCharacterTextSplitter 클래스의 from_tiktoken_encoder 메소드를 사용해 토큰 기준 분할용 RecursiveCharacterTextSplitter 객체를 만드는 방법을 보여준다. ▶ main.py
|
from langchain_text_splitters import RecursiveCharacterTextSplitter recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( model_name = "gpt-4", chunk_size = 100, chunk_overlap = 0 ) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tiktoken==0.7.0 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ CharacterTextSplitter 클래스의 split_text 메소드를 사용해 문자열에서 문자열 리스트를 구하는 방법을 보여준다. ▶ main.py
|
from langchain_text_splitters import CharacterTextSplitter with open("state_of_the_union.txt") as textIOWrapper: fileContent = textIOWrapper.read() characterTextSplitter = CharacterTextSplitter.from_tiktoken_encoder(encoding_name = "cl100k_base", chunk_size = 100, chunk_overlap = 0) stringList = characterTextSplitter.split_text(fileContent) for string in stringList[:3]: print(string) print() |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tiktoken==0.7.0 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-text-splitters
더 읽기
■ CharacterTextSplitter 클래스의 from_tiktoken_encoder 메소드를 사용해 토큰 기준 분할용 CharacterTextSplitter 객체를 만드는 방법을 보여준다. ▶ main.py
|
from langchain_text_splitters import CharacterTextSplitter characterTextSplitter = CharacterTextSplitter.from_tiktoken_encoder(encoding_name = "cl100k_base", chunk_size = 100, chunk_overlap = 0) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 tenacity==8.4.2 tiktoken==0.7.0 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ SemanticChunker 클래스의 생성자에서 breakpoint_threshold_type 인자를 사용해 문서 분할 임계값 기준을 설정하는 방법을 보여준다. ※ breakpoint_threshold_type 인자의 디폴트 값은 "percentile"이고, "percentile", "standard_deviation",
더 읽기
■ SemanticChunker 클래스의 create_documents 메소드를 사용해 의미론적 유사성 기준으로 문서 리스트를 구하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
더 읽기
■ RecursiveJsonSplitter 클래스의 split_json 메소드에서 convert_lists 인자를 사용해 리스트도 분할 대상으로 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) jsonDictionaryList = recursiveJsonSplitter.split_json(json_data = jsonDictionary, convert_lists = True) for splitDictionary in jsonDictionaryList[:3]: print(splitDictionary) print() """ {'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': {'0': 'tracer-sessions'}, 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': {'0': {'API Key': {}}, '1': {'Tenant ID': {}}, '2': {'Bearer Auth': {}}}}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': {'0': {'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}}}}}} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ RecursiveJsonSplitter 클래스의 create_documents 메소드를 사용해 JSON 딕셔너리에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) documentList = recursiveJsonSplitter.create_documents(texts = [jsonDictionary]) for document in documentList[:3]: print(document) print() """ page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}' page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}' page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}' """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
더 읽기
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 C# 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ using System; class Program { static void Main() { int age = 30; // Change the age value as needed // Categorize the age without any console output if (age < 18) { // Age is under 18 } else if (age >= 18 && age < 65) { // Age is an adult } else { // Age is a senior citizen } } } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.CSHARP, chunk_size = 128, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='using System;'), Document(page_content='class Program\n{\n static void Main()\n {\n int age = 30; // Change the age value as needed'), Document(page_content='// Categorize the age without any console output\n if (age < 18)\n {\n // Age is under 18'), Document(page_content='}\n else if (age >= 18 && age < 65)\n {\n // Age is an adult\n }\n else\n {'), Document(page_content='// Age is a senior citizen\n }\n }\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
더 읽기
■ RecursiveJsonSplitter 클래스의 split_json 메소드를 사용해 JSON 딕셔너리에서 분리한 JSON 딕셔너리 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
import requests from langchain_text_splitters import RecursiveJsonSplitter response = requests.get("https://api.smith.langchain.com/openapi.json") jsonDictionary = response.json() recursiveJsonSplitter = RecursiveJsonSplitter(max_chunk_size = 300) jsonDictionaryList = recursiveJsonSplitter.split_json(json_data = jsonDictionary) for splitDictionary in jsonDictionaryList[:3]: print(splitDictionary) print() """ {'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}} {'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※
더 읽기
■ MarkdownHeaderTextSplitter 클래스에서 생성한 문서 리스트를 RecursiveCharacterTextSplitter 객체를 사용해 청크 크기를 제한하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
|
from langchain_text_splitters import MarkdownHeaderTextSplitter from langchain_text_splitters import RecursiveCharacterTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn, return_each_line = True) documentList1 = markdownHeaderTextSplitter.split_text(codeString) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 260, chunk_overlap = 30) documentList2 = recursiveCharacterTextSplitter.split_documents(documentList1) for document in documentList2: print(document) """ page_content='Hi this is Jim' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ MarkdownHeaderTextSplitter 클래스의 생성자에서 return_each_line 인자를 사용해 마크다운 라인별로 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
from langchain_text_splitters import MarkdownHeaderTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn, return_each_line = True) documentList = markdownHeaderTextSplitter.split_text(codeString) for document in documentList: print(document) """ page_content='Hi this is Jim' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ MarkdownHeaderTextSplitter 클래스의 split_text 메소드를 사용해 마크다운 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
from langchain_text_splitters import MarkdownHeaderTextSplitter codeString = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headerListToSplitOn = [ ("#" , "Header 1"), ("##" , "Header 2"), ("###", "Header 3") ] markdownHeaderTextSplitter = MarkdownHeaderTextSplitter(headerListToSplitOn) documentList = markdownHeaderTextSplitter.split_text(codeString) for document in documentList: print(document) """ page_content='Hi this is Jim \nHi this is Joe' metadata={'Header 1': 'Foo', 'Header 2': 'Bar'} page_content='Hi this is Lance' metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'} page_content='Hi this is Molly' metadata={'Header 1': 'Foo', 'Header 2': 'Baz'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install
더 읽기
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 PHP 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
|
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ <?php namespace foo; class Hello { public function __construct() { } } function hello() { echo "Hello World!"; } interface Human { public function breath(); } trait Foo { } enum Color { case Red; case Blue; } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.PHP, chunk_size = 50, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='<?php\nnamespace foo;'), Document(page_content='class Hello {'), Document(page_content='public function __construct() { }\n}'), Document(page_content='function hello() {\n echo "Hello World!";\n}'), Document(page_content='interface Human {\n public function breath();\n}'), Document(page_content='trait Foo { }\nenum Color\n{\n case Red;'), Document(page_content='case Blue;\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
더 읽기
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 HASKELL 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ main :: IO () main = do putStrLn "Hello, World!" -- Some sample functions add :: Int -> Int -> Int add x y = x + y """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.HASKELL, chunk_size = 50, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='main :: IO ()'), Document(page_content='main = do\n putStrLn "Hello, World!"\n-- Some'), Document(page_content='sample functions\nadd :: Int -> Int -> Int\nadd x y'), Document(page_content='= x + y')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
더 읽기
■ RecursiveCharacterTextSplitter 클래스를 사용해 create_documents 메소드를 사용해 SOLIDITY 소스 코드 문자열에서 문서 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.text_splitter import Language codeString = """ pragma solidity ^0.8.20; contract HelloWorld { function add(uint a, uint b) pure public returns(uint) { return a + b; } } """ recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter.from_language(language = Language.SOL, chunk_size = 128, chunk_overlap = 0) documentList = recursiveCharacterTextSplitter.create_documents([codeString]) print(documentList) """ [Document(page_content='pragma solidity ^0.8.20;'), Document(page_content='contract HelloWorld {\n function add(uint a, uint b) pure public returns(uint) {\n return a + b;\n }\n}')] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
더 읽기