■ UnstructuredMarkdownLoader 클래스의 생성자에서 mode 인자를 사용해 특정 단위로 분리해 문서를 로드하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
from langchain_community.document_loaders import UnstructuredMarkdownLoader unstructuredMarkdownLoader = UnstructuredMarkdownLoader("list.md", mode = "elements") documentList = unstructuredMarkdownLoader.load() for document in documentList: print(document) """ page_content='첫번째' metadata={'source': 'list.md', 'category_depth': 1, 'last_modified': '2024-06-27T19:30:15', 'languages': ['kor'], 'filetype': 'text/markdown', 'filename': 'list.md', 'category': 'ListItem'} page_content='두번째' metadata={'source': 'list.md', 'category_depth': 1, 'last_modified': '2024-06-27T19:30:15', 'languages': ['kor'], 'filetype': 'text/markdown', 'filename': 'list.md', 'category': 'ListItem'} page_content='세번째' metadata={'source': 'list.md', 'category_depth': 1, 'last_modified': '2024-06-27T19:30:15', 'languages': ['kor'], 'filetype': 'text/markdown', 'filename': 'list.md', 'category': 'ListItem'} page_content='항목1\n\n항목 2\n항목 3\n항목 4' metadata={'source': 'list.md', 'category_depth': 1, 'last_modified': '2024-06-27T19:30:15', 'languages': ['kor'], 'filetype': 'text/markdown', 'filename': 'list.md', 'category': 'ListItem'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.3 certifi==2024.6.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 dataclasses-json==0.6.7 deepdiff==7.0.1 emoji==2.12.1 exceptiongroup==1.2.1 filetype==1.2.0 frozenlist==1.4.1 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langdetect==1.0.9 langsmith==0.1.82 lxml==5.2.2 Markdown==3.6 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 nltk==3.8.1 numpy==1.26.4 ordered-set==4.1.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 python-dateutil==2.9.0.post0 python-iso639==2024.4.27 python-magic==0.4.27 PyYAML==6.0.1 rapidfuzz==3.9.3 regex==2024.5.15 requests==2.32.3 requests-toolbelt==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.31 tabulate==0.9.0 tenacity==8.4.2 tqdm==4.66.4 typing-inspect==0.9.0 typing_extensions==4.12.2 unstructured==0.14.8 unstructured-client==0.23.7 urllib3==2.2.2 wrapt==1.16.0 yarl==1.9.4 |
※ pip
더 읽기
■ UnstructuredMarkdownLoader 클래스의 load 메소드를 사용해 MD 파일 문서를 로드하는 방법을 보여준다. ▶ main.py
|
from langchain_community.document_loaders import UnstructuredMarkdownLoader unstructuredMarkdownLoader = UnstructuredMarkdownLoader("list.md") documentList = unstructuredMarkdownLoader.load() for document in documentList: print(document) """ page_content='첫번째\n\n두번째\n\n세번째\n\n항목1\n\n항목 2\n항목 3\n항목 4' metadata={'source': 'list.md'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.3 certifi==2024.6.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 dataclasses-json==0.6.7 deepdiff==7.0.1 emoji==2.12.1 exceptiongroup==1.2.1 filetype==1.2.0 frozenlist==1.4.1 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langdetect==1.0.9 langsmith==0.1.82 lxml==5.2.2 Markdown==3.6 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 nltk==3.8.1 numpy==1.26.4 ordered-set==4.1.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 python-dateutil==2.9.0.post0 python-iso639==2024.4.27 python-magic==0.4.27 PyYAML==6.0.1 rapidfuzz==3.9.3 regex==2024.5.15 requests==2.32.3 requests-toolbelt==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.31 tabulate==0.9.0 tenacity==8.4.2 tqdm==4.66.4 typing-inspect==0.9.0 typing_extensions==4.12.2 unstructured==0.14.8 unstructured-client==0.23.7 urllib3==2.2.2 wrapt==1.16.0 yarl==1.9.4 |
※ pip langchain-community unstructured[md]
더 읽기
■ DirectoryLoader 클래스의 생성자에서 loader_cls 인자를 사용해 TextLoader 로더 클래스를 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
from langchain_community.document_loaders import TextLoader from langchain_community.document_loaders import DirectoryLoader directoryLoader = DirectoryLoader(".", glob = "*.md", loader_cls = TextLoader) documentList = directoryLoader.load() for document in documentList: print(document) print() """ page_content='# 제목 1단계\n## 제목 2단계 \n### 제목 3단계\n#### 제목 4단계\n##### 제목 5단계\n###### 제목 6단계' metadata={'source': 'header.md'} page_content='1. 첫번째\n1. 두번째\n1. 세번째\n \n+ 항목1\n - 항목 2\n * 항목 3\n + 항목 4' metadata={'source': 'list.md'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 dataclasses-json==0.6.7 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip install
더 읽기
■ DirectoryLoader 클래스의 생성자에서 use_multithreading 인자를 사용해 멀티스레딩을 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
from langchain_community.document_loaders import DirectoryLoader directoryLoader = DirectoryLoader(".", glob = "*.md", use_multithreading = True) documentList = directoryLoader.load() for document in documentList: print(document) print() """ page_content='첫번째\n\n두번째\n\n세번째\n\n항목1\n\n항목 2\n항목 3\n항목 4' metadata={'source': 'list.md'} page_content='제목 1단계\n\n제목 2단계\n\n제목 3단계\n\n제목 4단계\n\n제목 5단계\n\n제목 6단계' metadata={'source': 'header.md'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.3 certifi==2024.6.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 dataclasses-json==0.6.7 deepdiff==7.0.1 emoji==2.12.1 exceptiongroup==1.2.1 filetype==1.2.0 frozenlist==1.4.1 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langdetect==1.0.9 langsmith==0.1.82 lxml==5.2.2 Markdown==3.6 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 nltk==3.8.1 numpy==1.26.4 ordered-set==4.1.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 python-dateutil==2.9.0.post0 python-iso639==2024.4.27 python-magic==0.4.27 PyYAML==6.0.1 rapidfuzz==3.9.3 regex==2024.5.15 requests==2.32.3 requests-toolbelt==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.31 tabulate==0.9.0 tenacity==8.4.2 tqdm==4.66.4 typing-inspect==0.9.0 typing_extensions==4.12.2 unstructured==0.14.8 unstructured-client==0.23.7 urllib3==2.2.2 wrapt==1.16.0 yarl==1.9.4 |
※ pip install langchain-community unstructured[md]
더 읽기
■ DirectoryLoader 클래스의 생성자에서 show_progress 인자를 사용해 문서 로드 진행 상태를 표시하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
from langchain_community.document_loaders import DirectoryLoader directoryLoader = DirectoryLoader(".", glob = "*.md", show_progress = True) documentList = directoryLoader.load() for document in documentList: print(document) print() """ 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.92it/s] page_content='제목 1단계\n\n제목 2단계\n\n제목 3단계\n\n제목 4단계\n\n제목 5단계\n\n제목 6단계' metadata={'source': 'header.md'} page_content='첫번째\n\n두번째\n\n세번째\n\n항목1\n\n항목 2\n항목 3\n항목 4' metadata={'source': 'list.md'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.3 certifi==2024.6.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 dataclasses-json==0.6.7 deepdiff==7.0.1 emoji==2.12.1 exceptiongroup==1.2.1 filetype==1.2.0 frozenlist==1.4.1 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langdetect==1.0.9 langsmith==0.1.82 lxml==5.2.2 Markdown==3.6 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 nltk==3.8.1 numpy==1.26.4 ordered-set==4.1.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 python-dateutil==2.9.0.post0 python-iso639==2024.4.27 python-magic==0.4.27 PyYAML==6.0.1 rapidfuzz==3.9.3 regex==2024.5.15 requests==2.32.3 requests-toolbelt==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.31 tabulate==0.9.0 tenacity==8.4.2 tqdm==4.66.4 typing-inspect==0.9.0 typing_extensions==4.12.2 unstructured==0.14.8 unstructured-client==0.23.7 urllib3==2.2.2 wrapt==1.16.0 yarl==1.9.4 |
※ pip
더 읽기
■ DirectoryLoader 클래스의 생성자에서 glob 인자를 사용해 MD 파일 문서를 로드하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
from langchain_community.document_loaders import DirectoryLoader directoryLoader = DirectoryLoader(".", glob = "*.md") documentList = directoryLoader.load() for document in documentList: print(document) print() """ page_content='제목 1단계\n\n제목 2단계\n\n제목 3단계\n\n제목 4단계\n\n제목 5단계\n\n제목 6단계' metadata={'source': 'header.md'} page_content='첫번째\n\n두번째\n\n세번째\n\n항목1\n\n항목 2\n항목 3\n항목 4' metadata={'source': 'list.md'} """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.3 certifi==2024.6.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 dataclasses-json==0.6.7 deepdiff==7.0.1 emoji==2.12.1 exceptiongroup==1.2.1 filetype==1.2.0 frozenlist==1.4.1 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 idna==3.7 joblib==1.4.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langdetect==1.0.9 langsmith==0.1.82 lxml==5.2.2 Markdown==3.6 marshmallow==3.21.3 multidict==6.0.5 mypy-extensions==1.0.0 nest-asyncio==1.6.0 nltk==3.8.1 numpy==1.26.4 ordered-set==4.1.0 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 python-dateutil==2.9.0.post0 python-iso639==2024.4.27 python-magic==0.4.27 PyYAML==6.0.1 rapidfuzz==3.9.3 regex==2024.5.15 requests==2.32.3 requests-toolbelt==1.0.0 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.31 tabulate==0.9.0 tenacity==8.4.2 tqdm==4.66.4 typing-inspect==0.9.0 typing_extensions==4.12.2 unstructured==0.14.8 unstructured-client==0.23.7 urllib3==2.2.2 wrapt==1.16.0 yarl==1.9.4 |
※ pip install
더 읽기