■ PyPDFLoader 클래스의 생성자에서 extract_images 인자를 사용해 PDF 파일 문서 로드시 이미지 텍스트를 추출하는 방법을 보여준다.
▶ main.py
1 2 3 4 5 6 7 8 9 10 11 |
from langchain_community.document_loaders import PyPDFLoader pyPDFLoader = PyPDFLoader("./nke-10k-2023.pdf", extract_images = True) documentList = pyPDFLoader.load_and_split() for document in documentList[:3]: print(document.page_content) print() |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 coloredlogs==15.0.1 dataclasses-json==0.6.7 flatbuffers==24.3.25 frozenlist==1.4.1 greenlet==3.0.3 humanfriendly==10.0 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 marshmallow==3.21.3 mpmath==1.3.0 multidict==6.0.5 mypy-extensions==1.0.0 numpy==1.26.4 onnxruntime==1.18.0 opencv-python==4.10.0.84 orjson==3.10.5 packaging==24.1 pillow==10.3.0 protobuf==5.27.2 pyclipper==1.3.0.post5 pydantic==2.7.4 pydantic_core==2.18.4 pypdf==4.2.0 PyYAML==6.0.1 rapidocr-onnxruntime==1.3.22 requests==2.32.3 shapely==2.0.4 six==1.16.0 SQLAlchemy==2.0.31 sympy==1.12.1 tenacity==8.4.2 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip install langchain-community pypdf rapidocr-onnxruntime 명령을 실행했다.