[PYTHON/LANGCHAIN] create_react_agent 함수 : CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소 검색하기
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ 채팅 히스토리를 갖고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
import bs4 from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.prompts import MessagesPlaceholder from langchain.chains import create_history_aware_retriever from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains import create_retrieval_chain from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_core.runnables.history import RunnableWithMessageHistory load_dotenv() chatOpenAI = ChatOpenAI(model = "gpt-4o") webBaseLoader = WebBaseLoader( web_paths = ("https://lilianweng.github.io/posts/2023-06-23-agent/",), bs_kwargs = dict(parse_only = bs4.SoupStrainer(class_ = ("post-content", "post-title", "post-header"))) ) documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documents = splitDocumentList, embedding = openAIEmbeddings) vectorStoreRetriever = chroma.as_retriever() systemMessage1 = "Given a chat history and the latest user question which might reference context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, just reformulate it if needed and otherwise return it as is." chatPromptTemplate1 = ChatPromptTemplate.from_messages( [ ("system", systemMessage1), MessagesPlaceholder("chat_history"), ("human", "{input}") ] ) runnableBinding1 = create_history_aware_retriever(chatOpenAI, vectorStoreRetriever, chatPromptTemplate1) systemMessage2 = "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}" chatPromptTemplate2 = ChatPromptTemplate.from_messages( [ ("system", systemMessage2), MessagesPlaceholder("chat_history"), ("human", "{input}"), ] ) runnableBinding2 = create_stuff_documents_chain(chatOpenAI, chatPromptTemplate2) runnableBinding3 = create_retrieval_chain(runnableBinding1, runnableBinding2) chatMessageHistoryDictionary = {} def GetChatMessageHistoryDictionary(session_id : str) -> BaseChatMessageHistory: if session_id not in chatMessageHistoryDictionary: chatMessageHistoryDictionary[session_id] = ChatMessageHistory() return chatMessageHistoryDictionary[session_id] runnableWithMessageHistory = RunnableWithMessageHistory( runnableBinding3, GetChatMessageHistoryDictionary, input_messages_key = "input", history_messages_key = "chat_history", output_messages_key = "answer", ) responseDictionary1 = runnableWithMessageHistory.invoke( {"input" : "What is Task Decomposition?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer1 = responseDictionary1["answer"] print(answer1) print("-" * 50) """ Task Decomposition is a technique used to break down complex tasks into smaller, manageable steps. It is often implemented using methods like Chain of Thought (CoT) or Tree of Thoughts, which help in systematically exploring and reasoning through various possibilities. This approach enhances model performance by allowing a step-by-step analysis and execution of tasks. """ responseDictionary2 = runnableWithMessageHistory.invoke( {"input" : "What are common ways of doing it?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer2 = responseDictionary2["answer"] print(answer2) print("-" * 50) from langchain_core.messages import AIMessage for message in chatMessageHistoryDictionary["abc123"].messages: if isinstance(message, AIMessage): prefix = "AI" else: prefix = "User" print(f"{prefix} : {message.content}") print("-" * 50) """ Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. -------------------------------------------------- Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- User : What is Task Decomposition? AI : Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. User : What are common ways of doing it? AI : Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- """ |
▶
■ BaseRetriever 클래스를 사용해 커스텀 검색기를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
import asyncio from langchain_core.retrievers import BaseRetriever from typing import List from langchain_core.documents import Document from langchain_core.callbacks import CallbackManagerForRetrieverRun class CustomRetriever(BaseRetriever): """A toy retriever that contains the top k documents that contain the user query. This retriever only implements the sync method _get_relevant_documents. If the retriever were to involve file access or network access, it could benefit from a native async implementation of `_aget_relevant_documents`. As usual, with Runnables, there's a default async implementation that's provided that delegates to the sync implementation running on another thread. """ documentList : List[Document] """List of documents to retrieve from.""" k : int """Number of top results to return""" def _get_relevant_documents(self, query : str, *, run_manager : CallbackManagerForRetrieverRun) -> List[Document]: """Sync implementations for retriever.""" matchingDocumentList = [] for document in self.documentList: if len(matchingDocumentList) > self.k: return matchingDocumentList if query.lower() in document.page_content.lower(): matchingDocumentList.append(document) return matchingDocumentList # 선택 사항 : _aget_relevant_documents를 재정의하여 보다 효율적인 네이티브 구현을 제공한다. # async def _aget_relevant_documents(self, query : str, *, run_manager : AsyncCallbackManagerForRetrieverRun) -> List[Document]: # """Asynchronously get documents relevant to a query. # Args : # query : String to find relevant documents for # run_manager : The callbacks handler to use # Returns : # List of relevant documents # """ documentList = [ Document( page_content = "Dogs are great companions, known for their loyalty and friendliness.", metadata = {"type" : "dog", "trait" : "loyalty"} ), Document( page_content = "Cats are independent pets that often enjoy their own space.", metadata = {"type" : "cat", "trait" : "independence"} ), Document( page_content = "Goldfish are popular pets for beginners, requiring relatively simple care.", metadata = {"type": "fish", "trait": "low maintenance"} ), Document( page_content = "Parrots are intelligent birds capable of mimicking human speech.", metadata = {"type": "bird", "trait": "intelligence"} ), Document( page_content = "Rabbits are social animals that need plenty of space to hop around.", metadata = {"type": "rabbit", "trait": "social"} ), ] customRetriever = CustomRetriever(documentList = documentList, k = 3) responseDocumentList = customRetriever.invoke("that") print(responseDocumentList) print("-" * 50) responseDocumentListList = customRetriever.batch(["dog", "cat"]) print(responseDocumentListList) print("-" * 50) async def main(): responseDocumentList = await customRetriever.ainvoke("that") print(responseDocumentList) print("-" * 50) async for eventDictionary in customRetriever.astream_events("bar", version = "v2"): print(eventDictionary) print("-" * 50) asyncio.run(main()) """ [Document(metadata={'type': 'cat', 'trait': 'independence'}, page_content='Cats are independent pets that often enjoy their own space.'), Document(metadata={'type': 'rabbit', 'trait': 'social'}, page_content='Rabbits are social animals that need plenty of space to hop around.')] --------------------------------------------------- [[Document(metadata={'type': 'dog', 'trait': 'loyalty'}, page_content='Dogs are great companions, known for their loyalty and friendliness.')], [Document(metadata={'type': 'cat', 'trait': 'independence'}, page_content='Cats are independent pets that often enjoy their own space.')]] --------------------------------------------------- [Document(metadata={'type': 'cat', 'trait': 'independence'}, page_content='Cats are independent pets that often enjoy their own space.'), Document(metadata={'type': 'rabbit', 'trait': 'social'}, page_content='Rabbits are social animals that need plenty of space to hop around.')] --------------------------------------------------- {'event': 'on_retriever_start', 'data': {'input': 'bar'}, 'name': 'ToyRetriever', 'tags': [], 'run_id': '359f2805-45a0-422a-b8d9-ccc9067ea7de', 'metadata': {'ls_retriever_name': 'toy'}, 'parent_ids': []} {'event': 'on_retriever_end', 'data': {'output': []}, 'run_id': '359f2805-45a0-422a-b8d9-ccc9067ea7de', 'name': 'ToyRetriever', 'tags': [], 'metadata': {'ls_retriever_name': 'toy'}, 'parent_ids': []} --------------------------------------------------- """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
aiohappyeyeballs==2.4.3 aiohttp==3.11.7 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.6.2.post1 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.4.0 frozenlist==1.5.0 greenlet==3.1.1 h11==0.14.0 httpcore==1.0.7 httpx==0.27.2 idna==3.10 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.3.7 langchain-core==0.3.19 langchain-text-splitters==0.3.2 langsmith==0.1.144 multidict==6.1.0 numpy==1.26.4 orjson==3.10.11 packaging==24.2 propcache==0.2.0 pydantic==2.10.1 pydantic_core==2.27.1 PyYAML==6.0.2 requests==2.32.3 requests-toolbelt==1.0.0 sniffio==1.3.1 SQLAlchemy==2.0.36 tenacity==9.0.0 typing_extensions==4.12.2 urllib3==2.2.3 yarl==1.18.0 |
※ pip install langchain 명령을 실행했다.
■ create_retriever_tool 함수를 사용해 FAISS 벡터 스토어 검색 도구를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ FAISS 클래스의 from_documents 정적 메소드를 사용해 FAISS 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ create_react_agent 함수에서 RunnableSequence 클래스의 as_tool 메소드를 사용해 만든 StructuredTool 객체를 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ create_react_agent 함수를 사용해 compiledStateGraph 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent load_dotenv() documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) chatOpenAI = ChatOpenAI(model = "gpt-4o-mini") tool = vectorStoreRetriever.as_tool( name = "pet_info_retriever", description = "Get information about pets.", ) toolList = [tool] compiledStateGraph = create_react_agent(chatOpenAI, toolList) for addableUpdatesDict in compiledStateGraph.stream({"messages" : [("human", "What are dogs known for?")]}): print(addableUpdatesDict) print("-" * 100) """ { 'agent' : { 'messages' : [ AIMessage( content = 'Dogs are known for several characteristics and traits, including:\n\n1. **Companionship**: Dogs are often referred to as "man\'s best friend" due to their loyalty and companionship.\n\n2. **Intelligence**: Many dog breeds are highly intelligent and capable of learning a variety of commands and tricks.\n\n3. **Variety of Breeds**: There are hundreds of dog breeds, each with its own unique traits, sizes, and temperaments.\n\n4. **Working Abilities**: Dogs are used in various roles, such as service animals, search and rescue, therapy dogs, and police or military dogs.\n\n5. **Strong Sense of Smell**: Dogs have an exceptional sense of smell, which makes them excellent for tracking and detection purposes.\n\n6. **Social Behavior**: Dogs are social animals and often thrive in the company of humans and other pets.\n\n7. **Playfulness**: Many dogs enjoy playing and being active, which makes them great companions for outdoor activities.\n\n8. **Emotional Support**: Dogs are known to provide emotional support and comfort to their owners, often sensing when someone is feeling down.\n\n9. **Protectiveness**: Many dogs have a natural instinct to protect their home and family, making them good guard animals.\n\n10. **Communication**: Dogs communicate through a combination of vocalizations, body language, and facial expressions. \n\nOverall, dogs are appreciated for their loyalty, intelligence, and the deep bond they can form with humans.', additional_kwargs = {'refusal' : None}, response_metadata = { 'token_usage' : { 'completion_tokens' : 299, 'prompt_tokens' : 58, 'total_tokens' : 357, 'completion_tokens_details' : {'reasoning_tokens' : 0} }, 'model_name' : 'gpt-4o-mini-2024-07-18', 'system_fingerprint' : 'fp_f85bea6784', 'finish_reason' : 'stop', 'logprobs' : None }, id = 'run-b2d78792-6c54-422e-8739-07662d2eb56b-0', usage_metadata = {'input_tokens' : 58, 'output_tokens' : 299, 'total_tokens' : 357} ) ] } } |
""" —————————————————————————————————-
■ VectorStoreRetriever 클래스의 as_tool 메소드를 사용해 Tool 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) tool = vectorStoreRetriever.as_tool( name = "pet_info_retriever", description = "Get information about pets.", ) |
※ pip install langchain-openai 명령을 실행했다.
■ InMemoryVectorStore 클래스의 as_retriever 메소드를 사용해 VectorStoreRetriever 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) |
※ pip install langchain-openai 명령을 실행했다.
■ TimeWeightedVectorStoreRetriever 클래스에서 mock_now 함수를 사용해 가상 시간을 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
import faiss from langchain_openai import OpenAIEmbeddings from langchain_community.docstore import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain.retrievers import TimeWeightedVectorStoreRetriever from datetime import datetime from datetime import timedelta from langchain_core.documents import Document from langchain_core.utils import mock_now openAIEmbeddings = OpenAIEmbeddings() embeddingSize = 1536 indexFlatL2 = faiss.IndexFlatL2(embeddingSize) inMemoryDocstore = InMemoryDocstore({}) faiss = FAISS(openAIEmbeddings, indexFlatL2, inMemoryDocstore, {}) timeWeightedVectorStoreRetriever = TimeWeightedVectorStoreRetriever(vectorstore = faiss, decay_rate = 0.999, k = 1) yesterday = datetime.now() - timedelta(days = 1) timeWeightedVectorStoreRetriever.add_documents([Document(page_content = "hello world", metadata = {"last_accessed_at" : yesterday})]) timeWeightedVectorStoreRetriever.add_documents([Document(page_content = "hello foo")]) with mock_now(datetime(2024, 9, 16, 10, 11)): resultDocumentList = timeWeightedVectorStoreRetriever.get_relevant_documents("hello world") for resultDocument in resultDocumentList: print(resultDocument) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.3.2 colorama==0.4.6 dataclasses-json==0.6.7 distro==1.9.0 faiss-cpu==1.8.0.post1 frozenlist==1.4.1 greenlet==3.1.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.2 idna==3.10 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.3.0 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.121 marshmallow==3.22.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 openai==1.45.1 orjson==3.10.7 packaging==24.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 sniffio==1.3.1 SQLAlchemy==2.0.35 tenacity==8.5.0 tiktoken==0.7.0 tqdm==4.66.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 yarl==1.11.1 |
※ pip install langchain-community langchain-openai
■ TimeWeightedVectorStoreRetriever 클래스를 사용해 높은 감쇠율로 문서를 검색하는 방법을 보여준다. ※ 높은 감소율(예 : 9가 여러 개)에서는 최근성 점수가 빠르게 0으로 떨어진다!
■ TimeWeightedVectorStoreRetriever 클래스를 사용해 낮은 감쇠율로 문서를 검색하는 방법을 보여준다. ※ 낮은 감쇠율(여기서는 극단적으로 0에 가깝게 설정했다)은 기억이 더 오랫동안 "기억"된다는 것을
■ TimeWeightedVectorStoreRetriever 클래스의 생성자에서 vectorstore/decay_rate 인자를 사용해 TimeWeightedVectorStoreRetriever 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import faiss from langchain_openai import OpenAIEmbeddings from langchain_community.docstore import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain.retrievers import TimeWeightedVectorStoreRetriever openAIEmbeddings = OpenAIEmbeddings() embeddingSize = 1536 indexFlatL2 = faiss.IndexFlatL2(embeddingSize) inMemoryDocstore = InMemoryDocstore({}) faiss = FAISS(openAIEmbeddings, indexFlatL2, inMemoryDocstore, {}) timeWeightedVectorStoreRetriever = TimeWeightedVectorStoreRetriever(vectorstore = faiss, decay_rate = 0.0000000000000000000000001, k = 1) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.3.2 colorama==0.4.6 dataclasses-json==0.6.7 distro==1.9.0 faiss-cpu==1.8.0.post1 frozenlist==1.4.1 greenlet==3.1.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.2 idna==3.10 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.3.0 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.121 marshmallow==3.22.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 openai==1.45.1 orjson==3.10.7 packaging==24.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 sniffio==1.3.1 SQLAlchemy==2.0.35 tenacity==8.5.0 tiktoken==0.7.0 tqdm==4.66.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 yarl==1.11.1 |
※ pip install langchain-community
■ SelfQueryRetriever 클래스의 생성자에서 query_constructor 인자를 사용해 LCEL을 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain.chains.query_constructor.base import get_query_constructor_prompt from langchain.chains.query_constructor.base import StructuredQueryOutputParser from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain_community.query_constructors.chroma import ChromaTranslator load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" fewShotPromptTemplate = get_query_constructor_prompt( documentContentDescription, attributeInfoList ) structuredQueryOutputParser = StructuredQueryOutputParser.from_components() chatOpenAI = ChatOpenAI(temperature = 0) runnableSequence = fewShotPromptTemplate | chatOpenAI | structuredQueryOutputParser selfQueryRetriever = SelfQueryRetriever( query_constructor = runnableSequence, vectorstore = chroma, structured_query_translator = ChromaTranslator(), ) resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") for resultDocument in resultDocumentList: print(resultDocument) |
■ SelfQueryRetriever 클래스의 from_llm 정적 메소드에서 enable_limit 인자를 사용해 SelfQueryRetriever 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ SelfQueryRetriever 클래스를 사용해 자체 쿼리 검색기를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" chatOpenAI = ChatOpenAI(temperature = 0) selfQueryRetriever = SelfQueryRetriever.from_llm( chatOpenAI, chroma, documentContentDescription, attributeInfoList ) # 필터만 지정한다. resultDocumentList = selfQueryRetriever.invoke("I want to watch a movie rated higher than 8.5") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 쿼리와 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("Has Greta Gerwig directed any movies about women") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a highly rated (above 8.5) science fiction film?") for resultDocument in resultDocumentList: print(resultDocument.page_content) # 쿼리와 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") print() for resultDocument in resultDocumentList: print(resultDocument.page_content) |
▶
■ ParentDocumentRetriever 클래스의 생성자에서 parent_splitter/child_splitter 인자를 사용해 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 2000) recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400 ) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "split_parents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, parent_splitter = recursiveCharacterTextSplitter1, child_splitter = recursiveCharacterTextSplitter2 ) parentDocumentRetriever.add_documents(documentList, ids = None) print(len(list(inMemoryStore.yield_keys()))) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
■ ParentDocumentRetriever 클래스를 사용해 부모 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 400) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, child_splitter = recursiveCharacterTextSplitter, ) parentDocumentRetriever.add_documents(documentList, ids = None) print(list(inMemoryStore.yield_keys())) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
▶ requirements.txt
■ MultiVectorRetriever 클래스를 사용해 가상 질문 생성 및 문서 연결해 검색 개선하기 ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ MultiVectorRetriever 클래스를 사용해 검색을 위해 요약문을 문서와 연관시키는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
import uuid from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain_core.documents import Document load_dotenv() idKey = "doc_id" TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] for i, splitDocument in enumerate(splitDocumentList): splitDocument.metadata[idKey] = splitDocumentIDList[i] chatOpenAI = ChatOpenAI(model_name = "gpt-4o-mini") runnableSequence = ( {"doc" : lambda document : document.page_content} | ChatPromptTemplate.from_template("Summarize the following document :\n\n{doc}") | chatOpenAI | StrOutputParser() ) summaryList = runnableSequence.batch(splitDocumentList, {"max_concurrency" : 5}) summaryDocumentList = [ Document(page_content = summary, metadata = {idKey : splitDocumentIDList[i]}) for i, summary in enumerate(summaryList) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "summaries", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = idKey, ) multiVectorRetriever.vectorstore.add_documents(summaryDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) multiVectorRetriever.vectorstore.add_documents(splitDocumentList) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") for resultDocument in resultDocumentList: print(resultDocument.metadata) |
■ MultiVectorRetriever 클래스의 search_type 변수를 사용해 MMR(Max Marginal Relevance) 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain.retrievers.multi_vector import SearchType TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitDocument in splitSplitDocumentList: splitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.search_type = SearchType.mmr multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(resultDocument) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain
■ MultiVectorRetriever 클래스의 invoke 메소드를 사용해 부모 문서를 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitSplitDocument in splitSplitDocumentList: splitSplitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(len(resultDocument.page_content)) print(resultDocument.metadata) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain langchain-community
■ MultiVectorRetriever 클래스의 vectorstore/docstore 변수를 사용해 자식 문서 유사도 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): documentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for subsidaryDocument in splitSplitDocumentList: subsidaryDocument.metadata["doc_id"] = documentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") print(resultDocumentList[0]) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain
■ MultiVectorRetriever 클래스의 생성자에서 vectorstore/byte_store/id_key 인자를 사용해 MultiVectorRetriever 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) |
※ pip install langchain langchain-chroma
■ LongContextReorder 클래스를 사용해 검색된 결과를 재정렬해 "중간에서 잃어버린" 효과를 완화하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma from langchain_community.document_transformers import LongContextReorder from langchain_openai import OpenAI from langchain_core.prompts import PromptTemplate from langchain.chains.combine_documents import create_stuff_documents_chain huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") textList = [ "Basquetball is a great sport.", "Fly me to the moon is one of my favourite songs.", "The Celtics are my favourite team.", "This is a document about the Boston Celtics", "I simply love going to the movies", "The Boston Celtics won the game by 20 points", "This is just a random text.", "Elden Ring is one of the best games in the last 15 years.", "L. Kornet is one of the best Celtics players.", "Larry Bird was an iconic NBA player." ] chroma = Chroma.from_texts(textList, embedding = huggingFaceEmbeddings) vectorStoreRetriever = chroma.as_retriever(search_kwargs = {"k" : 10}) query = "What can you tell me about the Celtics?" documentList = vectorStoreRetriever.invoke(query) longContextReorder = LongContextReorder() reorderedDocumentList = longContextReorder.transform_documents(documentList) openAI = OpenAI() templateString = """ Given these texts: ----- {context} ----- Please answer the following question: {query} """ promptTemplate = PromptTemplate( template = templateString, input_variables = ["context", "query"] ) runnableBinding = create_stuff_documents_chain(openAI, promptTemplate) responseString = runnableBinding.invoke({"context" : reorderedDocumentList, "query" : query}) print(responseString) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.1 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.8 importlib_metadata==8.4.0 importlib_resources==6.4.5 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.2.16 langchain-chroma==0.1.3 langchain-community==0.2.16 langchain-core==0.2.39 langchain-huggingface==0.0.3 langchain-openai==0.1.23 langchain-text-splitters==0.2.4 langsmith==0.1.120 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 networkx==3.3 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 pillow==10.4.0 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.1.0 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.1 tqdm==4.66.5 transformers==4.44.2 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.1 |
※ pip install