[PYTHON/LANGCHAIN] create_react_agent 함수 : CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소 검색하기
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ 채팅 히스토리를 갖고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
import bs4 from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.prompts import MessagesPlaceholder from langchain.chains import create_history_aware_retriever from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains import create_retrieval_chain from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_core.runnables.history import RunnableWithMessageHistory load_dotenv() chatOpenAI = ChatOpenAI(model = "gpt-4o") webBaseLoader = WebBaseLoader( web_paths = ("https://lilianweng.github.io/posts/2023-06-23-agent/",), bs_kwargs = dict(parse_only = bs4.SoupStrainer(class_ = ("post-content", "post-title", "post-header"))) ) documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documents = splitDocumentList, embedding = openAIEmbeddings) vectorStoreRetriever = chroma.as_retriever() systemMessage1 = "Given a chat history and the latest user question which might reference context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, just reformulate it if needed and otherwise return it as is." chatPromptTemplate1 = ChatPromptTemplate.from_messages( [ ("system", systemMessage1), MessagesPlaceholder("chat_history"), ("human", "{input}") ] ) runnableBinding1 = create_history_aware_retriever(chatOpenAI, vectorStoreRetriever, chatPromptTemplate1) systemMessage2 = "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}" chatPromptTemplate2 = ChatPromptTemplate.from_messages( [ ("system", systemMessage2), MessagesPlaceholder("chat_history"), ("human", "{input}"), ] ) runnableBinding2 = create_stuff_documents_chain(chatOpenAI, chatPromptTemplate2) runnableBinding3 = create_retrieval_chain(runnableBinding1, runnableBinding2) chatMessageHistoryDictionary = {} def GetChatMessageHistoryDictionary(session_id : str) -> BaseChatMessageHistory: if session_id not in chatMessageHistoryDictionary: chatMessageHistoryDictionary[session_id] = ChatMessageHistory() return chatMessageHistoryDictionary[session_id] runnableWithMessageHistory = RunnableWithMessageHistory( runnableBinding3, GetChatMessageHistoryDictionary, input_messages_key = "input", history_messages_key = "chat_history", output_messages_key = "answer", ) responseDictionary1 = runnableWithMessageHistory.invoke( {"input" : "What is Task Decomposition?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer1 = responseDictionary1["answer"] print(answer1) print("-" * 50) """ Task Decomposition is a technique used to break down complex tasks into smaller, manageable steps. It is often implemented using methods like Chain of Thought (CoT) or Tree of Thoughts, which help in systematically exploring and reasoning through various possibilities. This approach enhances model performance by allowing a step-by-step analysis and execution of tasks. """ responseDictionary2 = runnableWithMessageHistory.invoke( {"input" : "What are common ways of doing it?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer2 = responseDictionary2["answer"] print(answer2) print("-" * 50) from langchain_core.messages import AIMessage for message in chatMessageHistoryDictionary["abc123"].messages: if isinstance(message, AIMessage): prefix = "AI" else: prefix = "User" print(f"{prefix} : {message.content}") print("-" * 50) """ Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. -------------------------------------------------- Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- User : What is Task Decomposition? AI : Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. User : What are common ways of doing it? AI : Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- """ |
▶
■ RunnableWithMessageHistory 클래스의 invoke 메소드를 사용해 채팅하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
from dotenv import load_dotenv from langchain_community.tools.tavily_search import TavilySearchResults from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain.tools.retriever import create_retriever_tool from langchain_openai import ChatOpenAI from langchain import hub from langchain.agents import create_tool_calling_agent from langchain.agents import AgentExecutor from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_core.runnables.history import RunnableWithMessageHistory load_dotenv() chatOpenAI = ChatOpenAI(model = "gpt-4o") tavilySearchResults = TavilySearchResults(max_results = 2) webBaseLoader = WebBaseLoader("https://docs.smith.langchain.com/overview") documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() faiss = FAISS.from_documents(splitDocumentList, openAIEmbeddings) vectorStoreRetriever = faiss.as_retriever() vectorStoreRetrieverTool = create_retriever_tool( vectorStoreRetriever, "langsmith_search", "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!", ) toolList = [tavilySearchResults, vectorStoreRetrieverTool] chatPromptTemplate = hub.pull("hwchase17/openai-functions-agent") runnableSequence = create_tool_calling_agent(chatOpenAI, toolList, chatPromptTemplate) agentExecutor = AgentExecutor(agent = runnableSequence, tools = toolList) sessionIDDictionary = {} def getChatMessageHistory(sessionID : str) -> BaseChatMessageHistory: if sessionID not in sessionIDDictionary: sessionIDDictionary[sessionID] = ChatMessageHistory() return sessionIDDictionary[sessionID] runnableWithMessageHistory = RunnableWithMessageHistory( agentExecutor, getChatMessageHistory, input_messages_key = "input", history_messages_key = "chat_history", ) responseDictionary1 = runnableWithMessageHistory.invoke( {"input" : "hi! I'm bob"}, config = {"configurable" : {"session_id" : "<foo>"}} ) print(responseDictionary1) print("-" * 100) responseDictionary2 = runnableWithMessageHistory.invoke( {"input" : "what's my name?"}, config = {"configurable" : {"session_id" : "<foo>"}} ) print(responseDictionary2) print("-" * 100) """ {'input': "hi! I'm bob", 'chat_history': [], 'output': 'Hello Bob! How can I assist you today?'} |
{'input': "what's
■ create_retriever_tool 함수를 사용해 FAISS 벡터 스토어 검색 도구를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ FAISS 클래스의 from_documents 정적 메소드를 사용해 FAISS 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ create_react_agent 함수에서 RunnableSequence 클래스의 as_tool 메소드를 사용해 만든 StructuredTool 객체를 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ create_react_agent 함수를 사용해 compiledStateGraph 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent load_dotenv() documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) chatOpenAI = ChatOpenAI(model = "gpt-4o-mini") tool = vectorStoreRetriever.as_tool( name = "pet_info_retriever", description = "Get information about pets.", ) toolList = [tool] compiledStateGraph = create_react_agent(chatOpenAI, toolList) for addableUpdatesDict in compiledStateGraph.stream({"messages" : [("human", "What are dogs known for?")]}): print(addableUpdatesDict) print("-" * 100) """ { 'agent' : { 'messages' : [ AIMessage( content = 'Dogs are known for several characteristics and traits, including:\n\n1. **Companionship**: Dogs are often referred to as "man\'s best friend" due to their loyalty and companionship.\n\n2. **Intelligence**: Many dog breeds are highly intelligent and capable of learning a variety of commands and tricks.\n\n3. **Variety of Breeds**: There are hundreds of dog breeds, each with its own unique traits, sizes, and temperaments.\n\n4. **Working Abilities**: Dogs are used in various roles, such as service animals, search and rescue, therapy dogs, and police or military dogs.\n\n5. **Strong Sense of Smell**: Dogs have an exceptional sense of smell, which makes them excellent for tracking and detection purposes.\n\n6. **Social Behavior**: Dogs are social animals and often thrive in the company of humans and other pets.\n\n7. **Playfulness**: Many dogs enjoy playing and being active, which makes them great companions for outdoor activities.\n\n8. **Emotional Support**: Dogs are known to provide emotional support and comfort to their owners, often sensing when someone is feeling down.\n\n9. **Protectiveness**: Many dogs have a natural instinct to protect their home and family, making them good guard animals.\n\n10. **Communication**: Dogs communicate through a combination of vocalizations, body language, and facial expressions. \n\nOverall, dogs are appreciated for their loyalty, intelligence, and the deep bond they can form with humans.', additional_kwargs = {'refusal' : None}, response_metadata = { 'token_usage' : { 'completion_tokens' : 299, 'prompt_tokens' : 58, 'total_tokens' : 357, 'completion_tokens_details' : {'reasoning_tokens' : 0} }, 'model_name' : 'gpt-4o-mini-2024-07-18', 'system_fingerprint' : 'fp_f85bea6784', 'finish_reason' : 'stop', 'logprobs' : None }, id = 'run-b2d78792-6c54-422e-8739-07662d2eb56b-0', usage_metadata = {'input_tokens' : 58, 'output_tokens' : 299, 'total_tokens' : 357} ) ] } } |
""" —————————————————————————————————-
■ VectorStoreRetriever 클래스의 as_tool 메소드를 사용해 Tool 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) tool = vectorStoreRetriever.as_tool( name = "pet_info_retriever", description = "Get information about pets.", ) |
※ pip install langchain-openai 명령을 실행했다.
■ InMemoryVectorStore 클래스의 as_retriever 메소드를 사용해 VectorStoreRetriever 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) vectorStoreRetriever = inMemoryVectorStore.as_retriever( search_type = "similarity", search_kwargs = {"k" : 1} ) |
※ pip install langchain-openai 명령을 실행했다.
■ InMemoryVectorStore 클래스의 from_documents 정적 메소드를 사용해 InMemoryVectorStore 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_core.vectorstores import InMemoryVectorStore documentList = [ Document(page_content = "Dogs are great companions, known for their loyalty and friendliness."), Document(page_content = "Cats are independent pets that often enjoy their own space." ) ] openAIEmbeddings = OpenAIEmbeddings() inMemoryVectorStore = InMemoryVectorStore.from_documents( documentList, embedding = openAIEmbeddings ) |
※ pip install langchain-openai 명령을 실행했다.
■ TimeWeightedVectorStoreRetriever 클래스에서 mock_now 함수를 사용해 가상 시간을 설정하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
import faiss from langchain_openai import OpenAIEmbeddings from langchain_community.docstore import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain.retrievers import TimeWeightedVectorStoreRetriever from datetime import datetime from datetime import timedelta from langchain_core.documents import Document from langchain_core.utils import mock_now openAIEmbeddings = OpenAIEmbeddings() embeddingSize = 1536 indexFlatL2 = faiss.IndexFlatL2(embeddingSize) inMemoryDocstore = InMemoryDocstore({}) faiss = FAISS(openAIEmbeddings, indexFlatL2, inMemoryDocstore, {}) timeWeightedVectorStoreRetriever = TimeWeightedVectorStoreRetriever(vectorstore = faiss, decay_rate = 0.999, k = 1) yesterday = datetime.now() - timedelta(days = 1) timeWeightedVectorStoreRetriever.add_documents([Document(page_content = "hello world", metadata = {"last_accessed_at" : yesterday})]) timeWeightedVectorStoreRetriever.add_documents([Document(page_content = "hello foo")]) with mock_now(datetime(2024, 9, 16, 10, 11)): resultDocumentList = timeWeightedVectorStoreRetriever.get_relevant_documents("hello world") for resultDocument in resultDocumentList: print(resultDocument) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.3.2 colorama==0.4.6 dataclasses-json==0.6.7 distro==1.9.0 faiss-cpu==1.8.0.post1 frozenlist==1.4.1 greenlet==3.1.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.2 idna==3.10 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.3.0 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.121 marshmallow==3.22.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 openai==1.45.1 orjson==3.10.7 packaging==24.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 sniffio==1.3.1 SQLAlchemy==2.0.35 tenacity==8.5.0 tiktoken==0.7.0 tqdm==4.66.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 yarl==1.11.1 |
※ pip install langchain-community langchain-openai
■ TimeWeightedVectorStoreRetriever 클래스를 사용해 높은 감쇠율로 문서를 검색하는 방법을 보여준다. ※ 높은 감소율(예 : 9가 여러 개)에서는 최근성 점수가 빠르게 0으로 떨어진다!
■ TimeWeightedVectorStoreRetriever 클래스를 사용해 낮은 감쇠율로 문서를 검색하는 방법을 보여준다. ※ 낮은 감쇠율(여기서는 극단적으로 0에 가깝게 설정했다)은 기억이 더 오랫동안 "기억"된다는 것을
■ TimeWeightedVectorStoreRetriever 클래스의 생성자에서 vectorstore/decay_rate 인자를 사용해 TimeWeightedVectorStoreRetriever 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import faiss from langchain_openai import OpenAIEmbeddings from langchain_community.docstore import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain.retrievers import TimeWeightedVectorStoreRetriever openAIEmbeddings = OpenAIEmbeddings() embeddingSize = 1536 indexFlatL2 = faiss.IndexFlatL2(embeddingSize) inMemoryDocstore = InMemoryDocstore({}) faiss = FAISS(openAIEmbeddings, indexFlatL2, inMemoryDocstore, {}) timeWeightedVectorStoreRetriever = TimeWeightedVectorStoreRetriever(vectorstore = faiss, decay_rate = 0.0000000000000000000000001, k = 1) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.3.2 colorama==0.4.6 dataclasses-json==0.6.7 distro==1.9.0 faiss-cpu==1.8.0.post1 frozenlist==1.4.1 greenlet==3.1.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.2 idna==3.10 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.3.0 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.121 marshmallow==3.22.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 openai==1.45.1 orjson==3.10.7 packaging==24.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 sniffio==1.3.1 SQLAlchemy==2.0.35 tenacity==8.5.0 tiktoken==0.7.0 tqdm==4.66.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 yarl==1.11.1 |
※ pip install langchain-community
■ ChromaTranslator 클래스를 사용해 Chroma 벡더 데이터베이스 쿼리를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 |
from langchain_community.query_constructors.chroma import ChromaTranslator chromaTranslator = ChromaTranslator() |
※ pip install langchain-community 명령을 실행했다.
■ SelfQueryRetriever 클래스의 생성자에서 query_constructor 인자를 사용해 LCEL을 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain.chains.query_constructor.base import get_query_constructor_prompt from langchain.chains.query_constructor.base import StructuredQueryOutputParser from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain_community.query_constructors.chroma import ChromaTranslator load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" fewShotPromptTemplate = get_query_constructor_prompt( documentContentDescription, attributeInfoList ) structuredQueryOutputParser = StructuredQueryOutputParser.from_components() chatOpenAI = ChatOpenAI(temperature = 0) runnableSequence = fewShotPromptTemplate | chatOpenAI | structuredQueryOutputParser selfQueryRetriever = SelfQueryRetriever( query_constructor = runnableSequence, vectorstore = chroma, structured_query_translator = ChromaTranslator(), ) resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") for resultDocument in resultDocumentList: print(resultDocument) |
■ SelfQueryRetriever 클래스의 from_llm 정적 메소드에서 enable_limit 인자를 사용해 SelfQueryRetriever 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ SelfQueryRetriever 클래스를 사용해 자체 쿼리 검색기를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" chatOpenAI = ChatOpenAI(temperature = 0) selfQueryRetriever = SelfQueryRetriever.from_llm( chatOpenAI, chroma, documentContentDescription, attributeInfoList ) # 필터만 지정한다. resultDocumentList = selfQueryRetriever.invoke("I want to watch a movie rated higher than 8.5") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 쿼리와 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("Has Greta Gerwig directed any movies about women") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a highly rated (above 8.5) science fiction film?") for resultDocument in resultDocumentList: print(resultDocument.page_content) # 쿼리와 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") print() for resultDocument in resultDocumentList: print(resultDocument.page_content) |
▶
■ ParentDocumentRetriever 클래스의 생성자에서 parent_splitter/child_splitter 인자를 사용해 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 2000) recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400 ) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "split_parents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, parent_splitter = recursiveCharacterTextSplitter1, child_splitter = recursiveCharacterTextSplitter2 ) parentDocumentRetriever.add_documents(documentList, ids = None) print(len(list(inMemoryStore.yield_keys()))) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
■ ParentDocumentRetriever 클래스를 사용해 부모 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 400) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, child_splitter = recursiveCharacterTextSplitter, ) parentDocumentRetriever.add_documents(documentList, ids = None) print(list(inMemoryStore.yield_keys())) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
▶ requirements.txt
■ MultiVectorRetriever 클래스를 사용해 가상 질문 생성 및 문서 연결해 검색 개선하기 ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ MultiVectorRetriever 클래스를 사용해 검색을 위해 요약문을 문서와 연관시키는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
import uuid from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain_core.documents import Document load_dotenv() idKey = "doc_id" TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] for i, splitDocument in enumerate(splitDocumentList): splitDocument.metadata[idKey] = splitDocumentIDList[i] chatOpenAI = ChatOpenAI(model_name = "gpt-4o-mini") runnableSequence = ( {"doc" : lambda document : document.page_content} | ChatPromptTemplate.from_template("Summarize the following document :\n\n{doc}") | chatOpenAI | StrOutputParser() ) summaryList = runnableSequence.batch(splitDocumentList, {"max_concurrency" : 5}) summaryDocumentList = [ Document(page_content = summary, metadata = {idKey : splitDocumentIDList[i]}) for i, summary in enumerate(summaryList) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "summaries", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = idKey, ) multiVectorRetriever.vectorstore.add_documents(summaryDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) multiVectorRetriever.vectorstore.add_documents(splitDocumentList) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") for resultDocument in resultDocumentList: print(resultDocument.metadata) |
■ MultiVectorRetriever 클래스의 search_type 변수를 사용해 MMR(Max Marginal Relevance) 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain.retrievers.multi_vector import SearchType TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitDocument in splitSplitDocumentList: splitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.search_type = SearchType.mmr multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(resultDocument) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain
■ MultiVectorRetriever 클래스의 invoke 메소드를 사용해 부모 문서를 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitSplitDocument in splitSplitDocumentList: splitSplitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(len(resultDocument.page_content)) print(resultDocument.metadata) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain langchain-community
■ MultiVectorRetriever 클래스의 vectorstore/docstore 변수를 사용해 자식 문서 유사도 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): documentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for subsidaryDocument in splitSplitDocumentList: subsidaryDocument.metadata["doc_id"] = documentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") print(resultDocumentList[0]) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain