[PYTHON/LANGCHAIN] create_react_agent 함수 : CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소 검색하기
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ create_react_agent 함수를 사용해 CompiledStateGraph 객체를 만들고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶
■ 채팅 히스토리를 갖고 CHROMA 벡터 저장소를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
import bs4 from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_core.prompts import MessagesPlaceholder from langchain.chains import create_history_aware_retriever from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.chains import create_retrieval_chain from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_core.runnables.history import RunnableWithMessageHistory load_dotenv() chatOpenAI = ChatOpenAI(model = "gpt-4o") webBaseLoader = WebBaseLoader( web_paths = ("https://lilianweng.github.io/posts/2023-06-23-agent/",), bs_kwargs = dict(parse_only = bs4.SoupStrainer(class_ = ("post-content", "post-title", "post-header"))) ) documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documents = splitDocumentList, embedding = openAIEmbeddings) vectorStoreRetriever = chroma.as_retriever() systemMessage1 = "Given a chat history and the latest user question which might reference context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, just reformulate it if needed and otherwise return it as is." chatPromptTemplate1 = ChatPromptTemplate.from_messages( [ ("system", systemMessage1), MessagesPlaceholder("chat_history"), ("human", "{input}") ] ) runnableBinding1 = create_history_aware_retriever(chatOpenAI, vectorStoreRetriever, chatPromptTemplate1) systemMessage2 = "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}" chatPromptTemplate2 = ChatPromptTemplate.from_messages( [ ("system", systemMessage2), MessagesPlaceholder("chat_history"), ("human", "{input}"), ] ) runnableBinding2 = create_stuff_documents_chain(chatOpenAI, chatPromptTemplate2) runnableBinding3 = create_retrieval_chain(runnableBinding1, runnableBinding2) chatMessageHistoryDictionary = {} def GetChatMessageHistoryDictionary(session_id : str) -> BaseChatMessageHistory: if session_id not in chatMessageHistoryDictionary: chatMessageHistoryDictionary[session_id] = ChatMessageHistory() return chatMessageHistoryDictionary[session_id] runnableWithMessageHistory = RunnableWithMessageHistory( runnableBinding3, GetChatMessageHistoryDictionary, input_messages_key = "input", history_messages_key = "chat_history", output_messages_key = "answer", ) responseDictionary1 = runnableWithMessageHistory.invoke( {"input" : "What is Task Decomposition?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer1 = responseDictionary1["answer"] print(answer1) print("-" * 50) """ Task Decomposition is a technique used to break down complex tasks into smaller, manageable steps. It is often implemented using methods like Chain of Thought (CoT) or Tree of Thoughts, which help in systematically exploring and reasoning through various possibilities. This approach enhances model performance by allowing a step-by-step analysis and execution of tasks. """ responseDictionary2 = runnableWithMessageHistory.invoke( {"input" : "What are common ways of doing it?"}, config = {"configurable" : {"session_id" : "abc123"}} ) answer2 = responseDictionary2["answer"] print(answer2) print("-" * 50) from langchain_core.messages import AIMessage for message in chatMessageHistoryDictionary["abc123"].messages: if isinstance(message, AIMessage): prefix = "AI" else: prefix = "User" print(f"{prefix} : {message.content}") print("-" * 50) """ Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. -------------------------------------------------- Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- User : What is Task Decomposition? AI : Task decomposition is the process of breaking down a complex task into smaller, more manageable steps or subgoals. This approach, often used in conjunction with techniques like Chain of Thought (CoT), helps enhance model performance by enabling step-by-step reasoning. It can be achieved through prompting, task-specific instructions, or human inputs. User : What are common ways of doing it? AI : Common ways of performing task decomposition include using straightforward prompts like "Steps for XYZ.\n1." or "What are the subgoals for achieving XYZ?", employing task-specific instructions such as "Write a story outline" for writing a novel, and incorporating human inputs. -------------------------------------------------- """ |
▶
■ ChromaTranslator 클래스를 사용해 Chroma 벡더 데이터베이스 쿼리를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 |
from langchain_community.query_constructors.chroma import ChromaTranslator chromaTranslator = ChromaTranslator() |
※ pip install langchain-community 명령을 실행했다.
■ SelfQueryRetriever 클래스의 생성자에서 query_constructor 인자를 사용해 LCEL을 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain.chains.query_constructor.base import get_query_constructor_prompt from langchain.chains.query_constructor.base import StructuredQueryOutputParser from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain_community.query_constructors.chroma import ChromaTranslator load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" fewShotPromptTemplate = get_query_constructor_prompt( documentContentDescription, attributeInfoList ) structuredQueryOutputParser = StructuredQueryOutputParser.from_components() chatOpenAI = ChatOpenAI(temperature = 0) runnableSequence = fewShotPromptTemplate | chatOpenAI | structuredQueryOutputParser selfQueryRetriever = SelfQueryRetriever( query_constructor = runnableSequence, vectorstore = chroma, structured_query_translator = ChromaTranslator(), ) resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") for resultDocument in resultDocumentList: print(resultDocument) |
■ SelfQueryRetriever 클래스의 from_llm 정적 메소드에서 enable_limit 인자를 사용해 SelfQueryRetriever 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ SelfQueryRetriever 클래스를 사용해 자체 쿼리 검색기를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
from dotenv import load_dotenv from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.chains.query_constructor.base import AttributeInfo from langchain_openai import ChatOpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever load_dotenv() documentList = [ Document( page_content = "A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata = {"year" : 1993, "rating" : 7.7, "genre" : "science fiction"} ), Document( page_content = "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata = {"year" : 2010, "director" : "Christopher Nolan", "rating" : 8.2} ), Document( page_content = "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata = {"year" : 2006, "director" : "Satoshi Kon", "rating" : 8.6} ), Document( page_content = "A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata = {"year" : 2019, "director" : "Greta Gerwig", "rating" : 8.3} ), Document( page_content = "Toys come alive and have a blast doing so", metadata = {"year" : 1995, "genre" : "animated"} ), Document( page_content = "Three men walk into the Zone, three men walk out of the Zone", metadata = {"year" : 1979, "director" : "Andrei Tarkovsky", "genre" : "thriller", "rating" : 9.9} ) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documentList, openAIEmbeddings) attributeInfoList = [ AttributeInfo( name = "genre", description = "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']", type = "string" ), AttributeInfo( name = "year", description = "The year the movie was released", type = "integer" ), AttributeInfo( name = "director", description = "The name of the movie director", type = "string" ), AttributeInfo( name = "rating", description = "A 1-10 rating for the movie", type = "float" ) ] documentContentDescription = "Brief summary of a movie" chatOpenAI = ChatOpenAI(temperature = 0) selfQueryRetriever = SelfQueryRetriever.from_llm( chatOpenAI, chroma, documentContentDescription, attributeInfoList ) # 필터만 지정한다. resultDocumentList = selfQueryRetriever.invoke("I want to watch a movie rated higher than 8.5") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 쿼리와 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("Has Greta Gerwig directed any movies about women") for resultDocument in resultDocumentList: print(resultDocument.page_content) print() # 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a highly rated (above 8.5) science fiction film?") for resultDocument in resultDocumentList: print(resultDocument.page_content) # 쿼리와 복합 필터를 지정한다. resultDocumentList = selfQueryRetriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated") print() for resultDocument in resultDocumentList: print(resultDocument.page_content) |
▶
■ ParentDocumentRetriever 클래스의 생성자에서 parent_splitter/child_splitter 인자를 사용해 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 2000) recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400 ) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "split_parents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, parent_splitter = recursiveCharacterTextSplitter1, child_splitter = recursiveCharacterTextSplitter2 ) parentDocumentRetriever.add_documents(documentList, ids = None) print(len(list(inMemoryStore.yield_keys()))) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
■ ParentDocumentRetriever 클래스를 사용해 부모 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryStore from langchain.retrievers import ParentDocumentRetriever load_dotenv() textLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in textLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 400) openAIEmbeddings = OpenAIEmbeddings() choma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryStore = InMemoryStore() parentDocumentRetriever = ParentDocumentRetriever( vectorstore = choma, docstore = inMemoryStore, child_splitter = recursiveCharacterTextSplitter, ) parentDocumentRetriever.add_documents(documentList, ids = None) print(list(inMemoryStore.yield_keys())) resultSplitDocumentList = choma.similarity_search("justice breyer") print(len(resultSplitDocumentList[0].page_content)) resultDocumentList = parentDocumentRetriever.invoke("justice breyer") print(len(resultDocumentList[0].page_content)) |
▶ requirements.txt
■ MultiVectorRetriever 클래스를 사용해 가상 질문 생성 및 문서 연결해 검색 개선하기 ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ MultiVectorRetriever 클래스를 사용해 검색을 위해 요약문을 문서와 연관시키는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
import uuid from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain_core.documents import Document load_dotenv() idKey = "doc_id" TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] for i, splitDocument in enumerate(splitDocumentList): splitDocument.metadata[idKey] = splitDocumentIDList[i] chatOpenAI = ChatOpenAI(model_name = "gpt-4o-mini") runnableSequence = ( {"doc" : lambda document : document.page_content} | ChatPromptTemplate.from_template("Summarize the following document :\n\n{doc}") | chatOpenAI | StrOutputParser() ) summaryList = runnableSequence.batch(splitDocumentList, {"max_concurrency" : 5}) summaryDocumentList = [ Document(page_content = summary, metadata = {idKey : splitDocumentIDList[i]}) for i, summary in enumerate(summaryList) ] openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "summaries", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = idKey, ) multiVectorRetriever.vectorstore.add_documents(summaryDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) multiVectorRetriever.vectorstore.add_documents(splitDocumentList) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") for resultDocument in resultDocumentList: print(resultDocument.metadata) |
■ MultiVectorRetriever 클래스의 search_type 변수를 사용해 MMR(Max Marginal Relevance) 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain.retrievers.multi_vector import SearchType TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitDocument in splitSplitDocumentList: splitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.search_type = SearchType.mmr multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(resultDocument) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain
■ MultiVectorRetriever 클래스의 invoke 메소드를 사용해 부모 문서를 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): splitDocumentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for splitSplitDocument in splitSplitDocumentList: splitSplitDocument.metadata["doc_id"] = splitDocumentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.invoke("justice breyer") resultDocument = resultDocumentList[0] print(len(resultDocument.page_content)) print(resultDocument.metadata) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain langchain-community
■ MultiVectorRetriever 클래스의 vectorstore/docstore 변수를 사용해 자식 문서 유사도 검색하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
import uuid from langchain_community.document_loaders import TextLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever TextLoaderList = [ TextLoader("paul_graham_essay.txt" , encoding = "utf-8"), TextLoader("state_of_the_union.txt", encoding = "utf-8") ] documentList = [] for textLoader in TextLoaderList: documentList.extend(textLoader.load()) recursiveCharacterTextSplitter1 = RecursiveCharacterTextSplitter(chunk_size = 10000) splitDocumentList = recursiveCharacterTextSplitter1.split_documents(documentList) splitDocumentIDList = [str(uuid.uuid4()) for _ in splitDocumentList] recursiveCharacterTextSplitter2 = RecursiveCharacterTextSplitter(chunk_size = 400) totalSplitSplitDocumentList = [] for i, splitDocument in enumerate(splitDocumentList): documentID = splitDocumentIDList[i] splitSplitDocumentList = recursiveCharacterTextSplitter2.split_documents([splitDocument]) for subsidaryDocument in splitSplitDocumentList: subsidaryDocument.metadata["doc_id"] = documentID totalSplitSplitDocumentList.extend(splitSplitDocumentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) multiVectorRetriever.vectorstore.add_documents(totalSplitSplitDocumentList) multiVectorRetriever.docstore.mset(list(zip(splitDocumentIDList, splitDocumentList))) resultDocumentList = multiVectorRetriever.vectorstore.similarity_search("justice breyer") print(resultDocumentList[0]) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.2 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.9 importlib_metadata==8.4.0 importlib_resources==6.4.5 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.3.0 langchain-chroma==0.1.4 langchain-community==0.3.0 langchain-core==0.3.0 langchain-openai==0.2.0 langchain-text-splitters==0.3.0 langsmith==0.1.120 markdown-it-py==3.0.0 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic-settings==2.5.2 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tqdm==4.66.5 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.2 |
※ pip install langchain
■ MultiVectorRetriever 클래스의 생성자에서 vectorstore/byte_store/id_key 인자를 사용해 MultiVectorRetriever 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain.storage import InMemoryByteStore from langchain.retrievers.multi_vector import MultiVectorRetriever openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) inMemoryByteStore = InMemoryByteStore() multiVectorRetriever = MultiVectorRetriever( vectorstore = chroma, byte_store = inMemoryByteStore, id_key = "doc_id" ) |
※ pip install langchain langchain-chroma
■ Chroma 클래스의 생성자에서 collection_name/embedding_function 인자를 사용해 Chroma 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 6 7 8 |
from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma(collection_name = "full_documents", embedding_function = openAIEmbeddings) |
※ pip install langchain-openai langchain-chroma
■ LongContextReorder 클래스를 사용해 검색된 결과를 재정렬해 "중간에서 잃어버린" 효과를 완화하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma from langchain_community.document_transformers import LongContextReorder from langchain_openai import OpenAI from langchain_core.prompts import PromptTemplate from langchain.chains.combine_documents import create_stuff_documents_chain huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") textList = [ "Basquetball is a great sport.", "Fly me to the moon is one of my favourite songs.", "The Celtics are my favourite team.", "This is a document about the Boston Celtics", "I simply love going to the movies", "The Boston Celtics won the game by 20 points", "This is just a random text.", "Elden Ring is one of the best games in the last 15 years.", "L. Kornet is one of the best Celtics players.", "Larry Bird was an iconic NBA player." ] chroma = Chroma.from_texts(textList, embedding = huggingFaceEmbeddings) vectorStoreRetriever = chroma.as_retriever(search_kwargs = {"k" : 10}) query = "What can you tell me about the Celtics?" documentList = vectorStoreRetriever.invoke(query) longContextReorder = LongContextReorder() reorderedDocumentList = longContextReorder.transform_documents(documentList) openAI = OpenAI() templateString = """ Given these texts: ----- {context} ----- Please answer the following question: {query} """ promptTemplate = PromptTemplate( template = templateString, input_variables = ["context", "query"] ) runnableBinding = create_stuff_documents_chain(openAI, promptTemplate) responseString = runnableBinding.invoke({"context" : reorderedDocumentList, "query" : query}) print(responseString) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.1 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.8 importlib_metadata==8.4.0 importlib_resources==6.4.5 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.2.16 langchain-chroma==0.1.3 langchain-community==0.2.16 langchain-core==0.2.39 langchain-huggingface==0.0.3 langchain-openai==0.1.23 langchain-text-splitters==0.2.4 langsmith==0.1.120 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 networkx==3.3 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 pillow==10.4.0 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.1.0 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.1 tqdm==4.66.5 transformers==4.44.2 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.1 |
※ pip install
■ LongContextReorder 클래스의 transform_documents 메소드를 사용해 검색 문서를 재정렬하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma from langchain_community.document_transformers import LongContextReorder huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") textList = [ "Basquetball is a great sport.", "Fly me to the moon is one of my favourite songs.", "The Celtics are my favourite team.", "This is a document about the Boston Celtics", "I simply love going to the movies", "The Boston Celtics won the game by 20 points", "This is just a random text.", "Elden Ring is one of the best games in the last 15 years.", "L. Kornet is one of the best Celtics players.", "Larry Bird was an iconic NBA player." ] chroma = Chroma.from_texts(textList, embedding = huggingFaceEmbeddings) vectorStoreRetriever = chroma.as_retriever(search_kwargs = {"k" : 10}) documentList = vectorStoreRetriever.invoke("What can you tell me about the Celtics?") for document in documentList: print(document.page_content) print() longContextReorder = LongContextReorder() reorderedDocumentList = longContextReorder.transform_documents(documentList) for reorderedDocument in reorderedDocumentList: print(reorderedDocument.page_content) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.1 filelock==3.16.0 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 greenlet==3.1.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.8 importlib_metadata==8.4.0 importlib_resources==6.4.5 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.2.16 langchain-chroma==0.1.3 langchain-community==0.2.16 langchain-core==0.2.39 langchain-huggingface==0.0.3 langchain-openai==0.1.23 langchain-text-splitters==0.2.4 langsmith==0.1.120 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.22.0 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.1.0 mypy-extensions==1.0.0 networkx==3.3 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 pillow==10.4.0 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.1.0 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.34 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.1 tqdm==4.66.5 transformers==4.44.2 typer==0.12.5 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 yarl==1.11.1 zipp==3.20.1 |
※ pip install langchain-community langchain-huggingface
■ HuggingFaceEmbeddings 클래스의 생성자에서 model_name 인자를 사용해 HuggingFaceEmbeddings 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") textList = [ "Basquetball is a great sport.", "Fly me to the moon is one of my favourite songs.", "The Celtics are my favourite team.", "This is a document about the Boston Celtics", "I simply love going to the movies", "The Boston Celtics won the game by 20 points", "This is just a random text.", "Elden Ring is one of the best games in the last 15 years.", "L. Kornet is one of the best Celtics players.", "Larry Bird was an iconic NBA player.", ] chroma = Chroma.from_texts(textList, embedding = huggingFaceEmbeddings) vectorStoreRetriever = chroma.as_retriever(search_kwargs = {"k" : 10}) documentList = vectorStoreRetriever.invoke("What can you tell me about the Celtics?") for document in documentList: print(document.page_content) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.1 filelock==3.16.0 flatbuffers==24.3.25 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.8 importlib_metadata==8.4.0 importlib_resources==6.4.5 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain-chroma==0.1.3 langchain-core==0.2.39 langchain-huggingface==0.0.3 langchain-openai==0.1.23 langsmith==0.1.120 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 pillow==10.4.0 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.1.0 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.1 tqdm==4.66.5 transformers==4.44.2 typer==0.12.5 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 zipp==3.20.1 |
※ pip install langchain-huggingface
■ MultiVectorRetriever 클래스를 사용해 여러 벡터를 단일 문서와 연결해 부모 문서를 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
■ SelfQueryRetriever 클래스를 사용해 LLM을 사용해 잠재적으로 구조화된 쿼리를 생성하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ Chroma 클래스의 similarity_search_with_score 메소드를 사용해 검색 결과에 점수를 추가하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
■ MultiQueryRetriever 클래스의 invoke 메소드를 사용하는 방법을 보여준다. 2 ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
from dotenv import load_dotenv from typing import List from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain_core.prompts import PromptTemplate from langchain_openai import ChatOpenAI from langchain_core.output_parsers import BaseOutputParser from langchain.retrievers.multi_query import MultiQueryRetriever load_dotenv() webBaseLoader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0) splitdocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documents = splitdocumentList, embedding = openAIEmbeddings) promptTemplate = PromptTemplate( input_variables = ["question"], template = """You are an AI language model assistant. Your task is to generate five different versions of the given user question to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question : {question}""" ) chatOpenAI = ChatOpenAI(temperature = 0) class LineListOutputParser(BaseOutputParser[List[str]]): """Output parser for a list of lines.""" def parse(self, text : str) -> List[str]: lineList = text.strip().split("\n") return list(filter(None, lineList)) lineListOutputParser = LineListOutputParser() runnableSequence = promptTemplate | chatOpenAI | lineListOutputParser rultiQueryRetriever = MultiQueryRetriever(retriever = chroma.as_retriever(), llm_chain = runnableSequence, parser_key = "lines") responseDocumentList = rultiQueryRetriever.invoke("What does the course say about regression?") print(len(responseDocumentList)) |
▶ requirements.txt
■ MultiQueryRetriever 클래스의 invoke 메소드를 사용하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
import logging from dotenv import load_dotenv from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_chroma import Chroma from langchain_openai import ChatOpenAI from langchain.retrievers.multi_query import MultiQueryRetriever logging.basicConfig() logger = logging.getLogger("langchain.retrievers.multi_query") logger.setLevel(logging.INFO) load_dotenv() webBaseLoader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") documentList = webBaseLoader.load() recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0) splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() chroma = Chroma.from_documents(documents = splitDocumentList, embedding = openAIEmbeddings) vectorStoreRetriever = chroma.as_retriever() chatOpenAI = ChatOpenAI(temperature = 0) multiQueryRetriever = MultiQueryRetriever.from_llm(retriever = vectorStoreRetriever, llm = chatOpenAI) question = "What are the approaches to Task Decomposition?" responseDocumentList = multiQueryRetriever.invoke(question) print(len(responseDocumentList)) print(responseDocumentList[0].page_content) """ 5 Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs. """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
aiohappyeyeballs==2.3.5 aiohttp==3.10.3 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 async-timeout==4.0.3 attrs==24.2.0 backoff==2.2.1 bcrypt==4.2.0 beautifulsoup4==4.12.3 bs4==0.0.2 build==1.2.1 cachetools==5.4.0 certifi==2024.7.4 charset-normalizer==3.3.2 chroma-hnswlib==0.7.6 chromadb==0.5.5 click==8.1.7 coloredlogs==15.0.1 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 exceptiongroup==1.2.2 fastapi==0.112.0 filelock==3.15.4 flatbuffers==24.3.25 frozenlist==1.4.1 fsspec==2024.6.1 google-auth==2.33.0 googleapis-common-protos==1.63.2 greenlet==3.0.3 grpcio==1.65.4 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.5 humanfriendly==10.0 idna==3.7 importlib_metadata==8.0.0 importlib_resources==6.4.0 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain==0.2.12 langchain-chroma==0.1.2 langchain-community==0.2.11 langchain-core==0.2.29 langchain-openai==0.1.21 langchain-text-splitters==0.2.2 langsmith==0.1.98 markdown-it-py==3.0.0 marshmallow==3.21.3 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.0.5 mypy-extensions==1.0.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.18.1 openai==1.40.3 opentelemetry-api==1.26.0 opentelemetry-exporter-otlp-proto-common==1.26.0 opentelemetry-exporter-otlp-proto-grpc==1.26.0 opentelemetry-instrumentation==0.47b0 opentelemetry-instrumentation-asgi==0.47b0 opentelemetry-instrumentation-fastapi==0.47b0 opentelemetry-proto==1.26.0 opentelemetry-sdk==1.26.0 opentelemetry-semantic-conventions==0.47b0 opentelemetry-util-http==0.47b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 posthog==3.5.0 protobuf==4.25.4 pyasn1==0.6.0 pyasn1_modules==0.4.0 pydantic==2.8.2 pydantic_core==2.20.1 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.7.24 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.7.1 rsa==4.9 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.32 starlette==0.37.2 sympy==1.13.1 tenacity==8.5.0 tiktoken==0.7.0 tokenizers==0.20.0 tomli==2.0.1 tqdm==4.66.5 typer==0.12.3 typing-inspect==0.9.0 typing_extensions==4.12.2 urllib3==2.2.2 uvicorn==0.30.5 uvloop==0.19.0 watchfiles==0.23.0 websocket-client==1.8.0 websockets==12.0 wrapt==1.16.0 yarl==1.9.4 zipp==3.19.2 |
■ MultiQueryRetriever 클래스의 생성자에서 retriever/llm_chain/parser_key 인자를 사용해 MultiQueryRetriever 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ 예제
■ from_llm 함수의 retriever/llm 인자를 사용해 multiQueryRetriever 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ 예제 코드