[PYTHON/LANGCHAIN] EmbeddingsFilter 클래스 : compress_documents 메소드를 사용해 검색 문서 후처리를 통해 컨텐츠 압축하기

■ EmbeddingsFilter 클래스의 compress_documents 메소드를 사용해 검색 문서 후처리를 통해 컨텐츠를 압축하는 방법을 보여준다.

※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.

▶ main.py


from dotenv                                    import load_dotenv
from langchain_community.retrievers            import WikipediaRetriever
from langchain_text_splitters                  import RecursiveCharacterTextSplitter
from langchain_openai                          import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from typing                                    import List
from langchain_core.documents                  import Document
from langchain_core.runnables                  import RunnableParallel
from langchain_core.runnables                  import RunnablePassthrough
from langchain_core.prompts                    import ChatPromptTemplate
from langchain_openai                          import ChatOpenAI
from langchain_core.output_parsers             import StrOutputParser

load_dotenv()

wikipediaRetriever = WikipediaRetriever(top_k_results = 6, doc_content_chars_max = 2000)

recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(
    chunk_size     = 400,
    chunk_overlap  = 0,
    separators     = ["\n\n", "\n", ".", " "],
    keep_separator = False
)

embeddingsFilter = EmbeddingsFilter(embeddings = OpenAIEmbeddings(), k = 10)

def getStatefulDocumentListFromInputDictionary(inputDictionary) -> List[Document]:
    documentList         = inputDictionary["documentList"]
    question             = inputDictionary["question"    ]
    splitDocumentList    = recursiveCharacterTextSplitter.split_documents(documentList)
    statefulDocumentList = embeddingsFilter.compress_documents(splitDocumentList, question) # List[langchain_community.document_transformers.embeddings_redundant_filter._DocumentWithState]
    return [statefulDocument for statefulDocument in statefulDocumentList]

runnableSequence1 = (RunnableParallel(question = RunnablePassthrough(), documentList = wikipediaRetriever) | getStatefulDocumentListFromInputDictionary)

runnableSequence2 = (lambda x : x["input"]) | runnableSequence1

def getStringFromDocumentList(documentList : List[Document]):
    return "\n\n".join(document.page_content for document in documentList)

systemString = """You're a helpful AI assistant. Given a user question and some Wikipedia article snippets, answer the user question and provide citations. If none of the articles answer the question, just say you don't know.

Remember, you must return both an answer and citations. A citation consists of a VERBATIM quote that justifies the answer and the ID of the quote article. Return a citation for every quote across all articles that justify the answer. Use the following format for your final output :

<cited_answer>
    <answer></answer>
    <citations>
        <citation><source_id></source_id><quote></quote></citation>
        <citation><source_id></source_id><quote></quote></citation>
        ...
    </citations>
</cited_answer>

Here are the Wikipedia articles : {context}"""

chatPromptTemplate = ChatPromptTemplate.from_messages(
    [
        ("system", systemString),
        ("human" , "{input}"   )
    ]
)

chatOpenAI = ChatOpenAI(model = "gpt-4o-mini")

runnableSequence3 = (
    RunnablePassthrough.assign(context = (lambda x : getStringFromDocumentList(x["context"])))
    | chatPromptTemplate
    | chatOpenAI
    | StrOutputParser()
)

runnableSequence4 = RunnablePassthrough.assign(context = runnableSequence2).assign(answer = runnableSequence3)

result = runnableSequence4.invoke({"input" : "How fast are cheetahs?"})

print(result["answer"])

"""
<cited_answer>
    <answer>Cheetahs are capable of running at speeds of 93 to 104 km/h (58 to 65 mph).</answer>
    <citations>
        <citation>
            <source_id>1</source_id>
            <quote>The cheetah is capable of running at 93 to 104 km/h (58 to 65 mph); it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail</quote>
        </citation>
    </citations>
</cited_answer>
"""

from dotenv import load_dotenv

from langchain_community.retrievers import WikipediaRetriever

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings

from langchain.retrievers.document_compressors import EmbeddingsFilter

from typing import List

from langchain_core.documents import Document

from langchain_core.runnables import RunnableParallel

from langchain_core.runnables import RunnablePassthrough

from langchain_core.prompts import ChatPromptTemplate

from langchain_openai import ChatOpenAI

from langchain_core.output_parsers import StrOutputParser

load_dotenv()

wikipediaRetriever = WikipediaRetriever(top_k_results = 6, doc_content_chars_max = 2000)

recursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(

chunk_size = 400,

chunk_overlap = 0,

separators = ["\n\n", "\n", ".", " "],

keep_separator = False

)

embeddingsFilter = EmbeddingsFilter(embeddings = OpenAIEmbeddings(), k = 10)

def getStatefulDocumentListFromInputDictionary(inputDictionary) -> List[Document]:

documentList = inputDictionary["documentList"]

question = inputDictionary["question" ]

splitDocumentList = recursiveCharacterTextSplitter.split_documents(documentList)

statefulDocumentList = embeddingsFilter.compress_documents(splitDocumentList, question) # List[langchain_community.document_transformers.embeddings_redundant_filter._DocumentWithState]

return [statefulDocument for statefulDocument in statefulDocumentList]

runnableSequence1 = (RunnableParallel(question = RunnablePassthrough(), documentList = wikipediaRetriever) | getStatefulDocumentListFromInputDictionary)

runnableSequence2 = (lambda x : x["input"]) | runnableSequence1

def getStringFromDocumentList(documentList : List[Document]):

return "\n\n".join(document.page_content for document in documentList)

systemString = """You're a helpful AI assistant. Given a user question and some Wikipedia article snippets, answer the user question and provide citations. If none of the articles answer the question, just say you don't know.

Remember, you must return both an answer and citations. A citation consists of a VERBATIM quote that justifies the answer and the ID of the quote article. Return a citation for every quote across all articles that justify the answer. Use the following format for your final output :

<cited_answer>

...

</citations>

</cited_answer>

Here are the Wikipedia articles : {context}"""

chatPromptTemplate = ChatPromptTemplate.from_messages(

[

("system", systemString),

("human" , "{input}" )

]

)

chatOpenAI = ChatOpenAI(model = "gpt-4o-mini")

runnableSequence3 = (

RunnablePassthrough.assign(context = (lambda x : getStringFromDocumentList(x["context"])))

| chatPromptTemplate

| chatOpenAI

| StrOutputParser()

)

runnableSequence4 = RunnablePassthrough.assign(context = runnableSequence2).assign(answer = runnableSequence3)

result = runnableSequence4.invoke({"input" : "How fast are cheetahs?"})

print(result["answer"])

"""

<cited_answer>

<answer>Cheetahs are capable of running at speeds of 93 to 104 km/h (58 to 65 mph).</answer>

<source_id>1</source_id>

<quote>The cheetah is capable of running at 93 to 104 km/h (58 to 65 mph); it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail</quote>

</citation>

</citations>

</cited_answer>

"""

▶ requirements.txt


aiohappyeyeballs==2.4.4
aiohttp==3.11.9
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
attrs==24.2.0
beautifulsoup4==4.12.3
certifi==2024.8.30
charset-normalizer==3.4.0
colorama==0.4.6
dataclasses-json==0.6.7
distro==1.9.0
frozenlist==1.5.0
greenlet==3.1.1
h11==0.14.0
httpcore==1.0.7
httpx==0.28.0
httpx-sse==0.4.0
idna==3.10
jiter==0.8.0
jsonpatch==1.33
jsonpointer==3.0.0
langchain==0.3.9
langchain-community==0.3.9
langchain-core==0.3.21
langchain-openai==0.2.10
langchain-text-splitters==0.3.2
langsmith==0.1.147
marshmallow==3.23.1
multidict==6.1.0
mypy-extensions==1.0.0
numpy==2.1.3
openai==1.56.0
orjson==3.10.12
packaging==24.2
propcache==0.2.1
pydantic==2.10.2
pydantic-settings==2.6.1
pydantic_core==2.27.1
python-dotenv==1.0.1
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
sniffio==1.3.1
soupsieve==2.6
SQLAlchemy==2.0.36
tenacity==9.0.0
tiktoken==0.8.0
tqdm==4.67.1
typing-inspect==0.9.0
typing_extensions==4.12.2
urllib3==2.2.3
wikipedia==1.4.0
yarl==1.18.3

aiohappyeyeballs==2.4.4

aiohttp==3.11.9

aiosignal==1.3.1

annotated-types==0.7.0

anyio==4.6.2.post1

attrs==24.2.0

beautifulsoup4==4.12.3

certifi==2024.8.30

charset-normalizer==3.4.0

colorama==0.4.6

dataclasses-json==0.6.7

distro==1.9.0

frozenlist==1.5.0

greenlet==3.1.1

h11==0.14.0

httpcore==1.0.7

httpx==0.28.0

httpx-sse==0.4.0

idna==3.10

jiter==0.8.0

jsonpatch==1.33

jsonpointer==3.0.0

langchain==0.3.9

langchain-community==0.3.9

langchain-core==0.3.21

langchain-openai==0.2.10

langchain-text-splitters==0.3.2

langsmith==0.1.147

marshmallow==3.23.1

multidict==6.1.0

mypy-extensions==1.0.0

numpy==2.1.3

openai==1.56.0

orjson==3.10.12

packaging==24.2

propcache==0.2.1

pydantic==2.10.2

pydantic-settings==2.6.1

pydantic_core==2.27.1

python-dotenv==1.0.1

PyYAML==6.0.2

regex==2024.11.6

requests==2.32.3

requests-toolbelt==1.0.0

sniffio==1.3.1

soupsieve==2.6

SQLAlchemy==2.0.36

tenacity==9.0.0

tiktoken==0.8.0

tqdm==4.67.1

typing-inspect==0.9.0

typing_extensions==4.12.2

urllib3==2.2.3

wikipedia==1.4.0

yarl==1.18.3

※ pip install python-dotenv langchain langchain-community langchain-openai wikipedia 명령을 실행했다.