■ MilvusClient 클래스의 delete 메소드에서 collection_name/filter 인자를 사용해 엔터티를 삭제하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
|
import numpy as np from pymilvus import MilvusClient milvusClient = MilvusClient("test.db") hasCollection = milvusClient.has_collection(collection_name = "temp") if milvusClient.has_collection(collection_name= "temp"): milvusClient.drop_collection(collection_name = "temp") milvusClient.create_collection( collection_name = "temp", dimension = 384 ) stringList = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList = [[ np.random.uniform(-1, 1) for _ in range(384) ] for _ in range(len(stringList)) ] # NDArray list itemList = [{"id" : i, "vector" : stringVectorList[i], "text" : stringList[i], "subject" : "history"} for i in range(len(stringVectorList))] milvusClient.insert( collection_name = "temp", data = itemList ) extraList1 = milvusClient.search( collection_name = "temp", data = [stringVectorList[0]], filter = "subject == 'history'", limit = 2, output_fields = ["text", "subject"] ) print(extraList1[0][0]) print(extraList1[0][1]) print("-" * 100) extraList2 = milvusClient.query( collection_name = "temp", filter = "subject == 'history'", output_fields = ["text", "subject"] ) print(extraList2[0]) print(extraList2[1]) print(extraList2[2]) print("-" * 100) deletedIDList = milvusClient.delete( collection_name = "temp", filter = "subject == 'history'" ) # int list print(deletedIDList) print("-" * 100) """ {'id': 0, 'distance': 1.0000003576278687, 'entity': {'text': 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject': 'history'}} {'id': 2, 'distance': 0.09245504438877106, 'entity': {'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}} |
{'id': 0, 'text': 'Artificial intelligence was founded
더 읽기
■ MilvusClient 클래스의 delete 메소드에서 collection_name/ids 인자를 사용해 엔터티를 삭제하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
|
from pymilvus import MilvusClient from pymilvus import model milvusClient = MilvusClient("test.db") hasCollection = milvusClient.has_collection(collection_name = "temp") if milvusClient.has_collection(collection_name= "temp"): milvusClient.drop_collection(collection_name = "temp") milvusClient.create_collection( collection_name = "temp", dimension = 768 ) onnxEmbeddingFunction = model.DefaultEmbeddingFunction() stringList1 = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList1 = onnxEmbeddingFunction.encode_documents(stringList1) # NDArray list itemList = [] itemList.extend( [ {"id" : i, "vector" : stringVectorList1[i], "text" : stringList1[i], "subject" : "history"} for i in range(len(stringVectorList1)) ] ) stringList2 = [ "Machine learning has been used for drug design.", "Computational synthesis with AI algorithms predicts molecular properties.", "DDR1 is involved in cancers and fibrosis.", ] stringVectorList2 = onnxEmbeddingFunction.encode_documents(stringList2) # NDArray list itemList.extend( [ {"id" : 3 + i, "vector" : stringVectorList2[i], "text" : stringList2[i], "subject" : "biology"} for i in range(len(stringVectorList2)) ] ) milvusClient.insert(collection_name = "temp", data = itemList) deletedIDList = milvusClient.delete(collection_name = "temp", ids = [0, 2]) # int list print(deletedIDList) """ [0, 2] """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model]
더 읽기
■ MilvusClient 클래스의 query 메소드에서 ids 인자를 사용해 쿼리하는 방법을 보여준다. ※ 기본 키로 엔터티를 직접 검색한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
|
from pymilvus import MilvusClient from pymilvus import model milvusClient = MilvusClient("test.db") hasCollection = milvusClient.has_collection(collection_name = "temp") if milvusClient.has_collection(collection_name= "temp"): milvusClient.drop_collection(collection_name = "temp") milvusClient.create_collection( collection_name = "temp", dimension = 768 ) onnxEmbeddingFunction = model.DefaultEmbeddingFunction() stringList1 = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList1 = onnxEmbeddingFunction.encode_documents(stringList1) # NDArray List itemList = [] itemList.extend( [ {"id" : i, "vector" : stringVectorList1[i], "text" : stringList1[i], "subject" : "history"} for i in range(len(stringVectorList1)) ] ) stringList2 = [ "Machine learning has been used for drug design.", "Computational synthesis with AI algorithms predicts molecular properties.", "DDR1 is involved in cancers and fibrosis.", ] stringVectorList2 = onnxEmbeddingFunction.encode_documents(stringList2) # NDArray list itemList.extend( [ {"id" : 3 + i, "vector" : stringVectorList2[i], "text" : stringList2[i], "subject" : "biology"} for i in range(len(stringVectorList2)) ] ) milvusClient.insert(collection_name = "temp", data = itemList) extraList = milvusClient.query( collection_name = "temp", ids = [0, 2], output_fields = ["text", "subject"] ) print(extraList[0]) print(extraList[1]) """ {'id' : 0, 'text' : 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject' : 'history'} {'id' : 2, 'text' : 'Born in Maida Vale, London, Turing was raised in southern England.' , 'subject' : 'history'} """ |
▶ requirements.txt
더 읽기
■ MilvusClient 클래스의 query 메소드에서 collection_name/filter/output_fields 인자를 사용해 쿼리하는 방법을 보여준다. ※ 필터 표현식이나 일부 ID와 일치하는 크레트리아와 일치하는 모든 엔터티를 검색하는
더 읽기
■ MilvusClient 클래스의 search 메소드에서 collection_name/data/limit/output_fields 인자를 사용해 벡터를 검색하는 방법을 보여준다. ※ Milvus는 동시에 하나 또는 여러 개의 벡터 검색 요청을
더 읽기
■ OnnxEmbeddingFunction 클래스의 encode_queries 메소드를 사용해 쿼리 벡터 리스트를 만드는 방법을 보여준다. ▶ main.py
|
from pymilvus import model onnxEmbeddingFunction = model.DefaultEmbeddingFunction() queryVectorList = onnxEmbeddingFunction.encode_queries(["Who is Alan Turing?"]) # NDArray list |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model]
더 읽기
■ 임의 벡터로 텍스트를 표현하는 방법을 보여준다. ※ 네트워크 문제로 인해 모델을 다운로드할 수 없는 경우, 임시 방편으로 임의의 벡터를 사용하여 텍스트를
더 읽기
■ 벡터로 텍스트를 표현하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
|
from pymilvus import model # https://huggingface.co/에 연결이 실패한 경우 다음 경로의 주석 처리를 제거한다. # import os # os.environ["HF_ENDPOINT":"] = "https://hf-mirror.com" # 작은 임베딩 모델 "paraphrase-albert-small-v2"(~50MB)이 다운로드된다. onnxEmbeddingFunction = model.DefaultEmbeddingFunction() stringList = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList = onnxEmbeddingFunction.encode_documents(stringList) # NDArray list itemList = [ {"id" : i, "vector" : stringVectorList[i], "text" : stringList[i], "subject" : "history"} for i in range(len(stringVectorList)) ] print("Data has", len(itemList), "entities, each with fields : ", itemList[0].keys()) print("Vector dim :", len(itemList[0]["vector"])) """ Data has 3 entities, each with fields : dict_keys(['id', 'vector', 'text', 'subject']) Vector dim : 768 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model] 명령을 실행했다.
■ OnnxEmbeddingFunction 클래스의 dim 속성을 사용해 차원을 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
from pymilvus import model # https://huggingface.co/에 연결이 실패한 경우 다음 경로의 주석 처리를 제거한다. # import os # os.environ["HF_ENDPOINT":"] = "https://hf-mirror.com" # 작은 임베딩 모델 "paraphrase-albert-small-v2"(~50MB)이 다운로드된다. onnxEmbeddingFunction = model.DefaultEmbeddingFunction() stringList = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList = onnxEmbeddingFunction.encode_documents(stringList) # NDArray list print("차원 :", onnxEmbeddingFunction.dim, stringVectorList[0].shape) # Dim: 768 (768,) """ 차원 : 768 (768,) """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model] 명령을 실행했다.
■ OnnxEmbeddingFunction 클래스의 encode_documents 메소드를 사용해 벡터 리스트를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
from pymilvus import model # https://huggingface.co/에 연결이 실패한 경우 다음 경로의 주석 처리를 제거한다. # import os # os.environ["HF_ENDPOINT":"] = "https://hf-mirror.com" # 작은 임베딩 모델 "paraphrase-albert-small-v2"(~50MB)이 다운로드된다. onnxEmbeddingFunction = model.DefaultEmbeddingFunction() stringList = [ "Artificial intelligence was founded as an academic discipline in 1956.", "Alan Turing was the first person to conduct substantial research in AI.", "Born in Maida Vale, London, Turing was raised in southern England." ] stringVectorList = onnxEmbeddingFunction.encode_documents(stringList) # NDArray list print(len(stringVectorList)) """ 3 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model] 명령을
더 읽기
■ DefaultEmbeddingFunction 함수를 사용해 OnnxEmbeddingFunction 객체를 만드는 방법을 보여준다. ▶ main.py
|
from pymilvus import model # https://huggingface.co/에 연결이 실패한 경우 다음 경로의 주석 처리를 제거한다. # import os # os.environ["HF_ENDPOINT":"] = "https://hf-mirror.com" # 작은 임베딩 모델 "paraphrase-albert-small-v2"(~50MB)이 다운로드된다. onnxEmbeddingFunction = model.DefaultEmbeddingFunction() |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
certifi==2024.8.30 charset-normalizer==3.3.2 coloredlogs==15.0.1 environs==9.5.0 filelock==3.16.1 flatbuffers==24.3.25 fsspec==2024.9.0 grpcio==1.66.2 huggingface-hub==0.25.1 humanfriendly==10.0 idna==3.10 marshmallow==3.22.0 milvus-lite==2.4.10 milvus-model==0.2.7 mpmath==1.3.0 numpy==2.1.2 onnxruntime==1.19.2 packaging==24.1 pandas==2.2.3 protobuf==5.28.2 pymilvus==2.4.7 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scipy==1.14.1 six==1.16.0 sympy==1.13.3 tokenizers==0.20.0 tqdm==4.66.5 transformers==4.45.1 typing_extensions==4.12.2 tzdata==2024.2 ujson==5.10.0 urllib3==2.2.3 |
※ pip install pymilvus[model] 명령을 실행했다.
■ HuggingFaceEmbeddings 클래스의 생성자에서 model_name 인자를 사용해 HuggingFaceEmbeddings 객체를 만드는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|
from langchain_huggingface import HuggingFaceEmbeddings from langchain_chroma import Chroma huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") textList = [ "Basquetball is a great sport.", "Fly me to the moon is one of my favourite songs.", "The Celtics are my favourite team.", "This is a document about the Boston Celtics", "I simply love going to the movies", "The Boston Celtics won the game by 20 points", "This is just a random text.", "Elden Ring is one of the best games in the last 15 years.", "L. Kornet is one of the best Celtics players.", "Larry Bird was an iconic NBA player.", ] chroma = Chroma.from_texts(textList, embedding = huggingFaceEmbeddings) vectorStoreRetriever = chroma.as_retriever(search_kwargs = {"k" : 10}) documentList = vectorStoreRetriever.invoke("What can you tell me about the Celtics?") for document in documentList: print(document.page_content) |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
|
annotated-types==0.7.0 anyio==4.4.0 asgiref==3.8.1 backoff==2.2.1 bcrypt==4.2.0 build==1.2.2 cachetools==5.5.0 certifi==2024.8.30 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.3 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 Deprecated==1.2.14 distro==1.9.0 fastapi==0.114.1 filelock==3.16.0 flatbuffers==24.3.25 fsspec==2024.9.0 google-auth==2.34.0 googleapis-common-protos==1.65.0 grpcio==1.66.1 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.7 humanfriendly==10.0 idna==3.8 importlib_metadata==8.4.0 importlib_resources==6.4.5 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 kubernetes==30.1.0 langchain-chroma==0.1.3 langchain-core==0.2.39 langchain-huggingface==0.0.3 langchain-openai==0.1.23 langsmith==0.1.120 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.2 openai==1.45.0 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 orjson==3.10.7 overrides==7.7.0 packaging==24.1 pillow==10.4.0 posthog==3.6.5 protobuf==4.25.4 pyasn1==0.6.1 pyasn1_modules==0.4.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 PyPika==0.48.9 pyproject_hooks==1.1.0 pyreadline3==3.4.3 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 rsa==4.9 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.1.0 setuptools==74.1.2 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 starlette==0.38.5 sympy==1.13.2 tenacity==8.5.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.1 tqdm==4.66.5 transformers==4.44.2 typer==0.12.5 typing_extensions==4.12.2 urllib3==2.2.3 uvicorn==0.30.6 watchfiles==0.24.0 websocket-client==1.8.0 websockets==13.0.1 wrapt==1.16.0 zipp==3.20.1 |
※ pip install langchain-huggingface
더 읽기
■ ContextualCompressionRetriever 클래스에서 DocumentCompressorPipeline 객체를 사용해 컨텍스트 압축 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
|
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain_community.document_transformers import EmbeddingsRedundantFilter from langchain.retrievers.document_compressors import EmbeddingsFilter from langchain.retrievers.document_compressors import DocumentCompressorPipeline from langchain.retrievers import ContextualCompressionRetriever load_dotenv() def printDocumentList(documentList): print( f"\n{'-' * 100}\n".join( [f"Document {i + 1} :\n\n" + document.page_content for i, document in enumerate(documentList)] ) ) textLoader = TextLoader("state_of_the_union.txt") documentList = textLoader.load() characterTextSplitter1 = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0) splitDocumentList = characterTextSplitter1.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() faiss = FAISS.from_documents(splitDocumentList, openAIEmbeddings) vectorStoreRetriever = faiss.as_retriever() characterTextSplitter2 = CharacterTextSplitter(chunk_size = 300, chunk_overlap = 0, separator = ". ") embeddingsRedundantFilter = EmbeddingsRedundantFilter(embeddings = openAIEmbeddings) embeddingsFilter = EmbeddingsFilter(embeddings = openAIEmbeddings, similarity_threshold = 0.76) documentCompressorPipeline = DocumentCompressorPipeline(transformers = [characterTextSplitter2, embeddingsRedundantFilter, embeddingsFilter]) contextualCompressionRetriever = ContextualCompressionRetriever(base_compressor = documentCompressorPipeline, base_retriever = vectorStoreRetriever) resultDocumentList = contextualCompressionRetriever.invoke("What did the president say about Ketanji Jackson Brown") printDocumentList(resultDocumentList) |
더 읽기
■ DocumentCompressorPipeline 클래스의 생성자에서 transformers 인자를 사용해 DocumentCompressorPipeline 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import CharacterTextSplitter from langchain_community.document_transformers import EmbeddingsRedundantFilter from langchain.retrievers.document_compressors import EmbeddingsFilter from langchain.retrievers.document_compressors import DocumentCompressorPipeline openAIEmbeddings = OpenAIEmbeddings() characterTextSplitter = CharacterTextSplitter(chunk_size = 300, chunk_overlap = 0, separator = ". ") embeddingsRedundantFilter = EmbeddingsRedundantFilter(embeddings = openAIEmbeddings) embeddingsFilter = EmbeddingsFilter(embeddings = openAIEmbeddings, similarity_threshold = 0.76) documentCompressorPipeline = DocumentCompressorPipeline(transformers = [characterTextSplitter, embeddingsRedundantFilter, embeddingsFilter]) |
■ EmbeddingsRedundantFilter 클래스의 생성자에서 embeddings 인자를 사용해 EmbeddingsRedundantFilter 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
|
from langchain_openai import OpenAIEmbeddings from langchain_community.document_transformers import EmbeddingsRedundantFilter openAIEmbeddings = OpenAIEmbeddings() embeddingsRedundantFilter = EmbeddingsRedundantFilter(embeddings = openAIEmbeddings) |
■ EmbeddingsFilter 클래스의 생성자에서 embeddings/similarity_threshold 인자를 사용해 EmbeddingsFilter 객체를 만드는 방법을 보여준다. ▶ 예제 코드 (PY)
|
from langchain_openai import OpenAIEmbeddings from langchain.retrievers.document_compressors import EmbeddingsFilter openAIEmbeddings = OpenAIEmbeddings() embeddingsFilter = EmbeddingsFilter(embeddings = openAIEmbeddings, similarity_threshold = 0.76) |
■ ContextualCompressionRetriever 클래스에서 EmbeddingsFilter 객체를 사용해 컨텍스트 압축 검색하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
from dotenv import load_dotenv from langchain_community.document_loaders import TextLoader from langchain_text_splitters import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain.retrievers.document_compressors import EmbeddingsFilter from langchain.retrievers import ContextualCompressionRetriever load_dotenv() def printDocumentList(documentList): print( f"\n{'-' * 100}\n".join( [f"Document {i + 1} :\n\n" + document.page_content for i, document in enumerate(documentList)] ) ) textLoader = TextLoader("state_of_the_union.txt") documentList = textLoader.load() characterTextSplitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0) splitDocumentList = characterTextSplitter.split_documents(documentList) openAIEmbeddings = OpenAIEmbeddings() faiss = FAISS.from_documents(splitDocumentList, openAIEmbeddings) vectorStoreRetriever = faiss.as_retriever() embeddingsFilter = EmbeddingsFilter(embeddings = openAIEmbeddings, similarity_threshold = 0.76) contextualCompressionRetriever = ContextualCompressionRetriever(base_compressor = embeddingsFilter, base_retriever = vectorStoreRetriever) resultDocumentList = contextualCompressionRetriever.invoke("What did the president say about Ketanji Jackson Brown") printDocumentList(resultDocumentList) |
더 읽기
■ InMemoryByteStore 클래스를 사용해 메모리 파일 저장소를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
from dotenv import load_dotenv from langchain_openai import OpenAIEmbeddings from langchain.storage import InMemoryByteStore from langchain.embeddings import CacheBackedEmbeddings from langchain_community.document_loaders import TextLoader from langchain_text_splitters import CharacterTextSplitter from langchain_community.vectorstores import FAISS load_dotenv() openAIEmbeddings = OpenAIEmbeddings() inMemoryByteStore = InMemoryByteStore() cacheBackedEmbeddings = CacheBackedEmbeddings.from_bytes_store(openAIEmbeddings, inMemoryByteStore, namespace = openAIEmbeddings.model) textLoader = TextLoader("state_of_the_union.txt") documentList = textLoader.load() characterTextSplitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0) splitDocumentList = characterTextSplitter.split_documents(documentList) list1 = list(inMemoryByteStore.yield_keys()) faiss = FAISS.from_documents(splitDocumentList, cacheBackedEmbeddings) list2 = list(inMemoryByteStore.yield_keys()) print(len(list1)) print(len(list2)) """ 0 42 """ |
▶
더 읽기
■ CacheBackedEmbeddings 클래스를 사용해 FAISS 벡터 저장소에 임베딩 캐시를 설정하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
더 읽기
■ CacheBackedEmbeddings 클래스의 from_bytes_store 정적 메소드를 사용해 CacheBackedEmbeddings 객체를 만드는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
더 읽기
■ LocalFileStore 클래스를 사용해 로컬 파일 저장소를 만드는 방법을 보여준다. ▶ main.py
|
from langchain.storage import LocalFileStore localFileStore = LocalFileStore("./cache/") |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 async-timeout==4.0.3 attrs==23.2.0 certifi==2024.6.2 charset-normalizer==3.3.2 frozenlist==1.4.1 greenlet==3.0.3 idna==3.7 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.2.6 langchain-core==0.2.10 langchain-text-splitters==0.2.2 langsmith==0.1.82 multidict==6.0.5 numpy==1.26.4 orjson==3.10.5 packaging==24.1 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 requests==2.32.3 SQLAlchemy==2.0.31 tenacity==8.4.2 typing_extensions==4.12.2 urllib3==2.2.2 yarl==1.9.4 |
※ pip install langchain 명령을 실행했다.
■ HuggingFaceEmbeddings 클래스의 embed_query 메소드를 사용해 문자열에서 벡터 리스트를 구하는 방법을 보여준다. ▶ main.py
|
from langchain_huggingface import HuggingFaceEmbeddings huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-mpnet-base-v2") embeddingList = huggingFaceEmbeddings.embed_query("What was the name mentioned in the conversation?") print("임베딩 리스트 :", len(embeddingList)) """ 임베딩 리스트 : 768 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-huggingface==0.0.3 langsmith==0.1.82 MarkupSafe==2.1.5 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 orjson==3.10.5 packaging==24.1 pillow==10.3.0 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 scikit-learn==1.5.0 scipy==1.14.0 sentence-transformers==3.0.1 sympy==1.12.1 tenacity==8.4.2 threadpoolctl==3.5.0 tokenizers==0.19.1 torch==2.3.1 tqdm==4.66.4 transformers==4.42.3 triton==2.3.1 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip install langchain-huggingface
더 읽기
■ HuggingFaceEmbeddings 클래스의 embed_documents 메소드를 이용해 문자열 리스트에서 벡터 리스트의 리스트를 구하는 방법을 보여준다. ▶ main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
from langchain_huggingface import HuggingFaceEmbeddings huggingFaceEmbeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-mpnet-base-v2") sourceList = [ "Hi there!", "Oh, hello!", "What's your name?", "My friends call me World", "Hello World!" ] embeddingListList = huggingFaceEmbeddings.embed_documents(sourceList) print("임베딩 리스트의 리스트 :", len(embeddingListList)) print() for embeddingList in embeddingListList: print("임베딩 리스트 :", len(embeddingList)) """ 임베딩 리스트의 리스트 : 5 임베딩 리스트 : 768 임베딩 리스트 : 768 임베딩 리스트 : 768 임베딩 리스트 : 768 임베딩 리스트 : 768 """ |
▶ requirements.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|
annotated-types==0.7.0 certifi==2024.6.2 charset-normalizer==3.3.2 filelock==3.15.4 fsspec==2024.6.1 huggingface-hub==0.23.4 idna==3.7 Jinja2==3.1.4 joblib==1.4.2 jsonpatch==1.33 jsonpointer==3.0.0 langchain-core==0.2.10 langchain-huggingface==0.0.3 langsmith==0.1.82 MarkupSafe==2.1.5 mpmath==1.3.0 networkx==3.3 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 orjson==3.10.5 packaging==24.1 pillow==10.3.0 pydantic==2.7.4 pydantic_core==2.18.4 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.3 safetensors==0.4.3 scikit-learn==1.5.0 scipy==1.14.0 sentence-transformers==3.0.1 sympy==1.12.1 tenacity==8.4.2 threadpoolctl==3.5.0 tokenizers==0.19.1 torch==2.3.1 tqdm==4.66.4 transformers==4.42.3 triton==2.3.1 typing_extensions==4.12.2 urllib3==2.2.2 |
※ pip
더 읽기
■ OpenAIEmbeddings 클래스의 embed_query 메소드를 사용해 문자열에서 벡터 리스트를 구하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다. ▶ main.py
더 읽기
■ OpenAIEmbeddings 클래스의 embed_documents 메소드를 사용해 문자열 리스트에서 벡터 리스트의 리스트를 구하는 방법을 보여준다. ※ OPENAI_API_KEY 환경 변수 값은 .env 파일에 정의한다.
더 읽기