khoj/tests/conftest.py

# Standard Packages
import pytest
import torch

# Internal Packages
from src.search_type import asymmetric, image_search
from src.utils.rawconfig import ContentConfig, TextContentConfig, ImageContentConfig, SearchConfig, SymmetricSearchConfig, AsymmetricSearchConfig, ImageSearchConfig


@pytest.fixture(scope='session')
def search_config(tmp_path_factory):
    model_dir = tmp_path_factory.mktemp('data')

    search_config = SearchConfig()

    search_config.asymmetric = SymmetricSearchConfig(
        encoder = "sentence-transformers/paraphrase-MiniLM-L6-v2",
        cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        model_directory = model_dir
    )

    search_config.asymmetric = AsymmetricSearchConfig(
        encoder = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
        cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        model_directory = model_dir
    )

    search_config.image = ImageSearchConfig(
        encoder = "clip-ViT-B-32",
        model_directory = model_dir
    )

    return search_config


@pytest.fixture(scope='session')
def model_dir(search_config):
    model_dir = search_config.asymmetric.model_directory
    device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

    # Generate Image Embeddings from Test Images
    content_config = ContentConfig()
    content_config.image = ImageContentConfig(
        input_directory = 'tests/data/images',
        embeddings_file = model_dir.joinpath('image_embeddings.pt'),
        batch_size = 10,
        use_xmp_metadata = False)

    image_search.setup(content_config.image, search_config.image, regenerate=False, verbose=True)

    # Generate Notes Embeddings from Test Notes
    content_config.org = TextContentConfig(
        input_files = None,
        input_filter = 'tests/data/notes/*.org',
        compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),
        embeddings_file = model_dir.joinpath('note_embeddings.pt'))

    asymmetric.setup(content_config.org, search_config.asymmetric, regenerate=False, device=device, verbose=True)

    return model_dir


@pytest.fixture(scope='session')
def content_config(model_dir):
    content_config = ContentConfig()
    content_config.org = TextContentConfig(
        input_files = None,
        input_filter = 'tests/data/notes/*.org',
        compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),
        embeddings_file = model_dir.joinpath('note_embeddings.pt'))

    content_config.image = ImageContentConfig(
        input_directory = 'tests/data/images',
        embeddings_file = model_dir.joinpath('image_embeddings.pt'),
        batch_size = 10,
        use_xmp_metadata = False)

    return content_config
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00			`# Standard Packages`
			`import pytest`
Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine 2022-06-29 22:59:57 +02:00			`import torch`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00
			`# Internal Packages`
			`from src.search_type import asymmetric, image_search`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`from src.utils.rawconfig import ContentConfig, TextContentConfig, ImageContentConfig, SearchConfig, SymmetricSearchConfig, AsymmetricSearchConfig, ImageSearchConfig`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00

			`@pytest.fixture(scope='session')`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`def search_config(tmp_path_factory):`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00			`model_dir = tmp_path_factory.mktemp('data')`

Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`search_config = SearchConfig()`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`search_config.asymmetric = SymmetricSearchConfig(`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`encoder = "sentence-transformers/paraphrase-MiniLM-L6-v2",`
			`cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",`
			`model_directory = model_dir`
			`)`

Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`search_config.asymmetric = AsymmetricSearchConfig(`
Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers 2022-07-18 18:00:19 +02:00			`encoder = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",`
			`model_directory = model_dir`
			`)`

Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`search_config.image = ImageSearchConfig(`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`encoder = "clip-ViT-B-32",`
			`model_directory = model_dir`
			`)`

			`return search_config`


			`@pytest.fixture(scope='session')`
			`def model_dir(search_config):`
			`model_dir = search_config.asymmetric.model_directory`
Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine 2022-06-29 22:59:57 +02:00			`device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00			`# Generate Image Embeddings from Test Images`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`content_config = ContentConfig()`
			`content_config.image = ImageContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 07:57:08 +01:00			`input_directory = 'tests/data/images',`
			`embeddings_file = model_dir.joinpath('image_embeddings.pt'),`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00			`batch_size = 10,`
Update types used in conftest.py 2021-12-04 18:02:19 +01:00			`use_xmp_metadata = False)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`image_search.setup(content_config.image, search_config.image, regenerate=False, verbose=True)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00
			`# Generate Notes Embeddings from Test Notes`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`content_config.org = TextContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 07:57:08 +01:00			`input_files = None,`
			`input_filter = 'tests/data/notes/*.org',`
			`compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),`
			`embeddings_file = model_dir.joinpath('note_embeddings.pt'))`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00
Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU - Move embeddings to CUDA GPU for compute, when available - Normalize embeddings and Use Dot Product instead of Cosine 2022-06-29 22:59:57 +02:00			`asymmetric.setup(content_config.org, search_config.asymmetric, regenerate=False, device=device, verbose=True)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-03 04:46:29 +02:00
			`return model_dir`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-03 05:28:33 +02:00

			`@pytest.fixture(scope='session')`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-15 02:13:14 +01:00			`def content_config(model_dir):`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`content_config = ContentConfig()`
			`content_config.org = TextContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 07:57:08 +01:00			`input_files = None,`
			`input_filter = 'tests/data/notes/*.org',`
			`compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),`
			`embeddings_file = model_dir.joinpath('note_embeddings.pt'))`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-03 05:28:33 +02:00
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-15 02:54:38 +01:00			`content_config.image = ImageContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 07:57:08 +01:00			`input_directory = 'tests/data/images',`
			`embeddings_file = model_dir.joinpath('image_embeddings.pt'),`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-03 05:28:33 +02:00			`batch_size = 10,`
Update types used in conftest.py 2021-12-04 18:02:19 +01:00			`use_xmp_metadata = False)`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-03 05:28:33 +02:00
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 07:57:08 +01:00			`return content_config`