khoj/tests/conftest.py

# Standard Packages
import pytest

# Internal Packages
from src.search_type import image_search, text_search
from src.utils.config import SearchType
from src.utils.rawconfig import ContentConfig, TextContentConfig, ImageContentConfig, SearchConfig, TextSearchConfig, ImageSearchConfig
from src.processor.org_mode.org_to_jsonl import org_to_jsonl


@pytest.fixture(scope='session')
def search_config(tmp_path_factory):
    model_dir = tmp_path_factory.mktemp('data')

    search_config = SearchConfig()

    search_config.symmetric = TextSearchConfig(
        encoder = "sentence-transformers/all-MiniLM-L6-v2",
        cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        model_directory = model_dir
    )

    search_config.asymmetric = TextSearchConfig(
        encoder = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
        cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        model_directory = model_dir
    )

    search_config.image = ImageSearchConfig(
        encoder = "sentence-transformers/clip-ViT-B-32",
        model_directory = model_dir
    )

    return search_config


@pytest.fixture(scope='session')
def model_dir(search_config):
    model_dir = search_config.asymmetric.model_directory

    # Generate Image Embeddings from Test Images
    content_config = ContentConfig()
    content_config.image = ImageContentConfig(
        input_directories = ['tests/data/images'],
        embeddings_file = model_dir.joinpath('image_embeddings.pt'),
        batch_size = 10,
        use_xmp_metadata = False)

    image_search.setup(content_config.image, search_config.image, regenerate=False)

    # Generate Notes Embeddings from Test Notes
    content_config.org = TextContentConfig(
        input_files = None,
        input_filter = 'tests/data/org/*.org',
        compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),
        embeddings_file = model_dir.joinpath('note_embeddings.pt'))

    text_search.setup(org_to_jsonl, content_config.org, search_config.asymmetric, SearchType.Org, regenerate=False)

    return model_dir


@pytest.fixture(scope='session')
def content_config(model_dir):
    content_config = ContentConfig()
    content_config.org = TextContentConfig(
        input_files = None,
        input_filter = 'tests/data/org/*.org',
        compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),
        embeddings_file = model_dir.joinpath('note_embeddings.pt'))

    content_config.image = ImageContentConfig(
        input_directories = ['tests/data/images'],
        embeddings_file = model_dir.joinpath('image_embeddings.pt'),
        batch_size = 1,
        use_xmp_metadata = False)

    return content_config
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00			`# Standard Packages`
			`import pytest`

			`# Internal Packages`
Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types 2022-07-21 18:05:43 +04:00			`from src.search_type import image_search, text_search`
Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:13:25 +03:00			`from src.utils.config import SearchType`
Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types 2022-07-21 18:05:43 +04:00			`from src.utils.rawconfig import ContentConfig, TextContentConfig, ImageContentConfig, SearchConfig, TextSearchConfig, ImageSearchConfig`
			`from src.processor.org_mode.org_to_jsonl import org_to_jsonl`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00

			`@pytest.fixture(scope='session')`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00			`def search_config(tmp_path_factory):`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00			`model_dir = tmp_path_factory.mktemp('data')`

Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-14 20:54:38 -05:00			`search_config = SearchConfig()`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00
Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types 2022-07-21 18:05:43 +04:00			`search_config.symmetric = TextSearchConfig(`
Use a more accurate model for symmetric semantic search - The all-MiniLM-L6-v2 is more accurate - The exact previous model isn't benchmarked but based on the performance of the closest model to it. Seems like the new model maybe similar in speed and size - On very preliminary evaluation of the model, the new model seems faster, with pretty decent results 2022-07-18 20:16:40 +04:00			`encoder = "sentence-transformers/all-MiniLM-L6-v2",`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00			`cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",`
			`model_directory = model_dir`
			`)`

Merge Symmetric, Asymmetric Search Types into a single Text Search Type - The code for both the text search types were mostly the same It was earlier done this way for expedience while experimenting - The minor differences were reconciled and merged into a single text_search type - This simplifies the app and making it easier to process other text types 2022-07-21 18:05:43 +04:00			`search_config.asymmetric = TextSearchConfig(`
Use a better model for asymmetric semantic search - The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1] - It has the right mix of model query speed, size and performance on benchmarks - On hugging face it has way more downloads and likes than the msmarco model[2] - On very preliminary evaluation of the model - It doubles the encoding speed of all entries (down from ~8min to 4mins) - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier) [1]: https://www.sbert.net/docs/pretrained_models.html [2]: https://huggingface.co/sentence-transformers 2022-07-18 20:00:19 +04:00			`encoder = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00			`cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2",`
			`model_directory = model_dir`
			`)`

Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-14 20:54:38 -05:00			`search_config.image = ImageSearchConfig(`
Set clip model name in conftest to sentence-tranformers/clip as well 2022-08-04 22:54:39 +03:00			`encoder = "sentence-transformers/clip-ViT-B-32",`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00			`model_directory = model_dir`
			`)`

			`return search_config`


			`@pytest.fixture(scope='session')`
			`def model_dir(search_config):`
			`model_dir = search_config.asymmetric.model_directory`

Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00			`# Generate Image Embeddings from Test Images`
Re-enable tests for image search Verify if recent fixes resolve test flakiness 2022-08-20 14:21:04 +03:00			`content_config = ContentConfig()`
			`content_config.image = ImageContentConfig(`
			`input_directories = ['tests/data/images'],`
			`embeddings_file = model_dir.joinpath('image_embeddings.pt'),`
			`batch_size = 10,`
			`use_xmp_metadata = False)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00
Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:13:25 +03:00			`image_search.setup(content_config.image, search_config.image, regenerate=False)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00
			`# Generate Notes Embeddings from Test Notes`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-14 20:54:38 -05:00			`content_config.org = TextContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 01:57:08 -05:00			`input_files = None,`
Rename org-mode test data directory to more specific org/ from notes/ 2022-08-04 22:29:57 +03:00			`input_filter = 'tests/data/org/*.org',`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 01:57:08 -05:00			`compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),`
			`embeddings_file = model_dir.joinpath('note_embeddings.pt'))`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00
Update Tests to Configure Filters and Setup Text Search 2022-09-03 22:13:25 +03:00			`text_search.setup(org_to_jsonl, content_config.org, search_config.asymmetric, SearchType.Org, regenerate=False)`
Create test directory with model data to reuse for pytest session - Use pytest fixture with session scope - Instantiate test directory with model data to reuse for tests 2021-10-02 19:46:29 -07:00
			`return model_dir`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-02 20:28:33 -07:00

			`@pytest.fixture(scope='session')`
Setup Search with Search_Config to Fix Tests - Rename pytest fixture search_config to more appropriate content_config - Create search_config pytest fixture - Use search_config where search being setup, used in tests 2022-01-14 20:13:14 -05:00			`def content_config(model_dir):`
Rename RawConfig Types for Consistency - Naming convention - [ContentType][ConfigType]Config - Where [ConfigType] ~ Content, Search, Processor - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation - Current Configs: - Content: - Org Notes - Org Music - Image - Ledger/Beancount - Search: - Asymmetric - Symmetric - Image - Processor: - Conversation 2022-01-14 20:54:38 -05:00			`content_config = ContentConfig()`
			`content_config.org = TextContentConfig(`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 01:57:08 -05:00			`input_files = None,`
Rename org-mode test data directory to more specific org/ from notes/ 2022-08-04 22:29:57 +03:00			`input_filter = 'tests/data/org/*.org',`
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 01:57:08 -05:00			`compressed_jsonl = model_dir.joinpath('notes.jsonl.gz'),`
			`embeddings_file = model_dir.joinpath('note_embeddings.pt'))`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-02 20:28:33 -07:00
Re-enable tests for image search Verify if recent fixes resolve test flakiness 2022-08-20 14:21:04 +03:00			`content_config.image = ImageContentConfig(`
			`input_directories = ['tests/data/images'],`
			`embeddings_file = model_dir.joinpath('image_embeddings.pt'),`
			`batch_size = 1,`
			`use_xmp_metadata = False)`
Split test_main into client and search type specific test files - Move search config fixture to conftests.py to be shared across tests - Move image search type specific tests to test_image_search.py file - Move, create asymmetric search type specific tests in new file 2021-10-02 20:28:33 -07:00
Improve test data organization and update correspoding conftests - Put test data for each content type into separate directories - Makes config.yml for docker and local host consistent - Prepending tests to /data in sample_config.yml makes application run on local host using test data - Allows mounting separate volume for each content type in docker-compose - Ignore gitignore to only add tests content, not generated models or embeddings 2022-01-29 01:57:08 -05:00			`return content_config`