Commit graph

139 commits

Author SHA1 Message Date
Debanjum Singh Solanky
dfb277ee37 Set skipif at module level if OpenAI API key not set for chat tests
- Remove stale message_to_prompt test
  It is too broad, reduces maintainability.
  Remove as it doesn't really need its own test right now
- Setting skipif at module level for chat actor, director tests
  reduces code duplication as earlier was using decorator on each chat
  test
2023-03-16 12:23:52 -06:00
Debanjum Singh Solanky
4e15b4e411 Create test notes dataset for chat testing
Combine hand-written custom notes and PG essays with personal
content to bulk up notes count

Delete old documentation markdown as not a representative dataset for
application (which is more tuned for personal notes)
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
1b4d562700 Test Chat Director Capabilities: Answer from notes, chat history etc
- Chat directors are broad agents.
  - Chat directors orchestrate narrow actor agents to synthesize
    final response for the user
  - Agents are Prompts + ML Model

- Test Chat Director Capabilities
  1. [X] Answer from retrieved notes
  2. [X] Answer from chat history
  3. [X] Answer general questions
  4. [X] Carry out multi-turn conversation
  5. [X] Say don't know when answer not in provided context
  6. [X] Answers that require current date awareness
     This test is expected to fail as the chat is not capable of doing
     this without the Search actor. But the test allows assessing chat quality
  7. [X] Date-aware aggregation across multiple different notes
     This test is expected to fail as the chat is not capable of doing
     this without the Search actor. But the test allows assessing chat quality
  8. [X] Ask clarification questions if no unambiguous answer in provided context
  9. [X] Retrieve answer from chat history beyond lookback window
     This test is expected to fail as the chat director is not capable
     of searching chat history yet. But the test allows assessing chat quality
 10. [X] Retrieve context for answer using multiple independent
         searches on knowledge base
     This test is expected to fail as the chat is not capable of doing
     this without the Search actor. But the test allows assessing chat quality
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
b6d63137f1 Setup Pytest fixture for conversation processor to test chat API
- Index markdown test data as knowledge base. As easier to get good
  markdown content (vs org)
- Setup markdown_content_config, processor_config and chat_client to
  test chat API
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
3f719c9e17 Rename Chat Model+Prompt tests to chat actor tests 2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
7526a50dd4 Extract conversation processor utility funcs from gpt.py into utils.py 2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
7c4d546039 Configure tests to mark chat quality tests & filter unhelpful warnings
- Mark chat quality tests, register custom mark for chat quality
- Filter unhelpful deprecation warnings from within dateparser library
- Error if tests use unregistered marks
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
c1128a1ad8 Test Chat Actor Capabilities; ability to answer from notes, chat logs etc
- Chat actors are narrow agents (prompt + ML model)
  Chat actors are different from the Chat director. who orchestrates
  the narrow actor agents to synthesize final response to the user

- Test Chat Actor Capabilities
  1. Answer from retrieved notes
  2. Answer from chat history
  3. Answer general questions
  4. Carry out multi-turn conversation
  5. Say don't know when answer not in provided context
  6. Answers that require current date awareness
  7. Date-aware aggregation across multiple different notes
  8. Ask clarification questions if no unambiguous answer in provided context
     This test is expected to fail as the chat is not capable of doing
     this consistently yet. But having the test allows assessing chat quality

- Use Openai API Key from OPENAI_API_KEY environment variable
- Gitignore .env file, python virtualenv directory
  Put OpenAI API Key in .env file to run chatbot tests via vscode
  The .env file is default location for importing env vars
2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
9306cd901a Clean up chat tests to work with updated chat methods in gpt.py 2023-03-16 09:30:37 -06:00
Debanjum Singh Solanky
b6cdc5c7cb Do not expose answer API as a chat type in chat web interface or API
Answer does not rely on past conversations, just the knowledge base.
It is meant for one off interactions, like search rather than a
continuing conversation like chat

For now it is only exposed via API. Later it will be expose in the
interfaces as well

Remove ability to select different chat types from the chat web
interface as there is only a single chat type

Stop appending answers to the conversation logs
2023-03-05 18:21:59 -06:00
Debanjum Singh Solanky
211e460398 Output date filter from cache log at debug level. Remove unused imports
Other logs not directly useful to user have already been converted
to debug log levels in 1ae4016. Just forgot to convert this log line too
2023-03-02 15:41:32 -06:00
Debanjum Singh Solanky
c823f46d89 Test error on missing fields in ContentConfig pulled from Khoj.yml
Resolves #9
2023-03-02 15:35:39 -06:00
Debanjum Singh Solanky
fe03ba3dce Index intro text before headings in org files
- Text before headings was not being indexed due to buggy orgnode
  parsing logic
- Resolved indexing intro text from files with and without headings in
  them
- Ensure intro text node has heading set to all title lines collected
  from the file

Resolves #165
2023-03-01 12:11:33 -06:00
Debanjum Singh Solanky
2bed4c3b50 Fix configuring search types & /config/types API when no plugin configured
- Test /config/types API when no plugin configured, only plugin configured
  and no content configured scenarios
- Do not throw null reference exception while configuring search types
  when no plugin configured
- Do not throw null reference exception on calling /config/types API
  when no plugin configured

Resolves bug introduced by #173
2023-03-01 01:23:37 -06:00
Debanjum Singh Solanky
b09350c052 Fix to return only enabled content types via the new config/types API
- Previously was return all core content types even if they had not been
  setup
- Add test to validate only configured content types are returned by
  the api/config/types API endpoint
2023-02-28 22:08:26 -06:00
Debanjum Singh Solanky
ede6eb6879 Re-enable testing search and update API with image content type
It may have been disabled due to issues with image search earlier
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
88a9eadfba Use client pytest fixture to test API with plugin type configured 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
ab501a56c9 Create pytest fixture to configure app with plugin, search types 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
f944408e69 Update content_config pytest fixture to index plugin content 2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
68bd5d9ebc Configure API routes after set up search types while configuring server
Configure app routes after configuring server.
Import API routers after search type is dynamically populated.
Allow API to recognize the dynamically populated plugin search types
as valid type query param.
Enable searching for plugin type content.
2023-02-28 20:25:51 -06:00
Debanjum Singh Solanky
55a032e8c4 Add processor to index entries from jsonl files for plugins
- Read, merge entries from input jsonl files and filters
- Mark new, modified entries for update
2023-02-24 02:54:12 -06:00
Debanjum Singh Solanky
fcbbe8c759 Read content plugin configs from Khoj config YAML
Configure external text content plugins via the Khoj YAML
Reuse existing TextContentConfig definition for external text content plugins
2023-02-23 23:57:32 -06:00
Debanjum Singh Solanky
47569da38e Fix usage of "\" in orgnode test string to resolve DeprecationWarning 2023-02-17 17:15:44 -06:00
Debanjum Singh Solanky
051f0e3fb5 Add, configure and run pre-commit locally and in test workflow 2023-02-17 13:31:36 -06:00
Debanjum Singh Solanky
5e83baab21 Use Black to format Khoj server code and tests 2023-02-17 11:55:17 -06:00
Debanjum Singh Solanky
af6d65a909 Create tagged Docker image on new tag/release 2023-02-14 20:04:06 -06:00
Debanjum Singh Solanky
bc7477ea3e Move Emacs, Obsidian plugin code out from under src/khoj directory
- What
  - The Emacs and Obsidian interfaces stay in their original
    directories under src/
  - src/khoj now only contains code meant for pypi packaging

- Benefits
  - This avoids having to update khoj MELPA, Obsidian plugin config as
    the Emacs, Obsidian code is under their original directories
  - It separates the code in src/khoj meant for python packaging from
    code for external interfaces like Emacs and Obsidian
2023-02-14 15:44:22 -06:00
Debanjum Singh Solanky
25a749ca1d Use the src/ layout to fix packaging Khoj for PyPi
- Why
  The khoj pypi packages should be installed in `khoj' directory.
  Previously it was being installed into `src' directory, which is a
  generic top level directory name that is discouraged from being used

- Changes
 - move src/* to src/khoj/*
 - update `setup.py' to `find_packages' in `src' instead of project root
 - rename imports to form `from khoj.*' in complete project
 - update `constants.web_directory' path to use `khoj' directory
 - rename root logger to `khoj' in `main.py'
 - fix image_search tests to use the newly rename `khoj' logger
 - update config, docs, workflows to reference new path `src/khoj'
2023-02-14 15:19:06 -06:00
Debanjum Singh Solanky
6908b6eed3 Truncate image queries below max tokens length supported by ML model
This would previously return the infamous tensor size mismatch error
Verify this error is not raised since adding the query truncation logic
2023-01-21 14:11:00 -03:00
Debanjum Singh Solanky
3d9ed91e42 Search by image at path only if query of form "file:/path/to/image"
Previously no query syntax helpers, like the "file:" prefix, were used
before checking if query contains file path.

This made query to image search brittle to misinterpretation and
pointless checking

Add test to verify search by image at file works as expected
2023-01-21 14:06:56 -03:00
Debanjum Singh Solanky
7b4f78776c Fix extracting Markdown Entries with Top Level Headings
- Previously top level headings would have get stripped of the
  space between heading text and the prefix # symbols. That is,
  `# Top Level Heading' would get converted to `#Top Level Heading'
- This would mess up their rendering as a heading in search results

- Add unit tests to text_to_jsonl processors to prevent regression
2023-01-17 13:06:28 -03:00
Debanjum Singh Solanky
d40076fcd6 Deduplicate test code, make teardown more robust using pytest fixtures 2023-01-09 19:47:27 -03:00
Debanjum Singh Solanky
237123d18c Fix tests for the conversation processor
- Use latest davinci model for tests
- Wrap prompt in triple quotes to improve legibilty
- `understand' method returns dictionary instead of string. Fix its test
- Fix prompt for new model to pass `chat_with_history' test
2023-01-09 00:22:26 -03:00
Debanjum Singh Solanky
826f9dc054 Drop long words from compiled entries to be within max token limit of models
Long words (>500 characters) provide less useful context to models.

Dropping very long words allow models to create better embeddings by
passing more of the useful context from the entry to the model
2023-01-07 23:13:56 -03:00
Debanjum Singh Solanky
24676f95d8 Fix comments, use minimal test case, regenerate test index, merge debug logs
- Remove property drawer from test entry for max_words splitting test
  - Property drawer is not required for the test
  - Keep minimal test case to reduce chance for confusion
2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky
53cd2e5605 Regenerate initial model in asymmetric reload test to reduce flakyness
- Fix logger message when converting org node to entries
- Remove unused import from conftest
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
c79919bd68 Split entries by max tokens while converting Org entries To JSONL
- Test usage the entry splitting by max tokens in text search
2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky
e057c8e208 Add method to split entries by specified max tokens limit
- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky
d292bdcc11 Do not version API. Premature given current state of the codebase
- Reason
  - All clients that currently consume the API are part of Khoj
  - Any breaking API changes will be fixed in clients immediately
  - So decoupling client from API is not required
  - This removes the burden of maintaining muliple versions of the API
2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky
2c548133f3 Remove unused imports, `embeddings' variable from text search tests 2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
e42a38e825 Version Khoj API, Update frontends, tests and docs to reflect it
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
  under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
2022-09-28 20:08:38 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
bf1ae038cb Get XMP metadata from image using Pillow. Remove ExifTool dependency
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
  - This also removes dependency on system Exiftool package for
    XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves #84
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00