Commit graph

3123 commits

Author SHA1 Message Date
Debanjum Singh Solanky
da17ff6ac8 Add Upgrade instructions for Khoj.el Readme. Fix version of khoj.el 2023-01-04 20:06:39 -03:00
Debanjum Singh Solanky
66ccd0c970 Create Obsidian plugin for Khoj
- Features
  - Search using Khoj from within the Obsidian app
    Allow Natural language search on your (markdown) notes in Obsidian Vault

  - Show search results as rendered (instead of raw) Markdown
    Improve legibility of the results

  - Jump to selected note from search result in Khoj search modal
    Simplify seeing result within its original note context

  - Automatically configure khoj to index markdown files in current vault
    Reduce khoj setup steps for plugin users by using reasonable defaults

    - Code updates the markdown config in khoj.yml and triggers index update
    - It can be configured by user in khoj plugin settings, if required

  - Add Demo and detailed Readme for the Obsidian plugin
    Ease setup and usage. Give context about capabilities

- Miscellaneous
  - Trying keep a mono repo until the Khoj project is mature enough
    to reduce maintainance burden
2023-01-04 18:28:16 -03:00
Debanjum Singh Solanky
feddb6ce62 Add start_url to khoj webmanifest to show Khoj as PWA on Chrome 2023-01-04 13:37:56 -03:00
Debanjum Singh Solanky
3dee1aed9e Create /config/data/default API endpoint to serve default khoj config
This can ease configuring khoj from the different interfaces

- Don't need to know all the (default) config used by khoj.
- Just get default config by calling the above API endpoint.
- Then modify desired portions and call POST /api/config/data to
  configure khoj.
2023-01-03 21:52:34 -03:00
Debanjum Singh Solanky
ce945f7a90 Configure processors too on calling /update API
- Previously only search was being reconfigured
- But Processors are configured on app start too
- Match that behavior on calling /update API
2023-01-03 21:51:02 -03:00
Debanjum Singh Solanky
9d31988f42 Allow starting khoj in non-GUI mode without config file instantiated
- Start khoj server (in non-GUI mode) without needing config file
  already instantiated.
  - But throw warning to configure khoj to use it
- This allows plugins to configure the app via the /config/data APIs
- To be used by the Khoj obsidian plugin to configure markdown content
  in khoj
2023-01-03 21:36:59 -03:00
Debanjum Singh Solanky
52664dd96c Allow recursive glob pattern (**) to add files to search index
- Simplify configuring files to index For Obsidian/Org-Roam type
  systems with lots of small files in khoj.yml using `input-filter'
2023-01-03 01:32:58 -03:00
Debanjum Singh Solanky
152e5f1661 Return the file of each search result in response
- Useful for enabling jump to note functionality in interfaces
- It will be used in the Khoj plugin for Obsidian
2023-01-03 01:25:34 -03:00
Debanjum Singh Solanky
c535953915 Update index automatically in non GUI mode too
- Poll scheduler every minute using threading.Timer
  - Use 60 seconds polling interval to avoid fork bombing
- Schedule next via the same poll scheduler
- Allow clean program interrupt by running scheduler in daemon mode
2023-01-01 21:03:19 -03:00
Debanjum Singh Solanky
701d92e17b Lock the index before updating it via API or Scheduler
- There are 3 paths to updating/setting the index (stored in state.model)
  - App start
  - API
  - Scheduler

- Put all updates to the index behind a lock. As multiple updates path
that could (potentially) run at the same time (via API or Scheduler)
2023-01-01 17:09:36 -03:00
Debanjum Singh Solanky
3b0783aab9 Automate updating embeddings, search index on a hourly schedule
- Use the schedule pypi package
- Use QTimer to poll schedule.run_pending() regularly for jobs to run
2023-01-01 17:09:36 -03:00
Debanjum
06c25682c9
Split text entries by max tokens supported by ML models
### Background
There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate [1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1#:~:text=model%20was%20just%20trained%20on%20input%20text%20up%20to%20250%20word%20pieces),[2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#:~:text=input%20text%20longer%20than%20256%20word%20pieces%20is%20truncated)

### Issue
Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

### Fix
- e057c8e Add method to split entries by specified max tokens limit
- Split entries by max tokens while converting [Org](https://github.com/debanjum/khoj/commit/c79919b), [Markdown](https://github.com/debanjum/khoj/commit/f209e30) and [Beancount](https://github.com/debanjum/khoj/commit/17fa123) entries to JSONL
- b283650 Deduplicate results for user query by raw text before returning results

### Results
- The quality of the search results should improve
- Relevant, long entries should show up in results more often
2022-12-26 18:23:43 +00:00
Debanjum Singh Solanky
17fa123b4e Split entries by max tokens while converting Beancount entries To JSONL 2022-12-26 15:14:32 -03:00
Debanjum Singh Solanky
f209e30a3b Split entries by max tokens while converting Markdown entries To JSONL 2022-12-26 13:14:15 -03:00
Debanjum Singh Solanky
24676f95d8 Fix comments, use minimal test case, regenerate test index, merge debug logs
- Remove property drawer from test entry for max_words splitting test
  - Property drawer is not required for the test
  - Keep minimal test case to reduce chance for confusion
2022-12-25 22:33:04 -03:00
Debanjum Singh Solanky
b283650991 Deduplicate results for user query by raw text before returning results
- Required because entries are now split by the max_word count supported
  by the ML models
- This would now result in potentially duplicate hits, entries being
  returned to user
- Do deduplication after ranking to get the top ranked deduplicated
  results
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
53cd2e5605 Regenerate initial model in asymmetric reload test to reduce flakyness
- Fix logger message when converting org node to entries
- Remove unused import from conftest
2022-12-25 21:36:15 -03:00
Debanjum Singh Solanky
c79919bd68 Split entries by max tokens while converting Org entries To JSONL
- Test usage the entry splitting by max tokens in text search
2022-12-25 21:36:00 -03:00
Debanjum Singh Solanky
08dc5e3324 Update instructions in khoj.el to install it from MELPA stable
- The instructions suggest installing khoj-assistant via pip install.
  This installs the latest tagged/release version of khoj
- To match that version user should install khoj.el from MELPA stable
  instead of MELPA
2022-12-23 19:08:38 -03:00
Debanjum Singh Solanky
e057c8e208 Add method to split entries by specified max tokens limit
- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
2022-12-23 16:24:04 -03:00
Debanjum Singh Solanky
d3e175370f Update readme to install khoj.el from MELPA stable unless using pre-release khoj
Update readme to ask user to install khoj.el from MELPA when a
pre-release version of the main khoj app is installed. Else install
khoj.el from MELPA Stable
2022-12-20 23:29:22 -03:00
Debanjum Singh Solanky
cd463c5085 Update Khoj.el Install Instructions on Emacs 2022-12-20 11:06:33 -03:00
Debanjum Singh Solanky
23ca5a2d43 Improve (un-)quoting of funcs used in `khoj--get-enabled-content-types'
- Based on melpa package feedback for khoj.el
- Verified these changes don't affect behavior of the function
2022-12-19 18:02:23 -03:00
Debanjum Singh Solanky
5db3a67df5 Fix Khoj Emacs package URL in khoj.el 2022-12-14 22:49:19 -03:00
Debanjum Singh Solanky
abad6d5f44 Declare external khoj.el funcs. Remove undefined func warnings on install 2022-12-14 22:36:04 -03:00
Debanjum Singh Solanky
c52383b11c Delete stale, unused installation helper script 2022-12-03 13:36:47 -03:00
Debanjum Singh Solanky
1990d09032 Bump khoj version in setup.py, khoj.el to 0.2.0 2022-12-02 14:58:54 -03:00
Debanjum Singh Solanky
a9cfd8b800 Extract hash func for incremental text indexing into separate method 2022-10-26 13:56:58 +05:30
Debanjum Singh Solanky
0de2ff9c97 Add __init__.py to routers directory to register it as a package 2022-10-25 20:40:40 +05:30
Debanjum Singh Solanky
55d2fea9be Move Custom Formatter class for logger to util.helper module from main.py 2022-10-20 00:32:24 +05:30
Debanjum Singh Solanky
1c40f97114 Merge branch 'master' of github.com:debanjum/khoj into modularize-api-and-increase-typing
- Conflicts:
  - src/interface/emacs/khoj.el
    Use our update to `config-url', use their `url-request-method'
2022-10-19 16:46:53 +05:30
Debanjum Singh Solanky
e1b5a87920 Rename Frontend Router to Web Client. Fix logger usage in routers
- Use logger in api_beta router instead of print statements
- Remove unused logger in web client router
2022-10-19 16:36:48 +05:30
Debanjum
4abd51cb04
Merge pull request from telotortium/method
Explicitly set `url-request-method' to GET in khoj.el
2022-10-19 10:31:37 +00:00
Debanjum Singh Solanky
c467df8fa3 Setup `mypy' for static type checking 2022-10-08 17:33:13 +03:00
Debanjum Singh Solanky
d292bdcc11 Do not version API. Premature given current state of the codebase
- Reason
  - All clients that currently consume the API are part of Khoj
  - Any breaking API changes will be fixed in clients immediately
  - So decoupling client from API is not required
  - This removes the burden of maintaining muliple versions of the API
2022-10-08 16:32:46 +03:00
Debanjum Singh Solanky
7e9298f315 Use new Text Entry class to track text entries in Intermediate Format
- Context
  - The app maintains all text content in a standard, intermediate format
  - The intermediate format was loaded, passed around as a dictionary
    for easier, faster updates to the intermediate format schema initially
  - The intermediate format is reasonably stable now, given it's usage
    by all 3 text content types currently implemented

- Changes
  - Concretize text entries into `Entries' class instead of using dictionaries
    - Code is updated to load, pass around entries as `Entries' objects
      instead of as dictionaries
    - `text_search' and `text_to_jsonl' methods are annotated with
       type hints for the new `Entries' type
    - Code and Tests referencing entries are updated to use class style
      access patterns instead of the previous dictionary access patterns

  - Move `mark_entries_for_update' method into `TextToJsonl' base class
    - This is a more natural location for the method as it is only
      (to be) used by `text_to_jsonl' classes
    - Avoid circular reference issues on importing `Entries' class
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
99754970ab Type the /search API response to better document the response schema
- Both Text, Image Search were already giving list of entry, score
- This change just concretizes this change and exposes this in the API
  documentation (i.e OpenAPI, Swagger, Redocs)
2022-10-08 12:06:05 +03:00
Debanjum Singh Solanky
0521ea10d6 Put image score breakdown under `additional' field in search response
- Update web, emacs interfaces to consume the scores from new schema
2022-10-08 12:06:01 +03:00
Debanjum Singh Solanky
e42a38e825 Version Khoj API, Update frontends, tests and docs to reflect it
- Split router.py into v1.0, beta and frontend (no-prefix) api modules
  under new router package. Version tag in main.py via prefix
- Update frontends to use the versioned api endpoints
- Update tests to work with versioned api endpoints
- Update docs to mentioned, reference only versioned api endpoints
2022-09-28 20:08:38 +03:00
Robert Irelan
d25e1d8e86
fix: explicitly set url-request-method
In my installation, it appears that `url-request-method` is sometimes set
globally to POST.  Need to explicitly set it to ensure that GET is always
used as intended.
2022-09-19 15:46:46 -04:00
Debanjum Singh Solanky
ee65a4f2c7 Merge /reload, /regenerate into single /update API endpoint
- Pass force=true to /update API to force regenerating index from
scratch
- Otherwise calls to the /update API endpoint will result in an
incremental update to index
2022-09-16 00:53:19 +03:00
Debanjum Singh Solanky
02d944030f Use Base TextToJsonl class to standardize <text>_to_jsonl processors
- Start standardizing implementation of the `text_to_jsonl' processors
  - `text_to_jsonl; scripts already had a shared structure
  - This change starts to codify that implicit structure

- Benefits
  - Ease adding more `text_to_jsonl; processors
  - Allow merging shared functionality
  - Help with type hinting

- Drawbacks
  - Lower agility to change. But this was already an implicit issue as
    the text_to_jsonl processors got more deeply wired into the app
2022-09-16 00:53:11 +03:00
Debanjum Singh Solanky
c16ae9e344 Ignore "Legacy way to download model" warning for upstream dependency 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
3169e3b78e Use ellipsis instead of pass in base filter abstract methods for aesthetic 2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
bf1ae038cb Get XMP metadata from image using Pillow. Remove ExifTool dependency
- Pillow already supports reading XMP metadata from Images
- Removes need to maintain my fork of unmaintained PyExiftool
  - This also removes dependency on system Exiftool package for
    XMP metadata extraction
- Add test to verify XMP metadata extracted from test images
- Remove references to Exiftool from Documentation
2022-09-16 00:48:45 +03:00
Debanjum Singh Solanky
8f57a62675 Remove unused imports. Fix typing and indentation
- Typing issues discovered using `mypy'. Fixed manually
- Unused imports discovered and fixed using `autoflake'
- Fix indentation in `org_to_jsonl' manually
2022-09-14 04:56:52 +03:00
Debanjum Singh Solanky
be57c711fd Revert OrgNode.hasTag func to method instead of property as accepts argument 2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
0109c7bd91 Disable ability to call <text>_to_jsonl, <type>_search packages directly
- This code is de-synced with expected args by above scripts
- Better to remove unused capabilitity that needlessly increases
  maintainance burden
2022-09-14 04:56:48 +03:00
Debanjum Singh Solanky
1680a617da Reflect updates to query and results count in URL
- Simplify tracking khoj query history, saving/sharing links
- Do not execute search, when query only contains whitespaces
  - Prevents error when try process results of empty query
2022-09-13 23:39:24 +03:00
Debanjum Singh Solanky
34314e859a Call /reload instead of /regenerate API to update index from web interface
- As `/reload` updates index incrementally, it's relatively quick
- This makes exposing `/reload` endpoint a better default to expose
  via the web interface than `the /regenerate' endpoint
2022-09-12 23:39:10 +03:00
Debanjum Singh Solanky
13b5d5082f Create input field to set results count on the web interface
Resolves 
2022-09-12 23:24:46 +03:00
Debanjum Singh Solanky
1bfe9c4ef2 Handle filter only queries. Short-circuit and return filtered results
- For queries with only filters in them short-circuit and return
  filtered results. No need to run semantic search, re-ranking.
- Add client test for filter only query and quote query in client tests
2022-09-12 17:13:05 +03:00
Debanjum Singh Solanky
afc84de234 Make word filter regex explicit. Allow hyphen in word filters
Helps with 
2022-09-12 17:05:29 +03:00
Debanjum Singh Solanky
536f03af8f Process text content files in sorted order for stable indexing
- Image search already uses a sorted list of images to process
- Prevents index of entries to desync when entries, embeddings
  generated by a separate server/app instance
2022-09-12 11:09:40 +03:00
Debanjum Singh Solanky
a701ad08b9 Support multiple input-filters to configure content to index via khoj.yml
- Update existings code, tests to process input-filters as list
  instead of str
- Test `text_to_jsonl' get files methods to work with combination of
  `input-files' and `input-filters'

Resolves 
2022-09-12 11:08:59 +03:00
Debanjum Singh Solanky
940c8fac8c Use app LRU, not functools LRU decorator, to cache search results in router
- Provides more control to invalidate cache on update to entries, embeddings
- Allows logging when results are being returned from cache etc
- FastAPI, Swagger API docs look better as the `search' controller not
  wrapped in generically named function when using functools LRU decorator
2022-09-12 09:38:48 +03:00
Debanjum Singh Solanky
c6fa09d8fc Fix querying with include word filter from web interface
- Not encoding the `query' string before querying the backend API with
  it was causing the "+" prefix for include word filter to be lost
2022-09-12 09:27:02 +03:00
Debanjum Singh Solanky
1502fbc9e9 Add index_heading_entries flag to default and sample khoj configs 2022-09-11 17:33:37 +03:00
Debanjum Singh Solanky
7216cdff58 Add Date, Word filter for Org-Music content 2022-09-11 17:29:34 +03:00
Debanjum Singh Solanky
9d369ae4df Fix OrgNode render of entries with property drawers and empty body
- Issue
  - Indent regex was previously catching escape sequences like newlines
  - This was resulting in entries with only escape sequences in body to
    be prepended to property drawers etc during rendering
- Fix
  - Update indent regex to only look for spaces in each line
  - Only render body when body contains non-escape characters
  - Create test to prevent this regression from silently resurfacing
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
253c9eae9a Set index_heading_entries field in config to index entries with no body
- Previously heading entries were not indexed to maintain search quality
- But given that there are use-cases for indexing entries with no body
- Add a configurable `index_heading_entries' field to index heading entries
- This `TextContentConfig' field is currently only used for OrgMode content
2022-09-11 16:09:19 +03:00
Debanjum Singh Solanky
1d3b3d5f39 Convert field get/set methods in OrgNode class to @property
- Use more descriptive variable names in OrgNode parser and class
- Convert OrgNode fields to private/protected, use property methods to
  get/set them
2022-09-11 14:59:28 +03:00
Debanjum Singh Solanky
db37e38df7 Create OrgNode hasBody method. Use it in org_to_jsonl checks 2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
b4878d76ea Extract entries from scratch when regenerate requested
- Do not rely on previously extracted entries to find new entries in
regenerate scenario
2022-09-11 12:50:03 +03:00
Debanjum Singh Solanky
52e3dd9835 Pass the whole TextContentConfig as argument to text_to_jsonl methods
- Let the specific text_to_jsonl method decide which of the
  TextContentConfig fields it needs to convert <text> type to jsonl
- This simplifies extending TextContentConfig for a specific type without
  modifying all text_to_jsonl methods
- It keeps the number of args being passed to the `text_to_jsonl'
  methods in check
2022-09-11 12:49:56 +03:00
Debanjum Singh Solanky
e951ba37ad Raise exception when org file not found
- No need to catch the IOError in OrgNode
2022-09-11 01:09:24 +03:00
Debanjum Singh Solanky
2e1bbe0cac Fix striping empty escape sequences from strings
- Fix log message on jsonl write
2022-09-10 23:57:05 +03:00
Debanjum Singh Solanky
a7cf6c8458 Use dictionary instead of list to track entry to file maps 2022-09-10 23:08:30 +03:00
Debanjum Singh Solanky
3e1323971b Stack function calls in jsonl converters to avoid unneeded variables 2022-09-10 22:56:06 +03:00
Debanjum Singh Solanky
4eb84c7f51 Log performance metrics for beancount, markdown to jsonl conversion 2022-09-10 22:47:54 +03:00
Debanjum Singh Solanky
ebd5039bd1 Merge branch 'master' into support-incremental-updates-of-embeddings 2022-09-10 22:37:13 +03:00
Debanjum Singh Solanky
030fab9bb2 Support incremental update of Markdown entries, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
91aac83c6a Support incremental update of Beancount transactions, embeddings 2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
b01b4d7daa Extract logic to mark entries for embeddings update into helper function
- This could be re-used by other text_to_jsonl converters like
  markdown, beancount
2022-09-10 21:43:08 +03:00
Debanjum Singh Solanky
f97308bef2 Fix log message on writing JSONL data to file 2022-09-10 21:40:08 +03:00
Debanjum Singh Solanky
c17a0fd05b Do not store word filters index to file. Not necessary for now
- It's more of a hassle to not let word filter go stale on entry
  updates
- Generating index on 120K lines of notes takes 1s. Loading from file
  takes 0.2s. For less content load time difference will be even smaller
- Let go of startup time improvement for simplicity for now
2022-09-10 21:01:54 +03:00
Debanjum Singh Solanky
91d11ccb49 Only hash compiled entry to identify new/updated entries to update
- Comparing compiled entries is the appropriately narrow target to
  identify entries that need to encode their embedding vectors. Given we
  pass the compiled form of the entry to the model for encoding

- Hashing the whole entry along with it's raw form was resulting in a
  bunch of entries being marked for updated as LINE: <entry_line_no>
  is a string added to each entries raw format.

- This results in an update to a single entry resulting in all entries
  below it in the file being marked for update (as all their line
  numbers have changed)

- Log performance metrics for steps to convert org entries to jsonl
2022-09-10 21:01:44 +03:00
Debanjum Singh Solanky
b9a6e80629 Make OrgNode tags stable sorted to find new entries for incremental updates
- Having Tags as sets was returning them in a different order
  everytime
- This resulted in spuriously identifying existing entries as new
  because their tags ordering changed
- Converting tags to list fixes the issue and identifies updated new
  entries for incremental update correctly
2022-09-10 20:59:52 +03:00
Debanjum Singh Solanky
2f7a6af56a Support incremental update of org-mode entries and embeddings
- What
  - Hash the entries and compare to find new/updated entries
  - Reuse embeddings encoded for existing entries
  - Only encode embeddings for updated or new entries
  - Merge the existing and new entries and embeddings to get the updated
    entries, embeddings

- Why
  - Given most note text entries are expected to be unchanged
    across time. Reusing their earlier encoded embeddings should
    significantly speed up embeddings updates
  - Previously we were regenerating embeddings for all entries,
    even if they had existed in previous runs
2022-09-10 20:58:33 +03:00
Debanjum Singh Solanky
ec675d27d3 Suppress non-actionable HuggingFace FutureWarning shown on app start 2022-09-10 16:43:14 +03:00
Debanjum Singh Solanky
1ac6a71ff0 Add --version flag to show installed version of khoj 2022-09-10 16:40:19 +03:00
Debanjum Singh Solanky
976397bd82 Ignore empty #+TITLE, merge multiple #+TITLE for 0th level headings 2022-09-10 15:34:47 +03:00
Debanjum Singh Solanky
11917c6ddd Do not normalize absolute filenames for creating links in OrgNode 2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
07b98d35f1 Use filename or #+TITLE as heading for 0th level content in org files
- Set LINE, SOURCE link properties in property drawer correctly for
  content which falls under no heading
- See Issue  for more details
2022-09-10 15:34:31 +03:00
Debanjum Singh Solanky
d6bd7bf3e1 Fix initializing OrgNode level to string to parse org files
- Parsed `level` argument passed to OrgNode during init is expected to
  be a string, not an integer
- This was resulting in app failure only when parsing org files with
  no headings, like in issue , as level is set to string of `*`s
  the moment a heading is found in the current file
2022-09-10 14:21:08 +03:00
Debanjum Singh Solanky
d835467f2c Throw exception if no valid entries found in specified content files
- Previously we were failing if no valid entries while computing
  embeddings. This was obscuring the actual issue of no valid entries
  found in the specified content files
- Throwing an exception early with clear message when no entries found
  should make clarify the issue to be fixed
- See issue  for details
2022-09-10 14:20:10 +03:00
Debanjum Singh Solanky
e00bb53336 Init word filter dictionary with default value as set to simplify code 2022-09-10 12:19:09 +03:00
Debanjum Singh Solanky
4d776d9c7a Bump khoj version to 0.1.9 2022-09-09 07:50:15 +03:00
Debanjum Singh Solanky
588f598949 Pass empty list of `input_files' to FileBrowser on first run
- Default config has `input_files' set to None
- This was being passed to `FileBrowser' on Initialization
- But `FileBrowser' expects `content_files' of list type, not None
- This resulted in an unexpected NoneType failure
2022-09-09 07:26:40 +03:00
Debanjum Singh Solanky
3ddffdfba4 Create config directory before setting up logging to file under it
- The logging to file code expects the config directory to already be setup
- But parent directory of config file was being set up later in code
- This resulted in app start failing with ~/.khoj dir does not exist error
2022-09-09 07:21:42 +03:00
Debanjum Singh Solanky
762607fc9f Log processed entries by org_to_jsonl only if verbosity > 2
Output too verbose for even debug mode logging. So gated behind -vvv
2022-09-06 23:03:29 +03:00
Debanjum Singh Solanky
490157cafa Setup File Filter for Markdown and Ledger content types
- Pass file associated with entries in markdown, beancount to json converters
- Add File, Word, Date Filters to Ledger, Markdown Types
  - Word, Date Filters were accidently removed from the above types yesterday
  - File Filter is the only filter that newly got added
2022-09-06 15:31:26 +03:00
Debanjum Singh Solanky
94cf3e97f3 Log app logs to file for posthoc debugging and performance analysis 2022-09-06 14:51:48 +03:00
Debanjum Singh Solanky
3707a4cdd4 Improve date filter perf. Precompute date to entry map, Cache results
- Precompute date to entry map
- Cache results for faster recall
- Log preformance timers in date filter
2022-09-05 18:21:29 +03:00
Debanjum Singh Solanky
31503e7afd Do not pass embeddings as argument to filter.apply method 2022-09-05 15:46:54 +03:00
Debanjum Singh Solanky
965bd052f1 Make search filters return entry ids satisfying filter
- Filter entries, embeddings by ids satisfying all filters in query
  func, after each filter has returned entry ids satisfying their
  individual acceptance criteria

- Previously each filter would return a filtered list of entries.
  Each filter would be applied on entries filtered by previous filters.
  This made the filtering order dependent

- Benefits
  - Filters can be applied independent of their order of execution
  - Precomputed indexes for each filter is not in danger of running
    into index out of bound errors, as filters run on original entries
    instead of on entries filtered by filters that have run before it
  - Extract entries satisfying filter only once instead of doing
    this for each filter

- Costs
  - Each filter has to process all entries even if previous filters
    may have already marked them as non-satisfactory
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7dd20d764c Pre-compute file to entry map in file filter to mark ids to include faster 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
2890b4cd44 Simplify extracting entries satisfying file filter 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7606724dbc Add file of each entry to entry dict in org_to_jsonl converter
- This will help filter query to org content type using file filter
- Do not explicitly specify items being extracted from json of each
  entry in text_search as all text search content types do not have
  file being set in jsonl converters
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
7e083d3e96 Cache results for file filters passed in query for faster filtering 2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
f634399f23 Convert simple file filters with no path separator into regex
- Specify just file name to get all notes associated with file at path
- E.g `query` with `file:"file1.org"` will return `entry1`
  if `entry1` is in `file1.org` at `~/notes/file.org`

- Test
  - Test converting simple file name filter to regex for path match
  - Test file filter with space in file name
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
092b9e329d Setup Filters when configuring Text Search for each Search Type
- Allows enabling different filters for different Text Search Types
- Use FileFilter in Text Search on Org Files
2022-09-05 15:21:40 +03:00
Debanjum Singh Solanky
1f9fd28b34 Create File Filter to filter files to query. Add tests for file filter 2022-09-05 01:09:20 +03:00
Debanjum Singh Solanky
e4418746f2 Create Abstract Base Class for Filters. Make Word, Date Filter Child of BaseFilter 2022-09-04 18:48:16 +03:00
Debanjum Singh Solanky
f930324350 Rename explicit filter to word filter to be more specific 2022-09-04 17:18:47 +03:00
Debanjum Singh Solanky
6087862521 Use LRU helper class for explicit filter cache 2022-09-04 16:42:28 +03:00
Debanjum Singh Solanky
8f3326c8d4 Create LRU helper class for caching 2022-09-04 16:31:46 +03:00
Debanjum Singh Solanky
191a656ed7 Use word to entry map, list comprehension to speed up explicit filter
- Code Changes
  - Use list comprehension and `torch.index_select' methods
    - to speed selection of entries, embedding tensors satisfying filter
    - avoid deep copy of entries, embeddings
    - avoid updating existing lists (of entries, embeddings)

  - Use word to entry map and set operations to mark entries satisfying
    inclusion, exclusion filters

- Results
  - Speed up explicit filtering by two orders of magnitude
  - Improve consistency of speed up across inclusion and exclusion filtering
2022-09-04 15:22:35 +03:00
Debanjum Singh Solanky
28d3dc1434 Deep copy entries, embeddings in filters. Defer till actual filtering
- Only the filter knows when entries, embeddings are to be manipulated.
  So move the responsibility to deep copy before manipulating entries,
  embeddings to the filters

- Create deep copy in filters. Avoids creating deep copy of entries,
  embeddings when filter results are being loaded from cache etc
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
3308e68edf Cache explicitly filtered entries, embeddings by required, blocked words 2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
cdcee89ae5 Wrap words in quotes to trigger explicit filter from query
- Do not run the more expensive explicit filter until the word to be
  filtered is completed by user. This requires an end sequence marker
  to identify end of explicit word filter to trigger filtering

- Space isn't a good enough delimiter as the explicit filter could be
  at the end of the query in which case no space
2022-09-04 02:38:57 +03:00
Debanjum Singh Solanky
8d9f507df3 Load entries_by_word_set from file only once on first load of explicit filter 2022-09-04 00:37:37 +03:00
Debanjum Singh Solanky
858d86075b Use regexes to check if any explicit filters in query. Test can_filter 2022-09-03 23:47:28 +03:00
Debanjum Singh Solanky
546fad570d Use regex to extract include, exclude filter words from query 2022-09-03 23:41:43 +03:00
Debanjum Singh Solanky
ffb8e3988e Use Python Logging Framework to Time Performance of Explicit Filter 2022-09-03 22:24:10 +03:00
Debanjum Singh Solanky
c7de57b8ea Pre-compute entry word sets to improve explicit filter query performance 2022-09-03 16:16:31 +03:00
Debanjum Singh Solanky
094bd18e57 Use python standard logging framework for app logs
- Stop passing verbose flag around app methods
- Minor remap of verbosity levels to match python logging framework levels
  - verbose = 0 maps to logging.WARN
  - verbose = 1 maps to logging.INFO
  - verbose >=2 maps to logging.DEBUG
- Minor clean-up of app: unused modules, conversation file opening
2022-09-03 14:43:32 +03:00
Debanjum Singh Solanky
d0531c3064 Update URL QueryParam when Type set in Dropdown on Web Interface
- This also pushes the updated URL state to history
- Allows jumping back to the web interface after clicking on an image
  and having the type set to image search
- Previously type would get reset to the default search type on
  jumping back
2022-08-28 12:22:22 +03:00
Debanjum Singh Solanky
2eae32d743 Time, Log Image Search Performance 2022-08-28 00:28:46 +03:00
Debanjum Singh Solanky
c3ca99841b Scale down images to generate image embeddings faster, with less memory
- CLIP doesn't need full size images for generating embeddings with
  decent search results. The sentence transformers docs use images
  scaled to 640px width

- Benefits
  - Normalize image sizes
  - Increase image embeddings generation speed
  - Decrease memory usage while generating embeddings from images
2022-08-24 14:09:02 +03:00
Debanjum Singh Solanky
ea4fdd9134 Fix logic to ignore notes with no body. Add tests to prevent regression
- Notes with empty newlines in body were not being ignored
- Add regression tests to avoid above regression in org_to_jsonl conversion
2022-08-21 19:41:40 +03:00
Debanjum
144986ebfd
Fix, Improve Desktop GUI Splash Screen and Main Window
- 5e6625a Fix file browser to not add empty line when no file/dir selected
- 8098b8c Bring main window to Top when open from System Tray
- 1c122a8 Place window near top so buttons are not hidden by OS bottom bar
- dfe2546 Set Khoj Icon on Main Desktop Window
- 1b1f8f9 Move Splash screen text below icon. Set the text color to black
- 450f644 Fix path to remove shared libraries when packaging the Windows app
2022-08-20 23:19:01 +00:00
Debanjum Singh Solanky
5e6625ac68 Fix file browser to not add empty line when no file/dir selected
- When no file selected in file browser an empty line/entry gets added
  to input entries list
- Bug got introduced due to insufficient update on change to add
  instead of insert
- Update is_none_or_empty helper method to also check for empty string
2022-08-21 02:03:28 +03:00
Debanjum Singh Solanky
8098b8c3a8 Bring Configure Window to Top when Opened from System Tray
- Previously the window could get hidden behind other app windows when
  user clicked configure from the system tray
2022-08-20 23:38:43 +03:00
Debanjum Singh Solanky
1c122a8a91 Place window near top so buttons are not hidden by OS bottom bar 2022-08-20 22:38:06 +03:00
Debanjum Singh Solanky
dfe2546c04 Set Khoj Icon on Main Desktop Window 2022-08-20 20:36:15 +03:00
Debanjum Singh Solanky
82d2891765 Do not pass ML compute `device' around as argument to search funcs
- It is a non-user configurable, app state that is set on app start
- Reduce passing unneeded arguments around. Just set device where
  required by looking for ML compute device in global state
2022-08-20 14:44:53 +03:00
Debanjum Singh Solanky
acc9091260 Use MPS on Apple Mac M1 to GPU accelerate Encode, Query Performance
- Note: Support for MPS in Pytorch is currently in v1.13.0 nightly builds
  - Users will have to wait for PyTorch MPS support to land in stable builds
- Until then the code can be tweaked and tested to make use of the GPU
  acceleration on newer Macs
2022-08-20 14:44:06 +03:00
Debanjum Singh Solanky
7de9c58a1c Load models, corpus embeddings onto GPU device for text search, if available
- Pass device to load models onto from app state.
  - SentenceTransformer models accept device to load models onto during initialization
- Pass device to load corpus embeddings onto from app state
2022-08-20 14:04:18 +03:00
Debanjum Singh Solanky
dc8dcc94a6 Bump Khoj.el package version to 0.1.6 2022-08-19 20:48:42 +03:00
Debanjum Singh Solanky
ffbf15eff8 Add helper function to identify when app running as pyinstaller app
Useful for when want the app to behave differently in pyinstaller app
scenario with frozen python. And in development scenarios
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
6c5c1c33c1 Turn off Tokenizers Parallelism. Khoj doesn't support it right now
- Forking and multiprocess are problemantic in frozen python
  scenarios. This will cause issues when running App packaged by
  pyinstaller
2022-08-19 19:17:54 +03:00
Debanjum Singh Solanky
d4072974d7 Use of XMP metadata in Khoj Image Search is broken. Disable by default
- CLIP Image score and XMP metadata score are not combining well.
  When combined they give non sensical results. Enable only once
  figure how best to combine the two.

- Show scores with higher precision for image search
  - Image search scores seem to be mostly be between 0.2 - 0.3 for some reason
  - Higher precision scores make it easier to understand the quality
    of returned results perceived by the model itself
2022-08-19 19:17:28 +03:00
Debanjum Singh Solanky
7c4417126c Append files, directories selected by user to config in Desktop GUI
- Allows adding multiple image directories via GUI
- Allow adding multiple files in different directories via GUI
- Previously users couldn't add multiple directories via GUI
  They'd have to manually append to input field if multiple files, directories
- To clear/overwrite is much easier.
  The user can just select text to delete in input area
2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
00ddcfdac8 Use .ico icon when packaging for Windows (and Linux) using Pynstaller 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
60dacf3f2c Show splash screen on app start. Only supported on Windows, Linux 2022-08-19 19:16:10 +03:00
Debanjum Singh Solanky
0079c13bf7 Set input-directories in config for image search type on Desktop GUI
- Issue
  Fix configuring image search from Desktop GUI. It was broken before.
  The Desktop GUI was updating input-files field under content-type > image.
  This field is not used for image search. So image search couldn't be
  configured from the Desktop GUI

- Fix
  - Set input-directories when field of search type image is set from GUI
  - Otherwise set input-files field in config
2022-08-18 18:29:55 +03:00
Debanjum Singh Solanky
c4fd661909 Move the experimental /chat API to under /beta/chat 2022-08-16 16:36:15 +03:00
Debanjum Singh Solanky
b8913476ba Fix if condition in router to trigger markdown search 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
9bc4fd539e Set Web Interface URL from loaded state in Desktop GUIs. Not hard-coded 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
7f479b0104 Improve Displaying Error to User on Khoj window in Desktop GUI
- Show a helpful error message in the GUI to the user, instead of the
  crashing if loading config fails, for e.g if file wasn't found
- Collate GUI errors into an ErrorType enum class
- Remove previous error messages before showing the new one
2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
873bb9dd97 Do not force the Khoj window to always be on top. It's needlessly annoying 2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
67ab40bb01 Regenerate embeddings everytime user clicks configure in Desktop GUI
Previously if the embeddings were already there only the khoj.yml
config file would get updated. The embeddings would remain old.

1. This results in a stale app state where the config doesn't
   match the embeddings

2. Currently the user cannot update their config from the config
   screen. They'd have to use a combination of config screen and web
   interface>regenerate button to trigger it or delete their ~/.khoj dir

This commit should resolve the above issues
2022-08-16 00:37:16 +03:00
Debanjum Singh Solanky
2647e6bab4 Display re-ranked results triggered via keybinding in khoj.el
- Prevent immediate overwrite of re-ranked results by
  incremental-search without rerank triggered via post-command-hook.

- This triggers right after the reranking results are rendered, so
  user never ends up seeing them
2022-08-15 18:41:12 +03:00
Debanjum Singh Solanky
a91d2df300 Simplify Emacs interface to only rerank results on explicit command 2022-08-15 06:20:13 +03:00
Debanjum Singh Solanky
e846829a2e Reset Khoj.el version to align with Khoj package version 2022-08-15 06:20:13 +03:00
Debanjum Singh Solanky
fed0b591af Package Khoj as Debian app in Github Release Workflow 2022-08-14 05:07:58 +03:00
Debanjum Singh Solanky
541e03da3d Make khoj.el pass checkdoc, package-lint, flycheck checks
- Add docstrings, mention args in them. Make docstring crisper
- prefix funcs, variables with khoj--
- Require emacs >27.1 for json-parse-buffer
- Use lexical binding
- Add quickstart docs to elisp file itself
- Bump version of khoj.el
2022-08-13 21:37:41 +03:00
Debanjum Singh Solanky
3300378804 Minimal formatting to render beancount results legibly on web interface 2022-08-13 05:03:45 +03:00
Debanjum Singh Solanky
a0759dd923 Convert Configure Screen into the Main Application Window
- What
  - Convert the config screen into the main application window
    with configuration as just one of the functionality it provides
  - Rename config screen to main window to match new designation

- Why
  - System Tray isn't available everywhere (e.g Linux)
  - This requires moving functionality into a normal window for cross-compat
2022-08-13 02:05:52 +03:00
Debanjum Singh Solanky
684f497abe Handle no System Tray on Linux (Gnome)
- What
  - On Linux
    - Show Configure Screen, even if not first run experience
    - Do no show system tray on Linux
    - Quit app on closing Configure Screen
  - On Windows, Mac
    - Show Configure screen only if first run experience
    - Show system tray always
    - Do not quit app on closing Configure Screen

- Why
  - Configure screen is the only GUI element on Linux. So closing it
    should close the application
  - On Windows, Mac the system tray exists, so app should not be closed
    on closing configure screen
2022-08-13 01:00:20 +03:00
Debanjum Singh Solanky
c2815c5d09 Open Search from Khoj Configure Screen
- Start evolving configure screen away from just being a configure screen
  - Update Window Title to just say Khoj
- Allow Opening Web Interface to Search from Khoj configure screen
- Rename "Start" Button to more accurate "Configure"
- Disable Search button on first run and while configuring app
2022-08-13 00:43:49 +03:00
Debanjum Singh Solanky
28a91ad1fd Deep copy the default_config constant to prevent it being overwritten
- Issue
  - In the previous form, updates to self.current_config would update
    default_config as python does a shallow copy
  - So self.current_config is just referencing the values of default_config
  - Hence updates to current_config updates the default_config values too
  - This is not what we want

- Fix
  - Deep copy the default_config values. Now updates to
    self.current_config wouldn't affect the default_config
2022-08-12 23:54:16 +03:00
Debanjum Singh Solanky
62ac41ce3b Reload settings in a separate thread to not freeze Config Screen
- Generating embeddings takes time
- If user enables a content type and clicks start.
  The app starts to generate embeddings when loading the new settings
- Run this function in a separate thread to keep config screen responsive
- But disable start button to prevent re-entrant threads
- Also show a minimal visual indication that the app is saving state
2022-08-12 23:34:00 +03:00
Debanjum Singh Solanky
927547d0af Update Title of Configure Screen to follow "<Screen> - App" pattern 2022-08-12 22:53:10 +03:00
Debanjum Singh Solanky
32ac1ea1b6 Allow user to quit application from the terminal via SIGINT
Call python interpreter at regular interval to handle any interrupt
signals. create custom handler to terminate server and application
2022-08-12 21:11:58 +03:00
Debanjum Singh Solanky
43301d488a Increase Width of Configure Screen 2022-08-12 18:34:47 +03:00
Debanjum Singh Solanky
9baea9c9fd Let Input Fields Wrap. Adjust Height based on Text in Field
- Convert Input Fields into PlainTextEdit
- Display Each Selected File on a Separate Line in Field
- Set Height of FileBrowser Input Field based on Number of Lines/Files
2022-08-12 18:33:56 +03:00
Debanjum Singh Solanky
b7b96110e9 Rename FileBrowser Button Text to "Select" instead of "Add" 2022-08-12 17:08:40 +03:00
Debanjum Singh Solanky
a1c58a9470 Create, Use a Labelled Text Field for the Conversation Input Field
- This fixes the field expanding when configure screen is expanded
- Allows for reusability of the labelled text field
- Simplifies the logic to save settings for conversation processor
2022-08-12 16:59:15 +03:00
Debanjum Singh Solanky
fa7e36cada Rename external *.js files to *.min.js to mark them as vendored
- Excludes from Github language stats.
  See linguists/vendor.yml for exclusion rules
- Signifies them as external for Khoj developers too
2022-08-12 04:08:50 +03:00
Debanjum Singh Solanky
110e3df0b7 Set default config in the constant module. Use from there to configure app
- Avoid having to pass the khoj_sample.yml data file into pip, native apps
- Packaging data files into python packages is annoying.
  - There's `MANIFEST.in`, `data_files` and `package_data` in setup.py
  - Bdist, wheel, generated source tarball use different set of these fields
    and put the data files in different locations
  - Rather just code the default config into a constant. Avoid
    pointless file reads as well this way
2022-08-12 02:18:46 +03:00
Debanjum Singh Solanky
fad2f3a2e7 Resolve config_file to absolute right at start on parsing args in cli
- Assume path is absolute in yaml util module while saving, loading file
  - This follows same convention as jsonl. Which just operates on
    passed file path, assuming it is of appropriate form.
    Responsibility to put it in appropriate form is on the caller, for now
2022-08-12 01:34:08 +03:00
Debanjum Singh Solanky
44fe70513a Handle situation where default config directory or file does not exist
- Include khoj_sample.yml in pip package to load default config from
- Create khoj config directory if it doesn't exist
- Load config from khoj_sample.yml if khoj.yml config doesn't exist
2022-08-12 01:17:34 +03:00
Debanjum Singh Solanky
41520e1608 Improve Docstring for Configure Screen and System Tray class, funcs 2022-08-11 23:36:02 +03:00
Debanjum Singh Solanky
a748acfeeb Merge branch 'master' of github.com:debanjum/khoj into create-native-gui
Conflicts:
- src/main.py
  - router functions have moved to router
  - move logic to handle null query perf timer variables into
    router.py
  - set main.py to current branch, not master
2022-08-11 21:09:42 +03:00
Debanjum Singh Solanky
6af2d6bb6d Add Flag to Start App without Native GUI 2022-08-11 20:59:57 +03:00
Debanjum Singh Solanky
b74ca1def6 Wrap error message instead of expanding screen to show message 2022-08-11 20:51:56 +03:00
Debanjum Singh Solanky
2646fa825b Get Files from File input line to match user expectation
- If a user manually edits the input file lines, clicking start should
  use that. Currently it just looks at the files selected last via file
  browser
- We want to allow users to manually enter file paths in field. Which
  is why the field hasn't been set to read-only
2022-08-11 20:48:45 +03:00
Debanjum Singh Solanky
dad9133598 Split save_settings method into smaller methods for modularization 2022-08-11 20:00:52 +03:00
Debanjum Singh Solanky
56ba91fec8 Remove unused methods in file browser widget. Improve name of existing 2022-08-11 19:46:09 +03:00
Debanjum Singh Solanky
fd4e41495c Use appropriate label for directory input types to minimize confusion 2022-08-11 19:45:19 +03:00
Debanjum Singh Solanky
c1e1466fb1 Validate new config before write. Show error if new config invalid 2022-08-11 19:18:22 +03:00
Debanjum Singh Solanky
1ff049599f Show current config on config screen. Load default config if config unset
- Track current (saved/loaded) config separate from the new config (to
  be written) when user clicks Start

- Fallback to using default config when no config for the specific
  content type or processor is specified in khoj.yml
  - Earlier were only loading default config on first run, not after

- Create Child CheckBox, LineEdit classes for Processor Widgets
  - Create ProcessorType, similar to SearchType
  - Track ProcessorType the widgets are associated with
  - Simplify update, save, load of config based on type
2022-08-11 19:11:25 +03:00
Debanjum Singh Solanky
23e06f483d Do not emit type tags when dumping config YAML to file 2022-08-11 19:08:36 +03:00
Debanjum Singh Solanky
678fb6a3c7 Add Settings Panel for Conversation Settings to Config Screen 2022-08-11 04:52:40 +03:00
Debanjum Singh Solanky
c1fcf44405 Initialize Settings on Config Screen with Existing Settings from File 2022-08-11 04:51:33 +03:00
Debanjum Singh Solanky
3cec6229ad Hot swap backend config via config screen start button click
- Update configuration to use by the backend, while app is running
- Trigger after user hits start button with their config.
  The config gets written to khoj.yml file first, then the updated
  config is loaded onto memory
2022-08-11 00:32:11 +03:00
Debanjum Singh Solanky
f7fdf8d8ce Refactor app start to start server even if backend not configured
- Decouple configuring backend from starting server.
  Backend search and processors can be configured after the backend
  server has started

- Set global state in main instead of in configure_server method.
  This allows the app to start even if configure_server exits early in
  the first run scenario, where no config available to configure server

- Now start server, even if no config, before GUI started in main

- This refactor of app startup flow will allow users to configure
  backend using the configure screen after server start
2022-08-11 00:13:14 +03:00
Debanjum Singh Solanky
34018c7d4b Store args passed from commandline at app start in global app state 2022-08-11 00:11:35 +03:00
Debanjum Singh Solanky
cc6ef0f450 Save configure screen settings to app config yaml on clicking Start 2022-08-10 23:10:39 +03:00
Debanjum Singh Solanky
dae65c5b6b Create child class of Qt CheckBox to track search type it enables/disables 2022-08-10 22:44:37 +03:00
Debanjum Singh Solanky
f42f54019b Type parent_layout passed as arguments to ConfigureScreen methods 2022-08-10 22:43:20 +03:00
Debanjum Singh Solanky
f63f11186f Pass config file for app to configure screen 2022-08-10 22:42:32 +03:00
Debanjum Singh Solanky
82a7059b6a Only setup conversation processor if it has configuration set 2022-08-10 22:34:03 +03:00
Debanjum Singh Solanky
9628ca073c Extract conversation processor from config into separate function
- Only pass processor config arg required by configure_processor. Not
  the unused full config object
- Type arguments passed to methods configure processors
- Import json for use by conversation processor to load logs
2022-08-10 22:33:33 +03:00
Debanjum Singh Solanky
62eb66b8ca Rename load_config_from_file to more descriptive parse_config_from_file 2022-08-10 22:28:51 +03:00
Debanjum Singh Solanky
328cc00439 Create global constant to store app root directory 2022-08-10 20:09:03 +03:00
Debanjum Singh Solanky
d2c7b28172 Extract code to load config from YAML file into new utils.yaml module 2022-08-10 20:07:44 +03:00
Debanjum Singh Solanky
150ae19660 Indent Timestamps, Drawers at Body Level in OrgNode Entry Representation 2022-08-10 18:55:37 +03:00
Debanjum Singh Solanky
fd31d339c1 Remove spurious space in Entries without Todo in OrgNode Entry Repr 2022-08-10 13:48:44 +03:00
Debanjum Singh Solanky
eddf88f818 Org buffer customization settings to tail of khoj.el results buffer
- Results get priority screen real estate
- Allows quick speed key based traversal of results as cursor
  on switching to buffer is at top level heading
  - E.g C-x o n n o 2 jumps to entry in actual file of second result
  - Unlike before when it is at the #+STARTUP org buffer customization
    settings
2022-08-10 12:57:37 +03:00
Debanjum Singh Solanky
daef276fd1 Add files for each search type. Extract config on clicking start
- Only allow adding files with appropriate file extension for each search type
  - e.g .org for org-mode search, directory for image search

- Extract file paths added to config and enablement state of each search type
  - This extracted state will be used to populate the khoj.yml config file
2022-08-10 03:27:22 +03:00
Debanjum Singh Solanky
d74134e6cc Reuse Single Method to Create Setting Panels for each Search Type 2022-08-09 23:50:43 +03:00
Debanjum Singh Solanky
509d52e2cd Toggle Editability instead of Visibility of Per Search Type Settings
- Simplifies the configure screen layout and allows it to be of constant width

- It was buggy, the configure screen would dynamically expand but not
  restore back to original size on disabling search type after enable
2022-08-09 23:34:54 +03:00
Debanjum Singh Solanky
3c788f1d29 Rename configure window to more generic configure screen 2022-08-09 22:44:05 +03:00
Debanjum Singh Solanky
c50ab7c3ad Split config settings GUI into functions. Convert Config Window to Dialog 2022-08-09 22:36:41 +03:00
Debanjum Singh Solanky
664713b24e Extract Qt GUI code from main.py into separate interface/desktop dir 2022-08-09 22:12:29 +03:00
Debanjum Singh Solanky
84c1fc701d Fix query timing variables from being referenced before assignment 2022-08-09 21:06:37 +03:00
Debanjum Singh Solanky
57026b802c Set size of rendered images using user customizable vars 2022-08-09 21:06:37 +03:00
Debanjum Singh Solanky
0a758c9f0f By default, wait for 2 seconds before initiating rerank in khoj.el
- Subjectively, previous default seems to aggressive based on usage
  Doesn't give time for user to think and type their query
2022-08-09 21:06:30 +03:00
Debanjum Singh Solanky
f01fb16ebb Use single hyphen in name of user configurable variables in khoj.el
- Follow convention, two hyphens indicate variable private to library
- Defcustom are user configurable variables. So they should have single -
- Use khoj-results-count variable directly in code
2022-08-09 20:49:34 +03:00
Debanjum Singh Solanky
cd59982c9c Add Qt Button to save Khoj configuration in Khoj Configuration Window 2022-08-09 20:42:44 +03:00
Debanjum Singh Solanky
2c77caf06c Group ledger, org setting widgets into child Qt widgets of config window 2022-08-09 20:42:44 +03:00
Debanjum Singh Solanky
027da719aa Open Configure Window on First Run or from System Tray
- Trigger FRE if no config loaded. Open Configure Window automatically
- Else user can manually open config window from App on System Tray
2022-08-09 17:05:27 +03:00
Debanjum Singh Solanky
a588a8e21f Make config_file an optional argument. It can be generated on FRE
- Make config_file an optional arg. It defaults to default khoj config dir
- Return args.config as None if no config_file explicitly passed by user
- Parent can use args.config = None as signal to trigger first run experience
2022-08-09 17:02:02 +03:00
Debanjum Singh Solanky
21af122447 Clean up unused methods, module imports. Add comments 2022-08-09 16:59:38 +03:00
Debanjum Singh Solanky
80fa9fde6a Quit GUI via SysTray instead of sys.exit to cleanly terminate server 2022-08-08 23:49:26 +03:00
Debanjum Singh Solanky
e5691f9d1d PyInstaller Spec to Wrap Khoj into a Basic Native App
- Verified functionality on MacOS

- Add ICNS Icon to use as MacOS App Icon
- Spec generated by PyInstaller:
  ```sh
  pyinstaller \
       src/main.py \
       --windowed \
       --onefile \
       --name "Khoj" \
       --target-arch arm64 \
       -i src/interface/web/assets/icons/favicon.icns \
       --add-data "src/interface/web:src/interface/web" \
       --copy-metadata tqdm \
       --copy-metadata regex \
       --copy-metadata requests \
       --copy-metadata packaging \
       --copy-metadata filelock \
       --copy-metadata numpy \
       --copy-metadata tokenizers
  ```
2022-08-08 23:23:02 +03:00
Debanjum Singh Solanky
ef009323e7 Use sys.exit to quit via system tray. Fix pip install cmd in Readme 2022-08-08 21:42:36 +03:00
Debanjum Singh Solanky
eacd95bebd Start Creating Native Configure Page using PyQt 2022-08-08 18:31:47 +03:00
Debanjum Singh Solanky
dddc57e132 Rename get-enabled-search-types to get-enabled-content-types as more appropriate 2022-08-07 18:53:14 +03:00
Debanjum Singh Solanky
127c6e78df Only show keybindings for enabled search types in simple info menu too
Convert the khoj--keybindings-info-message into a func
Dynamically generate info menu
Show keybindings for enabled search types only
2022-08-07 18:40:35 +03:00
Debanjum Singh Solanky
d08c25b62b Make default search type used in the Emacs interface configurable 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
5a10c47499 Allow setting music as search type in khoj.el. Had forgotten to include it earlier 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
ebee716026 Only show keybindings reference for enabled search types in khoj.el 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
6dc9801f45 Get Khoj search-types enabled by user in Emacs 2022-08-07 18:24:53 +03:00
Debanjum Singh Solanky
f3c1512c38 Fix to let user to start enter query right after initiating khoj on emacs
- Fix regression since moving to use `which-key-show-full-keymap~
- The above function reads user keypress, so eats up 1 keypress
  before starting to enter query
- No way to pass no-paging config via the external function to the
  internally used which-key--show-keymap function that does allow
  setting no-paging to not read user keypress
- So use the internal function instead and set no-paging arg to t
2022-08-07 15:57:08 +03:00
Debanjum Singh Solanky
e95686c89c Show complete Khoj keybindings when initiate search in Emacs
- The keybindings to select search types was previously confusing as
  it only highlighted the final symbol to press (the C-x was shown but
  it wasn't made apparent that it had to be pressed before)

- Previously some keybindings unrelated to khoj were also being shown
  in the which-key popup. Now only the khoj keybindings are visible
2022-08-06 16:36:57 +03:00
Debanjum Singh Solanky
4696eadc02 Fix definition of khoj--search-<content-type> functions in khoj.el 2022-08-06 15:19:01 +03:00
Debanjum Singh Solanky
c5bf051a29 Rename initialize_{search,processor,server} to configure_{search,procesor,server}
- Search is being reconfigured multiple times in /regenerate and
  n/reload. More appropriate name is configure_ rather than initialize_
  for it
- Standardize name of methods under configure.py
2022-08-06 03:23:02 +03:00
Debanjum Singh Solanky
7b04978f52 Put global state variables into separate state module
- Variables storing app, device state aren't constants.
  Do not mix with actual constants like empty_escape_sequence, web_directory
2022-08-06 03:13:18 +03:00
Debanjum Singh Solanky
b04c84721b Extract configure and routers from main.py into separate modules
- Main.py was becoming too big to manage. It had both
  controllers/routers and component configurations (search, processors)
  in it

- Now that the native app GUI code is also getting added to the main
  path, good time to split/modularize/clean main.py

- Put global state into a separate file to share across modules
2022-08-06 02:39:18 +03:00
Debanjum Singh Solanky
083fefdd07 Create Native Menu Bar with PyQt to open Search, Config webpages
- Run FastAPI server in a separate thread.
  - This allows starting both the server and gui in parallel

- Create System Tray for Khoj
  - Contains menu items that open search or config pages in browser

- Rearrange code to have only the code required to start Backend and
  GUI in the run() method
  - Move the backend setup code into a separate method
2022-08-06 01:00:25 +03:00
Debanjum Singh Solanky
9fa3345000 Show available Khoj keybindings to customize search using which-key
Fallback to showing simple khoj keybindings info message in echo area
when which-key not available
2022-08-05 20:24:29 +03:00
Debanjum Singh Solanky
6a8b2a6936 Do not run incremental search when query is empty 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
609cd6e8bb Show keybindings to set khoj search type in echo area to assist user 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
48e4a983c5 Allow switching search type in the middle of querying Khoj on Emacs
- More generally, this allows configuring the khoj search anytime
  while in khoj minibuffer window
- Earlier could only configure search type at the start of the search
2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
48c33b93cc Generalize khoj keymap to func that can update existing keybdings 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
19c4701f3f Default to ledger search from files with .beancount extensions 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
cc9a395e0a Keep name of buffer for Khoj results in a variable 2022-08-05 19:35:42 +03:00
Debanjum Singh Solanky
0a5c6d067a Do not prompt user to set search type before querying Khoj via Emacs
- What
  - Default to last used search type, when no search type specified
  - Allow user to change search type before they enter query (and
    after they've called khoj), if they want

- Why
  - Reduce time from intent to results by using reasonable defaults
  - Make interactions smoother, more intuitive
2022-08-05 19:35:38 +03:00
Debanjum Singh Solanky
24ccba74d4 Put type dropdown, regenerate button on same row. Regain screen space 2022-08-05 06:17:43 +03:00
Debanjum Singh Solanky
017e287b8a Remove redundant query as title in results section
- Regain screen real-estate
- Remove unused parameters, html being returned by org.js
2022-08-05 06:17:25 +03:00
Debanjum Singh Solanky
06afeec7e2 Hide stars of org entry results on Emacs to reduce visual clutter
They've all been normlized to the same level and hence don't hold much
data. So good opportunity to reduce, non-useful visual clutter
2022-08-05 05:27:57 +03:00
Saba
d1fe6353b5 Check whether processor_config exists during shutdown event 2022-08-04 21:57:36 -04:00
Debanjum Singh Solanky
4d4d2ff921 Ensure all org entries are unfolded in results buffer on Emacs 2022-08-05 04:54:29 +03:00
Debanjum Singh Solanky
49ef741d4b Prevent Zoom on Input in Web Interface. Document Pip upgrade in Readme
- Name /Reload API Controller Reload
2022-08-05 03:51:34 +03:00
Debanjum Singh Solanky
675e821d95 Make embeddings, jsonl paths absolute. Create directories if non-existent 2022-08-05 02:57:59 +03:00
Debanjum Singh Solanky
d5b43eb836 Use input filter in image search setup. Input filter wasn't used earlier 2022-08-05 02:40:03 +03:00
Debanjum Singh Solanky
ca5a8bd113 Make config file a positional argument, as it is required
- Test invalid config file path throws. Remove redundant cli test

- Simplify cli parser code
  - Do not need to explicitly check if args.config_file set.
    argparser checks for positional arguments automatically

- Use standard semantics for cli args
  - All positional args are required. Non positional args are optional

- Improve command line --help description
2022-08-05 01:09:40 +03:00
Debanjum Singh Solanky
1374065092 Mark all required fields for config. Throw if no input_* field specified
- Add custom validator to throw if neither input_filter or
  input_<files|directories> are specified

- Set field expecting paths to type Path

- Now that default_config isn't used in code. We can update
  fields in rawconfig to specify whether they're required or not.
  This lets pydantic validate config file and throw appropriate error
2022-08-05 01:08:48 +03:00
Debanjum Singh Solanky
f78d6ae754 Create khoj_sample file with all configurable fields in one place
- Reason
  - Simplifies code. No merge_dict required
  - 1 place for user to see all configurables, defaults and required values

- Details
  - Remove default_config from code. Set defaults in khoj_sample.yml itself
  - Keep fields required to be set by user as empty in khoj_sample to YAML
  - Set defaults for fields not requiring configuration by user
2022-08-05 01:08:33 +03:00
Debanjum Singh Solanky
3abf3e5ee0 Update merge_dicts to recursively merge the dictionaries
Previously it was only merging dictionary at the first/top level
2022-08-04 22:46:20 +03:00
Debanjum Singh Solanky
61c26ba611 Only show large Khoj favicon on web interface
- Do not want browsers to use the small, grainy favicons
- Firefox for Android does use the bigger icon, when it's the only one available
- Update svg to match the 144x144 ratio just for consistency
2022-08-04 14:33:29 +03:00
Debanjum Singh Solanky
1649fa644c Autofocus on Query field in Web Interface. Improve time to query 2022-08-04 05:23:19 +03:00
Debanjum Singh Solanky
71fcb1087f Add icons for web interface to render on more browsers and as PWA
Safari, Firefox for Android etc don't support SVG Favicons yet
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
5b6b7ec123 Delete khoj network connections on incremental search teardown on Emacs interface
Currently only get into this state when debug breakpoints on backend
are keeping the connection open and user exits khoj search from Emacs
Results in a number of open connections that slow khoj down.
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
555c1088cc Cache queries in /search controller using LRU cache
- Most concretely right now,
  it eliminates the re-rank latency hit
  on re-rank triggered on user hitting enter
  after re-rank is already done on user idle
  in the emacs interface

- Improves search latency of (incremental) search
2022-08-03 18:52:41 +03:00
Debanjum Singh Solanky
38df727ef4 Fix escape sequence usage in strings. Remove unneeded import of os
Rename /config API method to config to match it's purpose. UI is
anyway too generic, and not what it is doing
2022-08-03 18:51:55 +03:00
Debanjum Singh Solanky
f642450ed9 Disable Incremental Search for Images on Web
Bug introduced in commit da118b3fed
2022-08-03 11:52:51 +03:00
Debanjum Singh Solanky
b9e6273644 Include interfaces in pip package. Fix paths to web interface in app 2022-08-03 00:02:39 +03:00
Debanjum Singh Solanky
1b55462fb0 Convert search_filter, conversation dir to proper modules
Add __init__.py files to their directories
2022-08-02 20:23:42 +03:00
Debanjum Singh Solanky
5108d45951 Wrap application startup steps into a method 2022-08-02 20:13:14 +03:00
Debanjum Singh Solanky
0ebfbb43ce Nest org, md results at level 2 on Emacs interface. Improve readability
- Makes it easier to fold/unfold, traverse and read results
- This 2 level nesting is already being used on the web interface

- Previously we were using the original nesting depth of the entry.
  This was aimed at providing more of the orginal context of the
  results. But currently this additional information does not provide
  as much, for the decreased legibility of the results
2022-08-01 04:01:18 +03:00
Debanjum Singh Solanky
1201bfddf3 Simplify name of config css from config-style.css to config.css 2022-08-01 01:34:00 +03:00
Debanjum Singh Solanky
075dba5d64 Use Khoj Title, Favicon in Config Page for Consistency 2022-08-01 01:27:14 +03:00
Debanjum Singh Solanky
56a4429f01 Move web interface to configure application into src/interface/web directory
- Improve code layout by ensuring all web interface specific code
  under the src/interface/web directory
- Rename config API to more specifi /config instead of /ui
- Rename config data GET, POST api to /config/data instead of /config
2022-08-01 00:53:42 +03:00
Debanjum
bb2ccec1ca
Populate type dropdown on the web interface with only enabled search types
- Previously we were statically populating types dropdown field in the web interface with all available search types
- This change populates the type dropdown field with only search types that are enabled/configured
- It queries the `/config` backend API to see which of the available search types are configured
2022-08-01 00:20:45 +03:00
Debanjum Singh Solanky
8b6058c879 Fix instantiating type field with value from URL query parameter
- Populate via `.then` after enabled search types in dropdown are
  populated
- Call to `/config` API is async and will usually complete after the value of type field is set from url
- So value of type field would earlier be overridden when search types
  dropdown is populated after the call to `/config` API completes
2022-08-01 00:04:50 +03:00
Debanjum Singh Solanky
be253bab39 Populate type dropdown with only enabled search types in web interface
- Get /config API and check config for which available search types is
  populated. This gives us the list of enabled search types
- Dynamically populate search type field with enabled search types only
2022-07-31 23:42:00 +03:00
Debanjum Singh Solanky
0abd40aeb7 Only set query field when appropriate query param passed via URL
- Setting query value to default option when query param wasn't
  passed via URL was overriding placeholder text in query field

- We wanted placeholder text in field, not the query field to actually
  be populated by placeholder text

- This clears field when user starts typing query into the query field,
  instead of them having to manually delete the  default text populated
2022-07-31 22:29:23 +03:00
Debanjum Singh Solanky
17c38b526a Default config for each search types to None
- Setting up default compressed-jsonl, embeddings-file was only required
  for org search_type, while org-files and org-filter were allowed to be
  passed as command line argument
- This avoided having to set compressed-jsonl and embeddings-file via
  command line argument as well for org search type
- Now that all search types are only configurable via config file, We
  can default all search types to None. The default config for the
  rest of the search types wasn't being used anyway
2022-07-31 22:23:57 +03:00
Debanjum Singh Solanky
b83021a723 Improve code readability of merge_dicts helper method 2022-07-31 22:07:56 +03:00
Debanjum Singh Solanky
38aede68f2 Only configure org via config file for consistency across search types
- Previously org-files were configurable via cmdline args.
  Where as none of the other search types are
- This is an artifact of how the application grew
- It can be removed for better consistency and
  equal preference given all search types
2022-07-31 22:02:03 +03:00
Saba
b55159f5bd Fix URL for khoj.el quelpa setup instructions 2022-07-29 23:01:04 -04:00
Debanjum Singh Solanky
da118b3fed Simplify incremental search function used in web interface
Re-rank isn't passed to image search API in search function.
So don't need to check type in incremental_search function too
2022-07-29 23:18:01 +04:00
Debanjum Singh Solanky
3079614981 Allow set up of search form via query params in web interface
- Default search type to org, instead of images
2022-07-29 23:13:26 +04:00
Debanjum Singh Solanky
02ca2c05a1 Add Eagle Icon for Khoj to Web, Emacs Interfaces and Readme 2022-07-29 17:50:29 +04:00
Debanjum Singh Solanky
78314263a0 Add Table of Contents, Features, Performance Details to Readme 2022-07-29 17:08:17 +04:00
Debanjum Singh Solanky
ed181f47c9 Prettify rendering of org music results on Khoj web interface 2022-07-29 04:28:22 +04:00
Debanjum Singh Solanky
7e5291a38e Make org result headings at same level. Improve spacing of results
Having org-mode result headings change size based on their depth in
the source document makes is a confusing UI experience.

Improve font-size, line-spacing and margins of results to make
delineation between entries, and differntiating between entry heading
and it's body easier to visually infer.

Do not white-space: pre-line. Improves rendering of Markdown results
2022-07-29 01:55:46 +04:00
Debanjum Singh Solanky
4d5183063c Create images directory if doesn't exist, to store image search results 2022-07-28 21:30:31 +04:00
Debanjum Singh Solanky
a9bc17a6b0 Prettify Render of Markdown Results in Web Interface 2022-07-28 20:56:37 +04:00
Debanjum Singh Solanky
a6ae74f52e Move JS files like org.js into a separate assets/ directory 2022-07-28 20:46:48 +04:00
Debanjum Singh Solanky
a12eaa4ce0 Move Khoj image results into a child images/ directory 2022-07-28 20:45:12 +04:00
Debanjum
a71253e137
Support Incremental Search on Web Interface
## Support Incremental Search on Khoj Web Interface
- Use default, fast path to query /search API while user is typing
- Upgrade to cross-encoder re-ranked results once user hits enter on search box

## Improve Render of Org Results on Web Interface
- We were previously just wrapping results from /search API into a pre formatted div field. This was not easy to read
- Use [org.js](https://mooz.github.io/org-js/) to render results from Khoj `/search` API as proper HTML
- Improve org.js to render all task states, stylize task tags and make org-mode results look more like original content

Closes  
2022-07-28 09:31:57 -07:00
Debanjum Singh Solanky
e8029bf415 Extract and Highlight org-mode tags in HTML render of search results 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
c6c248df26 Improve styling of org-mode results to original alignment, line breaks 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
9f59897eeb Highlight all org-mode task states in HTML. Not just TODO, DONE.
- Make logic to extract, mark todo state in org.js more generic
- Add default todo state styling to html
2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
f040b3f65c Stylize TODO/DONE states with CSS 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
581b6097c7 Clean Results. Remove TOC, Heading Number and Property Drawers 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
965a93a2f2 Add Basic HTML Rendering of Org-Mode Results 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
1da44d4dfe Add Incremental Search to Khoj Web Interface 2022-07-28 19:55:15 +04:00
Debanjum Singh Solanky
af1dd31401 Do not pass verbose argument to image_search.query() as not supported 2022-07-28 19:52:58 +04:00
Debanjum Singh Solanky
80ac10835c Rerank results on normal minibuffer exit
In current state:
 - Rerank results:
   - If user idles while entering query OR
   - exits normally

 - Do not rerank results:
   - If user exits abnormally, e.g via C-g from query
2022-07-28 03:37:16 +04:00
Debanjum Singh Solanky
1b759597df Make incremental search more robust. Follow standard user expectations
- Rename functions to more standard, descriptive names
- Keep known, required code for incremental search
  - E.g Do not set buffer local flag in hooks on minibuffer setup

- Only query when user in khoj minibuffer
  - Use active-minibuffer-window and track khoj minibuffer
  - (minibuffer-prompt) is not useful for our use-case here

- (For now) Run re-rank only if user idle while querying
  - Do not run rerank on teardown/completion
    - The reranking lag (~2s) is annoying; hit enter,
      wait to see results
    - Also triggered when user exits abnormally,
      so C-g also results in rerank which is even more annoying
  - Emacs will still hang if re-ranking gets triggered on idle but
    that's better than always getting triggered. And better than not
    having mechanism to get results re-ranked via cross-encoder at all
2022-07-28 02:52:27 +04:00
Debanjum Singh Solanky
9a6eee31be Make number of results to get from Khoj API customizable in khoj.el 2022-07-27 18:55:18 +04:00
Debanjum Singh Solanky
9302b45fe0 Use khoj-incremental as the main khoj func. Rename khoj to khoj-simple
- Update khoj-simple to work cross-encoder re-ranked results like before
- Increment major version as incremental search considered a breaking
  change and a major update to search capability
2022-07-27 18:18:17 +04:00
Debanjum Singh Solanky
09727ac3be Make bi-encoder return fewer results to reduce cross-encoder latency 2022-07-27 07:26:02 +04:00
Debanjum Singh Solanky
9ab3edf6d6 Re-rank incremental search results using cross-encoder if user idle
This provides a relatively smooth mechanism
- to improve relevance of results on idle
- while providing the rapid, incremental results while typing
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
ad242cafa7 Support querying all text search types in incremental search
- Before incremental search was hard-coded to only query org
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
bfcb962cbe Use post-command-hook to only query on user input
- Hooking into after-change-functions results in system logs triggering query
2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
0d49398954 Reuse code to query api, render results. Formalize method, arg names 2022-07-27 07:25:42 +04:00
Debanjum Singh Solanky
fd1963d781 Implement Basic Incremental Search Interface in Emacs for Org Mode Notes 2022-07-27 03:05:00 +04:00
Debanjum Singh Solanky
3fa7d8f03a Skeleton to allow incremental search on Khoj via Emacs 2022-07-27 02:48:27 +04:00
Debanjum Singh Solanky
1168244c92 Make cross-encoder re-rank results if query param set on /search API
- Improve search speed by ~10x
  Tested on corpus of 125K lines, 12.5K entries

- Allow cross-encoder to re-rank results by settings &?r=true when querying /search API
  - It's an optional param that default to False
  - Earlier all results were re-ranked by cross-encoder
  - Making this configurable allows for much faster results, if desired
    but for lower accuracy
2022-07-26 22:56:36 +04:00
Debanjum Singh Solanky
b1e64fd4a8 Improve search speed. Only apply filter if filter keywords in query
- Formalize filters into class with can_filter() and filter() methods

- Use can_filter() method to decide whether to apply filter and
  create deep copies of entries and embeddings for it

- Improve search speed for queries with no filters
  as deep copying entries, embeddings takes the most time
  after cross-encodes scoring when calling the /search API

  Earlier we would create deep copies of entries, embeddings
  even if the query did not contain any filter keywords
2022-07-26 22:47:26 +04:00
Debanjum Singh Solanky
f094c86204 Trace query response performance and display timings in verbose mode 2022-07-26 21:03:53 +04:00
Debanjum Singh Solanky
65fea7681a Rename notes search type to org search, now that markdown notes supported 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
4c24202e42 Update documentation. Simplify, reflect current capabilities 2022-07-21 22:09:44 +04:00
Debanjum Singh Solanky
d4d7dbaca6 Support Natural Search on Markdown Files
- Reason:
  Allow natural search on markdown based notes, documentation,
  websites etc

- Details:
  - Create markdown processor to extract Markdown entries (identified by
    Heading) into standard jsonl format required by text_search
  - Update API, Configs to support interfacing with new markdown type
  - Update Emacs, Web clients to support interfacing with new markdown
    type via API
  - Update Readme to mentiond markdown is also supported

Closes 
2022-07-21 22:07:05 +04:00
Debanjum Singh Solanky
0602d018c0 Merge Symmetric, Asymmetric Search Types into a single Text Search Type
- The code for both the text search types were mostly the same
  It was earlier done this way for expedience while experimenting
- The minor differences were reconciled and merged into a single
  text_search type
- This simplifies the app and making it easier to process other
  text types
2022-07-21 21:19:52 +04:00
Debanjum Singh Solanky
0917f1574d Consolidate jsonl helper methods in a single file under utils module 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
de726c4b6c Minor fixes to unused installer utility script 2022-07-21 03:30:13 +04:00
Debanjum Singh Solanky
5aad297286 Reuse logic to extract entries across symmetric, asymmetric search
Now that the logic to compile entries is in the processor layer, the
extract_entries method is standard across (text) search_types

Extract the load_jsonl method as a utility helper method.
Use it in (a)symmetric search types
2022-07-21 02:53:18 +04:00
Debanjum Singh Solanky
e220ecc00b Generate compiled form of each transaction directly in the beancount processor
- The logic for compiling a beancount entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows symmetric search to be generic and not be aware of
  beancount specific properties that were extracted by the
  beancount-to-jsonl processor layer

- Now symmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be location, transaction, chat etc, it doesn't have to care
2022-07-21 02:43:28 +04:00
Debanjum Singh Solanky
06cf425314 Generate compiled form of each entry directly in the org-mode processor
- The logic for compiling an org-mode entry (for later encoding) now
  completely resides in the org-to-jsonl processor layer

- This allows asymmetric search to be generic and not be aware of
  org-mode specific properties that were extracted by the org-to-jsonl
  processor layer

- Now asymmetric search just expects the jsonl to (at least) have the
  'compiled' and 'raw' keys for each entry. What original text the
  entry was compiled from is irrelevant to it. The original text
  could be mail, chat, markdown, org-mode etc, it doesn't have to care
2022-07-21 02:08:02 +04:00
Debanjum Singh Solanky
4ead79d272 Make Notes Search Natural Language Date Aware
- Pass Scheduled, Closed Dates of Entries to Include in Embeddings

- The (new?) model seems to understand dates. So can give more
  relevant entries if date in natural language mentioned in query
- E.g "Went Surfing with Friends" vs "Went Surfing with Friends in 1984"
  will give different results, with the second prioritizing entries
  mentioning any entries with closed, scheduled dates from 1984
2022-07-21 01:06:49 +04:00
Debanjum Singh Solanky
d50bfb5188 Parse Logbook Entries in the OrgNode parser for Org-Mode. Update tests 2022-07-21 00:15:30 +04:00
Debanjum Singh Solanky
70e70d4b15 Rename 'embed' key to more generic 'compiled' for jsonl extracted results
- While it's true those strings are going to be used to generated
  embeddings, the more generic term allows them to be used elsewhere as
  well

- Their main property is that they are processed, compiled for
  usage by semantic search

- Unlike the 'raw' string which contains the external representation
  of the data, as is
2022-07-20 20:35:50 +04:00
Debanjum Singh Solanky
c1369233db Consistently use "entry", "score" in json response for all search types
- Had already made some progress on this earlier by updating the image
  search responses. But needed to update the text search responses to
  use lowercase entry and score

- Update khoj.el to consume the updated json response keys for text
  search
2022-07-20 20:33:27 +04:00
Debanjum Singh Solanky
d68a9dc445 Sort extracted images before computing their embeddings
- Image order returned by glob is OS dependent
- This prevented sharing image embeddings across machines running different OS
- A stable sort order for processed images allows sharing embeddings
  across machines.
- Use case:
  A more powerful, always on machine actually computes the image embeddings regularly
  The client machine just load these periodically to provide semantic search functionality
2022-07-20 03:51:27 +04:00
Debanjum Singh Solanky
c4c7f38b15 Fix extracting image names from multiple image directories 2022-07-20 03:40:49 +04:00
Debanjum Singh Solanky
bdc1b9f2bb Resolve edge case errors in encoding image metadata
- Handle case where current image batch smaller than batch_size
- Handle case where no XMP metadata for current image
  - return empty strings in such a scenario instead of ". "
2022-07-20 02:58:43 +04:00
Debanjum Singh Solanky
2a5445216c Image input directory not required by collate result as image_name already absolute path 2022-07-20 02:56:23 +04:00
Debanjum Singh Solanky
6c9ffdba57 Allow indexing multiple image directories for image search 2022-07-20 02:56:01 +04:00
Debanjum Singh Solanky
70221bb038 Allow filtering transactions by date in symmetric ledger 2022-07-19 20:58:24 +04:00
Debanjum Singh Solanky
b673d26a12 Extract Entries in a standardized format across text search types
Issue:
 - Had different schema of extracted entries for symmetric_ledger vs asymmetric

 - Entry extraction for asymmetric was dirty, relying on cryptic
   indices to store raw entry vs cleaned entry meant to be passed to embeddings

 - This was pushing the load of figuring out what property to extract
   from each entry to downstream processes like the filters

 - This limited the filters to only work for asymmetric search, not for
   symmetric_ledger

- Fix
   - Use consistent format for extracted entries
     {
       'embed': entry_string_meant_to_be_passed_to_model_and_get_embeddings,
       'raw'  : raw_entry_string_meant_to_be_passed_to_use
     }

 - Result
   - Now filters can be applied across search types, and the specific
     field they should be applied on can be configured by each search
     type
2022-07-19 20:52:25 +04:00
Debanjum Singh Solanky
e66cd5bf59 Only extract transactions from Beancount
- Earlier was extracting all entries starting with dates but the other
  type of entries like account open/close, asserts etc aren't useful for
  querying
2022-07-19 19:50:58 +04:00
Debanjum Singh Solanky
732b2d287f Give the project a short, less generic name. Rename it to Khoj
- Semantic Search was just a placeholder used to test the idea out
  Didn't want to get into naming at that point of time
2022-07-19 18:26:16 +04:00
Debanjum Singh Solanky
989526ae54 Use a more accurate model for symmetric semantic search
- The all-MiniLM-L6-v2 is more accurate
  - The exact previous model isn't benchmarked but based on the
    performance of the closest model to it. Seems like the new model
    maybe similar in speed and size

- On very preliminary evaluation of the model, the new model seems
  faster, with pretty decent results
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
4a90972e38 Use a better model for asymmetric semantic search
- The multi-qa-MiniLM-L6-cos-v1 is more extensively benchmarked[1]
- It has the right mix of model query speed, size and performance on benchmarks
- On hugging face it has way more downloads and likes than the msmarco model[2]
- On very preliminary evaluation of the model
  - It doubles the encoding speed of all entries (down from ~8min to 4mins)
  - It gave more entries that stay relevant to the query (3/5 vs 1/5 earlier)

[1]: https://www.sbert.net/docs/pretrained_models.html
[2]: https://huggingface.co/sentence-transformers
2022-07-18 20:27:26 +04:00
Debanjum Singh Solanky
5e302dbcda Fix using 1 column layout on small screens 2022-07-18 02:40:16 +04:00
Debanjum Singh Solanky
7d16b673b1 Use Single Column Layout for Small Screens on Web Interface 2022-07-18 02:08:52 +04:00
Debanjum Singh Solanky
31a221a76b Auto focus cursor on query input box to simplify, speed interactions
- Avoids having to click the query input box
- Just open page, type whatever and hit enter to do image search
  - For other search types select appropriate type from dropdown
2022-07-16 19:39:15 +04:00
Debanjum Singh Solanky
06b0c720d6 Improve Rendering of Image Search Results in Emacs
- Use shr to render image response from html in result buffer
  Earlier was using org-mode. But rendering HTML with shr seems cleaner
- Use Headings to Add highlights
- Use Random to Force fetch of Image. Similar to what was done for Web interface
- Remove trailing elisp brackets from response
- Show query match scores by image model for each image in results
2022-07-16 19:31:49 +04:00
Debanjum Singh Solanky
28ec9af589 Extract image URL location from response in elisp after API update 2022-07-16 18:43:55 +04:00
Debanjum Singh Solanky
47613cba1f Improve Landing Page Look in General and Layout for Mobile
- Ask for 6 Images to Fill Grid into 3x2 Layout
- Submit Form on Hitting Enter
2022-07-16 16:55:13 +04:00
Debanjum Singh Solanky
cf207d6ebe Add title, heading to the semantic search web interface 2022-07-16 03:44:29 +04:00
Debanjum Singh Solanky
e0d8398b27 Normalize metadata match score to work better with image match score
- Metadata match score were consistently giving higher scores by a
  factor of ~3x wrt to image match score. This was resulting in all
  results being from the metadata match with query and none from the
  image match with query.
- Scaling the metadata match scores down by scaling factor seems to
  give more consistently give a blend of results from both image and
  metadata matches
2022-07-16 03:39:33 +04:00
Debanjum Singh Solanky
a3fc82817d Log and continue on image metadata encoding error due to Tensor size mismatch 2022-07-16 03:39:19 +04:00
Debanjum Singh Solanky
f26d0ddbbd Minor fix to asymmetric search when no entries returned 2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
ca3f93e641 Add button on web interface to regenerate embeddings of specified type 2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
231cc91e14 Force reload of images every time user clicks search button
Adding a random, unused url param at the end of the img.src string
fixes the issue. As the browser thinks it's a new image and doesn't
use the image data that's already cached because of which it wasn't
even making the fetch call for the image
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
a6aef62a99 Create Basic Landing Page to Query Semantic Search and Render Results
- Allow viewing image results returned by Semantic Search.
  Until now there wasn't any interface within the app to view image
  search results. For text results, we at least had the emacs interface

- This should help with debugging issues with image search too
  For text the Swagger interface was good enough
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
4e27ae0577 Ease access to image result for given query by image_search
- Copy images to accessible directory
- Return URL paths to them to ease access
- This is to be used in the web interface to render image results
  directly in browser
- Return image, metadata scores for each image in response as well
  This should help get a better sense of image scores along both
  XMP metadata and whole image axis
2022-07-16 03:36:19 +04:00
Debanjum Singh Solanky
801e59a20d Allow explicit filters when querying Ledger transactions 2022-07-15 23:41:54 +04:00
Debanjum Singh Solanky
0e979587e0 Add configurable filter support to Symmetric Ledger Search 2022-07-14 23:40:41 +04:00
Debanjum Singh Solanky
85077bc1d1 Handle unparseable date range passed via date filter in query
- Do not reuse the same list
- Just create new list, so only parsed data is in it
2022-07-14 22:47:23 +04:00
Debanjum Singh Solanky
a60de2c02b Include date filter in asymmetic search on music as well 2022-07-14 22:37:17 +04:00
Debanjum Singh Solanky
c3b3e8959d Put entry splitting regex in explicit filter into a variable for code readability 2022-07-14 22:00:10 +04:00
Debanjum Singh Solanky
3aac3c7d52 Run explicit filter on raw entry, add more terms to split entries by
- With \t Last Word in Headings was suffixed by \t and so couldn't be
filtered by
- User interacts with raw entries, so run explicit filters on raw entry
   - For semantic search using the filtered entry is cleaner, still
2022-07-14 21:54:04 +04:00
Debanjum Singh Solanky
7640e2ab0c Wrap attempt to extract dates from entry in try/catch
- Not all YYYY-MM-DD strings in entry are necessarily dates
2022-07-14 21:38:00 +04:00
Debanjum Singh Solanky
9de2097182 Fix date filter usage with multi word queries. Simplify date regex 2022-07-14 21:34:33 +04:00
Debanjum Singh Solanky
dcb6fe479e Fix date_filter query, entry in query range check. Add tests for it
- Fix date_filter date_in_entry within query range check
  - Extracted_date_range is in [included_date, excluded_date) format
  - But check was checking for date_in_entry <= excluded_date
  - Fixed it to do date_in_entry < excluded_date

- Fix removal of date filter from query
- Add tests for date_filter
2022-07-14 20:01:35 +04:00
Debanjum Singh Solanky
011f81fac5 Fix date_filter to handle non overlapping date ranges 2022-07-14 18:53:38 +04:00
Debanjum Singh Solanky
70ac35b2a5 Compute Date Range to filter entries to, from Comparators, Dates in Query 2022-07-14 18:20:09 +04:00
Debanjum Singh Solanky
e6db3e3d00 Prefer Dates From Future only when specific words in date string
- Default to looking at dates from past, as most notes are from past
- Look for dates in future for cases where it's obvious query is for
  dates in the future but dateparser's parse doesn't parse it at all.
  E.g parse('5 months from now') returns nothing

- Setting PREFER_DATES_FROM_FUTURE in this case and passing just
  parse('5 months') to dateparser.parse works as expected
2022-07-14 18:13:12 +04:00
Debanjum Singh Solanky
4a201d52af Add, test date filter regex and date parsing to get natural date range 2022-07-14 16:47:32 +04:00
Debanjum Singh Solanky
b54588717f Filter for entries with dates specified by user in query
- Create Date filter
  - Users can pass dates in YYYY-MM-DD format in their query
- Use it to filter asymmetric search to user specified dates
2022-07-14 00:51:02 +04:00
Debanjum Singh Solanky
b82aef26bf Make filters to apply before semantic search configurable
Details
--
- The filters to apply are configured for each type in the search controller
- Muliple filters can be applied on the query, entries etc before search
- The asymmetric query method now just applies the passed filters to the
  query, entries and embeddings before semantic search is performed

Reason
--
This abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:37:09 +04:00
Debanjum Singh Solanky
c92789d20a Extract explicit pre-search filter function into a separate module
Details
--
- Move explicit_filters function into separate module under search_filter
- Update signature of explicit filter to take and return query, entries, embeddings
- Use this explicit_filter func from search_filters module in query

Reason
--
Abstraction will simplify adding other pre-search filters. E.g datetime filter
2022-07-13 16:20:04 +04:00
Debanjum Singh Solanky
6d7ab50113 Run Explicit Filter on Entries, Embeddings before Semantic Search for Query
- Issue
  - Explicit filtering was earlier being done after search by bi-encoder
    but before re-ranking by cross-encoder

  - This was limiting the quality of results being returned. As the
    bi-encoder returned results which were going to be excluded. So the
    burden of improving those limited results post filtering was on the
    cross-encoder by re-ranking the remaining results based on query

- Fix
  - Given the embeddings corresponding to an entry are at the same index
    in their respective lists. We can run the filter for blocked,
    required words before the search by the bi-encoder model. And limit
    entries, embeddings being considered for the current query

- Result
  - Semantic search by the bi-encoder gets to return most relevant
    results for the query, knowing that the results aren't going to be
    filtered out after. So the cross-encoder shoulders less of the
    burden of improving results

- Corollary
  - This pre-filtering technique allows us to apply other explicit
    filters on entries relevant for the current query
    - E.g limit search for entries within date/time specified in query
2022-07-12 18:25:42 +04:00
Debanjum Singh Solanky
7677465f23 Fix passing of device to setup method in /reload, /regenerate API
- Use local variable to pass device to asymmetric.setup method via /reload, /regenerate API
- Set default argument to torch.device('cpu') instead of 'cpu' to be more formal
2022-06-30 01:32:56 +04:00
Debanjum Singh Solanky
eda4b65ddb Improve Query Speed. Normalize Embeddings, Moving them to Cuda GPU
- Move embeddings to CUDA GPU for compute, when available
- Normalize embeddings and Use Dot Product instead of Cosine
2022-06-30 00:59:57 +04:00
Debanjum Singh Solanky
b89fc2f4ac Add /reload API to reload model embeddings and entries from file
- The reload API adds the ability to separate out the loading of
  embeddings from file without having to restart app or (re-)generate embeddings

- Before this the only way to load model from file was by restarting app
- The other way to reload the model embeddings by regenerating them
  was to expensive for larger datasets

- This unlocks at least 1 use-case, where
  - we regenerate model via an app instance running on a separate server and
  - just reload the generated embeddings on the client device

  - This allows us to offload the expensive embedding generation
    compute to a background server while letting

  - This avoids having to (re-)restart application on client device or
    be forced to generate embeddings on the client device itself

  - But it requires the model relevant files to be synced to the client device
    This can be done with any file syncing application like Syncthing

  - We can then call /regenerate on server and /reload client on a
    regular schedule to keep our data up to date on semantic search
2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
f5d6d1e752 Tiny style fix to separate functions by 2 newlines 2022-06-29 23:47:17 +04:00
Debanjum Singh Solanky
85fbe1c42b Normalize org notes path to be relative to home directory
- This is still clunky but it should be commitable
- General enough that it'll work even when a users notes are not in the home directory
- While solving for the special case where:
  - Notes are being processed on a different machine and used on a different machine
  - But the notes directory is in the same location relative to home on both the machines
2022-06-28 19:16:11 +04:00
Debanjum Singh Solanky
094eaf3fcc Fix minor bugs in OrgNode parser
- Bugs discovered from writing org-node tests
2022-06-17 19:14:54 +03:00
Debanjum Singh Solanky
36495038dd Fix storing parsed CLOSED date in OrgNode
The CLOSED date was getting parsed but not stored
Adding setClosed at start also fixed the issue
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
1c5754bf95 Simplify storing Tags in OrgNode object
- Use Set for Tags instead of dictionary with empty keys
- No Need to store First Tag separately
  - Remove properties methods associated with storing first tag separately
- Simplify extraction of tags string in org_to_jsonl
- Split notes_string creation into multiple f-string in separate line
  for code readability
2022-06-17 16:33:37 +03:00
Debanjum Singh Solanky
51a43245d3 Escape square brackets in file+heading based org-mode links 2022-06-17 16:20:19 +03:00
Debanjum Singh Solanky
04610f453a Include scheduled date, deadline date and close date in repr of org node
- Now that excluding the times line from the raw body of node,
  show it in repr so user can see it for reference

- But the model doesn't need to see it for it's embeddings to be
  confused by
2022-06-17 05:13:48 +03:00
Debanjum Singh Solanky
367d7377df Ignore scheduled, closed, deadline time and logbook start, end in org node body
- Gives cleaner embeddings for semantic search
- Hopefully improves results and reduces size, compute
2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
b77ccadcba Make property key regex more strict. Property key has to be alphanumeric 2022-06-17 05:13:09 +03:00
Debanjum Singh Solanky
ac9d746444 Fix Tags extraction in Org Node parser
- Previous version required two tags at least to work, not sure why
- Fixed it to extract all tags, even if only one tag in heading
2022-06-17 04:21:22 +03:00
Debanjum Singh Solanky
fb86be8cd9 Add ID, File+Heading based Links to Org-Mode Entries
- Add links to property drawer
- This ensures results returned by semantic search contain these links
- This allows the user to jump to entry within original file for context
- The ID, file+heading based links are more robust to find relevant
  entry in original file than the line no based link,
  as edits being done by user to original files between embedding regenerations
2022-06-17 03:11:11 +03:00
Debanjum Singh Solanky
de23fc2051 Revert Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search
Sentence Transformer MSMarco Model isn't date aware
So no use of adding scheduled, deadline dates to model embeddings for consideration

This reverts commit a2a08d1354.
2022-06-17 02:57:28 +03:00
Debanjum Singh Solanky
a2a08d1354 Add Scheduled, Deadlne date to Model Embeddings for Date Aware Search 2022-06-17 02:55:27 +03:00
Debanjum Singh Solanky
cfbd5c4ecc Update global model on regenerate via API 2022-06-17 00:49:06 +03:00
Debanjum Singh Solanky
c78bf84eef Introduce search api endpoint that auto infers search type intent
- Introduce prompt for GPT to automatically extract user's search intent
- Expose new search api endpoint to use that to set SearchType being
  passed to search API
- Currently meant as an experimental API to gauge usefulness,
  extendability. Evaluating for phone or voice use-case
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
8ef7917014 Fix json format passed in prompt to GPT 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
f57b7f65ea Wrap prompts for GPT in triple quotes to improve prompt readability
To prompt improve readability:
- Remove newline escape sequence and use actual newline directly
  - This avoids one long line of text as prompt and
- Remove escaping of double quotes
2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1eba7b1c6f Use empty_escape_sequence constant to strip response text from gpt 2022-02-27 23:17:49 -05:00
Debanjum Singh Solanky
1c3a1420f8 Update asymmetric extract_entries method to handle uncompressed jsonl
This is similar to what was done for the symmetric extract_entries
method earlier
2022-02-27 19:03:31 -05:00
Debanjum Singh Solanky
3d8a07f252 Extract empty line escape sequences var into constants file for reuse 2022-02-27 19:01:49 -05:00
Debanjum Singh Solanky
bb5d0d8908 Improve Semantic Search Buffer Names in Emacs
- Allow multiple semantic searches buffers to exist simultaneously
  - Uniquify semantic search buffer namew
- Add query and search-type to semantic search buffer name for easier
  disambiguration, search and find appropriate
2022-02-26 18:30:14 -05:00
Debanjum Singh Solanky
b68558651b Improve Extraction of Beancount Entries
- Only extract entries starting with YYYY-MM-DD from Beancount
- Strip Trailing Escape Sequences from Entries
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
b3ac2dd730 Improve Results Rendered on Emacs from Semantic Search on Ledger
- Add search query to top of buffer as Beancount comment
- Remove trailing ) from response
- Separate entries by empty line
- Load beancount-mode in semantic search on ledger buffer
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
502c68d4f8 Remove trailling escape sequence in ledger search response entries
- Fix loading entries from jsonl in extract_entries method
  - Only extract Title from jsonl of each entry
    This is the only thing written to the jsonl for symmetric ledger
  - This fixes the trailing escape seq in loaded entries
  - Remove the need for semantic-search.el response reader to do pointless complicated cleanup

- Make symmetric_ledger:extract_entries use beancount_to_jsonl:load_jsonl
  Both methods were doing similar work

- Make load_jsonl handle loading entries from both gzip and uncompressed jsonl
2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
248aa632c0 Do not throw warning for beancount files with .beancount extension 2022-02-26 17:48:45 -05:00
Debanjum Singh Solanky
76cd63f4bd Fix count of processed jsonl entries shown to user by ledger processor
Count lines not chars
2022-02-26 17:46:06 -05:00
Saba
33bc62dc19 Fix type of use_xmp_metadata to be bool, rather than str 2022-01-24 21:53:26 -05:00
Debanjum Singh Solanky
179153dc5a Rename RawConfig Types for Consistency
- Naming convention - [ContentType][ConfigType]Config
  - Where [ConfigType] ~ Content, Search, Processor
  - Where [ContentType] ~ Text, Image, Asymmetric, Symmetric, Conversation

- Current Configs:
  - Content:
    - Org Notes
    - Org Music
    - Image
    - Ledger/Beancount

  - Search:
     - Asymmetric
     - Symmetric
     - Image

  - Processor:
    - Conversation
2022-01-14 20:54:38 -05:00
Debanjum Singh Solanky
c64e0c2965 Load model from HuggingFace if model_directory unset in config YAML
- Do not save/load the model to/from disk when model_directory unset
in config.yml
- Add symmetric search default config to cli.py
2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
510faa1904 Save Image Search Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
934ec233b0 Add Search Config for Symmetric Model. Save Model to Disk 2022-01-14 17:36:59 -05:00
Debanjum Singh Solanky
b63026d97c Save Asymmetric Search Model to Disk
- Improve application load time
- Remove dependence on internet to startup application and perform semantic search
2022-01-14 17:36:27 -05:00
Debanjum Singh Solanky
2e53fbc844 Fix the user intent extraction prompt for GPT. Clean up chatbot test 2022-01-12 10:36:01 -05:00
Debanjum Singh Solanky
ea28897cdd Remove deprecated conversation_history field from config 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
5a686b7be9 Add logs for chat bot in verbose mode 2022-01-12 10:35:52 -05:00
Debanjum Singh Solanky
6dc2a99d35 Merge branch 'master' of github.com:debanjum/semantic-search into add-summarize-capability-to-chat-bot
- Fix openai_api_key being set in ConfigProcessorConfig
- Merge addition of config UI and config instantiation updates
2021-12-20 13:30:42 +05:30
Debanjum Singh Solanky
65da7daf1f Load, Save Conversation Session Summaries to Log. s/chat_log/chat_session
Conversation logs structure now has session info too instead of just chat info
Session info will allow loading past conversation summaries as context for AI in new conversations

{
    "session": [
    {
        "summary": <chat_session_summary>,
        "session-start": <session_start_index_in_chat_log>,
        "session-end": <session_end_index_in_chat_log>
    }],
    "chat": [
    {
        "intent": <intent-object>
        "trigger-emotion": <emotion-triggered-by-message>
        "by": <AI|Human>
        "message": <chat_message>
        "created": <message_created_date>
    }]
}
2021-12-15 10:17:07 +05:30
Saba
97a6dfaa1e Use default value False for verbose parameter, and small changes
Pass config as parameter to initialize_search, change name of API methods to handle config CRUD operations, and initalize config to FullConfig
2021-12-11 14:13:14 -05:00
Saba
9536358d34 Fix key error model_name issue by upgrade sentence-transformers version
Refer to https://github.com/UKPLab/sentence-transformers/issues/1241
Also user verbose flag passed through function parameters in image_search
2021-12-11 11:58:19 -05:00
Saba
ce7a751e6b Fix passing verbose flag down in symmetric_ledger.py 2021-12-11 11:36:32 -05:00
Saba
d65190c3ee Update unit tests, files with removing model suffix to config types 2021-12-09 08:50:38 -05:00
Debanjum Singh Solanky
0ac1e5f372 Summarize chat logs and notes returned by semantic search via /chat API 2021-12-08 02:34:07 +05:30
Saba
76e9e9da2f Update unit tests to use the new BaseModel types 2021-12-05 09:31:39 -05:00
Saba
9b16cdbb41 Use past tense for verbose log 2021-12-04 11:45:44 -05:00
Saba
10e4065e05 Consolidate the search config models and pass verbose as a top level flag 2021-12-04 11:43:48 -05:00
Saba
43e647835b Append Model Suffixed to config models 2021-12-04 10:51:21 -05:00
Saba
e068968b35 Update imports for raw config models in config.py 2021-12-04 10:44:55 -05:00
Saba
4d6284b0af Remove Test suffix from Config models 2021-12-04 10:44:13 -05:00
Saba
7fcc8d2cef Add null check for processor config 2021-12-04 10:11:00 -05:00
Saba
7ca4fc3453 Resolve mrege conflicts with updated processor conversation data model 2021-11-28 16:22:52 -05:00
Saba
87a6c2d716 Use parse_obj instead of parse_raw as incoming data is in dict 2021-11-28 14:34:32 -05:00
Saba
5d50487d83 Linting
New line at end of config.html
Remove debug print statement
2021-11-28 13:32:56 -05:00
Saba
6f466c8d99 Use global config and add a regenerate button to the config ui' && git push 2021-11-28 13:28:22 -05:00
Saba
34d1e4199c Use alias generator when deserializing the config file 2021-11-28 13:05:48 -05:00
Saba
19b81e82f0 Write back to the raw config.yml file on update 2021-11-28 12:34:40 -05:00
Saba
8837b02de6 dump updated config to a yaml file 2021-11-28 12:26:07 -05:00
Saba
5b80b87379 Streamline None checking in initialize_search 2021-11-28 12:05:04 -05:00
Saba
bf8ae31e6a Streamline None checking in initialize_search 2021-11-28 11:59:45 -05:00
Saba
da52433d89 Update to re-use the raw config base models in config.py as well 2021-11-28 11:57:33 -05:00
Saba
6292fe4481 Update to re-use the raw config base models in config.py as well 2021-11-28 11:57:13 -05:00
Saba
311c4b7e7b Working API request body parsing to /post config! 2021-11-28 11:16:33 -05:00
Saba
66183cc298 Working API request body parsing to /post config! 2021-11-28 11:12:26 -05:00
Debanjum Singh Solanky
5cd920544d Add GPT method to summarize notes and chat logs 2021-11-28 13:08:05 +05:30
Debanjum Singh Solanky
1785047ea6 Improve understand primer and load understand response as dict 2021-11-28 13:04:16 +05:30
Saba
64645c3ac1 Begin type checking/input validation effort 2021-11-27 21:47:56 -05:00
Saba
9a0264b7fc Add a dummy POST config endpoint, integrate with editable UI 2021-11-27 20:36:03 -05:00
Saba
f3b03ea5b7 Make raw data reactive to changes 2021-11-27 19:17:15 -05:00
Debanjum Singh Solanky
67c3cd7372 Wire up GPT understand method to /chat API. Log conversation metadata too 2021-11-28 00:04:39 +05:30
Saba
3db06eee3f Basic example of serving conifg as JSON and retriving on button click 2021-11-27 10:49:33 -05:00
Saba
3d4471e107 Merge branch 'master' of github.com:debanjum/semantic-search into saba/configui 2021-11-27 08:52:48 -05:00
Debanjum Singh Solanky
ccfb97e1a7 Wire up minimal conversation processor. Expose it over /chat API endpoint
Ensure conversation history persists across application restart
2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky
a99b4b3434 Make conversation processor configurable 2021-11-27 18:12:01 +05:30
Debanjum Singh Solanky
d4e1120b22 Add GPT based conversation processor to understand intent and converse with user
- Allow conversing with user using GPT's contextually aware, generative capability
- Extract metadata, user intent from user's messages using GPT's general understanding
2021-11-27 18:12:01 +05:30
Saba
baee52648d Set up basic ui page with no functionality 2021-11-26 14:51:11 -05:00
debanjum
46661b3057 Ensure top_k never more than total entries to run symmetric search on 2021-11-16 11:32:21 -08:00
debanjum
8c858d1a94 Reduce symmetric search results for cross-encoder to re-rank to improve search speed 2021-11-16 11:31:19 -08:00
Debanjum Singh Solanky
f3fd5ae978 Improve code comments. Do not import unused modules in asymmetric search 2021-11-17 00:58:31 +05:30
Debanjum Singh Solanky
8cf2465e8e Ensure top_k never more than total entries to search from 2021-11-17 00:56:31 +05:30
Debanjum Singh Solanky
4d37ace3d6 Reduce search results for cross-encoder to re-rank to improve search speed
Search time on my notes reduced from 14s to 4s. Cross-encoder
re-ranking step takes majority time, not the cosine similarity search
2021-11-17 00:50:28 +05:30
Debanjum Singh Solanky
1832e418e5 Use raw string for regex in orgnode to fix deprecation warning 2021-10-02 17:38:31 -07:00
Debanjum Singh Solanky
f59e321419 Update CLIP model load path 2021-10-02 16:50:06 -07:00
Debanjum Singh Solanky
c47a8cdf16 Allow configuring host, port or unix socket of server via CLI 2021-10-02 16:16:33 -07:00
Debanjum Singh Solanky
516f28b082 Merge branch 'master' of github.com:debanjum/semantic-search 2021-09-30 04:17:32 -07:00
Debanjum Singh Solanky
d2905c4be6 Move tests out to project root. Use absolute import in project
tests/ directory in project root is more standard.
Just had to use absolute path for internal module imports to get it to
work
2021-09-30 04:12:14 -07:00
Debanjum Singh Solanky
58bb420f69 Fix image_metadata argument ordering bug. Add E2E image search test
- Image search test seems a little flaky
- Interchanged argument was causing inaccurate results earlier
2021-09-30 03:30:47 -07:00
Debanjum Singh Solanky
d5597442f4 Modularize Code. Wrap Search, Model Config in Classes. Add Tests
Details
  - Rename method query_* to query in search_types for standardization
  - Wrapping Config code in classes simplified mocking test config
  - Reduce args beings passed to a function by passing it as single
    argument wrapped in a class
  - Minimize setup in main.py:__main__. Put most of it into functions
    These functions can be mocked if required in tests later too

Setup Flow:
  CLI_Args|Config_YAML -> (Text|Image)SearchConfig -> (Text|Image)SearchModel
2021-09-30 02:04:04 -07:00
Debanjum Singh Solanky
f4dd9cd117 Use type specific model for other search types too. Expose them via SearchModels
- Wrap Image, Music, Ledger search into the type of SearchModel they use
  Similar to what was done for notes model by wrapping it's config
  into an AsymmetricSearchModel.

- Use the uber wrapper class to expose all type specific search models
2021-09-29 21:09:42 -07:00
Debanjum Singh Solanky
352d2930ee Use multiple threads to generate model embeddings. Other minor formating 2021-09-29 20:47:58 -07:00
Debanjum Singh Solanky
e22e0b41e3 Wrap asymmetric search model into SearchModels. Test notes search end-to-end
- Wrap asymmetric search model parameters into AsymmetricSearchModel class
- Create wrapper for all search type models. Put notes search model into it
- Test notes search end-to-end from client API layer to results.
  Use model build on test data
2021-09-29 20:47:35 -07:00
Debanjum Singh Solanky
cde11a2331 Wrap search type enablement status in a search settings class
- Cleaner, more idiomatic usage of a global variable
- Simplifies mocking when testing client in pytest as setting wrapped
  in object rather than a simple type. So passed around by reference
2021-09-29 19:18:33 -07:00
Debanjum Singh Solanky
81ce0cacc3 Only allow supported search types to /search, /regenerate APIs
- Use a SearchType to limit types that can be passed by user
- FastAPI automatically validates type passed in query param
- Available type options show up in Swagger UI, FastAPI docs
- controller code looks neater instead of doing string comparisons for type
- Test invalid, valid search types via pytest
2021-09-29 19:12:56 -07:00
Debanjum Singh Solanky
5db08c5293 Set query as heading of notes search results in Emacs Org buffer 2021-09-29 13:30:15 -07:00
Debanjum Singh Solanky
fdb60a8dcf Set Query as Heading of Image Search Results Emacs Buffer 2021-09-16 12:30:06 -07:00
Debanjum Singh Solanky
169ddcc8c6 Make Using XMP Metadata to Enhance Image Search Optional, Configurable
- Break the compute embeddings method into separate methods:
  compute_image_embeddings and compute_metadata_embeddings

- If image_metadata_embeddings isn't defined, do not use it to enhance
  search results. Given image_metadata_embeddings wouldn't be defined
  if use_xmp_metadata is False, we can avoid unnecessary addition of
  args to query method
2021-09-16 12:01:05 -07:00
Debanjum Singh Solanky
a4a23d7a72 Batch encode XMP metadata from images too for image_search 2021-09-16 11:11:36 -07:00
Debanjum Singh Solanky
3afe054312 Make image batch size to encode configurable via config.yml 2021-09-16 10:52:31 -07:00
Debanjum Singh Solanky
41c328dae0 Batch encode images to keep memory consumption manageable
- Issue:
  Process would get killed while encoding images
  for consuming too much memory

- Fix:
  - Encode images in batches and append to image_embeddings
  - No need to use copy or deep_copy anymore with batch processing.
    It would earlier throw too many files open error

Other Changes:
  - Use tqdm to see progress even when using batch
  - See progress bar of encoding independent of verbosity (for now)
2021-09-16 10:15:54 -07:00
Debanjum Singh Solanky
d8abbc0552 Use XMP metadata in images to improve image search
- Details
  - The CLIP model can represent images, text in the same vector space

  - Enhance CLIP's image understanding by augmenting the plain image
    with it's text based metadata.
    Specifically with any subject, description XMP tags on the image

  - Improve results by combining plain image similarity score with
    metadata similarity scores for the highest ranked images

- Minor Fixes
  - Convert verbose to integer from bool in image_search.
    It's already passed as integer from the main program entrypoint

  - Process images with ".jpeg" extensions too
2021-09-16 08:55:20 -07:00
Debanjum Singh Solanky
0e34c8f493 Allow semantic search on images from Emacs
Images are rendered inline a temporary org-mode buffer
2021-09-10 01:14:34 -07:00
Debanjum Singh Solanky
7d5514ecaa Allow user to override inferred search type with other valid options 2021-09-10 00:58:24 -07:00
Debanjum Singh Solanky
3bdeeb1e19 Autoload main semantic-search function 2021-09-09 22:10:37 -07:00
Debanjum Singh Solanky
f4bde75249 Decouple results shown to user and text the model is trained on
- Previously:
  The text the model was trained on was being used to
  re-create a semblance of the original org-mode entry.

- Now:
  - Store raw entry as another key:value in each entry json too
    Only return actual raw org entries in results
    But create embeddings like before
  - Also add link to entry in file:<filename>::<line_number> form
    in property drawer of returned results
    This can be used to jump to actual entry in it's original file
2021-08-29 06:06:54 -07:00
Debanjum Singh Solanky
7ee3007070 Get ID, QUERY, TYPE, CATEGORY properties from org property drawer when present 2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
0263d4d068 Enable semantic search for songs in org-music
Org-Music: https://github.com/debanjum/org-music
2021-08-29 06:06:28 -07:00
Debanjum Singh Solanky
fd7888f3d4 Resolve relative file paths to config YAML file in cli.py 2021-08-29 03:03:37 -07:00
Debanjum Singh Solanky
fc531a1915 Resolve relative file paths to model embeddings in all search types 2021-08-28 22:26:12 -07:00
Debanjum Singh Solanky
4daeddbbda Enable Semantic Search on Images 2021-08-22 21:42:37 -07:00
Debanjum Singh Solanky
fd217fe8b7 Enable Semantic Search for Beancount transactions 2021-08-22 21:36:06 -07:00
Debanjum Singh Solanky
97263b8209 Move CLI into a separate module. Move CLI tests into a separate file 2021-08-21 19:21:38 -07:00
Debanjum Singh Solanky
78a1f4ebb4 Use YAML file to allow user to configure application. Add tests
- YAML Config
  - Can specify all params[1] earlier being passed via cmd args in config YAML
  - Can now also configure sentence-transformer models to use etc for search
    - [1] Config params
       - org files
       - compressed entries file config path
       - embeddings file config path

  - Include sample_config.yaml
  - Include sample .org file from this repos readmes

- CLI
  - Configuration Priority: Config via cmd > Config via YAML > Default Config
  - Test CLI, include test config.yml for the tests

- Set default type to None unless set via query param to API
  Run notes search if search_enabled, also if type is None (default)
  Prepares for running queries on all search types unless type
  specified in API query param

- Update Readme
2021-08-21 19:07:39 -07:00
Debanjum Singh Solanky
bafc86d583 Add helpers to merge dictionaries and get keys deep inside a dictionary 2021-08-21 18:27:50 -07:00
Debanjum Singh Solanky
252266b62a Pass type of item via regenerate API. Default type query param to None 2021-08-17 18:25:07 -07:00
Debanjum Singh Solanky
ff7207a6bd Extract commandline arguments into separate testable method 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
a3a1100be9 Arrange modules in standardized ordering 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
569e30b1c8 Create a few basic tests 2021-08-17 04:11:03 -07:00
Debanjum Singh Solanky
af9660f28e Move application files under src directory. Update Readmes
- Remove callign asymmetric search script directly command.
  It doesn't work anymore on calling directly due to internal package
  import issues
2021-08-17 04:11:03 -07:00