Commit graph

380 commits

Author SHA1 Message Date
sabaimran
c08b9e89f0 Update test_db_lock with new function name 2024-08-08 13:03:01 +05:30
sabaimran
1a1d9c7257 Merge branch 'master' of github.com:khoj-ai/khoj into features/big-upgrade-chat-ux 2024-07-27 14:18:05 +05:30
sabaimran
44d34f9090 Update the unit test for the subscribed user 2024-07-26 19:59:01 +05:30
sabaimran
377f7668c5
Merge pull request #858 from khoj-ai/use-sse-instead-of-websocket
Use Single HTTP API for Robust, Generalizable Chat Streaming
2024-07-26 07:11:54 -07:00
Debanjum Singh Solanky
54b4203683 Update chat API client tests to mix testing of batch and streaming mode 2024-07-23 17:56:03 +05:30
Debanjum Singh Solanky
469a1cb6a2 Move API endpoints under /api/configure/content/ to /api/content/
Pull out /api/configure/content API endpoints into /api/content to
allow for more logical organization of API path hierarchy

This should make the url more succinct and API request intent more
understandable by using existing HTTP method semantics along with the
path.

The /configure URL path segment was either
- redundant (e.g POST /configure/notion) or
- incorrect (e.g GET /configure/files)

Some example of naming improvements:
- GET /configure/types -> GET /content/types
- GET /configure/files -> GET /content/files
- DELETE /configure/files -> DELETE /content/files

This should also align, merge better the the content indexing API
triggered via PUT, PATCH /content

Refactor Flow
1. Rename /api/configure/types -> /api/content/types
2. Rename /api/configure -> /api
3. Move /api/content to api_content from under api_config
2024-07-19 05:40:34 +05:30
Debanjum Singh Solanky
bba4e0b529 Accept file deletion requests by clients during sync
- Remove unused full_corpus boolean. The full_corpus=False code path
  wasn't being used (accept for in a test)
- The full_corpus=True code path used was ignoring file deletion
  requests sent by clients during sync. Unclear why this was done
- Added unit test to prevent regression and show file deletion by
  clients during sync not ignored now
2024-07-19 04:53:01 +05:30
Debanjum Singh Solanky
5923b6d89e Split /api/v1/index/update into /api/content PUT, PATCH API endpoints
- This utilizes PUT, PATCH HTTP method semantics to remove need for
  the "regenerate" query param and "/update" url suffix
- This should make the url more succinct and API request intent more
  understandable by using existing HTTP method semantics
2024-07-19 01:45:53 +05:30
Debanjum Singh Solanky
e9f86e320b Fix and improve offline chat actor, director tests
- Use updated references schema with compiled key
- Enable director tests that are now expected to pass and that do pass
  (with Gemma 2 at least)
2024-07-18 03:43:09 +05:30
Debanjum Singh Solanky
de15a7a3fc Rename API path /api/config to /api/configure
- Update clients calling /api/config to call /api/configure instead
2024-07-16 16:13:27 +05:30
Debanjum Singh Solanky
21fe1a917b Support syncing, searching images from Obsidian plugin 2024-07-11 16:22:31 +05:30
Debanjum Singh Solanky
010486fb36 Split current section once by heading to resolve org-mode indexing bug
- Split once by heading (=first_non_empty) to extract current section body
  Otherwise child headings with same prefix as current heading will
  cause the section split to go into infinite loop
- Also add check to prevent getting into recursive loop while trying
  to split entry into sub sections
2024-07-06 19:35:59 +05:30
Debanjum Singh Solanky
d5ceff2691 Update tests and documentation with Jina reader API usage and info
Update offline, openai chat actor, director tests to not require
Serper to run the online command tests

Update documentation for self-hosted online search to mention no setup
is required by default. But improvements can be made by using
Serper.dev or Olostep
2024-07-02 17:19:09 +05:30
Raghav Tirumale
8eccd8a5e4
Support Indexing Images via OCR (#823)
- Added support for uploading .jpeg, .jpg, and .png files to Khoj from Web, Desktop app
- Updating indexer to generate raw text and entries using RapidOCR
- Details
  * added support for indexing images via ocr
  * fixed pyproject.toml
  * Update src/khoj/processor/content/images/image_to_entries.py
     Co-authored-by: Debanjum <debanjum@gmail.com>
  * Update src/khoj/processor/content/images/image_to_entries.py
     Co-authored-by: Debanjum <debanjum@gmail.com>
  * removed redudant try except blocks
  * updated desktop js file to support image formats
  * added tests for jpg and png
  * Fix processing for image to entries files
  * Update unit tests with working image indexer
  * Change png test from version verificaition to open-cv verification

---------

Co-authored-by: Debanjum <debanjum@gmail.com>
Co-authored-by: sabaimran <narmiabas@gmail.com>
2024-07-01 06:00:00 -07:00
Debanjum Singh Solanky
732332a3c5 Spell fix s/e.g/e.g./ across code, tests and docs 2024-06-24 15:24:45 +05:30
Debanjum Singh Solanky
22f6db0a6b Upgrade RapidOCR and enable for Python 3.12. Fix PDF OCR test 2024-06-22 16:01:55 +05:30
Raghav Tirumale
bd3b590153
Support Indexing Docx Files (#801)
* Add support for indexing docx files and associated unit tests

---------

Co-authored-by: sabaimran <narmiabas@gmail.com>
2024-06-20 11:18:01 +05:30
Raghav Tirumale
d4e5c95711
Add Ability to Summarize Documents (#800)
* Uses entire file text and summarizer model to generate document summary.
* Uses the contents of the user's query to create a tailored summary.
* Integrates with File Filters #788 for a better UX.
2024-06-18 19:31:07 +05:30
Debanjum
6afbd8032e
Improve Intermediate Steps in Formulating Chat Response (#799)
# Major
- Disambiguate Text output mode to disambiguate from Default data source lookup
- Fix showing headings in intermediate step in generating chat response
- Remove "Path" prefix from org ancestor heading in compiled entry

# Minor
- Fix OpenAI chat actor, director unit tests
2024-06-09 07:55:01 +05:30
Debanjum Singh Solanky
f440ddbe1d Fix openai chat actor, director tests
- Update test ChatModelOptions setup since update to it's schema
- Fix stale function calls using their updated signatures
2024-06-09 07:24:47 +05:30
Debanjum Singh Solanky
5f2442450c Update truncation test to reduce flakyness in cloud tests
Removed dependency on faker, factory for the truncation tests as that
seems to be the point of flakiness
2024-06-07 19:42:48 +05:30
Debanjum Singh Solanky
18f7e6e7ed Remove "Path" prefix from org ancestor heading in compiled entry 2024-06-06 16:51:26 +05:30
Debanjum Singh Solanky
22289a0002 Improve task scheduling by using json mode and agent scratchpad
- The task scheduling actor was having trouble calculating the
  timezone. Giving the actor a scratchpad to improve correctness by
  thinking step by step
- Add more examples to reduce chances of the inferred query looping to
  create another reminder instead of running the query and sharing
  results with user
- Improve task scheduling chat actor test with more tests and
  by ensuring unexpected words not present in response
2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky
7f5981594c Only notify when scheduled task results satisfy user's requirements
There's a difference between running a scheduled task and notifying
the user about the results of running the scheduled task.

Decide to notify the user only when the results of running the
scheduled task satisfy the user's requirements.

Use sync version of send_message_to_model_wrapper for scheduled tasks
2024-05-01 08:30:10 +05:30
Debanjum Singh Solanky
c28d7d3414 Add basic chat actor test to infer scheduled queries 2024-05-01 08:28:59 +05:30
Debanjum
17a06f152c
Support Llama 3 and Improve Offline Chat Actors (#724)
- Add support for Llama 3 in Khoj offline mode
- Make chat actors generate valid json with more local models
- Fix offline chat actor tests
2024-04-25 14:00:56 +05:30
Debanjum Singh Solanky
ec41482324 Upgrade default cross-encoder to mixedbread ai's mxbai-rerank-xsmall
Previous cross-encoder model was a few years old, newer models should
have improved in quality. Model size increases by 50% compared to
previous for better performance, at least on benchmarks
2024-04-24 09:50:09 +05:30
Debanjum Singh Solanky
f2db8d7d99 Fix offline chat actor tests
Do not check for original q in extracted questions. Since this was
removed in a previous commit
2024-04-24 09:40:00 +05:30
sabaimran
60658a8037 Get rid of enable flag for the offline chat processor config
- Default, assume that offline chat is enabled if there is an offline chat model option configured
2024-04-23 23:08:29 +05:30
sabaimran
6de4a4873a Fix image-related client unit test 2024-04-17 13:28:48 +05:30
sabaimran
3132430737 Add tests for the db lock 2024-04-17 13:22:41 +05:30
sabaimran
d11354f9c8 Remove additional references to image content config 2024-04-17 13:00:50 +05:30
sabaimran
87b9a93fa1 Update assertion line to match new logic 2024-04-12 13:09:19 +05:30
sabaimran
e58bd0e485 Remove mbox file from list of files expected to be included 2024-04-12 12:55:22 +05:30
Debanjum Singh Solanky
8291b898ca Standardize structure of text to entries to match other entry processors
Add process_single_plaintext_file func etc with similar signatures as
org_to_entries and markdown_to_entries processors

The standardization makes modifications, abstractions easier to create
2024-04-09 20:19:40 +05:30
Debanjum
11ce3e2268
Update Text Chunking Strategy to Improve Search Context (#645)
## Major
- Parse markdown, org parent entries as single entry if fit within max tokens
- Parse a file as single entry if it fits with max token limits
- Add parent heading ancestry to extracted markdown entries for context
- Chunk text in preference order of para, sentence, word, character

## Minor
- Create wrapper function to get entries from org, md, pdf & text files
- Remove unused Entry to Jsonl converter from text to entry class, tests
- Dedupe code by using single func to process an org file into entries

Resolves #620
2024-04-08 13:56:38 +05:30
Debanjum Singh Solanky
9239c2c2ed Update drop large words test to ensure newlines considerd word boundary
Prevent regression to #620
2024-04-08 13:38:08 +05:30
sabaimran
f57f9f672d
Address Notion, Image tech debt in indexing code path (#687)
* Add support for using OAuth2.0 in the Notion integration
* Add notion to the admin page
* Remove unnecessary content_index and image search/setup references
* Trigger background job to start indexing Notion after user configures it
* Add a log line when a new Notion integration is setup
* Fix references to the configure_content methods
2024-04-05 12:10:03 +05:30
Debanjum Singh Solanky
29c1c18042 Increase search distance to get relevant content for chat post indexer update
More content indexed per entry would result in an overall scores
lowering effect. Increase default search distance threshold to counter that

- Details
  - Fix expected results post indexing updates
  - Fix search with max distance post indexing updates

- Minor
  - Remove openai chat actor test for after: operator as it's not expected anymore
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
ad4fa4b2f4 Fix adding file path instead of stem to markdown entries 2024-04-04 02:41:55 +05:30
sabaimran
720139c3c1 Fix all unit tests for test_text_search 2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
44b3247869 Update logical splitting of org-mode text into entries
- Major
  - Do not split org file, entry if it fits within the max token limits
    - Recurse down org file entries, one heading level at a time until
      reach leaf node or the current parent tree fits context window
    - Update `process_single_org_file' func logic to do this recursion

  - Convert extracted org nodes with children into entries
    - Previously org node to entry code just had to handle leaf entries
    - Now it recieve list of org node trees
    - Only add ancestor path to root org-node of each tree
    - Indent each entry trees headings by +1 level from base level (=2)

- Minor
  - Stop timing org-node parsing vs org-node to entry conversion
    Just time the wrapping function for org-mode entry extraction
    This standardizes what is being timed across at md, org etc.
  - Move try/catch to `extract_org_nodes' from `parse_single_org_file'
    func to standardize this also across md, org
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
44eab74888 Dedupe code by using single func to process an org file into entries
Add type hints to orgnode and org-to-entries packages
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
db2581459f Parse markdown parent entries as single entry if fit within max tokens
These changes improve context available to the search model.
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with sparse, short heading/entry trees

Previously we'd always split markdown files by headings, even if a
parent entry was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to the search model to select appropriate entries for a query,
especially from short entry trees

Revert back to using regex to parse through markdown file instead of
using MarkdownHeaderTextSplitter. It was easier to implement the
logical split using regexes rather than bend MarkdowHeaderTextSplitter
to implement it.
- DFS traverse the markdown knowledge tree, prefix ancestry to each entry
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
982ac1859c Parse markdown file as single entry if it fits with max token limits
These changes improve entry context available to the search model
Specifically this should improve entry context from short knowledge trees,
that is knowledge bases with small files

Previously we split all markdown files by their headings,
even if the file was small enough to fit entirely within the max token
limits of the search model. This used to reduce the context available
to select the appropriate entries for a given query for the search model,
especially from short knowledge trees
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
d8f01876e5 Add parent heading ancestory to extracted markdown entries for context
Improve, update the markdown to entries extractor tests
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
86575b2946 Chunk text in preference order of para, sentence, word, character
- Previous simplistic chunking strategy of splitting text by space
  didn't capture notes with newlines, no spaces. For e.g in #620

- New strategy will try chunk the text at more natural points like
  paragraph, sentence, word first. If none of those work it'll split
  at character to fit within max token limit

- Drop long words while preserving original delimiters

Resolves #620
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
a627f56a64 Remove unused Entry to Jsonl converter from text to entry class, tests
This was earlier used when the index was plaintext jsonl file. Now
that documents are indexed in a DB this func is not required.

Simplify org,md,pdf,plaintext to entries tests by removing the entry
to jsonl conversion step
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
28105ee027 Create wrapper function to get entries from org, md, pdf & text files
- Convert extract_org_entries function to actually extract org entries
  Previously it was extracting intermediary org-node objects instead
  Now it extracts the org-node objects from files and converts them
  into entries
- Create separate, new function to extract_org_nodes from files
- Similarly create wrapper funcs for md, pdf, plaintext to entries

- Update org, md, pdf, plaintext to entries tests to use the new
  simplified wrapper function to extract org entries
2024-04-04 02:41:55 +05:30
Debanjum
215ab6e66a
Extract More Dates from entries to improve Date Filter (#683)
- Overview
  - Extract more structured date variants (e.g with dot(.) & slash(/) separators, 2-digit year)
  - Extract some natural, partial dates as well from entries
- Capability
  Add ability to extract the following additional date forms:
  - Natural Dates: 21st April 2000, February 29 2024
  - Partial Natural Dates: March 24, Mar 2024
  - Structured Dates: 20/12/24, 20.12.2024, 2024/12/20
  Note: Previously only YYYY-MM-DD ISO-8601 structured date form was extracted for date filters
- Performance
  Using regexes is MUCH faster than using the `dateparser' python library
  It's a little crude but gives acceptable performance for large datasets
2024-04-02 16:14:53 +05:30