Commit graph

1573 commits

Author SHA1 Message Date
Debanjum
e022910f31
Search PDF files with Khoj. Integrate with LangChain
- **Introduce Khoj to LangChain**: 
    Call GPT with LangChain for Khoj Chat
- **Search (and Chat about) PDF files with Khoj**
  - Create PDF to JSONL Processor: Convert PDF content into standardized JSONL format
  - Expose PDF search type via Khoj server API
  - Enable querying PDF files via Obsidian, Emacs and Web interfaces
2023-06-02 10:20:26 +05:30
Debanjum Singh Solanky
e9ed7a19fd Update search prompt to extract PDF search type. Fix extract_question prompt 2023-06-02 10:06:03 +05:30
Debanjum Singh Solanky
89fbfce20a Mention PDF are also supported in Khoj Readme 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
bbe3bf9733 Render PDF search results in Khoj Obsidian interface
- Make plugin update khoj server config to index PDF files in vault too
- Make Obsidian plugin update index for PDF files in vault too
- Show PDF results in Khoj Search modal as well
  - Ensure combined results are sorted by score across both types
- Jump to PDF file when select it PDF search result from modal
2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
e3892945d4 Render PDF search results in Khoj.el Emacs interface 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
85144006a1 Render PDF search results in khoj web interface 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
acd14a5e41 Wire up PDF to jsonl processor to Khoj server layer (API, config)
- Specify PDF content to index via khoj.yml
- Index PDF content on app start, reconfigure
- Expose PDF as a search type via API
2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
d63194c3a9 Create tests for PDF to JSONL processor 2023-06-01 21:42:48 +05:30
Debanjum Singh Solanky
286b500f66 Create PDF to JSONL processor using PyPDF and LangChain
Switch `pydantic' to >= 1.9.1 else `langchain.document_loaders' starts
throwing typing error for python 3.8, 3.9
2023-06-01 21:41:49 +05:30
Debanjum Singh Solanky
1b3effd8e6 Fork Markdown to JSONL processor as start template for PDF to Jsonl Processor 2023-06-01 09:13:31 +05:30
Debanjum Singh Solanky
1cd9ecd449 Truncate last message if still over max supported prompt size by model 2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
ed4d0f9076 Simplify argument names used in khoj openai completion functions
- Match argument names passed to khoj openai completion funcs with
  arguments passed to langchain calls to OpenAI
- This simplifies the logic in the khoj openai completion funcs
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
703a7c89c0 Reduce retry count and request timeout for faster response or failure
- Fix bug where both LangChain and Khoj retry requests 6 times each.
  So a total of 12 requests at >1minute intervals for each chat
  response in case of OpenAI API being down

- Retrying too many times when the API is failing doesn't help
- The earlier 60 second request timeout was spacing out the interval
  between retries way too much. This slowed down chat response times
  quite a bit when API was being flaky

- With these updates you'll know if call to chat API failed in under a
  minute
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
18081b3bc6 Use LangChain to call GPT over API 2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
277d2f5c96 Do not add "Notes:" suffix to chat messages when no notes retrieved
This was causing spurious "Notes:" suffix being added to Khoj Chat in
response
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
334be4e600 Use LangChain to call OpenAI for Khoj Chat
- Use ChatModel and ChatOpenAI to call OpenAI chat model instead of
  using OpenAI package directly
- This is being done as part of migration to rely on LangChain for
  creating agents and managing their state
2023-06-01 08:50:59 +05:30
Debanjum Singh Solanky
efcf7d1508 Extract prompts as LangChain Prompt Templates into a separate module
Improves code modularity, cleanliness. Reduces bloat in GPT.py module
2023-06-01 08:50:58 +05:30
Debanjum Singh Solanky
b484953bb3 Import app state correctly to generate embeddings with OpenAI model
Resolves #216
2023-05-28 10:21:54 +05:30
Debanjum Singh Solanky
9cfaaf0941 Update docs to configure khoj.yml for using OpenAI model for embeddings 2023-05-28 10:21:54 +05:30
Debanjum Singh Solanky
a0d0dbaca7 Fix link to Khoj Obsidian Demo video in Readmes 2023-05-23 04:23:08 +05:30
Debanjum Singh Solanky
ebb5d7b8e5 Release Khoj version 0.6.2 2023-05-17 20:04:20 +05:30
Debanjum Singh Solanky
d02415edcc Write generated server id to env file when env file does not contain it 2023-05-17 19:38:44 +05:30
Debanjum Singh Solanky
dc0626856e Put the telemetry db in a separate directory by default 2023-05-17 18:58:47 +05:30
Debanjum
dc495babb3
Add Telemetry to Understand Khoj Usage
### Objective: 
Use telemetry to better understand Khoj usage.
This will motivate and prioritize work for Khoj.

Specific questions:
- Number of active deployments of khoj server
- How regularly is khoj used (hourly, daily, weekly etc)?
- How much is which feature used (chat, search)?
- Which UI interface is used most (obsidian, emacs, web ui)?

### Details
- Expose setting to disable telemetry logging in khoj.yml
- Create basic telemetry server to log data to a DB
- Log calls to Khoj API /search, /chat, /update endpoints
- Batch upload telemetry data to server at ~hourly interval
2023-05-17 19:09:50 +08:00
Debanjum Singh Solanky
55d72231b3 Generate docker image for telemetry server using Github workflow 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
e9f04dc644 Add dockerfile to containerize telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
07b19964d4 Schedule jobs at (co-)prime intervals to reduce overlap in job runs 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
d42f0f5055 Add basic telemetry server for khoj 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
134cce9d32 Batch upload telemetry data at regular interval instead of while querying 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
3ede919c66 Log usage of /search, /chat, /update API endpoints to telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
f2e89f6f46 Add khoj app helper methods to log app usage to a telemetry server 2023-05-17 16:08:21 +05:30
Debanjum Singh Solanky
9ca61d62ff Enable/disable logging telemetry by setting bool in khoj.yml config
We log usage telemetry by default, unless setting explicitly set in
khoj.yml
2023-05-15 23:26:38 +08:00
Debanjum Singh Solanky
131b8407b5 Allow Khoj Chat to respond to general queries not in reference notes
- Khoj chat will now respond to general queries if:
  1. no relevant reference notes available or
  2. when explicitly induced by prefixing the chat message with "@general"

- Previously Khoj Chat would a lot of times refuse to respond to
  general queries not answerable from reference notes or chat history

- Make chat quality tests more robust
  - Add more equivalent chat response options refusing to answer
  - Force haiku writing to not give any preable, just the haiku
2023-05-12 18:42:40 +08:00
Debanjum Singh Solanky
cc75f986b2 Test text search index only updates on changes to text content 2023-05-12 17:37:34 +08:00
Debanjum Singh Solanky
f9ccce430e Allow configuring OpenAI chat model for Khoj chat
- Simplifies switching between different OpenAI chat models. E.g GPT4
- It was previously hard-coded to use gpt-3.5-turbo. Now it just
  defaults to using gpt-3.5-turbo, unless chat-model field under
  conversation processor updated in khoj.yml
2023-05-03 23:01:13 +08:00
Debanjum
f0253e2cbb
Include Filename, Entry Heading in All Compiled Entries to Improve Search Context
Merge pull request #214 from debanjum/add-filename-heading-to-compiled-entry-for-context

- Set filename as top heading in compiled org, markdown entries
  - Note: *Khoj was already indexing filenames in compiled markdown entries but they weren't set as top level headings but rather appended as bare text*. The updated structure should provide more schematic context of relevance
- Set entry heading as heading for compiled org, md entries, even if split by max tokens
- Snip prepended heading to avoid crossing model max_token limits
- Entries with no md headings should not get heading prefix prepended
2023-05-03 22:59:30 +08:00
Debanjum Singh Solanky
6b535cc345 Snip prepended heading to avoid crossing model max_token limits
Otherwise if heading > max_tokens than the search models will just see
a heading (with repeated filename) for each compiled entry and not
actual content.

100 characters should be sufficient to include filename (not path) and
entry heading. If longer rather truncate to pass entry unique text to
model for search context
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
02aeee60aa Set filename as top heading of org entries for better search context
Previously filename was only being appended to markdown entries.

Test filename getting prepended to compiled entry as heading
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
94825a70b9 Set heading of md entries to improve search context for long entries
Otherwise if a markdown entry is longer than max_tokens, the split
entries (apart from first one) do not get their heading context set
2023-05-03 22:53:13 +08:00
Debanjum Singh Solanky
5de04621b5 Set filename as top heading of md entries for better search context
Previously filename was appended to the end of the compiled entry.
This didn't provide appropriate structured context

Test filename getting prepended as heading to compiled entry
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
0e3fb59e09 Entries with no md headings should not get heading prefix prepended
Files with no headings would previously get their entry be prefixed
with a markdown heading prefix (#)
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
45a991d75c Prepend entry heading to all compiled org snippets to improve search context
All compiled snippets split by max tokens (apart from first) do not
get the heading as context.

This limits search context required to retrieve these continuation
entries
2023-05-03 22:50:31 +08:00
Debanjum Singh Solanky
3386cc92b5 Fix khoj server config update in khoj.el by unquoting list to cl-push to
- cl-push expects a generatlized variable. Else throws (setf quote)
  undefined warning
- This results in the config call failing on calling khoj entrypoint
2023-05-03 15:10:56 +08:00
Debanjum Singh Solanky
948a4274e4 Fix documentation strings and simplify not null checks 2023-05-02 21:47:50 +08:00
Debanjum Singh Solanky
731ef5688f Use cl-pushnew to fix byte-compile errors with using add-to-list 2023-05-02 21:47:38 +08:00
Debanjum Singh Solanky
f046523b33 Improve khoj.el messages to convey state of khoj server
- Remove waiting for server message as it hides the messages from the
  server
- Fix the nil message that were being rendered, by checking before
  showing messages from server
- Consistently prefix messages from khoj with khoj.el
2023-04-28 11:15:13 +08:00
Debanjum Singh Solanky
76df393eb5 Only call khoj server configure API from khoj.el when config updated
Previously khoj.el was calling the server configure API even when
config was same as before.
This had broken the khoj search as you type experience from emacs

Also show more details to user about what in khoj is being configured
2023-04-27 20:45:16 +08:00
Debanjum Singh Solanky
ceae06ae9d Fix khoj.el compilation warnings around unused variables 2023-04-27 20:45:16 +08:00
Debanjum Singh Solanky
8269adf849 Refactor khoj-setup in khoj.el for readability. No functional change 2023-04-27 20:45:00 +08:00
Debanjum Singh Solanky
865d12b6f2 Fix escaping quote in chat references to prevent it breaking out of html 2023-04-27 20:45:00 +08:00