1
0
Fork 0
mirror of https://github.com/khoj-ai/khoj.git synced 2024-12-23 20:58:09 +00:00
Commit graph

10 commits

Author SHA1 Message Date
sabaimran
ad197be70c Fix PDFs unit test, skip OCR 2024-10-20 22:25:41 -07:00
Debanjum Singh Solanky
22f6db0a6b Upgrade RapidOCR and enable for Python 3.12. Fix PDF OCR test 2024-06-22 16:01:55 +05:30
Raghav Tirumale
d4e5c95711
Add Ability to Summarize Documents ()
* Uses entire file text and summarizer model to generate document summary.
* Uses the contents of the user's query to create a tailored summary.
* Integrates with File Filters  for a better UX.
2024-06-18 19:31:07 +05:30
Debanjum Singh Solanky
a627f56a64 Remove unused Entry to Jsonl converter from text to entry class, tests
This was earlier used when the index was plaintext jsonl file. Now
that documents are indexed in a DB this func is not required.

Simplify org,md,pdf,plaintext to entries tests by removing the entry
to jsonl conversion step
2024-04-04 02:41:55 +05:30
Debanjum Singh Solanky
28105ee027 Create wrapper function to get entries from org, md, pdf & text files
- Convert extract_org_entries function to actually extract org entries
  Previously it was extracting intermediary org-node objects instead
  Now it extracts the org-node objects from files and converts them
  into entries
- Create separate, new function to extract_org_nodes from files
- Similarly create wrapper funcs for md, pdf, plaintext to entries

- Update org, md, pdf, plaintext to entries tests to use the new
  simplified wrapper function to extract org entries
2024-04-04 02:41:55 +05:30
sabaimran
79913d4c17
Add isort to the pre-commit configuration and apply it to the whole project ()
* Apply isort to the entire repository
* Fix missing import issues in text_to_entries
* Fix imports in migration files
2023-12-28 18:04:02 +05:30
sabaimran
1e2af083f0 Rename the data_sources module to content 2023-11-21 22:11:32 -08:00
sabaimran
ec06d2c446 Move data indexer files into a separate folder under processor. Update assoc UTs 2023-11-16 17:19:55 -08:00
sabaimran
3d6e8d53fe Try adding dependencies for libgl in order to run OCR in github action unit tests 2023-11-05 15:09:40 -08:00
sabaimran
fdd727712f Rename test files from x_to_jsonl to x_to_entries 2023-11-05 14:33:07 -08:00
Renamed from tests/test_pdf_to_jsonl.py (Browse further)