Commit graph

125 commits

Author SHA1 Message Date
Sean Hatfield
f40309cfdb
Add id to all metadata to prevent errors in frontend document picker ()
add id to all metadata to prevent errors in frontend docuemnt picker

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2023-11-16 14:36:26 -08:00
timothycarambat
1e3d82e184 patch collector script 2023-11-16 10:25:23 -08:00
timothycarambat
c5dc68633b patch link scrape tool schema 2023-11-14 16:41:39 -08:00
Timothy Carambat
5441717294
normalize parser struct for all file types () 2023-11-01 16:44:02 -07:00
Francisco Bischoff
26dba59249
mbox parsing improvements v1 ()
* mbox parsing improvements v1

* autobots roll out!
2023-10-30 11:57:33 -07:00
Timothy Carambat
18798c5b64
prevent deletion of documents not in hotdir via director traversal ()
resolves 
2023-09-29 11:04:47 -07:00
Timothy Carambat
a505928934
Display better error messages from document processor ()
pass messages to frontend on success/failure
resolves 
2023-09-18 16:50:20 -07:00
Timothy Carambat
3e78476739
Franzbischoff document improvements ()
* cosmetic changes to be compatible to hadolint

* common configuration for most editors until better plugins comes up

* Changes on PDF metadata, using PyMuPDF (faster and more compatible)

* small changes on other file ingestions in order to try to keep the fields equal

* Lint, review, and review

* fixed unknown chars

* Use PyMuPDF for pdf loading for 200% speed increase
linting

---------

Co-authored-by: Francisco Bischoff <franzbischoff@gmail.com>
Co-authored-by: Francisco Bischoff <984592+franzbischoff@users.noreply.github.com>
2023-09-18 16:21:37 -07:00
Melroy van den Berg
16b8330fbf
Update requirements.txt ()
Upgrade fake-useragent to latest version (v1.2.1). Disclaimer: I'm the package maintainer.
2023-08-14 14:38:14 -07:00
Timothy Carambat
b42493c6de
Split large PDFS into subfolder in documents ()
append time value to folder name to prevent duplicate uploads
2023-08-03 18:57:50 -07:00
AntonioCiolino
31e5db7490
Twitter Feature ()
* .

* twitter feature update

* Key validation and operation
2023-07-06 14:05:50 -07:00
Timothy Carambat
d7315b0e53
be able to parse relative and FQDN links from root reliabily () 2023-07-05 14:40:54 -07:00
mplawner
3efe55a720
Added mbox support ()
* Update filetypes.py

Added mbox format

* Created new file

Added support for mbox files as used by many email services, including Google Takeout's Gmail archive.

* Update filetypes.py

* Update as_mbox.py
2023-06-25 18:11:05 -07:00
AntonioCiolino
a52b0ae655
Updated Link scraper to avoid NoneType error. ()
* Enable web scraping based on a urtl and a simple filter.

* ignore yarn

* Updated Link scraper to avoid NoneType error.
2023-06-19 12:07:26 -07:00
frasergr
4079020de0
dockerfile cleanup; enforce text LF line endings () 2023-06-17 20:18:01 -07:00
AntonioCiolino
e7ba028497
Enable web scraping based on a urtl and a simple filter. () 2023-06-16 17:29:11 -07:00
timothycarambat
81b2159329 reorder docs 2023-06-16 17:26:42 -07:00
Timothy Carambat
c4eb46ca19
Upload and process documents via UI + document processor in docker image ()
* implement dnd uploader
show file upload progress
write files to hotdirector
build simple flaskAPI to process files one off

* move document processor calls to util
build out dockerfile to run both procs at the same time
update UI to check for document processor before upload
* disable pragma update on boot
* dockerfile changes

* add filetype restrictions based on python app support response and show rejected files in the UI

* cleanup

* stub migrations on boot to prevent exit condition

* update CF template for AWS deploy
2023-06-16 16:01:27 -07:00
AntonioCiolino
537a6a91d2
Update __HOTDIR__.md ()
fixed typo for text.
2023-06-16 11:17:18 -07:00
Skid Vis
4118c9dcf3
Blocks images in sitemaps from being parsed. ()
* Adds ability to import sitemaps to include a website

* adds example sitemap url

* adds filter to bypass common image formats

* moves filetype ignoring to sitemap script
2023-06-14 23:00:03 -07:00
Skid Vis
bd32f97a21
Adds ability to import sitemaps to include a website ()
* Adds ability to import sitemaps to include a website

* adds example sitemap url
2023-06-14 11:04:17 -07:00
frasergr
9f33b3dfcb
Docker support ()
* Updates for Linux for frontend/server

* frontend/server docker

* updated Dockerfile for deps related to node vectordb

* updates for collector in docker

* docker deps for ODT processing

* ignore another collector dir

* storage mount improvements; run as UID

* fix pypandoc version typo

* permissions fixes
2023-06-13 11:26:11 -07:00
Fabio
d954d7a3d5
Fix pypandoc issue in requirements.txt ()
Co-authored-by: Carvalho, Fabio <Fabio_Carvalho@comcast.com>
2023-06-12 11:21:11 -07:00
timothycarambat
728eaff773 fix typo 2023-06-09 11:23:53 -07:00
timothycarambat
27c58541bd inital commit 2023-06-03 19:28:07 -07:00