Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation ( #3072 )
...
* Add tokenizer improvments via Singleton class
linting
* dev build
* Estimation fallback when string exceeds a fixed byte size
* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers ( #2807 )
...
* fix scraping failed bug in link/bulk link scrapers
* reset submodule
* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages
* lint
---------
Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
timothycarambat
ab6f03ce1c
linting
2024-10-18 11:44:14 -07:00
Sean Hatfield
41522cdfb4
Handle non-ascii characters in single and bulk link scraper URLs ( #2495 )
...
handle non-ascii characters in urls
2024-10-17 17:04:00 -07:00
Sean Hatfield
2797298507
Fix depth handling in bulk link scraper ( #2096 )
...
fix depth handling in bulk link scraper
2024-08-12 11:44:35 -07:00
Sean Hatfield
fc375f4036
[FIX] Bulk link scraper bug fix ( #1800 )
...
patch website depth data connector to work for other links that are not root url
2024-07-01 16:59:28 -07:00
timothycarambat
b5ac944475
patch: bulk-scraper, update when folder is made and path creation params
2024-05-14 12:57:23 -07:00
Sean Hatfield
612a7e1662
[FEAT] Website depth scraping data connector ( #1191 )
...
* WIP website depth scraping, (sort of works)
* website depth data connector stable + add maxLinks option
* linting + loading small ui tweak
* refactor website depth data connector for stability, speed, & readability
* patch: remove console log
Guard clause on URL validitiy check
reasonable overrides
---------
Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-05-14 12:49:14 -07:00