Commit graph

8 commits

Author SHA1 Message Date
Timothy Carambat
d1ca16f7f8
Add tokenizer improvments via Singleton class and estimation ()
* Add tokenizer improvments via Singleton class
linting

* dev build

* Estimation fallback when string exceeds a fixed byte size

* Add notice to tiktoken on backend
2025-01-30 17:55:03 -08:00
Sean Hatfield
9bc01afa7d
Fix scraping failed bug in link/bulk link scrapers ()
* fix scraping failed bug in link/bulk link scrapers

* reset submodule

* swap to networkidle2 as a safe mix for SPA and API-loaded pages, but also not hang on request heavy pages

* lint

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
2024-12-11 14:01:52 -08:00
timothycarambat
ab6f03ce1c linting 2024-10-18 11:44:14 -07:00
Sean Hatfield
41522cdfb4
Handle non-ascii characters in single and bulk link scraper URLs ()
handle non-ascii characters in urls
2024-10-17 17:04:00 -07:00
Sean Hatfield
2797298507
Fix depth handling in bulk link scraper ()
fix depth handling in bulk link scraper
2024-08-12 11:44:35 -07:00
Sean Hatfield
fc375f4036
[FIX] Bulk link scraper bug fix ()
patch website depth data connector to work for other links that are not root url
2024-07-01 16:59:28 -07:00
timothycarambat
b5ac944475 patch: bulk-scraper, update when folder is made and path creation params 2024-05-14 12:57:23 -07:00
Sean Hatfield
612a7e1662
[FEAT] Website depth scraping data connector ()
* WIP website depth scraping, (sort of works)

* website depth data connector stable + add maxLinks option

* linting + loading small ui tweak

* refactor website depth data connector for stability, speed, & readability

* patch: remove console log
Guard clause on URL validitiy check
reasonable overrides

---------

Co-authored-by: Timothy Carambat <rambat1010@gmail.com>
2024-05-14 12:49:14 -07:00