mirror of
https://github.com/Mintplex-Labs/anything-llm.git
synced 2025-04-17 18:18:11 +00:00
Document Processor v2 (#442)
* wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch
This commit is contained in:
parent
5f6a013139
commit
719521c307
69 changed files with 3682 additions and 1925 deletions
README.md
cloud-deployments
aws/cloudformation
digitalocean/terraform
gcp/deployment
collector
.env.example.gitignore.nvmrcREADME.mdapi.py
hotdir
index.jsmain.pynodemon.jsonpackage.jsonprocessLink
processSingleFile
requirements.txtscripts
__init__.pygitbook.pylink.pylink_utils.pymedium.pymedium_utils.pysitemap.pysubstack.pysubstack_utils.pytwitter.pyutils.py
watch
youtube.pyyt_utils.pyutils
watch.pywsgi.pyyarn.lockdocker
package.jsonserver
12
README.md
12
README.md
|
@ -74,10 +74,10 @@ Some cool features of AnythingLLM
|
|||
|
||||
### Technical Overview
|
||||
This monorepo consists of three main sections:
|
||||
- `collector`: Python tools that enable you to quickly convert online resources or local documents into LLM useable format.
|
||||
- `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
|
||||
- `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
|
||||
- `server`: A NodeJS express server to handle all the interactions and do all the vectorDB management and LLM interactions.
|
||||
- `docker`: Docker instructions and build process + information for building from source.
|
||||
- `collector`: NodeJS express server that process and parses documents from the UI.
|
||||
|
||||
### Minimum Requirements
|
||||
> [!TIP]
|
||||
|
@ -86,7 +86,6 @@ This monorepo consists of three main sections:
|
|||
> you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
|
||||
|
||||
- `yarn` and `node` on your machine
|
||||
- `python` 3.9+ for running scripts in `collector/`.
|
||||
- access to an LLM running locally or remotely.
|
||||
|
||||
*AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
|
||||
|
@ -112,6 +111,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
|
|||
mkdir -p $STORAGE_LOCATION && \
|
||||
touch "$STORAGE_LOCATION/.env" && \
|
||||
docker run -d -p 3001:3001 \
|
||||
--cap-add SYS_ADMIN \
|
||||
-v ${STORAGE_LOCATION}:/app/server/storage \
|
||||
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
|
||||
-e STORAGE_DIR="/app/server/storage" \
|
||||
|
@ -141,12 +141,6 @@ To boot the frontend locally (run commands from root of repo):
|
|||
|
||||
[Learn about vector caching](./server/storage/vector-cache/VECTOR_CACHE.md)
|
||||
|
||||
## Standalone scripts
|
||||
|
||||
This repo contains standalone scripts you can run to collect data from a Youtube Channel, Medium articles, local text files, word documents, and the list goes on. This is where you will use the `collector/` part of the repo.
|
||||
|
||||
[Go set up and run collector scripts](./collector/README.md)
|
||||
|
||||
## Contributing
|
||||
- create issue
|
||||
- create PR with branch name format of `<issue number>-<short name>`
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# How to deploy a private AnythingLLM instance on AWS
|
||||
|
||||
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
|
||||
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
|
||||
|
||||
**Quick Launch (EASY)**
|
||||
1. Log in to your AWS account
|
||||
|
@ -30,12 +30,11 @@ The output of this cloudformation stack will be:
|
|||
|
||||
**Requirements**
|
||||
- An AWS account with billing information.
|
||||
- AnythingLLM (GUI + document processor) must use a t2.small minimum and 10Gib SSD hard disk volume
|
||||
|
||||
## Please read this notice before submitting issues about your deployment
|
||||
|
||||
**Note:**
|
||||
Your instance will not be available instantly. Depending on the instance size you launched with it can take varying amounts of time to fully boot up.
|
||||
Your instance will not be available instantly. Depending on the instance size you launched with it can take 5-10 minutes to fully boot up.
|
||||
|
||||
If you want to check the instance's progress, navigate to [your deployed EC2 instances](https://us-west-1.console.aws.amazon.com/ec2/home) and connect to your instance via SSH in browser.
|
||||
|
||||
|
|
|
@ -89,7 +89,7 @@
|
|||
"touch /home/ec2-user/anythingllm/.env\n",
|
||||
"sudo chown ec2-user:ec2-user -R /home/ec2-user/anythingllm\n",
|
||||
"docker pull mintplexlabs/anythingllm:master\n",
|
||||
"docker run -d -p 3001:3001 -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
|
||||
"docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
|
||||
"echo \"Container ID: $(sudo docker ps --latest --quiet)\"\n",
|
||||
"export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)\n",
|
||||
"echo \"Health check: $ONLINE\"\n",
|
||||
|
|
|
@ -1,8 +1,6 @@
|
|||
# How to deploy a private AnythingLLM instance on DigitalOcean using Terraform
|
||||
|
||||
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
|
||||
|
||||
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
|
||||
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set a password one setup is complete.
|
||||
|
||||
The output of this Terraform configuration will be:
|
||||
- 1 DigitalOcean Droplet
|
||||
|
@ -12,8 +10,6 @@ The output of this Terraform configuration will be:
|
|||
- An DigitalOcean account with billing information
|
||||
- Terraform installed on your local machine
|
||||
- Follow the instructions in the [official Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) for your operating system.
|
||||
- `.env` file that is filled out with your settings and set up in the `docker/` folder
|
||||
|
||||
|
||||
## How to deploy on DigitalOcean
|
||||
Open your terminal and navigate to the `digitalocean/terraform` folder
|
||||
|
@ -36,7 +32,7 @@ terraform destroy
|
|||
## Please read this notice before submitting issues about your deployment
|
||||
|
||||
**Note:**
|
||||
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 10-20 minutes to fully boot up.
|
||||
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
|
||||
|
||||
If you want to check the instances progress, navigate to [your deployed instances](https://cloud.digitalocean.com/droplets) and connect to your instance via SSH in browser.
|
||||
|
||||
|
|
|
@ -12,7 +12,7 @@ mkdir -p /home/anythingllm
|
|||
touch /home/anythingllm/.env
|
||||
|
||||
sudo docker pull mintplexlabs/anythingllm:master
|
||||
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
|
||||
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
|
||||
echo "Container ID: $(sudo docker ps --latest --quiet)"
|
||||
|
||||
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)
|
||||
|
|
|
@ -1,8 +1,6 @@
|
|||
# How to deploy a private AnythingLLM instance on GCP
|
||||
|
||||
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
|
||||
|
||||
[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
|
||||
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
|
||||
|
||||
The output of this cloudformation stack will be:
|
||||
- 1 GCP VM
|
||||
|
@ -11,19 +9,15 @@ The output of this cloudformation stack will be:
|
|||
|
||||
**Requirements**
|
||||
- An GCP account with billing information.
|
||||
- AnythingLLM (GUI + document processor) must use a n1-standard-1 minimum and 10Gib SSD hard disk volume
|
||||
- `.env` file that is filled out with your settings and set up in the `docker/` folder
|
||||
|
||||
## How to deploy on GCP
|
||||
Open your terminal
|
||||
1. Generate your specific cloudformation document by running `yarn generate:gcp_deployment` from the project root directory.
|
||||
2. This will create a new file (`gcp_deploy_anything_llm_with_env.yaml`) in the `gcp/deployment` folder.
|
||||
3. Log in to your GCP account using the following command:
|
||||
1. Log in to your GCP account using the following command:
|
||||
```
|
||||
gcloud auth login
|
||||
```
|
||||
|
||||
4. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
|
||||
2. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
|
||||
|
||||
```
|
||||
|
||||
|
@ -57,5 +51,4 @@ If you want to check the instances progress, navigate to [your deployed instance
|
|||
|
||||
Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
|
||||
|
||||
|
||||
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
|
||||
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
|
|
@ -34,7 +34,7 @@ resources:
|
|||
touch /home/anythingllm/.env
|
||||
|
||||
sudo docker pull mintplexlabs/anythingllm:master
|
||||
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
|
||||
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
|
||||
echo "Container ID: $(sudo docker ps --latest --quiet)"
|
||||
|
||||
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)
|
||||
|
|
|
@ -1,61 +0,0 @@
|
|||
import fs from 'fs';
|
||||
import { fileURLToPath } from 'url';
|
||||
import path, { dirname } from 'path';
|
||||
import { exit } from 'process';
|
||||
const __dirname = dirname(fileURLToPath(import.meta.url));
|
||||
const REPLACEMENT_KEY = '!SUB::USER::CONTENT!'
|
||||
|
||||
const envPath = path.resolve(__dirname, `../../../docker/.env`)
|
||||
const envFileExists = fs.existsSync(envPath);
|
||||
|
||||
const chalk = {
|
||||
redBright: function (text) {
|
||||
return `\x1b[31m${text}\x1b[0m`
|
||||
},
|
||||
cyan: function (text) {
|
||||
return `\x1b[36m${text}\x1b[0m`
|
||||
},
|
||||
greenBright: function (text) {
|
||||
return `\x1b[32m${text}\x1b[0m`
|
||||
},
|
||||
blueBright: function (text) {
|
||||
return `\x1b[34m${text}\x1b[0m`
|
||||
}
|
||||
}
|
||||
|
||||
if (!envFileExists) {
|
||||
console.log(chalk.redBright('[ABORT]'), 'You do not have an .env file in your ./docker/ folder. You need to create it first.');
|
||||
console.log('You can start by running', chalk.cyan('cp -n ./docker/.env.example ./docker/.env'))
|
||||
exit(1);
|
||||
}
|
||||
|
||||
// Remove comments
|
||||
// Remove UID,GID,etc
|
||||
// Remove empty strings
|
||||
// Split into array
|
||||
const settings = fs.readFileSync(envPath, "utf8")
|
||||
.replace(/^#.*\n?/gm, '')
|
||||
.replace(/^UID.*\n?/gm, '')
|
||||
.replace(/^GID.*\n?/gm, '')
|
||||
.replace(/^CLOUD_BUILD.*\n?/gm, '')
|
||||
.replace(/^\s*\n/gm, "")
|
||||
.split('\n')
|
||||
.filter((i) => !!i);
|
||||
const formattedSettings = settings.map((i, index) => index === 0 ? i + '\n' : ' ' + i).join('\n');
|
||||
|
||||
// Read the existing GCP Deployment Manager template
|
||||
const templatePath = path.resolve(__dirname, `gcp_deploy_anything_llm.yaml`);
|
||||
const templateString = fs.readFileSync(templatePath, "utf8");
|
||||
|
||||
// Update the metadata section with the UserData content
|
||||
const updatedTemplateString = templateString.replace(REPLACEMENT_KEY, formattedSettings);
|
||||
|
||||
// Save the updated GCP Deployment Manager template
|
||||
const output = path.resolve(__dirname, `gcp_deploy_anything_llm_with_env.yaml`);
|
||||
fs.writeFileSync(output, updatedTemplateString, "utf8");
|
||||
|
||||
console.log(chalk.greenBright('[SUCCESS]'), 'Deploy AnythingLLM on GCP Deployment Manager using your template document.');
|
||||
console.log(chalk.greenBright('File Created:'), 'gcp_deploy_anything_llm_with_env.yaml in the output directory.');
|
||||
console.log(chalk.blueBright('[INFO]'), 'Refer to the GCP Deployment Manager documentation for how to use this file.');
|
||||
|
||||
exit();
|
|
@ -1 +0,0 @@
|
|||
GOOGLE_APIS_KEY=
|
10
collector/.gitignore
vendored
10
collector/.gitignore
vendored
|
@ -1,8 +1,6 @@
|
|||
outputs/*/*.json
|
||||
hotdir/*
|
||||
hotdir/processed/*
|
||||
hotdir/failed/*
|
||||
!hotdir/__HOTDIR__.md
|
||||
!hotdir/processed
|
||||
!hotdir/failed
|
||||
|
||||
yarn-error.log
|
||||
!yarn.lock
|
||||
outputs
|
||||
scripts
|
||||
|
|
1
collector/.nvmrc
Normal file
1
collector/.nvmrc
Normal file
|
@ -0,0 +1 @@
|
|||
v18.13.0
|
|
@ -1,62 +0,0 @@
|
|||
# How to collect data for vectorizing
|
||||
This process should be run first. This will enable you to collect a ton of data across various sources. Currently the following services are supported:
|
||||
- [x] YouTube Channels
|
||||
- [x] Medium
|
||||
- [x] Substack
|
||||
- [x] Arbitrary Link
|
||||
- [x] Gitbook
|
||||
- [x] Local Files (.txt, .pdf, etc) [See full list](./hotdir/__HOTDIR__.md)
|
||||
_these resources are under development or require PR_
|
||||
- Twitter
|
||||

|
||||
|
||||
### Requirements
|
||||
- [ ] Python 3.8+
|
||||
- [ ] Google Cloud Account (for YouTube channels)
|
||||
- [ ] `brew install pandoc` [pandoc](https://pandoc.org/installing.html) (for .ODT document processing)
|
||||
|
||||
### Setup
|
||||
This example will be using python3.9, but will work with 3.8+. Tested on MacOs. Untested on Windows
|
||||
- install virtualenv for python3.8+ first before any other steps. `python3.9 -m pip install virtualenv`
|
||||
- `cd collector` from root directory
|
||||
- `python3.9 -m virtualenv v-env`
|
||||
- `source v-env/bin/activate`
|
||||
- `pip install -r requirements.txt`
|
||||
- `cp .env.example .env`
|
||||
- `python main.py` for interactive collection or `python watch.py` to process local documents.
|
||||
- Select the option you want and follow follow the prompts - Done!
|
||||
- run `deactivate` to get back to regular shell
|
||||
|
||||
### Outputs
|
||||
All JSON file data is cached in the `output/` folder. This is to prevent redundant API calls to services which may have rate limits to quota caps. Clearing out the `output/` folder will execute the script as if there was no cache.
|
||||
|
||||
As files are processed you will see data being written to both the `collector/outputs` folder as well as the `server/documents` folder. Later in this process, once you boot up the server you will then bulk vectorize this content from a simple UI!
|
||||
|
||||
If collection fails at any point in the process it will pick up where it last bailed out so you are not reusing credits.
|
||||
|
||||
### Running the document processing API locally
|
||||
From the `collector` directory with the `v-env` active run `flask run --host '0.0.0.0' --port 8888`.
|
||||
Now uploads from the frontend will be processed as if you ran the `watch.py` script manually.
|
||||
|
||||
**Docker**: If you run this application via docker the API is already started for you and no additional action is needed.
|
||||
|
||||
### How to get a Google Cloud API Key (YouTube data collection only)
|
||||
**required to fetch YouTube transcripts and data**
|
||||
- Have a google account
|
||||
- [Visit the GCP Cloud Console](https://console.cloud.google.com/welcome)
|
||||
- Click on dropdown in top right > Create new project. Name it whatever you like
|
||||
- 
|
||||
- [Enable YouTube Data APIV3](https://console.cloud.google.com/apis/library/youtube.googleapis.com)
|
||||
- Once enabled generate a Credential key for this API
|
||||
- Paste your key after `GOOGLE_APIS_KEY=` in your `collector/.env` file.
|
||||
|
||||
### Using ther Twitter API
|
||||
***required to get data form twitter with tweepy**
|
||||
- Go to https://developer.twitter.com/en/portal/dashboard with your twitter account
|
||||
- Create a new Project App
|
||||
- Get your 4 keys and place them in your `collector.env` file
|
||||
* TW_CONSUMER_KEY
|
||||
* TW_CONSUMER_SECRET
|
||||
* TW_ACCESS_TOKEN
|
||||
* TW_ACCESS_TOKEN_SECRET
|
||||
populate the .env with the values
|
|
@ -1,32 +0,0 @@
|
|||
import os
|
||||
from flask import Flask, json, request
|
||||
from scripts.watch.process_single import process_single
|
||||
from scripts.watch.filetypes import ACCEPTED_MIMES
|
||||
from scripts.link import process_single_link
|
||||
api = Flask(__name__)
|
||||
|
||||
WATCH_DIRECTORY = "hotdir"
|
||||
@api.route('/process', methods=['POST'])
|
||||
def process_file():
|
||||
content = request.json
|
||||
target_filename = os.path.normpath(content.get('filename')).lstrip(os.pardir + os.sep)
|
||||
print(f"Processing {target_filename}")
|
||||
success, reason = process_single(WATCH_DIRECTORY, target_filename)
|
||||
return json.dumps({'filename': target_filename, 'success': success, 'reason': reason})
|
||||
|
||||
@api.route('/process-link', methods=['POST'])
|
||||
async def process_link():
|
||||
content = request.json
|
||||
url = content.get('link')
|
||||
print(f"Processing {url}")
|
||||
success, reason = await process_single_link(url)
|
||||
return json.dumps({'url': url, 'success': success, 'reason': reason})
|
||||
|
||||
|
||||
@api.route('/accepts', methods=['GET'])
|
||||
def get_accepted_filetypes():
|
||||
return json.dumps(ACCEPTED_MIMES)
|
||||
|
||||
@api.route('/', methods=['GET'])
|
||||
def root():
|
||||
return "<p>Use POST /process with filename key in JSON body in order to process a file. File by that name must exist in hotdir already.</p>"
|
|
@ -1,17 +1,3 @@
|
|||
### What is the "Hot directory"
|
||||
|
||||
This is the location where you can dump all supported file types and have them automatically converted and prepared to be digested by the vectorizing service and selected from the AnythingLLM frontend.
|
||||
|
||||
Files dropped in here will only be processed when you are running `python watch.py` from the `collector` directory.
|
||||
|
||||
Once converted the original file will be moved to the `hotdir/processed` folder so that the original document is still able to be linked to when referenced when attached as a source document during chatting.
|
||||
|
||||
**Supported File types**
|
||||
- `.md`
|
||||
- `.txt`
|
||||
- `.pdf`
|
||||
|
||||
__requires more development__
|
||||
- `.png .jpg etc`
|
||||
- `.mp3`
|
||||
- `.mp4`
|
||||
This is a pre-set file location that documents will be written to when uploaded by AnythingLLM. There is really no need to touch it.
|
78
collector/index.js
Normal file
78
collector/index.js
Normal file
|
@ -0,0 +1,78 @@
|
|||
process.env.NODE_ENV === "development"
|
||||
? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
|
||||
: require("dotenv").config();
|
||||
|
||||
const express = require("express");
|
||||
const bodyParser = require("body-parser");
|
||||
const cors = require("cors");
|
||||
const path = require("path");
|
||||
const { ACCEPTED_MIMES } = require("./utils/constants");
|
||||
const { reqBody } = require("./utils/http");
|
||||
const { processSingleFile } = require("./processSingleFile");
|
||||
const { processLink } = require("./processLink");
|
||||
const app = express();
|
||||
|
||||
app.use(cors({ origin: true }));
|
||||
app.use(
|
||||
bodyParser.text(),
|
||||
bodyParser.json(),
|
||||
bodyParser.urlencoded({
|
||||
extended: true,
|
||||
})
|
||||
);
|
||||
|
||||
app.post("/process", async function (request, response) {
|
||||
const { filename } = reqBody(request);
|
||||
try {
|
||||
const targetFilename = path
|
||||
.normalize(filename)
|
||||
.replace(/^(\.\.(\/|\\|$))+/, "");
|
||||
const { success, reason } = await processSingleFile(targetFilename);
|
||||
response.status(200).json({ filename: targetFilename, success, reason });
|
||||
} catch (e) {
|
||||
console.error(e);
|
||||
response.status(200).json({
|
||||
filename: filename,
|
||||
success: false,
|
||||
reason: "A processing error occurred.",
|
||||
});
|
||||
}
|
||||
return;
|
||||
});
|
||||
|
||||
app.post("/process-link", async function (request, response) {
|
||||
const { link } = reqBody(request);
|
||||
try {
|
||||
const { success, reason } = await processLink(link);
|
||||
response.status(200).json({ url: link, success, reason });
|
||||
} catch (e) {
|
||||
console.error(e);
|
||||
response.status(200).json({
|
||||
url: link,
|
||||
success: false,
|
||||
reason: "A processing error occurred.",
|
||||
});
|
||||
}
|
||||
return;
|
||||
});
|
||||
|
||||
app.get("/accepts", function (_, response) {
|
||||
response.status(200).json(ACCEPTED_MIMES);
|
||||
});
|
||||
|
||||
app.all("*", function (_, response) {
|
||||
response.sendStatus(200);
|
||||
});
|
||||
|
||||
app
|
||||
.listen(8888, async () => {
|
||||
console.log(`Document processor app listening on port 8888`);
|
||||
})
|
||||
.on("error", function (_) {
|
||||
process.once("SIGUSR2", function () {
|
||||
process.kill(process.pid, "SIGUSR2");
|
||||
});
|
||||
process.on("SIGINT", function () {
|
||||
process.kill(process.pid, "SIGINT");
|
||||
});
|
||||
});
|
|
@ -1,84 +0,0 @@
|
|||
import os
|
||||
from InquirerPy import inquirer
|
||||
from scripts.youtube import youtube
|
||||
from scripts.link import link, links, crawler
|
||||
from scripts.substack import substack
|
||||
from scripts.medium import medium
|
||||
from scripts.gitbook import gitbook
|
||||
from scripts.sitemap import sitemap
|
||||
from scripts.twitter import twitter
|
||||
|
||||
def main():
|
||||
if os.name == 'nt':
|
||||
methods = {
|
||||
'1': 'YouTube Channel',
|
||||
'2': 'Article or Blog Link',
|
||||
'3': 'Substack',
|
||||
'4': 'Medium',
|
||||
'5': 'Gitbook',
|
||||
'6': 'Twitter',
|
||||
'7': 'Sitemap',
|
||||
}
|
||||
print("There are options for data collection to make this easier for you.\nType the number of the method you wish to execute.")
|
||||
print("1. YouTube Channel\n2. Article or Blog Link (Single)\n3. Substack\n4. Medium\n\n[In development]:\nTwitter\n\n")
|
||||
selection = input("Your selection: ")
|
||||
method = methods.get(str(selection))
|
||||
else:
|
||||
method = inquirer.select(
|
||||
message="What kind of data would you like to add to convert into long-term memory?",
|
||||
choices=[
|
||||
{"name": "YouTube Channel", "value": "YouTube Channel"},
|
||||
{"name": "Substack", "value": "Substack"},
|
||||
{"name": "Medium", "value": "Medium"},
|
||||
{"name": "Article or Blog Link(s)", "value": "Article or Blog Link(s)"},
|
||||
{"name": "Gitbook", "value": "Gitbook"},
|
||||
{"name": "Twitter", "value": "Twitter"},
|
||||
{"name": "Sitemap", "value": "Sitemap"},
|
||||
{"name": "Abort", "value": "Abort"},
|
||||
],
|
||||
).execute()
|
||||
|
||||
if 'Article or Blog Link' in method:
|
||||
method = inquirer.select(
|
||||
message="Do you want to scrape a single article/blog/url or many at once?",
|
||||
choices=[
|
||||
{"name": "Single URL", "value": "Single URL"},
|
||||
{"name": "Multiple URLs", "value": "Multiple URLs"},
|
||||
{"name": "URL Crawler", "value": "URL Crawler"},
|
||||
{"name": "Abort", "value": "Abort"},
|
||||
],
|
||||
).execute()
|
||||
if method == 'Single URL':
|
||||
link()
|
||||
exit(0)
|
||||
if method == 'Multiple URLs':
|
||||
links()
|
||||
exit(0)
|
||||
if method == 'URL Crawler':
|
||||
crawler()
|
||||
exit(0)
|
||||
|
||||
if method == 'Abort': exit(0)
|
||||
if method == 'YouTube Channel':
|
||||
youtube()
|
||||
exit(0)
|
||||
if method == 'Substack':
|
||||
substack()
|
||||
exit(0)
|
||||
if method == 'Medium':
|
||||
medium()
|
||||
exit(0)
|
||||
if method == 'Gitbook':
|
||||
gitbook()
|
||||
exit(0)
|
||||
if method == 'Sitemap':
|
||||
sitemap()
|
||||
exit(0)
|
||||
if method == 'Twitter':
|
||||
twitter()
|
||||
exit(0)
|
||||
print("Selection was not valid.")
|
||||
exit(1)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
3
collector/nodemon.json
Normal file
3
collector/nodemon.json
Normal file
|
@ -0,0 +1,3 @@
|
|||
{
|
||||
"events": {}
|
||||
}
|
42
collector/package.json
Normal file
42
collector/package.json
Normal file
|
@ -0,0 +1,42 @@
|
|||
{
|
||||
"name": "anything-llm-document-collector",
|
||||
"version": "0.2.0",
|
||||
"description": "Document collector server endpoints",
|
||||
"main": "index.js",
|
||||
"author": "Timothy Carambat (Mintplex Labs)",
|
||||
"license": "MIT",
|
||||
"private": false,
|
||||
"engines": {
|
||||
"node": ">=18.12.1"
|
||||
},
|
||||
"scripts": {
|
||||
"dev": "NODE_ENV=development nodemon --trace-warnings index.js",
|
||||
"start": "NODE_ENV=production node index.js",
|
||||
"lint": "yarn prettier --write ./processSingleFile ./processLink ./utils index.js"
|
||||
},
|
||||
"dependencies": {
|
||||
"@googleapis/youtube": "^9.0.0",
|
||||
"bcrypt": "^5.1.0",
|
||||
"body-parser": "^1.20.2",
|
||||
"cors": "^2.8.5",
|
||||
"dotenv": "^16.0.3",
|
||||
"express": "^4.18.2",
|
||||
"extract-zip": "^2.0.1",
|
||||
"js-tiktoken": "^1.0.8",
|
||||
"langchain": "0.0.201",
|
||||
"mammoth": "^1.6.0",
|
||||
"mbox-parser": "^1.0.1",
|
||||
"mime": "^3.0.0",
|
||||
"moment": "^2.29.4",
|
||||
"multer": "^1.4.5-lts.1",
|
||||
"officeparser": "^4.0.5",
|
||||
"pdf-parse": "^1.1.1",
|
||||
"puppeteer": "^21.6.1",
|
||||
"slugify": "^1.6.6",
|
||||
"uuid": "^9.0.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"nodemon": "^2.0.22",
|
||||
"prettier": "^2.4.1"
|
||||
}
|
||||
}
|
72
collector/processLink/convert/generic.js
Normal file
72
collector/processLink/convert/generic.js
Normal file
|
@ -0,0 +1,72 @@
|
|||
const { v4 } = require("uuid");
|
||||
const {
|
||||
PuppeteerWebBaseLoader,
|
||||
} = require("langchain/document_loaders/web/puppeteer");
|
||||
const { writeToServerDocuments } = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function scrapeGenericUrl(link) {
|
||||
console.log(`-- Working URL ${link} --`);
|
||||
const content = await getPageContent(link);
|
||||
|
||||
if (!content.length) {
|
||||
console.error(`Resulting URL content was empty at ${link}.`);
|
||||
return { success: false, reason: `No URL content found at ${link}.` };
|
||||
}
|
||||
|
||||
const url = new URL(link);
|
||||
const filename = (url.host + "-" + url.pathname).replace(".", "_");
|
||||
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + slugify(filename) + ".html",
|
||||
title: slugify(filename) + ".html",
|
||||
docAuthor: "no author found",
|
||||
description: "No description found.",
|
||||
docSource: "URL link uploaded by the user.",
|
||||
chunkSource: slugify(link) + ".html",
|
||||
published: new Date().toLocaleString(),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `url-${slugify(filename)}-${data.id}`);
|
||||
console.log(`[SUCCESS]: URL ${link} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
async function getPageContent(link) {
|
||||
try {
|
||||
let pageContents = [];
|
||||
const loader = new PuppeteerWebBaseLoader(link, {
|
||||
launchOptions: {
|
||||
headless: "new",
|
||||
},
|
||||
gotoOptions: {
|
||||
waitUntil: "domcontentloaded",
|
||||
},
|
||||
async evaluate(page, browser) {
|
||||
const result = await page.evaluate(() => document.body.innerText);
|
||||
await browser.close();
|
||||
return result;
|
||||
},
|
||||
});
|
||||
|
||||
const docs = await loader.load();
|
||||
|
||||
for (const doc of docs) {
|
||||
pageContents.push(doc.pageContent);
|
||||
}
|
||||
|
||||
return pageContents.join(" ");
|
||||
} catch (error) {
|
||||
console.error("getPageContent failed!", error);
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
scrapeGenericUrl,
|
||||
};
|
11
collector/processLink/index.js
Normal file
11
collector/processLink/index.js
Normal file
|
@ -0,0 +1,11 @@
|
|||
const { validURL } = require("../utils/url");
|
||||
const { scrapeGenericUrl } = require("./convert/generic");
|
||||
|
||||
async function processLink(link) {
|
||||
if (!validURL(link)) return { success: false, reason: "Not a valid URL." };
|
||||
return await scrapeGenericUrl(link);
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
processLink,
|
||||
};
|
51
collector/processSingleFile/convert/asDocx.js
Normal file
51
collector/processSingleFile/convert/asDocx.js
Normal file
|
@ -0,0 +1,51 @@
|
|||
const { v4 } = require("uuid");
|
||||
const { DocxLoader } = require("langchain/document_loaders/fs/docx");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asDocX({ fullFilePath = "", filename = "" }) {
|
||||
const loader = new DocxLoader(fullFilePath);
|
||||
|
||||
console.log(`-- Working ${filename} --`);
|
||||
let pageContent = [];
|
||||
const docs = await loader.load();
|
||||
for (const doc of docs) {
|
||||
console.log(doc.metadata);
|
||||
console.log(`-- Parsing content from docx page --`);
|
||||
if (!doc.pageContent.length) continue;
|
||||
pageContent.push(doc.pageContent);
|
||||
}
|
||||
|
||||
if (!pageContent.length) {
|
||||
console.error(`Resulting text content was empty for ${filename}.`);
|
||||
trashFile(fullFilePath);
|
||||
return { success: false, reason: `No text content found in ${filename}.` };
|
||||
}
|
||||
|
||||
const content = pageContent.join("");
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: filename,
|
||||
docAuthor: "no author found",
|
||||
description: "No description found.",
|
||||
docSource: "pdf file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
|
||||
trashFile(fullFilePath);
|
||||
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asDocX;
|
65
collector/processSingleFile/convert/asMbox.js
Normal file
65
collector/processSingleFile/convert/asMbox.js
Normal file
|
@ -0,0 +1,65 @@
|
|||
const { v4 } = require("uuid");
|
||||
const fs = require("fs");
|
||||
const { mboxParser } = require("mbox-parser");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asMbox({ fullFilePath = "", filename = "" }) {
|
||||
console.log(`-- Working ${filename} --`);
|
||||
|
||||
const mails = await mboxParser(fs.createReadStream(fullFilePath))
|
||||
.then((mails) => mails)
|
||||
.catch((error) => {
|
||||
console.log(`Could not parse mail items`, error);
|
||||
return [];
|
||||
});
|
||||
|
||||
if (!mails.length) {
|
||||
console.error(`Resulting mail items was empty for ${filename}.`);
|
||||
trashFile(fullFilePath);
|
||||
return { success: false, reason: `No mail items found in ${filename}.` };
|
||||
}
|
||||
|
||||
let item = 1;
|
||||
for (const mail of mails) {
|
||||
if (!mail.hasOwnProperty("text")) continue;
|
||||
|
||||
const content = mail.text;
|
||||
if (!content) continue;
|
||||
console.log(
|
||||
`-- Working on message "${mail.subject || "Unknown subject"}" --`
|
||||
);
|
||||
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: mail?.subject
|
||||
? slugify(mail?.subject?.replace(".", "")) + ".mbox"
|
||||
: `msg_${item}-${filename}`,
|
||||
docAuthor: mail?.from?.text,
|
||||
description: "No description found.",
|
||||
docSource: "Mbox message file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
item++;
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}-msg-${item}`);
|
||||
}
|
||||
|
||||
trashFile(fullFilePath);
|
||||
console.log(
|
||||
`[SUCCESS]: ${filename} messages converted & ready for embedding.\n`
|
||||
);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asMbox;
|
46
collector/processSingleFile/convert/asOfficeMime.js
Normal file
46
collector/processSingleFile/convert/asOfficeMime.js
Normal file
|
@ -0,0 +1,46 @@
|
|||
const { v4 } = require("uuid");
|
||||
const officeParser = require("officeparser");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asOfficeMime({ fullFilePath = "", filename = "" }) {
|
||||
console.log(`-- Working ${filename} --`);
|
||||
let content = "";
|
||||
try {
|
||||
content = await officeParser.parseOfficeAsync(fullFilePath);
|
||||
} catch (error) {
|
||||
console.error(`Could not parse office or office-like file`, error);
|
||||
}
|
||||
|
||||
if (!content.length) {
|
||||
console.error(`Resulting text content was empty for ${filename}.`);
|
||||
trashFile(fullFilePath);
|
||||
return { success: false, reason: `No text content found in ${filename}.` };
|
||||
}
|
||||
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: filename,
|
||||
docAuthor: "no author found",
|
||||
description: "No description found.",
|
||||
docSource: "Office file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
|
||||
trashFile(fullFilePath);
|
||||
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asOfficeMime;
|
56
collector/processSingleFile/convert/asPDF.js
Normal file
56
collector/processSingleFile/convert/asPDF.js
Normal file
|
@ -0,0 +1,56 @@
|
|||
const { v4 } = require("uuid");
|
||||
const { PDFLoader } = require("langchain/document_loaders/fs/pdf");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asPDF({ fullFilePath = "", filename = "" }) {
|
||||
const pdfLoader = new PDFLoader(fullFilePath, {
|
||||
splitPages: true,
|
||||
});
|
||||
|
||||
console.log(`-- Working ${filename} --`);
|
||||
const pageContent = [];
|
||||
const docs = await pdfLoader.load();
|
||||
for (const doc of docs) {
|
||||
console.log(
|
||||
`-- Parsing content from pg ${
|
||||
doc.metadata?.loc?.pageNumber || "unknown"
|
||||
} --`
|
||||
);
|
||||
if (!doc.pageContent.length) continue;
|
||||
pageContent.push(doc.pageContent);
|
||||
}
|
||||
|
||||
if (!pageContent.length) {
|
||||
console.error(`Resulting text content was empty for ${filename}.`);
|
||||
trashFile(fullFilePath);
|
||||
return { success: false, reason: `No text content found in ${filename}.` };
|
||||
}
|
||||
|
||||
const content = pageContent.join("");
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: docs[0]?.metadata?.pdf?.info?.Title || filename,
|
||||
docAuthor: docs[0]?.metadata?.pdf?.info?.Creator || "no author found",
|
||||
description: "No description found.",
|
||||
docSource: "pdf file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
|
||||
trashFile(fullFilePath);
|
||||
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asPDF;
|
46
collector/processSingleFile/convert/asTxt.js
Normal file
46
collector/processSingleFile/convert/asTxt.js
Normal file
|
@ -0,0 +1,46 @@
|
|||
const { v4 } = require("uuid");
|
||||
const fs = require("fs");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asTxt({ fullFilePath = "", filename = "" }) {
|
||||
let content = "";
|
||||
try {
|
||||
content = fs.readFileSync(fullFilePath, "utf8");
|
||||
} catch (err) {
|
||||
console.error("Could not read file!", err);
|
||||
}
|
||||
|
||||
if (!content?.length) {
|
||||
console.error(`Resulting text content was empty for ${filename}.`);
|
||||
trashFile(fullFilePath);
|
||||
return { success: false, reason: `No text content found in ${filename}.` };
|
||||
}
|
||||
|
||||
console.log(`-- Working ${filename} --`);
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: filename,
|
||||
docAuthor: "Unknown", // TODO: Find a better author
|
||||
description: "Unknown", // TODO: Find a better description
|
||||
docSource: "a text file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
|
||||
trashFile(fullFilePath);
|
||||
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asTxt;
|
51
collector/processSingleFile/index.js
Normal file
51
collector/processSingleFile/index.js
Normal file
|
@ -0,0 +1,51 @@
|
|||
const path = require("path");
|
||||
const fs = require("fs");
|
||||
const {
|
||||
WATCH_DIRECTORY,
|
||||
SUPPORTED_FILETYPE_CONVERTERS,
|
||||
} = require("../utils/constants");
|
||||
const { trashFile } = require("../utils/files");
|
||||
|
||||
RESERVED_FILES = ["__HOTDIR__.md"];
|
||||
|
||||
async function processSingleFile(targetFilename) {
|
||||
const fullFilePath = path.resolve(WATCH_DIRECTORY, targetFilename);
|
||||
if (RESERVED_FILES.includes(targetFilename))
|
||||
return {
|
||||
success: false,
|
||||
reason: "Filename is a reserved filename and cannot be processed.",
|
||||
};
|
||||
if (!fs.existsSync(fullFilePath))
|
||||
return {
|
||||
success: false,
|
||||
reason: "File does not exist in upload directory.",
|
||||
};
|
||||
|
||||
const fileExtension = path.extname(fullFilePath).toLowerCase();
|
||||
if (!fileExtension) {
|
||||
return {
|
||||
success: false,
|
||||
reason: `No file extension found. This file cannot be processed.`,
|
||||
};
|
||||
}
|
||||
|
||||
if (!Object.keys(SUPPORTED_FILETYPE_CONVERTERS).includes(fileExtension)) {
|
||||
trashFile(fullFilePath);
|
||||
return {
|
||||
success: false,
|
||||
reason: `File extension ${fileExtension} not supported for parsing.`,
|
||||
};
|
||||
}
|
||||
|
||||
const FileTypeProcessor = require(SUPPORTED_FILETYPE_CONVERTERS[
|
||||
fileExtension
|
||||
]);
|
||||
return await FileTypeProcessor({
|
||||
fullFilePath,
|
||||
filename: targetFilename,
|
||||
});
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
processSingleFile,
|
||||
};
|
|
@ -1,117 +0,0 @@
|
|||
about-time==4.2.1
|
||||
aiohttp==3.8.4
|
||||
aiosignal==1.3.1
|
||||
alive-progress==3.1.2
|
||||
anyio==3.7.0
|
||||
appdirs==1.4.4
|
||||
argilla==1.8.0
|
||||
asgiref==3.7.2
|
||||
async-timeout==4.0.2
|
||||
attrs==23.1.0
|
||||
backoff==2.2.1
|
||||
beautifulsoup4==4.12.2
|
||||
blinker==1.6.2
|
||||
bs4==0.0.1
|
||||
certifi==2023.5.7
|
||||
cffi==1.15.1
|
||||
chardet==5.1.0
|
||||
charset-normalizer==3.1.0
|
||||
click==8.1.3
|
||||
commonmark==0.9.1
|
||||
cryptography==41.0.1
|
||||
cssselect==1.2.0
|
||||
dataclasses-json==0.5.7
|
||||
Deprecated==1.2.14
|
||||
docx2txt==0.8
|
||||
et-xmlfile==1.1.0
|
||||
exceptiongroup==1.1.1
|
||||
fake-useragent==1.2.1
|
||||
Flask==2.3.2
|
||||
frozenlist==1.3.3
|
||||
grapheme==0.6.0
|
||||
greenlet==2.0.2
|
||||
gunicorn==20.1.0
|
||||
h11==0.14.0
|
||||
httpcore==0.16.3
|
||||
httpx==0.23.3
|
||||
idna==3.4
|
||||
importlib-metadata==6.6.0
|
||||
importlib-resources==5.12.0
|
||||
inquirerpy==0.3.4
|
||||
install==1.3.5
|
||||
itsdangerous==2.1.2
|
||||
Jinja2==3.1.2
|
||||
joblib==1.2.0
|
||||
langchain==0.0.189
|
||||
lxml==4.9.2
|
||||
Markdown==3.4.3
|
||||
MarkupSafe==2.1.3
|
||||
marshmallow==3.19.0
|
||||
marshmallow-enum==1.5.1
|
||||
monotonic==1.6
|
||||
msg-parser==1.2.0
|
||||
multidict==6.0.4
|
||||
mypy-extensions==1.0.0
|
||||
nltk==3.8.1
|
||||
numexpr==2.8.4
|
||||
numpy==1.23.5
|
||||
oauthlib==3.2.2
|
||||
olefile==0.46
|
||||
openapi-schema-pydantic==1.2.4
|
||||
openpyxl==3.1.2
|
||||
packaging==23.1
|
||||
pandas==1.5.3
|
||||
parse==1.19.0
|
||||
pdfminer.six==20221105
|
||||
pfzy==0.3.4
|
||||
Pillow==9.5.0
|
||||
prompt-toolkit==3.0.38
|
||||
pycparser==2.21
|
||||
pydantic==1.10.8
|
||||
pyee==8.2.2
|
||||
Pygments==2.15.1
|
||||
PyMuPDF==1.22.5
|
||||
pypandoc==1.4
|
||||
pyppeteer==1.0.2
|
||||
pyquery==2.0.0
|
||||
python-dateutil==2.8.2
|
||||
python-docx==0.8.11
|
||||
python-dotenv==0.21.1
|
||||
python-magic==0.4.27
|
||||
python-pptx==0.6.21
|
||||
python-slugify==8.0.1
|
||||
pytz==2023.3
|
||||
PyYAML==6.0
|
||||
regex==2023.5.5
|
||||
requests==2.31.0
|
||||
requests-html==0.10.0
|
||||
requests-oauthlib==1.3.1
|
||||
rfc3986==1.5.0
|
||||
rich==13.0.1
|
||||
six==1.16.0
|
||||
sniffio==1.3.0
|
||||
soupsieve==2.4.1
|
||||
SQLAlchemy==2.0.15
|
||||
tabulate==0.9.0
|
||||
tenacity==8.2.2
|
||||
text-unidecode==1.3
|
||||
tiktoken==0.4.0
|
||||
tqdm==4.65.0
|
||||
tweepy==4.14.0
|
||||
typer==0.9.0
|
||||
typing-inspect==0.9.0
|
||||
typing_extensions==4.6.3
|
||||
Unidecode==1.3.6
|
||||
unstructured==0.7.1
|
||||
urllib3==1.26.16
|
||||
uuid==1.30
|
||||
w3lib==2.1.1
|
||||
wcwidth==0.2.6
|
||||
websockets==10.4
|
||||
Werkzeug==2.3.6
|
||||
wrapt==1.14.1
|
||||
xlrd==2.0.1
|
||||
XlsxWriter==3.1.2
|
||||
yarl==1.9.2
|
||||
youtube-transcript-api==0.6.0
|
||||
zipp==3.15.0
|
|
@ -1,44 +0,0 @@
|
|||
import os, json
|
||||
from langchain.document_loaders import GitbookLoader
|
||||
from urllib.parse import urlparse
|
||||
from datetime import datetime
|
||||
from alive_progress import alive_it
|
||||
from .utils import tokenize
|
||||
from uuid import uuid4
|
||||
|
||||
def gitbook():
|
||||
url = input("Enter the URL of the GitBook you want to collect: ")
|
||||
if(url == ''):
|
||||
print("Not a gitbook URL")
|
||||
exit(1)
|
||||
|
||||
primary_source = urlparse(url)
|
||||
output_path = f"./outputs/gitbook-logs/{primary_source.netloc}"
|
||||
transaction_output_dir = f"../server/storage/documents/gitbook-{primary_source.netloc}"
|
||||
|
||||
if os.path.exists(output_path) == False:os.makedirs(output_path)
|
||||
if os.path.exists(transaction_output_dir) == False: os.makedirs(transaction_output_dir)
|
||||
loader = GitbookLoader(url, load_all_paths= primary_source.path in ['','/'])
|
||||
for doc in alive_it(loader.load()):
|
||||
metadata = doc.metadata
|
||||
content = doc.page_content
|
||||
source = urlparse(metadata.get('source'))
|
||||
name = 'home' if source.path in ['','/'] else source.path.replace('/','_')
|
||||
output_filename = f"doc-{name}.json"
|
||||
transaction_output_filename = f"doc-{name}.json"
|
||||
data = {
|
||||
'id': str(uuid4()),
|
||||
'url': metadata.get('source'),
|
||||
'title': metadata.get('title'),
|
||||
'description': metadata.get('title'),
|
||||
'published': datetime.today().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(data, file, ensure_ascii=True, indent=4)
|
||||
|
||||
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(data, file, ensure_ascii=True, indent=4)
|
|
@ -1,222 +0,0 @@
|
|||
import os, json, tempfile
|
||||
from urllib.parse import urlparse
|
||||
from requests_html import HTMLSession
|
||||
from langchain.document_loaders import UnstructuredHTMLLoader
|
||||
from .link_utils import append_meta, AsyncHTMLSessionFixed
|
||||
from .utils import tokenize, ada_v2_cost
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Example Channel URL https://tim.blog/2022/08/09/nft-insider-trading-policy/
|
||||
def link():
|
||||
totalTokens = 0
|
||||
print("[NOTICE]: The first time running this process it will download supporting libraries.\n\n")
|
||||
fqdn_link = input("Paste in the URL of an online article or blog: ")
|
||||
if(len(fqdn_link) == 0):
|
||||
print("Invalid URL!")
|
||||
exit(1)
|
||||
|
||||
session = HTMLSession()
|
||||
req = session.get(fqdn_link)
|
||||
if(req.ok == False):
|
||||
print("Could not reach this url!")
|
||||
exit(1)
|
||||
|
||||
req.html.render()
|
||||
full_text = None
|
||||
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
|
||||
tmp.write(req.html.html)
|
||||
tmp.seek(0)
|
||||
loader = UnstructuredHTMLLoader(tmp.name)
|
||||
data = loader.load()[0]
|
||||
full_text = data.page_content
|
||||
tmp.close()
|
||||
|
||||
link = append_meta(req, full_text, True)
|
||||
if(len(full_text) > 0):
|
||||
totalTokens += len(tokenize(full_text))
|
||||
source = urlparse(req.url)
|
||||
output_filename = f"website-{source.netloc}-{source.path.replace('/','_')}.json"
|
||||
output_path = f"./outputs/website-logs"
|
||||
|
||||
transaction_output_filename = f"website-{source.path.replace('/','_')}.json"
|
||||
transaction_output_dir = f"../server/storage/documents/custom-documents"
|
||||
|
||||
if os.path.isdir(output_path) == False:
|
||||
os.makedirs(output_path)
|
||||
|
||||
if os.path.isdir(transaction_output_dir) == False:
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
full_text = append_meta(req, full_text)
|
||||
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(link, file, ensure_ascii=True, indent=4)
|
||||
|
||||
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(link, file, ensure_ascii=True, indent=4)
|
||||
else:
|
||||
print("Could not parse any meaningful data from this link or url.")
|
||||
exit(1)
|
||||
|
||||
print(f"\n\n[Success]: article or link content fetched!")
|
||||
print(f"////////////////////////////")
|
||||
print(f"Your estimated cost to embed this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokens)} using {totalTokens} tokens.")
|
||||
print(f"////////////////////////////")
|
||||
exit(0)
|
||||
|
||||
async def process_single_link(url):
|
||||
session = None
|
||||
try:
|
||||
print(f"Working on {url}...")
|
||||
session = AsyncHTMLSessionFixed()
|
||||
req = await session.get(url)
|
||||
await req.html.arender()
|
||||
await session.close()
|
||||
|
||||
if not req.ok:
|
||||
return False, "Could not reach this URL."
|
||||
|
||||
full_text = None
|
||||
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
|
||||
tmp.write(req.html.html)
|
||||
tmp.seek(0)
|
||||
loader = UnstructuredHTMLLoader(tmp.name)
|
||||
data = loader.load()[0]
|
||||
full_text = data.page_content
|
||||
tmp.close()
|
||||
|
||||
if full_text:
|
||||
link_meta = append_meta(req, full_text, True)
|
||||
|
||||
source = urlparse(req.url)
|
||||
transaction_output_dir = "../server/storage/documents/custom-documents"
|
||||
transaction_output_filename = f"website-{source.netloc}-{source.path.replace('/', '_')}.json"
|
||||
|
||||
if not os.path.isdir(transaction_output_dir):
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
file_path = os.path.join(transaction_output_dir, transaction_output_filename)
|
||||
with open(file_path, 'w', encoding='utf-8') as file:
|
||||
json.dump(link_meta, file, ensure_ascii=False, indent=4)
|
||||
|
||||
|
||||
return True, "Content fetched and saved."
|
||||
|
||||
else:
|
||||
return False, "Could not parse any meaningful data from this URL."
|
||||
|
||||
except Exception as e:
|
||||
if session is not None:
|
||||
session.close() # Kill hanging session.
|
||||
return False, str(e)
|
||||
|
||||
def crawler():
|
||||
prompt = "Paste in root URI of the pages of interest: "
|
||||
new_link = input(prompt)
|
||||
filter_value = input("Add a filter value for the url to ensure links don't wander too far. eg: 'my-domain.com': ")
|
||||
#extract this from the uri provided
|
||||
root_site = urlparse(new_link).scheme + "://" + urlparse(new_link).hostname
|
||||
links = []
|
||||
urls = new_link
|
||||
links.append(new_link)
|
||||
grab = requests.get(urls)
|
||||
soup = BeautifulSoup(grab.text, 'html.parser')
|
||||
|
||||
# traverse paragraphs from soup
|
||||
for link in soup.find_all("a"):
|
||||
data = link.get('href')
|
||||
if (data is not None):
|
||||
fullpath = data if data[0] != '/' else f"{root_site}{data}"
|
||||
try:
|
||||
destination = urlparse(fullpath).scheme + "://" + urlparse(fullpath).hostname + (urlparse(fullpath).path if urlparse(fullpath).path is not None else '')
|
||||
if filter_value in destination:
|
||||
data = destination.strip()
|
||||
print (data)
|
||||
links.append(data)
|
||||
else:
|
||||
print (data + " does not apply for linking...")
|
||||
except:
|
||||
print (data + " does not apply for linking...")
|
||||
#parse the links found
|
||||
parse_links(links)
|
||||
|
||||
def links():
|
||||
links = []
|
||||
prompt = "Paste in the URL of an online article or blog: "
|
||||
done = False
|
||||
|
||||
while(done == False):
|
||||
new_link = input(prompt)
|
||||
if(len(new_link) == 0):
|
||||
done = True
|
||||
links = [*set(links)]
|
||||
continue
|
||||
|
||||
links.append(new_link)
|
||||
prompt = f"\n{len(links)} links in queue. Submit an empty value when done pasting in links to execute collection.\nPaste in the next URL of an online article or blog: "
|
||||
|
||||
if(len(links) == 0):
|
||||
print("No valid links provided!")
|
||||
exit(1)
|
||||
|
||||
parse_links(links)
|
||||
|
||||
|
||||
# parse links from array
|
||||
def parse_links(links):
|
||||
totalTokens = 0
|
||||
for link in links:
|
||||
print(f"Working on {link}...")
|
||||
session = HTMLSession()
|
||||
|
||||
req = session.get(link, timeout=20)
|
||||
|
||||
if not req.ok:
|
||||
print(f"Could not reach {link} - skipping!")
|
||||
continue
|
||||
|
||||
req.html.render(timeout=10)
|
||||
|
||||
full_text = None
|
||||
with tempfile.NamedTemporaryFile(mode="w") as tmp:
|
||||
tmp.write(req.html.html)
|
||||
tmp.seek(0)
|
||||
loader = UnstructuredHTMLLoader(tmp.name)
|
||||
data = loader.load()[0]
|
||||
full_text = data.page_content
|
||||
tmp.close()
|
||||
|
||||
link = append_meta(req, full_text, True)
|
||||
if len(full_text) > 0:
|
||||
source = urlparse(req.url)
|
||||
output_filename = f"website-{source.netloc}-{source.path.replace('/','_')}.json"
|
||||
output_path = f"./outputs/website-logs"
|
||||
|
||||
transaction_output_filename = f"website-{source.path.replace('/','_')}.json"
|
||||
transaction_output_dir = f"../server/storage/documents/custom-documents"
|
||||
|
||||
if not os.path.isdir(output_path):
|
||||
os.makedirs(output_path)
|
||||
|
||||
if not os.path.isdir(transaction_output_dir):
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
full_text = append_meta(req, full_text)
|
||||
tokenCount = len(tokenize(full_text))
|
||||
totalTokens += tokenCount
|
||||
|
||||
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(link, file, ensure_ascii=True, indent=4)
|
||||
|
||||
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(link, file, ensure_ascii=True, indent=4)
|
||||
|
||||
req.session.close()
|
||||
else:
|
||||
print(f"Could not parse any meaningful data from {link}.")
|
||||
continue
|
||||
|
||||
print(f"\n\n[Success]: {len(links)} article or link contents fetched!")
|
||||
print(f"////////////////////////////")
|
||||
print(f"Your estimated cost to embed this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokens)} using {totalTokens} tokens.")
|
||||
print(f"////////////////////////////")
|
|
@ -1,45 +0,0 @@
|
|||
import json, pyppeteer
|
||||
from datetime import datetime
|
||||
from .watch.utils import guid
|
||||
from dotenv import load_dotenv
|
||||
from .watch.utils import guid
|
||||
from .utils import tokenize
|
||||
from requests_html import AsyncHTMLSession
|
||||
|
||||
load_dotenv()
|
||||
|
||||
def normalize_url(url):
|
||||
if(url.endswith('.web')):
|
||||
return url
|
||||
return f"{url}.web"
|
||||
|
||||
def append_meta(request, text, metadata_only = False):
|
||||
meta = {
|
||||
'id': guid(),
|
||||
'url': normalize_url(request.url),
|
||||
'title': request.html.find('title', first=True).text if len(request.html.find('title')) != 0 else '',
|
||||
'docAuthor': 'N/A',
|
||||
'description': request.html.find('meta[name="description"]', first=True).attrs.get('content') if request.html.find('meta[name="description"]', first=True) != None else '',
|
||||
'docSource': 'web page',
|
||||
'chunkSource': request.url,
|
||||
'published':request.html.find('meta[property="article:published_time"]', first=True).attrs.get('content') if request.html.find('meta[property="article:published_time"]', first=True) != None else datetime.today().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
'wordCount': len(text.split(' ')),
|
||||
'pageContent': text,
|
||||
'token_count_estimate':len(tokenize(text)),
|
||||
}
|
||||
return "Article JSON Metadata:\n"+json.dumps(meta)+"\n\n\nText Content:\n" + text if metadata_only == False else meta
|
||||
|
||||
class AsyncHTMLSessionFixed(AsyncHTMLSession):
|
||||
"""
|
||||
pip3 install websockets==6.0 --force-reinstall
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
super(AsyncHTMLSessionFixed, self).__init__(**kwargs)
|
||||
self.__browser_args = kwargs.get("browser_args", ["--no-sandbox"])
|
||||
|
||||
@property
|
||||
async def browser(self):
|
||||
if not hasattr(self, "_browser"):
|
||||
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, handleSIGINT=False, handleSIGTERM=False, handleSIGHUP=False, args=self.__browser_args)
|
||||
|
||||
return self._browser
|
|
@ -1,71 +0,0 @@
|
|||
import os, json
|
||||
from urllib.parse import urlparse
|
||||
from .utils import tokenize, ada_v2_cost
|
||||
from .medium_utils import get_username, fetch_recent_publications, append_meta
|
||||
from alive_progress import alive_it
|
||||
|
||||
# Example medium URL: https://medium.com/@yujiangtham or https://davidall.medium.com
|
||||
def medium():
|
||||
print("[NOTICE]: This method will only get the 10 most recent publishings.")
|
||||
author_url = input("Enter the medium URL of the author you want to collect: ")
|
||||
if(author_url == ''):
|
||||
print("Not a valid medium.com/@author URL")
|
||||
exit(1)
|
||||
|
||||
handle = get_username(author_url)
|
||||
if(handle is None):
|
||||
print("This does not appear to be a valid medium.com/@author URL")
|
||||
exit(1)
|
||||
|
||||
publications = fetch_recent_publications(handle)
|
||||
if(len(publications)==0):
|
||||
print("There are no public or free publications by this creator - nothing to collect.")
|
||||
exit(1)
|
||||
|
||||
totalTokenCount = 0
|
||||
transaction_output_dir = f"../server/storage/documents/medium-{handle}"
|
||||
if os.path.isdir(transaction_output_dir) == False:
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
for publication in alive_it(publications):
|
||||
pub_file_path = transaction_output_dir + f"/publication-{publication.get('id')}.json"
|
||||
if os.path.exists(pub_file_path) == True: continue
|
||||
|
||||
full_text = publication.get('pageContent')
|
||||
if full_text is None or len(full_text) == 0: continue
|
||||
|
||||
full_text = append_meta(publication, full_text)
|
||||
item = {
|
||||
'id': publication.get('id'),
|
||||
'url': publication.get('url'),
|
||||
'title': publication.get('title'),
|
||||
'published': publication.get('published'),
|
||||
'wordCount': len(full_text.split(' ')),
|
||||
'pageContent': full_text,
|
||||
}
|
||||
|
||||
tokenCount = len(tokenize(full_text))
|
||||
item['token_count_estimate'] = tokenCount
|
||||
|
||||
totalTokenCount += tokenCount
|
||||
with open(pub_file_path, 'w', encoding='utf-8') as file:
|
||||
json.dump(item, file, ensure_ascii=True, indent=4)
|
||||
|
||||
print(f"[Success]: {len(publications)} scraped and fetched!")
|
||||
print(f"\n\n////////////////////////////")
|
||||
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
|
||||
print(f"////////////////////////////\n\n")
|
||||
exit(0)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -1,71 +0,0 @@
|
|||
import os, json, requests, re
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
def get_username(author_url):
|
||||
if '@' in author_url:
|
||||
pattern = r"medium\.com/@([\w-]+)"
|
||||
match = re.search(pattern, author_url)
|
||||
return match.group(1) if match else None
|
||||
else:
|
||||
# Given subdomain
|
||||
pattern = r"([\w-]+).medium\.com"
|
||||
match = re.search(pattern, author_url)
|
||||
return match.group(1) if match else None
|
||||
|
||||
def get_docid(medium_docpath):
|
||||
pattern = r"medium\.com/p/([\w-]+)"
|
||||
match = re.search(pattern, medium_docpath)
|
||||
return match.group(1) if match else None
|
||||
|
||||
def fetch_recent_publications(handle):
|
||||
rss_link = f"https://medium.com/feed/@{handle}"
|
||||
response = requests.get(rss_link)
|
||||
if(response.ok == False):
|
||||
print(f"Could not fetch RSS results for author.")
|
||||
return []
|
||||
|
||||
xml = response.content
|
||||
soup = BeautifulSoup(xml, 'xml')
|
||||
items = soup.find_all('item')
|
||||
publications = []
|
||||
|
||||
if os.path.isdir("./outputs/medium-logs") == False:
|
||||
os.makedirs("./outputs/medium-logs")
|
||||
|
||||
file_path = f"./outputs/medium-logs/medium-{handle}.json"
|
||||
|
||||
if os.path.exists(file_path):
|
||||
with open(file_path, "r") as file:
|
||||
print(f"Returning cached data for Author {handle}. If you do not wish to use stored data then delete the file for this author to allow refetching.")
|
||||
return json.load(file)
|
||||
|
||||
for item in items:
|
||||
tags = []
|
||||
for tag in item.find_all('category'): tags.append(tag.text)
|
||||
content = BeautifulSoup(item.find('content:encoded').text, 'html.parser')
|
||||
data = {
|
||||
'id': get_docid(item.find('guid').text),
|
||||
'title': item.find('title').text,
|
||||
'url': item.find('link').text.split('?')[0],
|
||||
'tags': ','.join(tags),
|
||||
'published': item.find('pubDate').text,
|
||||
'pageContent': content.get_text()
|
||||
}
|
||||
publications.append(data)
|
||||
|
||||
with open(file_path, 'w+', encoding='utf-8') as json_file:
|
||||
json.dump(publications, json_file, ensure_ascii=True, indent=2)
|
||||
print(f"{len(publications)} articles found for author medium.com/@{handle}. Saved to medium-logs/medium-{handle}.json")
|
||||
|
||||
return publications
|
||||
|
||||
def append_meta(publication, text):
|
||||
meta = {
|
||||
'url': publication.get('url'),
|
||||
'tags': publication.get('tags'),
|
||||
'title': publication.get('title'),
|
||||
'createdAt': publication.get('published'),
|
||||
'wordCount': len(text.split(' '))
|
||||
}
|
||||
return "Article Metadata:\n"+json.dumps(meta)+"\n\nArticle Content:\n" + text
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
import requests
|
||||
import xml.etree.ElementTree as ET
|
||||
from scripts.link import parse_links
|
||||
import re
|
||||
|
||||
def parse_sitemap(url):
|
||||
response = requests.get(url)
|
||||
root = ET.fromstring(response.content)
|
||||
|
||||
urls = []
|
||||
for element in root.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
|
||||
for loc in element.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
|
||||
if not has_extension_to_ignore(loc.text):
|
||||
urls.append(loc.text)
|
||||
else:
|
||||
print(f"Skipping filetype: {loc.text}")
|
||||
|
||||
return urls
|
||||
|
||||
# Example sitemap URL https://www.nerdwallet.com/blog/wp-sitemap-news-articles-1.xml
|
||||
def sitemap():
|
||||
sitemap_url = input("Enter the URL of the sitemap: ")
|
||||
|
||||
if(len(sitemap_url) == 0):
|
||||
print("No valid sitemap provided!")
|
||||
exit(1)
|
||||
|
||||
url_array = parse_sitemap(sitemap_url)
|
||||
|
||||
#parse links from array
|
||||
parse_links(url_array)
|
||||
|
||||
def has_extension_to_ignore(string):
|
||||
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.pdf']
|
||||
|
||||
pattern = r'\b(' + '|'.join(re.escape(ext) for ext in image_extensions) + r')\b'
|
||||
match = re.search(pattern, string, re.IGNORECASE)
|
||||
|
||||
return match is not None
|
|
@ -1,78 +0,0 @@
|
|||
import os, json
|
||||
from urllib.parse import urlparse
|
||||
from .utils import tokenize, ada_v2_cost
|
||||
from .substack_utils import fetch_all_publications, only_valid_publications, get_content, append_meta
|
||||
from alive_progress import alive_it
|
||||
|
||||
# Example substack URL: https://swyx.substack.com/
|
||||
def substack():
|
||||
author_url = input("Enter the substack URL of the author you want to collect: ")
|
||||
if(author_url == ''):
|
||||
print("Not a valid author.substack.com URL")
|
||||
exit(1)
|
||||
|
||||
source = urlparse(author_url)
|
||||
if('substack.com' not in source.netloc or len(source.netloc.split('.')) != 3):
|
||||
print("This does not appear to be a valid author.substack.com URL")
|
||||
exit(1)
|
||||
|
||||
subdomain = source.netloc.split('.')[0]
|
||||
publications = fetch_all_publications(subdomain)
|
||||
valid_publications = only_valid_publications(publications)
|
||||
|
||||
if(len(valid_publications)==0):
|
||||
print("There are no public or free preview newsletters by this creator - nothing to collect.")
|
||||
exit(1)
|
||||
|
||||
print(f"{len(valid_publications)} of {len(publications)} publications are readable publically text posts - collecting those.")
|
||||
|
||||
totalTokenCount = 0
|
||||
transaction_output_dir = f"../server/storage/documents/substack-{subdomain}"
|
||||
if os.path.isdir(transaction_output_dir) == False:
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
for publication in alive_it(valid_publications):
|
||||
pub_file_path = transaction_output_dir + f"/publication-{publication.get('id')}.json"
|
||||
if os.path.exists(pub_file_path) == True: continue
|
||||
|
||||
full_text = get_content(publication.get('canonical_url'))
|
||||
if full_text is None or len(full_text) == 0: continue
|
||||
|
||||
full_text = append_meta(publication, full_text)
|
||||
item = {
|
||||
'id': publication.get('id'),
|
||||
'url': publication.get('canonical_url'),
|
||||
'thumbnail': publication.get('cover_image'),
|
||||
'title': publication.get('title'),
|
||||
'subtitle': publication.get('subtitle'),
|
||||
'description': publication.get('description'),
|
||||
'published': publication.get('post_date'),
|
||||
'wordCount': publication.get('wordcount'),
|
||||
'pageContent': full_text,
|
||||
}
|
||||
|
||||
tokenCount = len(tokenize(full_text))
|
||||
item['token_count_estimate'] = tokenCount
|
||||
|
||||
totalTokenCount += tokenCount
|
||||
with open(pub_file_path, 'w', encoding='utf-8') as file:
|
||||
json.dump(item, file, ensure_ascii=True, indent=4)
|
||||
|
||||
print(f"[Success]: {len(valid_publications)} scraped and fetched!")
|
||||
print(f"\n\n////////////////////////////")
|
||||
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
|
||||
print(f"////////////////////////////\n\n")
|
||||
exit(0)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -1,88 +0,0 @@
|
|||
import os, json, requests, tempfile
|
||||
from requests_html import HTMLSession
|
||||
from langchain.document_loaders import UnstructuredHTMLLoader
|
||||
from .watch.utils import guid
|
||||
|
||||
def fetch_all_publications(subdomain):
|
||||
file_path = f"./outputs/substack-logs/substack-{subdomain}.json"
|
||||
|
||||
if os.path.isdir("./outputs/substack-logs") == False:
|
||||
os.makedirs("./outputs/substack-logs")
|
||||
|
||||
if os.path.exists(file_path):
|
||||
with open(file_path, "r") as file:
|
||||
print(f"Returning cached data for substack {subdomain}.substack.com. If you do not wish to use stored data then delete the file for this newsletter to allow refetching.")
|
||||
return json.load(file)
|
||||
|
||||
collecting = True
|
||||
offset = 0
|
||||
publications = []
|
||||
|
||||
while collecting is True:
|
||||
url = f"https://{subdomain}.substack.com/api/v1/archive?sort=new&offset={offset}"
|
||||
response = requests.get(url)
|
||||
if(response.ok == False):
|
||||
print("Bad response - exiting collection")
|
||||
collecting = False
|
||||
continue
|
||||
|
||||
data = response.json()
|
||||
|
||||
if(len(data) ==0 ):
|
||||
collecting = False
|
||||
continue
|
||||
|
||||
for publication in data:
|
||||
publications.append(publication)
|
||||
offset = len(publications)
|
||||
|
||||
with open(file_path, 'w+', encoding='utf-8') as json_file:
|
||||
json.dump(publications, json_file, ensure_ascii=True, indent=2)
|
||||
print(f"{len(publications)} publications found for author {subdomain}.substack.com. Saved to substack-logs/channel-{subdomain}.json")
|
||||
|
||||
return publications
|
||||
|
||||
def only_valid_publications(publications= []):
|
||||
valid_publications = []
|
||||
for publication in publications:
|
||||
is_paid = publication.get('audience') != 'everyone'
|
||||
if (is_paid and publication.get('should_send_free_preview') != True) or publication.get('type') != 'newsletter': continue
|
||||
valid_publications.append(publication)
|
||||
return valid_publications
|
||||
|
||||
def get_content(article_link):
|
||||
print(f"Fetching {article_link}")
|
||||
if(len(article_link) == 0):
|
||||
print("Invalid URL!")
|
||||
return None
|
||||
|
||||
session = HTMLSession()
|
||||
req = session.get(article_link)
|
||||
if(req.ok == False):
|
||||
print("Could not reach this url!")
|
||||
return None
|
||||
|
||||
req.html.render()
|
||||
|
||||
full_text = None
|
||||
with tempfile.NamedTemporaryFile(mode = "w") as tmp:
|
||||
tmp.write(req.html.html)
|
||||
tmp.seek(0)
|
||||
loader = UnstructuredHTMLLoader(tmp.name)
|
||||
data = loader.load()[0]
|
||||
full_text = data.page_content
|
||||
tmp.close()
|
||||
return full_text
|
||||
|
||||
def append_meta(publication, text):
|
||||
meta = {
|
||||
'id': guid(),
|
||||
'url': publication.get('canonical_url'),
|
||||
'thumbnail': publication.get('cover_image'),
|
||||
'title': publication.get('title'),
|
||||
'subtitle': publication.get('subtitle'),
|
||||
'description': publication.get('description'),
|
||||
'createdAt': publication.get('post_date'),
|
||||
'wordCount': publication.get('wordcount')
|
||||
}
|
||||
return "Newsletter Metadata:\n"+json.dumps(meta)+"\n\nArticle Content:\n" + text
|
|
@ -1,103 +0,0 @@
|
|||
"""
|
||||
Tweepy implementation of twitter reader. Requires the 4 twitter keys to operate.
|
||||
"""
|
||||
|
||||
import tweepy
|
||||
import os, time
|
||||
import pandas as pd
|
||||
import json
|
||||
from .utils import tokenize, ada_v2_cost
|
||||
from .watch.utils import guid
|
||||
|
||||
def twitter():
|
||||
#get user and number of tweets to read
|
||||
username = input("user timeline to read from (blank to ignore): ")
|
||||
searchQuery = input("Search term, or leave blank to get user tweets (blank to ignore): ")
|
||||
tweetCount = input("Gather the last number of tweets: ")
|
||||
|
||||
# Read your API keys to call the API.
|
||||
consumer_key = os.environ.get("TW_CONSUMER_KEY")
|
||||
consumer_secret = os.environ.get("TW_CONSUMER_SECRET")
|
||||
access_token = os.environ.get("TW_ACCESS_TOKEN")
|
||||
access_token_secret = os.environ.get("TW_ACCESS_TOKEN_SECRET")
|
||||
|
||||
# Check if any of the required environment variables is missing.
|
||||
if not consumer_key or not consumer_secret or not access_token or not access_token_secret:
|
||||
raise EnvironmentError("One of the twitter API environment variables are missing.")
|
||||
|
||||
# Pass in our twitter API authentication key
|
||||
auth = tweepy.OAuth1UserHandler(
|
||||
consumer_key, consumer_secret, access_token, access_token_secret
|
||||
)
|
||||
|
||||
# Instantiate the tweepy API
|
||||
api = tweepy.API(auth, wait_on_rate_limit=True)
|
||||
|
||||
try:
|
||||
if (searchQuery == ''):
|
||||
tweets = api.user_timeline(screen_name=username, tweet_mode = 'extended', count=tweetCount)
|
||||
else:
|
||||
tweets = api.search_tweets(q=searchQuery, tweet_mode = 'extended', count=tweetCount)
|
||||
|
||||
# Pulling Some attributes from the tweet
|
||||
attributes_container = [
|
||||
[tweet.id, tweet.user.screen_name, tweet.created_at, tweet.favorite_count, tweet.source, tweet.full_text]
|
||||
for tweet in tweets
|
||||
]
|
||||
|
||||
# Creation of column list to rename the columns in the dataframe
|
||||
columns = ["id", "Screen Name", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
|
||||
|
||||
# Creation of Dataframe
|
||||
tweets_df = pd.DataFrame(attributes_container, columns=columns)
|
||||
|
||||
totalTokens = 0
|
||||
for index, row in tweets_df.iterrows():
|
||||
meta_link = twitter_meta(row, True)
|
||||
output_filename = f"twitter-{username}-{row['Date Created']}.json"
|
||||
output_path = f"./outputs/twitter-logs"
|
||||
|
||||
transaction_output_filename = f"tweet-{username}-{row['id']}.json"
|
||||
transaction_output_dir = f"../server/storage/documents/twitter-{username}"
|
||||
|
||||
if not os.path.isdir(output_path):
|
||||
os.makedirs(output_path)
|
||||
|
||||
if not os.path.isdir(transaction_output_dir):
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
full_text = twitter_meta(row)
|
||||
tokenCount = len(tokenize(full_text))
|
||||
meta_link['pageContent'] = full_text
|
||||
meta_link['token_count_estimate'] = tokenCount
|
||||
totalTokens += tokenCount
|
||||
|
||||
with open(f"{output_path}/{output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(meta_link, file, ensure_ascii=True, indent=4)
|
||||
|
||||
with open(f"{transaction_output_dir}/{transaction_output_filename}", 'w', encoding='utf-8') as file:
|
||||
json.dump(meta_link, file, ensure_ascii=True, indent=4)
|
||||
|
||||
# print(f"{transaction_output_dir}/{transaction_output_filename}")
|
||||
|
||||
print(f"{tokenCount} tokens written over {tweets_df.shape[0]} records.")
|
||||
|
||||
except BaseException as e:
|
||||
print("Status Failed: ", str(e))
|
||||
time.sleep(3)
|
||||
|
||||
|
||||
def twitter_meta(row, metadata_only = False):
|
||||
# Note that /anyuser is a known twitter hack for not knowing the user's handle
|
||||
# https://stackoverflow.com/questions/897107/can-i-fetch-the-tweet-from-twitter-if-i-know-the-tweets-id
|
||||
url = f"http://twitter.com/anyuser/status/{row['id']}"
|
||||
title = f"Tweet {row['id']}"
|
||||
meta = {
|
||||
'id': guid(),
|
||||
'url': url,
|
||||
'title': title,
|
||||
'description': 'Tweet from ' + row["Screen Name"],
|
||||
'published': row["Date Created"].strftime('%Y-%m-%d %H:%M:%S'),
|
||||
'wordCount': len(row["Tweet"]),
|
||||
}
|
||||
return "Tweet JSON Metadata:\n"+json.dumps(meta)+"\n\n\nText Content:\n" + row["Tweet"] if metadata_only == False else meta
|
|
@ -1,10 +0,0 @@
|
|||
import tiktoken
|
||||
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")
|
||||
|
||||
def tokenize(fullText):
|
||||
return encoder.encode(fullText)
|
||||
|
||||
def ada_v2_cost(tokenCount):
|
||||
rate_per = 0.0004 / 1_000 # $0.0004 / 1K tokens
|
||||
total = tokenCount * rate_per
|
||||
return '${:,.2f}'.format(total) if total >= 0.01 else '< $0.01'
|
|
@ -1,78 +0,0 @@
|
|||
import os
|
||||
from langchain.document_loaders import Docx2txtLoader, UnstructuredODTLoader
|
||||
from slugify import slugify
|
||||
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
|
||||
from ...utils import tokenize
|
||||
|
||||
# Process all text-related documents.
|
||||
def as_docx(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.txt')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
loader = Docx2txtLoader(fullpath)
|
||||
data = loader.load()[0]
|
||||
content = data.page_content
|
||||
|
||||
if len(content) == 0:
|
||||
print(f"Resulting text content was empty for {filename}{ext}.")
|
||||
return(False, f"No text content found in {filename}{ext}")
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': f"{filename}{ext}",
|
||||
'docAuthor': 'Unknown', # TODO: Find a better author
|
||||
'description': 'Unknown', # TODO: Find a better bescription
|
||||
'docSource': 'Docx Text file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
||||
|
||||
def as_odt(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.txt')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
loader = UnstructuredODTLoader(fullpath)
|
||||
data = loader.load()[0]
|
||||
content = data.page_content
|
||||
|
||||
if len(content) == 0:
|
||||
print(f"Resulting text content was empty for {filename}{ext}.")
|
||||
return(False, f"No text content found in {filename}{ext}")
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': f"{filename}{ext}",
|
||||
'docAuthor': 'Unknown', # TODO: Find a better author
|
||||
'description': 'Unknown', # TODO: Find a better bescription
|
||||
'docSource': 'ODT Text file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
|
@ -1,42 +0,0 @@
|
|||
import os, re
|
||||
from slugify import slugify
|
||||
from langchain.document_loaders import BSHTMLLoader
|
||||
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
|
||||
from ...utils import tokenize
|
||||
|
||||
# Process all html-related documents.
|
||||
def as_html(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.html')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
loader = BSHTMLLoader(fullpath)
|
||||
document = loader.load()[0]
|
||||
content = re.sub(r"\n+", "\n", document.page_content)
|
||||
|
||||
if len(content) == 0:
|
||||
print(f"Resulting text content was empty for {filename}{ext}.")
|
||||
return(False, f"No text content found in {filename}{ext}")
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': document.metadata.get('title', f"{filename}{ext}"),
|
||||
'docAuthor': 'Unknown', # TODO: Find a better author
|
||||
'description': 'Unknown', # TODO: Find a better description
|
||||
'docSource': 'an HTML file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
|
@ -1,42 +0,0 @@
|
|||
import os
|
||||
from langchain.document_loaders import UnstructuredMarkdownLoader
|
||||
from slugify import slugify
|
||||
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
|
||||
from ...utils import tokenize
|
||||
|
||||
# Process all text-related documents.
|
||||
def as_markdown(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.txt')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
loader = UnstructuredMarkdownLoader(fullpath)
|
||||
data = loader.load()[0]
|
||||
content = data.page_content
|
||||
|
||||
if len(content) == 0:
|
||||
print(f"Resulting page content was empty - no text could be extracted from {filename}{ext}.")
|
||||
return(False, f"No text could be extracted from {filename}{ext}.")
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': f"{filename}", # TODO: find a better metadata
|
||||
'docAuthor': 'Unknown', # TODO: find a better metadata
|
||||
'description': 'Unknown', # TODO: find a better metadata
|
||||
'docSource': 'markdown file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
|
@ -1,124 +0,0 @@
|
|||
import os
|
||||
import datetime
|
||||
import email.utils
|
||||
import re
|
||||
import quopri
|
||||
import base64
|
||||
from mailbox import mbox, mboxMessage
|
||||
from slugify import slugify
|
||||
from bs4 import BeautifulSoup
|
||||
from scripts.watch.utils import (
|
||||
guid,
|
||||
file_creation_time,
|
||||
write_to_server_documents,
|
||||
move_source,
|
||||
)
|
||||
from scripts.utils import tokenize
|
||||
|
||||
|
||||
def get_content(message: mboxMessage) -> str:
|
||||
content = "None"
|
||||
# if message.is_multipart():
|
||||
for part in message.walk():
|
||||
if part.get_content_type() == "text/plain":
|
||||
content = part.get_payload(decode=True)
|
||||
break
|
||||
elif part.get_content_type() == "text/html":
|
||||
soup = BeautifulSoup(part.get_payload(decode=True), "html.parser")
|
||||
content = soup.get_text()
|
||||
|
||||
if isinstance(content, bytes):
|
||||
try:
|
||||
content = content.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
content = content.decode("latin-1")
|
||||
|
||||
return content
|
||||
|
||||
|
||||
def parse_subject(subject: str) -> str:
|
||||
# Check if subject is Quoted-Printable encoded
|
||||
if subject.startswith("=?") and subject.endswith("?="):
|
||||
# Extract character set and encoding information
|
||||
match = re.match(r"=\?(.+)\?(.)\?(.+)\?=", subject)
|
||||
if match:
|
||||
charset = match.group(1)
|
||||
encoding = match.group(2)
|
||||
encoded_text = match.group(3)
|
||||
is_quoted_printable = encoding.upper() == "Q"
|
||||
is_base64 = encoding.upper() == "B"
|
||||
if is_quoted_printable:
|
||||
# Decode Quoted-Printable encoded text
|
||||
subject = quopri.decodestring(encoded_text).decode(charset)
|
||||
elif is_base64:
|
||||
# Decode Base64 encoded text
|
||||
subject = base64.b64decode(encoded_text).decode(charset)
|
||||
|
||||
return subject
|
||||
|
||||
|
||||
# Process all mbox-related documents.
|
||||
def as_mbox(**kwargs):
|
||||
parent_dir = kwargs.get("directory", "hotdir")
|
||||
filename = kwargs.get("filename")
|
||||
ext = kwargs.get("ext", ".mbox")
|
||||
remove = kwargs.get("remove_on_complete", False)
|
||||
|
||||
if filename is not None:
|
||||
filename = str(filename)
|
||||
else:
|
||||
print("[ERROR]: No filename provided.")
|
||||
return (False, "No filename provided.")
|
||||
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
box = mbox(fullpath)
|
||||
|
||||
for message in box:
|
||||
content = get_content(message)
|
||||
content = content.strip().replace("\r\n", "\n")
|
||||
|
||||
if len(content) == 0:
|
||||
print("[WARNING]: Mail with no content. Ignored.")
|
||||
continue
|
||||
|
||||
date_tuple = email.utils.parsedate_tz(message["Date"])
|
||||
if date_tuple:
|
||||
local_date = datetime.datetime.fromtimestamp(
|
||||
email.utils.mktime_tz(date_tuple)
|
||||
)
|
||||
date_sent = local_date.strftime("%a, %d %b %Y %H:%M:%S")
|
||||
else:
|
||||
date_sent = None
|
||||
|
||||
subject = message["Subject"]
|
||||
|
||||
if subject is None:
|
||||
print("[WARNING]: Mail with no subject. But has content.")
|
||||
subject = "None"
|
||||
else:
|
||||
subject = parse_subject(subject)
|
||||
|
||||
abs_path = os.path.abspath(
|
||||
f"{parent_dir}/processed/{slugify(filename)}-{guid()}{ext}"
|
||||
)
|
||||
data = {
|
||||
"id": guid(),
|
||||
"url": f"file://{abs_path}",
|
||||
"title": subject,
|
||||
"docAuthor": message["From"],
|
||||
"description": f"email from {message['From']} to {message['To']}",
|
||||
"docSource": "mbox file uploaded by the user.",
|
||||
"chunkSource": subject,
|
||||
"published": file_creation_time(fullpath),
|
||||
"wordCount": len(content),
|
||||
"pageContent": content,
|
||||
"token_count_estimate": len(tokenize(content)),
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return (True, None)
|
|
@ -1,58 +0,0 @@
|
|||
import os, fitz
|
||||
from langchain.document_loaders import PyMuPDFLoader # better UTF support and metadata
|
||||
from slugify import slugify
|
||||
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
|
||||
from ...utils import tokenize
|
||||
|
||||
# Process all PDF-related documents.
|
||||
def as_pdf(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.txt')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
loader = PyMuPDFLoader(fullpath)
|
||||
pages = loader.load()
|
||||
|
||||
if len(pages) == 0:
|
||||
print(f"{fullpath} parsing resulted in no pages - nothing to do.")
|
||||
return(False, f"No pages found for {filename}{ext}!")
|
||||
|
||||
# Set doc to the first page so we can still get the metadata from PyMuPDF but without all the unicode issues.
|
||||
doc = pages[0]
|
||||
del loader
|
||||
del pages
|
||||
|
||||
page_content = ''
|
||||
for page in fitz.open(fullpath):
|
||||
print(f"-- Parsing content from pg {page.number} --")
|
||||
page_content += str(page.get_text('text'))
|
||||
|
||||
if len(page_content) == 0:
|
||||
print(f"Resulting page content was empty - no text could be extracted from the document.")
|
||||
return(False, f"No text content could be extracted from {filename}{ext}!")
|
||||
|
||||
title = doc.metadata.get('title')
|
||||
author = doc.metadata.get('author')
|
||||
subject = doc.metadata.get('subject')
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': title if title else f"{filename}{ext}",
|
||||
'docAuthor': author if author else 'No author found',
|
||||
'description': subject if subject else 'No description found.',
|
||||
'docSource': 'pdf file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(page_content), # Technically a letter count :p
|
||||
'pageContent': page_content,
|
||||
'token_count_estimate': len(tokenize(page_content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
|
@ -1,38 +0,0 @@
|
|||
import os
|
||||
from slugify import slugify
|
||||
from ..utils import guid, file_creation_time, write_to_server_documents, move_source
|
||||
from ...utils import tokenize
|
||||
|
||||
# Process all text-related documents.
|
||||
def as_text(**kwargs):
|
||||
parent_dir = kwargs.get('directory', 'hotdir')
|
||||
filename = kwargs.get('filename')
|
||||
ext = kwargs.get('ext', '.txt')
|
||||
remove = kwargs.get('remove_on_complete', False)
|
||||
fullpath = f"{parent_dir}/{filename}{ext}"
|
||||
content = open(fullpath).read()
|
||||
|
||||
if len(content) == 0:
|
||||
print(f"Resulting text content was empty for {filename}{ext}.")
|
||||
return(False, f"No text content found in {filename}{ext}")
|
||||
|
||||
print(f"-- Working {fullpath} --")
|
||||
data = {
|
||||
'id': guid(),
|
||||
'url': "file://"+os.path.abspath(f"{parent_dir}/processed/{filename}{ext}"),
|
||||
'title': f"{filename}{ext}",
|
||||
'docAuthor': 'Unknown', # TODO: Find a better author
|
||||
'description': 'Unknown', # TODO: Find a better description
|
||||
'docSource': 'a text file uploaded by the user.',
|
||||
'chunkSource': f"{filename}{ext}",
|
||||
'published': file_creation_time(fullpath),
|
||||
'wordCount': len(content),
|
||||
'pageContent': content,
|
||||
'token_count_estimate': len(tokenize(content))
|
||||
}
|
||||
|
||||
write_to_server_documents(data, f"{slugify(filename)}-{data.get('id')}")
|
||||
move_source(parent_dir, f"{filename}{ext}", remove=remove)
|
||||
|
||||
print(f"[SUCCESS]: {filename}{ext} converted & ready for embedding.\n")
|
||||
return(True, None)
|
|
@ -1,25 +0,0 @@
|
|||
from .convert.as_text import as_text
|
||||
from .convert.as_markdown import as_markdown
|
||||
from .convert.as_pdf import as_pdf
|
||||
from .convert.as_docx import as_docx, as_odt
|
||||
from .convert.as_mbox import as_mbox
|
||||
from .convert.as_html import as_html
|
||||
|
||||
FILETYPES = {
|
||||
'.txt': as_text,
|
||||
'.md': as_markdown,
|
||||
'.pdf': as_pdf,
|
||||
'.docx': as_docx,
|
||||
'.odt': as_odt,
|
||||
'.mbox': as_mbox,
|
||||
'.html': as_html,
|
||||
}
|
||||
|
||||
ACCEPTED_MIMES = {
|
||||
'text/plain': ['.txt', '.md'],
|
||||
'text/html': ['.html'],
|
||||
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
|
||||
'application/vnd.oasis.opendocument.text': ['.odt'],
|
||||
'application/pdf': ['.pdf'],
|
||||
'application/mbox': ['.mbox'],
|
||||
}
|
|
@ -1,22 +0,0 @@
|
|||
import os
|
||||
from .filetypes import FILETYPES
|
||||
from .utils import move_source
|
||||
|
||||
RESERVED = ['__HOTDIR__.md']
|
||||
def watch_for_changes(directory):
|
||||
for raw_doc in os.listdir(directory):
|
||||
if os.path.isdir(f"{directory}/{raw_doc}") or raw_doc in RESERVED: continue
|
||||
|
||||
filename, fileext = os.path.splitext(raw_doc)
|
||||
if filename in ['.DS_Store'] or fileext == '': continue
|
||||
|
||||
if fileext not in FILETYPES.keys():
|
||||
print(f"{fileext} not a supported file type for conversion. Removing from hot directory.")
|
||||
move_source(new_destination_filename=raw_doc, failed=True)
|
||||
continue
|
||||
|
||||
FILETYPES[fileext](
|
||||
directory=directory,
|
||||
filename=filename,
|
||||
ext=fileext,
|
||||
)
|
|
@ -1,35 +0,0 @@
|
|||
import os
|
||||
from .filetypes import FILETYPES
|
||||
from .utils import move_source
|
||||
|
||||
RESERVED = ['__HOTDIR__.md']
|
||||
|
||||
# This script will do a one-off processing of a specific document that exists in hotdir.
|
||||
# For this function we remove the original source document since there is no need to keep it and it will
|
||||
# only occupy additional disk space.
|
||||
def process_single(directory, target_doc):
|
||||
if os.path.isdir(f"{directory}/{target_doc}") or target_doc in RESERVED: return (False, "Not a file")
|
||||
|
||||
if os.path.exists(f"{directory}/{target_doc}") is False:
|
||||
print(f"{directory}/{target_doc} does not exist.")
|
||||
return (False, f"{directory}/{target_doc} does not exist.")
|
||||
|
||||
filename, fileext = os.path.splitext(target_doc)
|
||||
if filename in ['.DS_Store'] or fileext == '': return False
|
||||
if fileext == '.lock':
|
||||
print(f"{filename} is locked - skipping until unlocked")
|
||||
return (False, f"{filename} is locked - skipping until unlocked")
|
||||
|
||||
if fileext not in FILETYPES.keys():
|
||||
print(f"{fileext} not a supported file type for conversion. It will not be processed.")
|
||||
move_source(new_destination_filename=target_doc, failed=True, remove=True)
|
||||
return (False, f"{fileext} not a supported file type for conversion. It will not be processed.")
|
||||
|
||||
# Returns Tuple of (Boolean, String|None) of success status and possible error message.
|
||||
# Error message will display to user.
|
||||
return FILETYPES[fileext](
|
||||
directory=directory,
|
||||
filename=filename,
|
||||
ext=fileext,
|
||||
remove_on_complete=True # remove source document to save disk space.
|
||||
)
|
|
@ -1,35 +0,0 @@
|
|||
import os, json
|
||||
from datetime import datetime
|
||||
from uuid import uuid4
|
||||
|
||||
def guid():
|
||||
return str(uuid4())
|
||||
|
||||
def file_creation_time(path_to_file):
|
||||
try:
|
||||
if os.name == 'nt':
|
||||
return datetime.fromtimestamp(os.path.getctime(path_to_file)).strftime('%Y-%m-%d %H:%M:%S')
|
||||
else:
|
||||
stat = os.stat(path_to_file)
|
||||
return datetime.fromtimestamp(stat.st_birthtime).strftime('%Y-%m-%d %H:%M:%S')
|
||||
except AttributeError:
|
||||
return datetime.today().strftime('%Y-%m-%d %H:%M:%S')
|
||||
|
||||
def move_source(working_dir='hotdir', new_destination_filename='', failed=False, remove=False):
|
||||
if remove and os.path.exists(f"{working_dir}/{new_destination_filename}"):
|
||||
print(f"{new_destination_filename} deleted from filesystem")
|
||||
os.remove(f"{working_dir}/{new_destination_filename}")
|
||||
return
|
||||
|
||||
destination = f"{working_dir}/processed" if not failed else f"{working_dir}/failed"
|
||||
if os.path.exists(destination) == False:
|
||||
os.mkdir(destination)
|
||||
|
||||
os.replace(f"{working_dir}/{new_destination_filename}", f"{destination}/{new_destination_filename}")
|
||||
return
|
||||
|
||||
def write_to_server_documents(data, filename, override_destination = None):
|
||||
destination = f"../server/storage/documents/custom-documents" if override_destination == None else override_destination
|
||||
if os.path.exists(destination) == False: os.makedirs(destination)
|
||||
with open(f"{destination}/{filename}.json", 'w', encoding='utf-8') as file:
|
||||
json.dump(data, file, ensure_ascii=True, indent=4)
|
|
@ -1,55 +0,0 @@
|
|||
import os, json
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
from youtube_transcript_api.formatters import TextFormatter, JSONFormatter
|
||||
from .utils import tokenize, ada_v2_cost
|
||||
from .yt_utils import fetch_channel_video_information, get_channel_id, clean_text, append_meta, get_duration
|
||||
from alive_progress import alive_it
|
||||
|
||||
# Example Channel URL https://www.youtube.com/channel/UCmWbhBB96ynOZuWG7LfKong
|
||||
# Example Channel URL https://www.youtube.com/@mintplex
|
||||
|
||||
def youtube():
|
||||
channel_link = input("Paste in the URL of a YouTube channel: ")
|
||||
channel_id = get_channel_id(channel_link)
|
||||
|
||||
if channel_id == None or len(channel_id) == 0:
|
||||
print("Invalid input - must be full YouTube channel URL")
|
||||
exit(1)
|
||||
|
||||
channel_data = fetch_channel_video_information(channel_id)
|
||||
transaction_output_dir = f"../server/storage/documents/youtube-{channel_data.get('channelTitle')}"
|
||||
|
||||
if os.path.isdir(transaction_output_dir) == False:
|
||||
os.makedirs(transaction_output_dir)
|
||||
|
||||
print(f"\nFetching transcripts for {len(channel_data.get('items'))} videos - please wait.\nStopping and restarting will not refetch known transcripts in case there is an error.\nSaving results to: {transaction_output_dir}.")
|
||||
totalTokenCount = 0
|
||||
for video in alive_it(channel_data.get('items')):
|
||||
video_file_path = transaction_output_dir + f"/video-{video.get('id')}.json"
|
||||
if os.path.exists(video_file_path) == True:
|
||||
continue
|
||||
|
||||
formatter = TextFormatter()
|
||||
json_formatter = JSONFormatter()
|
||||
try:
|
||||
transcript = YouTubeTranscriptApi.get_transcript(video.get('id'))
|
||||
raw_text = clean_text(formatter.format_transcript(transcript))
|
||||
duration = get_duration(json_formatter.format_transcript(transcript))
|
||||
|
||||
if(len(raw_text) > 0):
|
||||
fullText = append_meta(video, duration, raw_text)
|
||||
tokenCount = len(tokenize(fullText))
|
||||
video['pageContent'] = fullText
|
||||
video['token_count_estimate'] = tokenCount
|
||||
totalTokenCount += tokenCount
|
||||
with open(video_file_path, 'w', encoding='utf-8') as file:
|
||||
json.dump(video, file, ensure_ascii=True, indent=4)
|
||||
except:
|
||||
print("There was an issue getting the transcription of a video in the list - likely because captions are disabled. Skipping")
|
||||
continue
|
||||
|
||||
print(f"[Success]: {len(channel_data.get('items'))} video transcripts fetched!")
|
||||
print(f"\n\n////////////////////////////")
|
||||
print(f"Your estimated cost to embed all of this data using OpenAI's text-embedding-ada-002 model at $0.0004 / 1K tokens will cost {ada_v2_cost(totalTokenCount)} using {totalTokenCount} tokens.")
|
||||
print(f"////////////////////////////\n\n")
|
||||
exit(0)
|
|
@ -1,122 +0,0 @@
|
|||
import json, requests, os, re
|
||||
from slugify import slugify
|
||||
from dotenv import load_dotenv
|
||||
from .watch.utils import guid
|
||||
load_dotenv()
|
||||
|
||||
def is_yt_short(videoId):
|
||||
url = 'https://www.youtube.com/shorts/' + videoId
|
||||
ret = requests.head(url)
|
||||
return ret.status_code == 200
|
||||
|
||||
def get_channel_id(channel_link):
|
||||
if('@' in channel_link):
|
||||
pattern = r'https?://www\.youtube\.com/(@\w+)/?'
|
||||
match = re.match(pattern, channel_link)
|
||||
if match is False: return None
|
||||
handle = match.group(1)
|
||||
print('Need to map username to channelId - this can take a while sometimes.')
|
||||
response = requests.get(f"https://yt.lemnoslife.com/channels?handle={handle}", timeout=20)
|
||||
|
||||
if(response.ok == False):
|
||||
print("Handle => ChannelId mapping endpoint is too slow - use regular youtube.com/channel URL")
|
||||
return None
|
||||
|
||||
json_data = response.json()
|
||||
return json_data.get('items')[0].get('id')
|
||||
else:
|
||||
pattern = r"youtube\.com/channel/([\w-]+)"
|
||||
match = re.search(pattern, channel_link)
|
||||
return match.group(1) if match else None
|
||||
|
||||
|
||||
def clean_text(text):
|
||||
return re.sub(r"\[.*?\]", "", text)
|
||||
|
||||
def append_meta(video, duration, text):
|
||||
meta = {
|
||||
'id': guid(),
|
||||
'youtubeURL': f"https://youtube.com/watch?v={video.get('id')}",
|
||||
'thumbnail': video.get('thumbnail'),
|
||||
'description': video.get('description'),
|
||||
'createdAt': video.get('published'),
|
||||
'videoDurationInSeconds': duration,
|
||||
}
|
||||
return "Video JSON Metadata:\n"+json.dumps(meta, indent=4)+"\n\n\nAudio Transcript:\n" + text
|
||||
|
||||
def get_duration(json_str):
|
||||
data = json.loads(json_str)
|
||||
return data[-1].get('start')
|
||||
|
||||
def fetch_channel_video_information(channel_id, windowSize = 50):
|
||||
if channel_id == None or len(channel_id) == 0:
|
||||
print("No channel id provided!")
|
||||
exit(1)
|
||||
|
||||
if os.path.isdir("./outputs/channel-logs") == False:
|
||||
os.makedirs("./outputs/channel-logs")
|
||||
|
||||
file_path = f"./outputs/channel-logs/channel-{channel_id}.json"
|
||||
if os.path.exists(file_path):
|
||||
with open(file_path, "r") as file:
|
||||
print(f"Returning cached data for channel {channel_id}. If you do not wish to use stored data then delete the file for this channel to allow refetching.")
|
||||
return json.load(file)
|
||||
|
||||
if(os.getenv('GOOGLE_APIS_KEY') == None):
|
||||
print("GOOGLE_APIS_KEY env variable not set!")
|
||||
exit(1)
|
||||
|
||||
done = False
|
||||
currentPage = None
|
||||
pageTokens = []
|
||||
items = []
|
||||
data = {
|
||||
'id': channel_id,
|
||||
}
|
||||
|
||||
print("Fetching first page of results...")
|
||||
while(done == False):
|
||||
url = f"https://www.googleapis.com/youtube/v3/search?key={os.getenv('GOOGLE_APIS_KEY')}&channelId={channel_id}&part=snippet,id&order=date&type=video&maxResults={windowSize}"
|
||||
if(currentPage != None):
|
||||
print(f"Fetching page ${currentPage}")
|
||||
url += f"&pageToken={currentPage}"
|
||||
|
||||
req = requests.get(url)
|
||||
if(req.ok == False):
|
||||
print("Could not fetch channel_id items!")
|
||||
exit(1)
|
||||
|
||||
response = req.json()
|
||||
currentPage = response.get('nextPageToken')
|
||||
if currentPage in pageTokens:
|
||||
print('All pages iterated and logged!')
|
||||
done = True
|
||||
break
|
||||
|
||||
for item in response.get('items'):
|
||||
if 'id' in item and 'videoId' in item.get('id'):
|
||||
if is_yt_short(item.get('id').get('videoId')):
|
||||
print(f"Filtering out YT Short {item.get('id').get('videoId')}")
|
||||
continue
|
||||
|
||||
if data.get('channelTitle') is None:
|
||||
data['channelTitle'] = slugify(item.get('snippet').get('channelTitle'))
|
||||
|
||||
newItem = {
|
||||
'id': item.get('id').get('videoId'),
|
||||
'url': f"https://youtube.com/watch?v={item.get('id').get('videoId')}",
|
||||
'title': item.get('snippet').get('title'),
|
||||
'description': item.get('snippet').get('description'),
|
||||
'thumbnail': item.get('snippet').get('thumbnails').get('high').get('url'),
|
||||
'published': item.get('snippet').get('publishTime'),
|
||||
}
|
||||
items.append(newItem)
|
||||
|
||||
pageTokens.append(currentPage)
|
||||
|
||||
data['items'] = items
|
||||
with open(file_path, 'w+', encoding='utf-8') as json_file:
|
||||
json.dump(data, json_file, ensure_ascii=True, indent=2)
|
||||
print(f"{len(items)} videos found for channel {data.get('channelTitle')}. Saved to channel-logs/channel-{channel_id}.json")
|
||||
|
||||
return data
|
50
collector/utils/asDocx.js
Normal file
50
collector/utils/asDocx.js
Normal file
|
@ -0,0 +1,50 @@
|
|||
const { v4 } = require("uuid");
|
||||
const { DocxLoader } = require("langchain/document_loaders/fs/docx");
|
||||
const {
|
||||
createdDate,
|
||||
trashFile,
|
||||
writeToServerDocuments,
|
||||
} = require("../../utils/files");
|
||||
const { tokenizeString } = require("../../utils/tokenizer");
|
||||
const { default: slugify } = require("slugify");
|
||||
|
||||
async function asDocX({ fullFilePath = "", filename = "" }) {
|
||||
const loader = new DocxLoader(fullFilePath);
|
||||
|
||||
console.log(`-- Working ${filename} --`);
|
||||
let pageContent = [];
|
||||
const docs = await loader.load();
|
||||
for (const doc of docs) {
|
||||
console.log(doc.metadata);
|
||||
console.log(`-- Parsing content from docx page --`);
|
||||
if (!doc.pageContent.length) continue;
|
||||
pageContent.push(doc.pageContent);
|
||||
}
|
||||
|
||||
if (!pageContent.length) {
|
||||
console.error(`Resulting text content was empty for ${filename}.`);
|
||||
return { success: false, reason: `No text content found in ${filename}.` };
|
||||
}
|
||||
|
||||
const content = pageContent.join("");
|
||||
data = {
|
||||
id: v4(),
|
||||
url: "file://" + fullFilePath,
|
||||
title: filename,
|
||||
docAuthor: "no author found",
|
||||
description: "No description found.",
|
||||
docSource: "pdf file uploaded by the user.",
|
||||
chunkSource: filename,
|
||||
published: createdDate(fullFilePath),
|
||||
wordCount: content.split(" ").length,
|
||||
pageContent: content,
|
||||
token_count_estimate: tokenizeString(content).length,
|
||||
};
|
||||
|
||||
writeToServerDocuments(data, `${slugify(filename)}-${data.id}`);
|
||||
trashFile(fullFilePath);
|
||||
console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
|
||||
return { success: true, reason: null };
|
||||
}
|
||||
|
||||
module.exports = asDocX;
|
40
collector/utils/constants.js
Normal file
40
collector/utils/constants.js
Normal file
|
@ -0,0 +1,40 @@
|
|||
const WATCH_DIRECTORY = require("path").resolve(__dirname, "../hotdir");
|
||||
|
||||
const ACCEPTED_MIMES = {
|
||||
"text/plain": [".txt", ".md"],
|
||||
"text/html": [".html"],
|
||||
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": [
|
||||
".docx",
|
||||
],
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation": [
|
||||
".pptx",
|
||||
],
|
||||
|
||||
"application/vnd.oasis.opendocument.text": [".odt"],
|
||||
"application/vnd.oasis.opendocument.presentation": [".odp"],
|
||||
|
||||
"application/pdf": [".pdf"],
|
||||
"application/mbox": [".mbox"],
|
||||
};
|
||||
|
||||
const SUPPORTED_FILETYPE_CONVERTERS = {
|
||||
".txt": "./convert/asTxt.js",
|
||||
".md": "./convert/asTxt.js",
|
||||
".html": "./convert/asTxt.js",
|
||||
".pdf": "./convert/asPDF.js",
|
||||
|
||||
".docx": "./convert/asDocx.js",
|
||||
".pptx": "./convert/asOfficeMime.js",
|
||||
|
||||
".odt": "./convert/asOfficeMime.js",
|
||||
".odp": "./convert/asOfficeMime.js",
|
||||
|
||||
".mbox": "./convert/asMbox.js",
|
||||
};
|
||||
|
||||
module.exports = {
|
||||
SUPPORTED_FILETYPE_CONVERTERS,
|
||||
WATCH_DIRECTORY,
|
||||
ACCEPTED_MIMES,
|
||||
};
|
55
collector/utils/files/index.js
Normal file
55
collector/utils/files/index.js
Normal file
|
@ -0,0 +1,55 @@
|
|||
const fs = require("fs");
|
||||
const path = require("path");
|
||||
|
||||
function trashFile(filepath) {
|
||||
if (!fs.existsSync(filepath)) return;
|
||||
|
||||
try {
|
||||
const isDir = fs.lstatSync(filepath).isDirectory();
|
||||
if (isDir) return;
|
||||
} catch {
|
||||
return;
|
||||
}
|
||||
|
||||
fs.rmSync(filepath);
|
||||
return;
|
||||
}
|
||||
|
||||
function createdDate(filepath) {
|
||||
try {
|
||||
const { birthtimeMs, birthtime } = fs.statSync(filepath);
|
||||
if (birthtimeMs === 0) throw new Error("Invalid stat for file!");
|
||||
return birthtime.toLocaleString();
|
||||
} catch {
|
||||
return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
function writeToServerDocuments(
|
||||
data = {},
|
||||
filename,
|
||||
destinationOverride = null
|
||||
) {
|
||||
const destination = destinationOverride
|
||||
? path.resolve(destinationOverride)
|
||||
: path.resolve(
|
||||
__dirname,
|
||||
"../../../server/storage/documents/custom-documents"
|
||||
);
|
||||
if (!fs.existsSync(destination))
|
||||
fs.mkdirSync(destination, { recursive: true });
|
||||
const destinationFilePath = path.resolve(destination, filename);
|
||||
|
||||
fs.writeFileSync(
|
||||
destinationFilePath + ".json",
|
||||
JSON.stringify(data, null, 4),
|
||||
{ encoding: "utf-8" }
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
trashFile,
|
||||
createdDate,
|
||||
writeToServerDocuments,
|
||||
};
|
18
collector/utils/http/index.js
Normal file
18
collector/utils/http/index.js
Normal file
|
@ -0,0 +1,18 @@
|
|||
process.env.NODE_ENV === "development"
|
||||
? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
|
||||
: require("dotenv").config();
|
||||
|
||||
function reqBody(request) {
|
||||
return typeof request.body === "string"
|
||||
? JSON.parse(request.body)
|
||||
: request.body;
|
||||
}
|
||||
|
||||
function queryParams(request) {
|
||||
return request.query;
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
reqBody,
|
||||
queryParams,
|
||||
};
|
15
collector/utils/tokenizer/index.js
Normal file
15
collector/utils/tokenizer/index.js
Normal file
|
@ -0,0 +1,15 @@
|
|||
const { getEncoding } = require("js-tiktoken");
|
||||
|
||||
function tokenizeString(input = "") {
|
||||
try {
|
||||
const encoder = getEncoding("cl100k_base");
|
||||
return encoder.encode(input);
|
||||
} catch (e) {
|
||||
console.error("Could not tokenize string!");
|
||||
return [];
|
||||
}
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
tokenizeString,
|
||||
};
|
11
collector/utils/url/index.js
Normal file
11
collector/utils/url/index.js
Normal file
|
@ -0,0 +1,11 @@
|
|||
function validURL(url) {
|
||||
try {
|
||||
new URL(url);
|
||||
return true;
|
||||
} catch {}
|
||||
return false;
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
validURL,
|
||||
};
|
|
@ -1,21 +0,0 @@
|
|||
import _thread, time
|
||||
from scripts.watch.main import watch_for_changes
|
||||
|
||||
a_list = []
|
||||
WATCH_DIRECTORY = "hotdir"
|
||||
def input_thread(a_list):
|
||||
input()
|
||||
a_list.append(True)
|
||||
|
||||
def main():
|
||||
_thread.start_new_thread(input_thread, (a_list,))
|
||||
print(f"Watching '{WATCH_DIRECTORY}/' for new files.\n\nUpload files into this directory while this script is running to convert them.\nPress enter or crtl+c to exit script.")
|
||||
while not a_list:
|
||||
watch_for_changes(WATCH_DIRECTORY)
|
||||
time.sleep(1)
|
||||
|
||||
print("Stopping watching of hot directory.")
|
||||
exit(1)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,4 +0,0 @@
|
|||
from api import api
|
||||
|
||||
if __name__ == '__main__':
|
||||
api.run(debug=False)
|
2925
collector/yarn.lock
Normal file
2925
collector/yarn.lock
Normal file
File diff suppressed because it is too large
Load diff
|
@ -8,7 +8,7 @@ ARG ARG_GID=1000
|
|||
# Install system dependencies
|
||||
RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends \
|
||||
curl gnupg libgfortran5 python3 python3-pip tzdata netcat \
|
||||
curl gnupg libgfortran5 libgbm1 tzdata netcat \
|
||||
libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 \
|
||||
libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 \
|
||||
libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 \
|
||||
|
@ -21,13 +21,7 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
|
|||
apt-get install -yq --no-install-recommends nodejs && \
|
||||
curl -LO https://github.com/yarnpkg/yarn/releases/download/v1.22.19/yarn_1.22.19_all.deb \
|
||||
&& dpkg -i yarn_1.22.19_all.deb \
|
||||
&& rm yarn_1.22.19_all.deb && \
|
||||
curl -LO https://github.com/jgm/pandoc/releases/download/3.1.3/pandoc-3.1.3-1-amd64.deb \
|
||||
&& dpkg -i pandoc-3.1.3-1-amd64.deb \
|
||||
&& rm pandoc-3.1.3-1-amd64.deb && \
|
||||
rm -rf /var/lib/apt/lists/* /usr/share/icons && \
|
||||
dpkg-reconfigure -f noninteractive tzdata && \
|
||||
python3 -m pip install --no-cache-dir virtualenv
|
||||
&& rm yarn_1.22.19_all.deb
|
||||
|
||||
# Create a group and user with specific UID and GID
|
||||
RUN groupadd -g $ARG_GID anythingllm && \
|
||||
|
@ -81,10 +75,7 @@ COPY --from=build-stage /app/frontend/dist ./server/public
|
|||
COPY --chown=anythingllm:anythingllm ./collector/ ./collector/
|
||||
|
||||
# Install collector dependencies
|
||||
RUN cd /app/collector && \
|
||||
python3 -m virtualenv v-env && \
|
||||
. v-env/bin/activate && \
|
||||
pip install --no-cache-dir -r requirements.txt
|
||||
RUN cd /app/collector && yarn install --production && yarn cache clean
|
||||
|
||||
# Migrate and Run Prisma against known schema
|
||||
RUN cd ./server && npx prisma generate --schema=./prisma/schema.prisma
|
||||
|
@ -92,7 +83,6 @@ RUN cd ./server && npx prisma migrate deploy --schema=./prisma/schema.prisma
|
|||
|
||||
# Setup the environment
|
||||
ENV NODE_ENV=production
|
||||
ENV PATH=/app/collector/v-env/bin:$PATH
|
||||
|
||||
# Expose the server port
|
||||
EXPOSE 3001
|
||||
|
|
|
@ -24,6 +24,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
|
|||
mkdir -p $STORAGE_LOCATION && \
|
||||
touch "$STORAGE_LOCATION/.env" && \
|
||||
docker run -d -p 3001:3001 \
|
||||
--cap-add SYS_ADMIN \
|
||||
-v ${STORAGE_LOCATION}:/app/server/storage \
|
||||
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
|
||||
-e STORAGE_DIR="/app/server/storage" \
|
||||
|
@ -45,16 +46,6 @@ Your docker host will show the image as online once the build process is complet
|
|||
## How to use the user interface
|
||||
- To access the full application, visit `http://localhost:3001` in your browser.
|
||||
|
||||
## How to add files to my system using the standalone scripts
|
||||
- Upload files from the UI in your Workspace settings
|
||||
|
||||
- To run the collector scripts to grab external data (articles, URLs, etc.)
|
||||
- `docker exec -it --workdir=/app/collector anything-llm python main.py`
|
||||
|
||||
- To run the collector watch script to process files from the hotdir
|
||||
- `docker exec -it --workdir=/app/collector anything-llm python watch.py`
|
||||
- Upload [compliant files](../collector/hotdir/__HOTDIR__.md) to `./collector/hotdir` and they will be processed and made available in the UI.
|
||||
|
||||
## About UID and GID in the ENV
|
||||
- The UID and GID are set to 1000 by default. This is the default user in the Docker container and on most host operating systems. If there is a mismatch between your host user UID and GID and what is set in the `.env` file, you may experience permission issues.
|
||||
|
||||
|
|
|
@ -17,6 +17,8 @@ services:
|
|||
args:
|
||||
ARG_UID: ${UID:-1000}
|
||||
ARG_GID: ${GID:-1000}
|
||||
cap_add:
|
||||
- SYS_ADMIN
|
||||
volumes:
|
||||
- "./.env:/app/server/.env"
|
||||
- "../server/storage:/app/server/storage"
|
||||
|
|
|
@ -4,6 +4,6 @@
|
|||
npx prisma migrate deploy --schema=./prisma/schema.prisma &&\
|
||||
node /app/server/index.js
|
||||
} &
|
||||
{ FLASK_ENV=production FLASK_APP=wsgi.py cd collector && gunicorn --timeout 300 --workers 4 --bind 0.0.0.0:8888 wsgi:api; } &
|
||||
{ node /app/collector/index.js; } &
|
||||
wait -n
|
||||
exit $?
|
|
@ -9,10 +9,11 @@
|
|||
"node": ">=18"
|
||||
},
|
||||
"scripts": {
|
||||
"lint": "cd server && yarn lint && cd .. && cd frontend && yarn lint",
|
||||
"setup": "cd server && yarn && cd ../frontend && yarn && cd .. && yarn setup:envs && yarn prisma:setup && echo \"Please run yarn dev:server and yarn dev:frontend in separate terminal tabs.\"",
|
||||
"lint": "cd server && yarn lint && cd ../frontend && yarn lint && cd ../collector && yarn lint",
|
||||
"setup": "cd server && yarn && cd ../collector && yarn && cd ../frontend && yarn && cd .. && yarn setup:envs && yarn prisma:setup && echo \"Please run yarn dev:server, yarn dev:collector, and yarn dev:frontend in separate terminal tabs.\"",
|
||||
"setup:envs": "cp -n ./frontend/.env.example ./frontend/.env && cp -n ./server/.env.example ./server/.env.development && cp -n ./collector/.env.example ./collector/.env && cp -n ./docker/.env.example ./docker/.env && echo \"All ENV files copied!\n\"",
|
||||
"dev:server": "cd server && yarn dev",
|
||||
"dev:collector": "cd collector && yarn dev",
|
||||
"dev:frontend": "cd frontend && yarn start",
|
||||
"prisma:generate": "cd server && npx prisma generate",
|
||||
"prisma:migrate": "cd server && npx prisma migrate dev --name init",
|
||||
|
|
|
@ -2,7 +2,7 @@ const { Telemetry } = require("../../../models/telemetry");
|
|||
const { validApiKey } = require("../../../utils/middleware/validApiKey");
|
||||
const { setupMulter } = require("../../../utils/files/multer");
|
||||
const {
|
||||
checkPythonAppAlive,
|
||||
checkProcessorAlive,
|
||||
acceptedFileTypes,
|
||||
processDocument,
|
||||
} = require("../../../utils/files/documentProcessor");
|
||||
|
@ -60,14 +60,14 @@ function apiDocumentEndpoints(app) {
|
|||
*/
|
||||
try {
|
||||
const { originalname } = request.file;
|
||||
const processingOnline = await checkPythonAppAlive();
|
||||
const processingOnline = await checkProcessorAlive();
|
||||
|
||||
if (!processingOnline) {
|
||||
response
|
||||
.status(500)
|
||||
.json({
|
||||
success: false,
|
||||
error: `Python processing API is not online. Document ${originalname} will not be processed automatically.`,
|
||||
error: `Document processing API is not online. Document ${originalname} will not be processed automatically.`,
|
||||
})
|
||||
.end();
|
||||
}
|
||||
|
|
|
@ -4,7 +4,7 @@ process.env.NODE_ENV === "development"
|
|||
const { viewLocalFiles } = require("../utils/files");
|
||||
const { exportData, unpackAndOverwriteImport } = require("../utils/files/data");
|
||||
const {
|
||||
checkPythonAppAlive,
|
||||
checkProcessorAlive,
|
||||
acceptedFileTypes,
|
||||
} = require("../utils/files/documentProcessor");
|
||||
const { purgeDocument } = require("../utils/files/purgeDocument");
|
||||
|
@ -221,7 +221,7 @@ function systemEndpoints(app) {
|
|||
[validatedRequest],
|
||||
async (_, response) => {
|
||||
try {
|
||||
const online = await checkPythonAppAlive();
|
||||
const online = await checkProcessorAlive();
|
||||
response.sendStatus(online ? 200 : 503);
|
||||
} catch (e) {
|
||||
console.log(e.message, e);
|
||||
|
|
|
@ -7,7 +7,7 @@ const { convertToChatHistory } = require("../utils/chats");
|
|||
const { getVectorDbClass } = require("../utils/helpers");
|
||||
const { setupMulter } = require("../utils/files/multer");
|
||||
const {
|
||||
checkPythonAppAlive,
|
||||
checkProcessorAlive,
|
||||
processDocument,
|
||||
processLink,
|
||||
} = require("../utils/files/documentProcessor");
|
||||
|
@ -82,14 +82,14 @@ function workspaceEndpoints(app) {
|
|||
handleUploads.single("file"),
|
||||
async function (request, response) {
|
||||
const { originalname } = request.file;
|
||||
const processingOnline = await checkPythonAppAlive();
|
||||
const processingOnline = await checkProcessorAlive();
|
||||
|
||||
if (!processingOnline) {
|
||||
response
|
||||
.status(500)
|
||||
.json({
|
||||
success: false,
|
||||
error: `Python processing API is not online. Document ${originalname} will not be processed automatically.`,
|
||||
error: `Document processing API is not online. Document ${originalname} will not be processed automatically.`,
|
||||
})
|
||||
.end();
|
||||
return;
|
||||
|
@ -114,14 +114,14 @@ function workspaceEndpoints(app) {
|
|||
[validatedRequest],
|
||||
async (request, response) => {
|
||||
const { link = "" } = reqBody(request);
|
||||
const processingOnline = await checkPythonAppAlive();
|
||||
const processingOnline = await checkProcessorAlive();
|
||||
|
||||
if (!processingOnline) {
|
||||
response
|
||||
.status(500)
|
||||
.json({
|
||||
success: false,
|
||||
error: `Python processing API is not online. Link ${link} will not be processed automatically.`,
|
||||
error: `Document processing API is not online. Link ${link} will not be processed automatically.`,
|
||||
})
|
||||
.end();
|
||||
return;
|
||||
|
|
|
@ -2,15 +2,15 @@
|
|||
// of docker this endpoint is not exposed so it is only on the Docker instances internal network
|
||||
// so no additional security is needed on the endpoint directly. Auth is done however by the express
|
||||
// middleware prior to leaving the node-side of the application so that is good enough >:)
|
||||
const PYTHON_API = "http://0.0.0.0:8888";
|
||||
async function checkPythonAppAlive() {
|
||||
return await fetch(`${PYTHON_API}`)
|
||||
const PROCESSOR_API = "http://0.0.0.0:8888";
|
||||
async function checkProcessorAlive() {
|
||||
return await fetch(`${PROCESSOR_API}`)
|
||||
.then((res) => res.ok)
|
||||
.catch((e) => false);
|
||||
}
|
||||
|
||||
async function acceptedFileTypes() {
|
||||
return await fetch(`${PYTHON_API}/accepts`)
|
||||
return await fetch(`${PROCESSOR_API}/accepts`)
|
||||
.then((res) => {
|
||||
if (!res.ok) throw new Error("Could not reach");
|
||||
return res.json();
|
||||
|
@ -21,7 +21,7 @@ async function acceptedFileTypes() {
|
|||
|
||||
async function processDocument(filename = "") {
|
||||
if (!filename) return false;
|
||||
return await fetch(`${PYTHON_API}/process`, {
|
||||
return await fetch(`${PROCESSOR_API}/process`, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
|
@ -41,7 +41,7 @@ async function processDocument(filename = "") {
|
|||
|
||||
async function processLink(link = "") {
|
||||
if (!link) return false;
|
||||
return await fetch(`${PYTHON_API}/process-link`, {
|
||||
return await fetch(`${PROCESSOR_API}/process-link`, {
|
||||
method: "POST",
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
|
@ -60,7 +60,7 @@ async function processLink(link = "") {
|
|||
}
|
||||
|
||||
module.exports = {
|
||||
checkPythonAppAlive,
|
||||
checkProcessorAlive,
|
||||
processDocument,
|
||||
processLink,
|
||||
acceptedFileTypes,
|
||||
|
|
Loading…
Add table
Reference in a new issue