Scrapes all notices of specified agencies from the U.S. Federal Register and uploads them to a MinIO object store

Find a file

sij 041c3a5d0f Update README.md		2025-02-20 06:40:06 +00:00
.gitignore	Update .gitignore	2025-01-10 01:18:01 +00:00
example-access-key-policy.json	Update example-access-key-policy.json	2025-01-10 01:17:05 +00:00
example-config.yaml	Update example-config.yaml	2025-01-10 01:17:21 +00:00
frscraper.py	Auto-update: Thu 20 Feb 2025 06:32:20 AM UTC	2025-02-20 06:32:20 +00:00
README.md	Update README.md	2025-02-20 06:40:06 +00:00
requirements.txt	Add requirements.txt	2025-01-10 01:24:03 +00:00

README.md

Federal Register Scraper

A Python tool that scrapes notices from the U.S. Federal Register API and stores them in a MinIO object store. The scraper focuses on environmental and natural resource agencies and maintains an index of abstracts for quick reference.

Features

Scrapes notices from specified Federal Register agencies
Stores PDFs in MinIO with organized folder structure
Maintains a searchable index of abstracts
Incremental updates by default (skips already processed notices)
Optional full refresh mode with --all flag
Detailed logging with timing information
Configurable via YAML file

Prerequisites

Python 3.7+
Access to a MinIO instance
Federal Register API access (no authentication required)

Installation

Clone the repository:

git clone https://sij.ai/sij/fedreg-scraper
cd fedreg-scraper

Install dependencies:

pip install -r requirements.txt

Configuration

Copy the example configuration file:

cp example-config.yaml config.yaml

Edit config.yaml with your settings:

# MinIO connection settings
minio:
  endpoint: "your-minio-endpoint"
  access_key: "your-access-key"
  secret_key: "your-secret-key"
  region: "us-east-1"
  secure: true  # Use HTTPS

# Storage settings
bucket_name: "your-bucket-name"
parent_folder: "federal-register"

# Agencies to scrape (use agency short names or full names)
agencies:
  - "APHIS"
  - "BLM"
  - "EPA"
  # Add more agencies as needed

MinIO Setup

Create a new bucket in your MinIO instance
Create an access policy for the scraper (example below)
Create access credentials and note the access/secret keys

Example MinIO access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name/*",
                "arn:aws:s3:::your-bucket-name"
            ]
        }
    ]
}

Usage

Basic Usage

Run the scraper in incremental mode (stops when it encounters existing documents):

python frscraper.py

Full Refresh

Run the scraper and process all documents, even if they already exist:

python frscraper.py --all

Storage Structure

The scraper organizes documents in MinIO as follows:

bucket_name/
└── federal-register/
    ├── abstracts.json
    ├── APHIS/
    │   ├── 2024-00123 - Notice Title.pdf
    │   └── ...
    ├── EPA/
    │   ├── 2024-00456 - Another Notice.pdf
    │   └── ...
    └── ...

Each agency gets its own folder
PDFs are named with their Federal Register document number and truncated title
abstracts.json contains metadata for all documents

Performance Considerations

Initial load of abstracts.json can take 2-3 minutes for large collections
Saving updated abstracts typically takes 20-30 seconds
Each agency check takes less than 1 second
Use --all flag judiciously as it will check every document

Logging

The scraper provides detailed logging with timestamps for:

Script startup and configuration
Agency processing progress
Document downloads and uploads
Performance metrics
Error conditions

Example log output:

[2025-02-20 06:32:31.042] INFO: Starting Federal Register document scraper
[2025-02-20 06:32:31.042] INFO: Running with --all=False
[2025-02-20 06:32:31.043] INFO: Loading config from config.yaml
...

Error Handling

The scraper handles several common error conditions:

Invalid agency names in config
Network connectivity issues
MinIO access problems
Missing PDFs on Federal Register

Check the logs for detailed error messages if issues occur.

Contributing

Contributions are welcome! Please submit pull requests with:

Clear description of changes
Updated documentation
Additional test coverage if applicable