RatCrawler

Advanced Web Crawling & Multi-Source Trending Analysis Platform

Intelligent batch processing • Real-time analytics • Professional-grade backlink analysis • AI-powered content intelligence

Python 3.8+ Rust 1.70+ SQLite Database MIT License
Smart
Batch Processing
Multi-Source
Analytics
Cloud
Database
Real-time
Monitoring

🔄 Complete RatCrawler Data Flow

From seed URLs to intelligent insights - follow the complete journey of data through our advanced crawling and analysis pipeline

Phase 1: Seed URLs

Initialize crawling process

Load seed URLs from JSON
Start initial web crawling
Store URLs in database

Phase 2: Backlink Crawl

Discover link relationships

Extract all outbound links
Build backlink network
Store backlink data

Phase 3: AI Analysis

Extract intelligent insights

Extract trending topics
Calculate PageRank scores
Spam detection analysis

Phase 4: Data Storage

Persist & organize results

Store in Turso database
Index trending data
Generate analytics

Live Data Flow Animation

Seeds
Crawl
Analyze
Store

Crawling Process

1

Load Seed URLs

Initialize from seed_urls.json file

2

Fetch & Parse

Extract content and discover new links

3

Backlink Analysis

Build comprehensive link network

Data Storage

1

Turso Database

Store crawled data in distributed SQLite

2

Trending Analysis

Extract and index trending topics

3

Real-time Monitoring

Live dashboard and API access

🚀 Quick Start Guide

Get RatCrawler running in minutes with these simple steps

Installation

1

Clone Repository

git clone https://github.com/swadhinbiswas/ratcrowler.git
2

Install Dependencies

pip install -r requirements.txt
3

Set Environment

export DASHBOARD_PASSWORD=swadhin

Launch System

1

Start Batch Crawler

python main.py

Starts intelligent batch processing with automatic resume

2

Monitor Dashboard

python run_dashboard.py

Access dashboard at http://localhost:8501

3

View Logs (Optional)

python run_log_api.py

Real-time logs at http://localhost:8502

🎯 Common Use Cases

Web Crawling

Automatically crawl and analyze web content with intelligent batch processing

Smart Batch Processing

Trend Analysis

Monitor Google Trends and social media for real-time insights

Multi-source analytics

Backlink Analysis

Professional PageRank calculation and spam detection

Advanced algorithms

System Ready

Smart
Batch Processing
Multi-Source
Analytics
Real-time
Monitoring
Auto
Resume

🎉 Everything is configured and ready to go!

Your system will automatically resume from where it left off

What is RatCrawler?

RatCrawler is a sophisticated multi-source trending analysis platform that combines intelligent web crawling, Google Trends analysis, Twitter/X trends monitoring, and professional-grade backlink analysis.

Intelligent Batch Processing

Automatically processes URLs with optimized batch sizes, progress persistence and auto-resume capability.

Multi-Source Analytics

Google Trends + Twitter/X + Web crawling integration for comprehensive trend analysis.

Professional Security

PageRank calculation, domain authority assessment, and advanced spam detection algorithms.

Built with Modern Technologies

Python 3.11+
SQLAlchemy
Streamlit
FastAPI
Turso Cloud

🏗️ Architecture

Modular and scalable system design

🏗️ System Architecture Overview

📥 Data Sources

Seed URLs
Input Sources
Twitter/X API
Social Trends

⚙️ Processing Engine

Crawler Engine
Web Scraping
Content Processor
Data Extraction

🧠 AI & Analytics

PageRank Calculator
Authority Scoring
Spam Detector
Quality Control
Trend Predictor
ML Analysis

💾 Data Storage

Turso Database
Distributed SQLite
Cache Layer
Fast Access

📊 Outputs

Web Dashboard
Real-time Monitoring
API Endpoints
Data Access
Reports
Analytics Export

Core Components

Crawler Engine

Handles web page discovery, content extraction, and link following

Backlink Processor

Analyzes link relationships, calculates PageRank, and detects spam

Trends Analyzer

Aggregates data from multiple sources for trending analysis

Database Layer

Manages data persistence, indexing, and query optimization

Data Flow

1
Seed URLs are loaded and prioritized
2
Pages are crawled with respect to robots.txt
3
Content is extracted and stored in database
4
Backlinks are discovered and analyzed
5
Trends data is aggregated and processed

Auto Batch Crawler

Intelligent URL batch processing from backlinks database with automatic progress tracking and resume capability.

Professional Crawler

Advanced async HTTP client with comprehensive content extraction, robots.txt respect, and error handling.

Trending Analysis

Real-time data from Google Trends and Twitter/X with cross-platform analytics and trend correlation.

Database Layer

Multi-database support with automatic schema migration, connection pooling, and cloud backup.

Monitoring Suite

Real-time dashboard and API monitoring with authentication, performance metrics, and health checks.

Security & Analysis

Advanced spam detection, PageRank calculation, domain authority assessment, and network analysis.

🚀 Features

Comprehensive web crawling and analysis capabilities

Advanced Web Crawling

  • • Priority-based crawling frontier
  • • Robots.txt compliance
  • • Content deduplication
  • • Incremental recrawling
  • • Redirect chain handling

Backlink Analysis

  • • PageRank calculation
  • • Domain authority scoring
  • • Spam link detection
  • • Anchor text analysis
  • • Link graph visualization

Trending Analysis

  • • Google Trends integration
  • • Social media trends
  • • Financial market analysis
  • • News aggregation
  • • Real-time insights

Robust Database

  • • SQLite with optimized schema
  • • Session-based crawling
  • • Comprehensive indexing
  • • Export capabilities
  • • Data persistence

High Performance

  • • Async/Await concurrency
  • • Memory-efficient processing
  • • Configurable delays
  • • Parallel processing
  • • Optimized algorithms

Scheduled Automation

  • • Configurable scheduling
  • • Automated recrawling
  • • Background processing
  • • System monitoring
  • • Error recovery

🔧 Installation & Setup

Complete installation guide for RatCrawler multi-source analysis platform

📋 Prerequisites

System Requirements

  • Python: 3.8+ (Recommended: 3.13+)
  • Memory: 4GB RAM minimum
  • Storage: 2GB free space
  • Network: Stable internet connection

Optional Dependencies

  • Docker: For containerized deployment
  • Git: For source code management
  • Rust: For high-performance variant
  • Node.js: For additional tools

Python Setup (Recommended)

1 Clone Repository

git clone https://github.com/swadhinbiswas/ratcrowler.git
cd ratcrowler

2 Install Core Dependencies

pip install -r requirements.txt

Installs: SQLAlchemy, Streamlit, FastAPI, aiohttp, BeautifulSoup4, and more

3 Optional: Turso Cloud Database

pip install -r requirements_turso.txt

For cloud database integration and scaling

4 Set Environment Variables

export DASHBOARD_PASSWORD=swadhin
export TURSO_DATABASE_URL=your_url # Optional
export TURSO_AUTH_TOKEN=your_token # Optional

Launch System

python main.py # Start batch crawler
python run_dashboard.py # Launch dashboard

Advanced Setup

🐳 Docker Deployment

docker build -t ratcrawler .
docker run -p 8501:8501 -p 8502:8502 ratcrawler

🦀 Rust High-Performance Version

cd rust_version
cargo build --release
./target/release/rat-crawler

For maximum performance and memory efficiency

🏗️ Development Environment

python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
pip install -e .

🔧 Configuration Files

seed_urls.json - Initial URL list
trends.json - Trending data cache
website_crawler.db - Local database

✅ Installation Verification

Database Check

python -c "from rat.database import *; print('✓ Database OK')"

Crawler Test

python test_enhanced_crawler.py

Dashboard Test

curl http://localhost:8501

🛠️ Troubleshooting

Common Issues

  • Permission Error: Use sudo pip install or virtual environment
  • Port Conflicts: Change ports in configuration files
  • Memory Issues: Reduce batch size in settings
  • Network Timeouts: Check firewall and proxy settings

Support Channels

🎮 Usage & Examples

Complete guide to using RatCrawler's powerful features

Batch Processing

🚀 Start Auto Batch Crawler

python main.py

Processes 50 URLs at a time with automatic progress saving

📊 Monitor Progress

python run_dashboard.py

Real-time dashboard at http://localhost:8501

🔍 View Detailed Logs

python run_log_api.py

Detailed logging API at http://localhost:8502

✨ Key Features:
  • • Automatic resume from last processed URL
  • • Progress saved to JSON after each batch
  • • Smart URL queue management
  • • Intelligent error handling and retry logic

Trending Analysis

📈 Google Trends

cd engine && python googletrends.py --limit 10

Analyze top trending topics with custom limits

🐦 Twitter/X Trends

cd engine && python xtrends.py

Social media trending analysis and correlation

🔗 Backlink Analysis

python test_backlink_storage.py

PageRank calculation and spam detection

🎯 Analytics Features:
  • • Cross-platform trend correlation
  • • Domain authority assessment
  • • Network analysis with PageRank
  • • Real-time data aggregation

⚙️ Advanced Configuration

🎛️ Crawler Settings

# main.py configuration
BATCH_SIZE = 50 # URLs per batch
DELAY_BETWEEN_REQUESTS = 1 # seconds
MAX_CONCURRENT_REQUESTS = 10
RESPECT_ROBOTS_TXT = True
ENABLE_DETAILED_LOGGING = True

🗄️ Database Options

# Local SQLite (default)
DATABASE_URL = "sqlite:///website_crawler.db"

# Turso Cloud Database
TURSO_DATABASE_URL = "libsql://your-db.turso.io"
TURSO_AUTH_TOKEN = "your-auth-token"

🔐 Security & Authentication

# Environment variables
export DASHBOARD_PASSWORD=swadhin
export LOG_API_TOKEN=your-secret-token
export USER_AGENT="RatCrawler/2.0"
export RATE_LIMIT=true

📊 Monitoring Setup

# Dashboard: http://localhost:8501
# Log API: http://localhost:8502
# Health Check: /health
# Metrics: /metrics
# Status: /status

🌟 Real-World Use Cases

SEO Research

# Analyze competitor backlinks
python test_backlink_storage.py

# Monitor domain authority
python test_enhanced_crawler.py

Comprehensive SEO analysis with PageRank and domain authority metrics

Market Intelligence

# Track trending topics
cd engine && python googletrends.py

# Social media analysis
python xtrends.py --realtime

Real-time market trends and social media sentiment analysis

Data Collection

# Large-scale crawling
python main.py

# Export results
python dashboard.py --export

Automated data collection with smart queue management and intelligent batch processing

📚 API Reference

Core classes and methods

🐍 Python API

EnhancedProductionCrawler

  • comprehensive_crawl() - Full crawling with analysis
  • crawl_page_content() - Single page crawling
  • export_results() - Export crawl data
  • get_all_backlinks() - Retrieve backlinks

BacklinkProcessor

  • crawl_backlinks() - Discover backlinks
  • calculate_pagerank() - PageRank computation
  • calculate_domain_authority() - Domain scoring
  • detect_link_spam() - Spam detection

🦀 Rust API

WebsiteCrawler

  • crawl() - Async web crawling
  • crawl_single_page() - Single page processing
  • extract_urls() - URL discovery
  • can_crawl() - Robots.txt checking

BacklinkProcessor

  • analyze_backlinks() - Backlink analysis
  • calculate_domain_authority() - Authority scoring
  • detect_spam_links() - Spam detection