RatCrawler - Advanced Web Crawling & Multi-Source Trending Analysis Platform

🔄 Complete RatCrawler Data Flow

From seed URLs to intelligent insights - follow the complete journey of data through our advanced crawling and analysis pipeline

Phase 1: Seed URLs

Initialize crawling process

Load seed URLs from JSON

Start initial web crawling

Store URLs in database

Phase 2: Backlink Crawl

Discover link relationships

Extract all outbound links

Build backlink network

Store backlink data

Phase 3: AI Analysis

Extract intelligent insights

Extract trending topics

Calculate PageRank scores

Spam detection analysis

Phase 4: Data Storage

Persist & organize results

Store in Turso database

Index trending data

Generate analytics

Live Data Flow Animation

Seeds

Crawl

Analyze

Store

Crawling Process

1

Load Seed URLs

Initialize from seed_urls.json file

2

Fetch & Parse

Extract content and discover new links

3

Backlink Analysis

Build comprehensive link network

Data Storage

1

Turso Database

Store crawled data in distributed SQLite

2

Trending Analysis

Extract and index trending topics

3

Real-time Monitoring

Live dashboard and API access

🚀 Quick Start Guide

Get RatCrawler running in minutes with these simple steps

Installation

1

Clone Repository

git clone
                      https://github.com/swadhinbiswas/ratcrowler.git

2

Install Dependencies

pip install -r requirements.txt

3

Set Environment

export DASHBOARD_PASSWORD=swadhin

Launch System

1

Start Batch Crawler

python main.py

Starts intelligent batch processing with automatic resume

2

Monitor Dashboard

python run_dashboard.py

Access dashboard at http://localhost:8501

3

View Logs (Optional)

python run_log_api.py

Real-time logs at http://localhost:8502

🎯 Common Use Cases

Web Crawling

Automatically crawl and analyze web content with intelligent batch processing

Smart Batch Processing

Trend Analysis

Monitor Google Trends and social media for real-time insights

Multi-source analytics

Backlink Analysis

Professional PageRank calculation and spam detection

Advanced algorithms

System Ready

Smart

Batch Processing

Multi-Source

Analytics

Real-time

Monitoring

Auto

Resume

🎉 Everything is configured and ready to go!

Your system will automatically resume from where it left off

What is RatCrawler?

RatCrawler is a sophisticated multi-source trending analysis platform that combines intelligent web crawling, Google Trends analysis, Twitter/X trends monitoring, and professional-grade backlink analysis.

Intelligent Batch Processing

Automatically processes URLs with optimized batch sizes, progress persistence and auto-resume capability.

Multi-Source Analytics

Google Trends + Twitter/X + Web crawling integration for comprehensive trend analysis.

Professional Security

PageRank calculation, domain authority assessment, and advanced spam detection algorithms.

Built with Modern Technologies

Python 3.11+

SQLAlchemy

Streamlit

FastAPI

Turso Cloud

🏗️ Architecture

Modular and scalable system design

🏗️ System Architecture Overview

📥 Data Sources

Seed URLs

Input Sources

Google Trends

Trending Data

Twitter/X API

Social Trends

⚙️ Processing Engine

Crawler Engine

Web Scraping

Content Processor

Data Extraction

Backlink Analyzer

Link Analysis

Trends Correlator

Data Correlation

🧠 AI & Analytics

PageRank Calculator

Authority Scoring

Spam Detector

Quality Control

Trend Predictor

ML Analysis

💾 Data Storage

Turso Database

Distributed SQLite

Cache Layer

Fast Access

📊 Outputs

Web Dashboard

Real-time Monitoring

API Endpoints

Data Access

Reports

Analytics Export

Core Components

Crawler Engine

Handles web page discovery, content extraction, and link following

Backlink Processor

Analyzes link relationships, calculates PageRank, and detects spam

Trends Analyzer

Aggregates data from multiple sources for trending analysis

Database Layer

Manages data persistence, indexing, and query optimization

Data Flow

1

Seed URLs are loaded and prioritized

2

Pages are crawled with respect to robots.txt

3

Content is extracted and stored in database

4

Backlinks are discovered and analyzed

5

Trends data is aggregated and processed

Auto Batch Crawler

Intelligent URL batch processing from backlinks database with automatic progress tracking and resume capability.

Professional Crawler

Advanced async HTTP client with comprehensive content extraction, robots.txt respect, and error handling.

Trending Analysis

Real-time data from Google Trends and Twitter/X with cross-platform analytics and trend correlation.

Database Layer

Multi-database support with automatic schema migration, connection pooling, and cloud backup.

Monitoring Suite

Real-time dashboard and API monitoring with authentication, performance metrics, and health checks.

Security & Analysis

Advanced spam detection, PageRank calculation, domain authority assessment, and network analysis.

🚀 Features

Comprehensive web crawling and analysis capabilities

Advanced Web Crawling

• Priority-based crawling frontier
• Robots.txt compliance
• Content deduplication
• Incremental recrawling
• Redirect chain handling

Backlink Analysis

• PageRank calculation
• Domain authority scoring
• Spam link detection
• Anchor text analysis
• Link graph visualization

Trending Analysis

• Google Trends integration
• Social media trends
• Financial market analysis
• News aggregation
• Real-time insights

Robust Database

• SQLite with optimized schema
• Session-based crawling
• Comprehensive indexing
• Export capabilities
• Data persistence

High Performance

• Async/Await concurrency
• Memory-efficient processing
• Configurable delays
• Parallel processing
• Optimized algorithms

Scheduled Automation

• Configurable scheduling
• Automated recrawling
• Background processing
• System monitoring
• Error recovery

🔧 Installation & Setup

Complete installation guide for RatCrawler multi-source analysis platform

📋 Prerequisites

System Requirements

• Python: 3.8+ (Recommended: 3.13+)
• Memory: 4GB RAM minimum
• Storage: 2GB free space
• Network: Stable internet connection

Optional Dependencies

• Docker: For containerized deployment
• Git: For source code management
• Rust: For high-performance variant
• Node.js: For additional tools

Python Setup (Recommended)

1 Clone Repository


                    git clone https://github.com/swadhinbiswas/ratcrowler.git

                    cd ratcrowler

2 Install Core Dependencies


                    pip install -r requirements.txt

Installs: SQLAlchemy, Streamlit, FastAPI, aiohttp, BeautifulSoup4, and more

3 Optional: Turso Cloud Database


                    pip install -r requirements_turso.txt

For cloud database integration and scaling

4 Set Environment Variables


                    export DASHBOARD_PASSWORD=swadhin

                    export TURSO_DATABASE_URL=your_url # Optional

                    export TURSO_AUTH_TOKEN=your_token # Optional

✓ Launch System


                    python main.py # Start batch crawler

                    python run_dashboard.py # Launch dashboard

Advanced Setup

🐳 Docker Deployment


                    docker build -t ratcrawler .

                    docker run -p 8501:8501 -p 8502:8502 ratcrawler

🦀 Rust High-Performance Version


                    cd rust_version

                    cargo build --release

                    ./target/release/rat-crawler

For maximum performance and memory efficiency

🏗️ Development Environment


                    python -m venv venv

                    source venv/bin/activate # Linux/Mac

                    venv\Scripts\activate # Windows

                    pip install -e .

🔧 Configuration Files

seed_urls.json - Initial URL list

trends.json - Trending data cache

website_crawler.db - Local database

✅ Installation Verification

Database Check

python -c "from rat.database import *; print('✓ Database
                  OK')"

Crawler Test

python test_enhanced_crawler.py

Dashboard Test

curl http://localhost:8501

🛠️ Troubleshooting

Common Issues

• Permission Error: Use sudo pip install or virtual environment
• Port Conflicts: Change ports in configuration files
• Memory Issues: Reduce batch size in settings
• Network Timeouts: Check firewall and proxy settings

Support Channels

• GitHub Issues
• Discussions
• Wiki Documentation
• Email: support@theboringrats.com

🎮 Usage & Examples

Complete guide to using RatCrawler's powerful features

Batch Processing

🚀 Start Auto Batch Crawler

python main.py

Processes 50 URLs at a time with automatic progress saving

📊 Monitor Progress

python run_dashboard.py

Real-time dashboard at http://localhost:8501

🔍 View Detailed Logs

python run_log_api.py

Detailed logging API at http://localhost:8502

✨ Key Features:

• Automatic resume from last processed URL
• Progress saved to JSON after each batch
• Smart URL queue management
• Intelligent error handling and retry logic

Trending Analysis

📈 Google Trends

cd engine && python googletrends.py --limit 10

Analyze top trending topics with custom limits

🐦 Twitter/X Trends

cd engine && python xtrends.py

Social media trending analysis and correlation

🔗 Backlink Analysis

python test_backlink_storage.py

PageRank calculation and spam detection

🎯 Analytics Features:

• Cross-platform trend correlation
• Domain authority assessment
• Network analysis with PageRank
• Real-time data aggregation

⚙️ Advanced Configuration

🎛️ Crawler Settings


                    # main.py configuration

                    BATCH_SIZE = 50 # URLs per batch

                    DELAY_BETWEEN_REQUESTS = 1 # seconds

                    MAX_CONCURRENT_REQUESTS = 10

                    RESPECT_ROBOTS_TXT = True

                    ENABLE_DETAILED_LOGGING = True

🗄️ Database Options


                    # Local SQLite (default)

                    DATABASE_URL = "sqlite:///website_crawler.db"


                    # Turso Cloud Database

                    TURSO_DATABASE_URL = "libsql://your-db.turso.io"

                    TURSO_AUTH_TOKEN = "your-auth-token"

🔐 Security & Authentication


                    # Environment variables

                    export DASHBOARD_PASSWORD=swadhin

                    export LOG_API_TOKEN=your-secret-token

                    export USER_AGENT="RatCrawler/2.0"

                    export RATE_LIMIT=true

📊 Monitoring Setup


                    # Dashboard: http://localhost:8501

                    # Log API: http://localhost:8502

                    # Health Check: /health

                    # Metrics: /metrics

                    # Status: /status

🌟 Real-World Use Cases

SEO Research


                  # Analyze competitor backlinks

                  python test_backlink_storage.py


                  # Monitor domain authority

                  python test_enhanced_crawler.py

Comprehensive SEO analysis with PageRank and domain authority metrics

Market Intelligence


                  # Track trending topics

                  cd engine && python googletrends.py


                  # Social media analysis

                  python xtrends.py --realtime

Real-time market trends and social media sentiment analysis

Data Collection


                  # Large-scale crawling

                  python main.py


                  # Export results

                  python dashboard.py --export

Automated data collection with smart queue management and intelligent batch processing

📚 API Reference

Core classes and methods

🐍 Python API

EnhancedProductionCrawler

• comprehensive_crawl() - Full crawling with analysis
• crawl_page_content() - Single page crawling
• export_results() - Export crawl data
• get_all_backlinks() - Retrieve backlinks

BacklinkProcessor

• crawl_backlinks() - Discover backlinks
• calculate_pagerank() - PageRank computation
• calculate_domain_authority() - Domain scoring
• detect_link_spam() - Spam detection

🦀 Rust API

WebsiteCrawler

• crawl() - Async web crawling
• crawl_single_page() - Single page processing
• extract_urls() - URL discovery
• can_crawl() - Robots.txt checking

BacklinkProcessor

• analyze_backlinks() - Backlink analysis
• calculate_domain_authority() - Authority scoring
• detect_spam_links() - Spam detection