Phishing URL Detector AI

A powerful phishing URL detection system that combines a trusted domain whitelist with machine learning for accurate and efficient phishing detection.

Features

Two-Layer Protection

Whitelist Database
- Multiple trusted domain sources (Umbrella, Tranco, Majestic, DomCop)
- Fast database lookups
- High confidence for legitimate domains
- Reduces false positives
AI Model
- BERT-based deep learning detection
- Works completely offline
- Catches sophisticated phishing attempts
- High accuracy for unknown domains

Key Benefits

Speed: Quick whitelist checks for known domains
Accuracy: AI model for unknown domains
Reliability: Trusted sources (Umbrella, Tranco, Majestic, DomCop) for whitelist
Efficiency: Optimized database for fast lookups

Offline Setup

Clone the repository

git clone https://github.com/yourusername/phishing-url-detector-ai.git
cd phishing-url-detector-ai

Download the AI Model
- Download the model from: Hugging Face Model
- Create a models directory in the project root
- Extract the model files into models/bert-finetuned-phishing
Set up the environment

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Initialize the Whitelist Database

# Create database with schema
sqlite3 data/whitelist.db < schema.sql

# Import whitelist data (choose sources as needed)

# For Umbrella Top 1M:
wget -O data/top-1m.csv https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip -o data/top-1m.csv.zip -d data/
sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/top-1m.csv umbrella"

# For DomCop Top 10M (optional, large file):
# wget -O data/DomCoptop10milliondomains.csv.zip https://example.com/path/to/DomCoptop10milliondomains.csv
# unzip -o data/DomCoptop10milliondomains.csv.zip -d data/
# sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/DomCoptop10milliondomains.csv domcop"

Usage

from phishing_detector import PhishingDetector

# Initialize detector (will use offline model)
detector = PhishingDetector(use_offline=True)

# Check a URL
result = detector.check_url("https://example.com")
print(f"Is phishing: {result['is_phishing']}")
print(f"Confidence: {result['confidence']:.2%}")

Data Sources

AI Model: BERT Finetuned for Phishing Detection
Whitelist Sources:
- Cisco Umbrella Top 1M
- Tranco Top 1M
- Majestic Million
- DomCop Top 10M (optional)

Notes

The model (1.34GB) and whitelist databases should be kept in the models and data directories respectively
Add these directories to your .gitignore to avoid committing large files
For production use, consider using a more robust database like PostgreSQL

Architecture

Components

Whitelist Manager
- SQLite database
- Optimized for fast lookups
- Multiple trusted domain sources
- Automatic updates
AI Model
- BERT-based deep learning architecture
- Feature extraction
- Real-time prediction
- Confidence scoring
Web Interface
- Modern, responsive design
- Real-time URL checking
- Detailed analysis view
- Batch processing

Database Schema

umbrella: Trusted domains from Cisco Umbrella
Optimized indexes for fast lookups
Views for common queries
Automatic timestamp updates

Performance

Whitelist lookup: < 1ms
AI model prediction: ~100ms
Batch processing: ~50ms per URL
Database size: ~100MB

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

MIT License - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
templates		templates
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
app.py		app.py
blacklist.sql		blacklist.sql
csv_analyzer.py		csv_analyzer.py
download_model.py		download_model.py
model_loader.py		model_loader.py
optimize.sql		optimize.sql
predictor.py		predictor.py
requirements.txt		requirements.txt
results.csv		results.csv
schema.sql		schema.sql
setup.py		setup.py
test_results.csv		test_results.csv
test_urls.txt		test_urls.txt
url_analyzer.py		url_analyzer.py
urls_to_test.txt		urls_to_test.txt
whitelist_manager.py		whitelist_manager.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phishing URL Detector AI

Features

Two-Layer Protection

Key Benefits

Offline Setup

Usage

Data Sources

Notes

Architecture

Components

Database Schema

Performance

Contributing

License

About

Uh oh!

Releases

Packages

Languages

abriljordan/phishing-url-detector-ai

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detector AI

Features

Two-Layer Protection

Key Benefits

Offline Setup

Usage

Data Sources

Notes

Architecture

Components

Database Schema

Performance

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages