A powerful phishing URL detection system that combines a trusted domain whitelist with machine learning for accurate and efficient phishing detection.
-
Whitelist Database
- Multiple trusted domain sources (Umbrella, Tranco, Majestic, DomCop)
- Fast database lookups
- High confidence for legitimate domains
- Reduces false positives
-
AI Model
- BERT-based deep learning detection
- Works completely offline
- Catches sophisticated phishing attempts
- High accuracy for unknown domains
- Speed: Quick whitelist checks for known domains
- Accuracy: AI model for unknown domains
- Reliability: Trusted sources (Umbrella, Tranco, Majestic, DomCop) for whitelist
- Efficiency: Optimized database for fast lookups
- Clone the repository
git clone https://github.com/yourusername/phishing-url-detector-ai.git
cd phishing-url-detector-ai-
Download the AI Model
- Download the model from: Hugging Face Model
- Create a
modelsdirectory in the project root - Extract the model files into
models/bert-finetuned-phishing
-
Set up the environment
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Initialize the Whitelist Database
# Create database with schema
sqlite3 data/whitelist.db < schema.sql
# Import whitelist data (choose sources as needed)
# For Umbrella Top 1M:
wget -O data/top-1m.csv https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip -o data/top-1m.csv.zip -d data/
sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/top-1m.csv umbrella"
# For DomCop Top 10M (optional, large file):
# wget -O data/DomCoptop10milliondomains.csv.zip https://example.com/path/to/DomCoptop10milliondomains.csv
# unzip -o data/DomCoptop10milliondomains.csv.zip -d data/
# sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/DomCoptop10milliondomains.csv domcop"from phishing_detector import PhishingDetector
# Initialize detector (will use offline model)
detector = PhishingDetector(use_offline=True)
# Check a URL
result = detector.check_url("https://example.com")
print(f"Is phishing: {result['is_phishing']}")
print(f"Confidence: {result['confidence']:.2%}")- AI Model: BERT Finetuned for Phishing Detection
- Whitelist Sources:
- Cisco Umbrella Top 1M
- Tranco Top 1M
- Majestic Million
- DomCop Top 10M (optional)
- The model (1.34GB) and whitelist databases should be kept in the
modelsanddatadirectories respectively - Add these directories to your
.gitignoreto avoid committing large files - For production use, consider using a more robust database like PostgreSQL
-
Whitelist Manager
- SQLite database
- Optimized for fast lookups
- Multiple trusted domain sources
- Automatic updates
-
AI Model
- BERT-based deep learning architecture
- Feature extraction
- Real-time prediction
- Confidence scoring
-
Web Interface
- Modern, responsive design
- Real-time URL checking
- Detailed analysis view
- Batch processing
umbrella: Trusted domains from Cisco Umbrella- Optimized indexes for fast lookups
- Views for common queries
- Automatic timestamp updates
- Whitelist lookup: < 1ms
- AI model prediction: ~100ms
- Batch processing: ~50ms per URL
- Database size: ~100MB
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
MIT License - See LICENSE for details.