mangAI

MangAI 📚🎵

A manga-to-audio application that converts English manga pages into immersive audio narratives using AI. Upload a manga image and generate a complete audio story with separate narrator and character voices using OpenAI GPT-4 Vision, GPT-4 Text, and ElevenLabs multi-voice TTS!

Features

🔍 Smart Frame Detection: Automatically detects manga panels using YOLO models with reading order optimization
📝 Advanced OCR Processing: Extracts English text from manga panels using PaddleOCR with confidence filtering
🧠 AI Scene Analysis: GPT-4 Vision analyzes visual context and scene descriptions for each frame
✍️ Intelligent Script Generation: GPT-4 Text creates cohesive narratives combining visual analysis with extracted text
🎭 Multi-Voice TTS: ElevenLabs generates separate narrator and character voices for immersive storytelling
📁 Organized Output: Structured processing directories with timestamped folders containing frames, OCR results, and audio files
🌐 Interactive Web Interface: User-friendly Streamlit frontend with real-time progress tracking and multi-voice audio controls
🇺🇸 English Focused: Optimized for English manga processing with advanced language models

Architecture

📊 For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.

High-Level Flow

┌─────────────────────┐
│   Streamlit App     │  ← Web Frontend with Multi-Voice Controls
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   Frame Detector    │  ← YOLO Models with Reading Order Detection
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   OCR Processor     │  ← PaddleOCR (English) with Confidence Filtering
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   LLM Vision        │  ← GPT-4 Vision for Scene Analysis
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   LLM Narrator      │  ← GPT-4 Text for Script Generation
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  Multi-Voice TTS    │  ← ElevenLabs with Narrator & Character Voices
└─────────────────────┘

Features

🔍 Frame Detection: Automatically detects manga panels using YOLO models
📝 OCR Text Extraction: Extracts English text from manga panels
📝 Text Processing: Simple text combination and cleaning for audio generation
🎵 Text-to-Speech: Generates audio from processed text
🌐 Web Interface: User-friendly Streamlit frontend for easy interaction
🐳 Containerized: Full Docker support for easy deployment
🇺🇸 English Only: Focused on English manga processing for simplicity

Architecture

📊 For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.

High-Level Flow

┌─────────────────────┐
│   Streamlit App     │  ← Web Frontend
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   Frame Detector    │  ← YOLO Models (integrated from yolov8Model.py)
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   OCR Processor     │  ← Tesseract/OCR (English only)
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   Text Processor    │  ← Simple text combination
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│   TTS Generator     │  ← Text-to-Speech (English)
└─────────────────────┘

🔄 Migration from Original Code

The original yolov8Model.py functionality has been integrated into modules/frame_detector.py with the following improvements:

Enhanced Architecture: Better separation of concerns and modular design
Configuration Management: Centralized settings and environment variables
Improved Error Handling: Better exception handling and validation
Extended Functionality: Additional methods for statistics and visualization
Backward Compatibility: Original interface methods are preserved
Type Safety: Added type hints and better documentation
Simplified Focus: English-only processing for cleaner implementation

Quick Start

Prerequisites

Before running MangAI, you need to obtain API keys for:

OpenAI API: For GPT-4 Vision and Text processing
ElevenLabs API: For multi-voice TTS generation

Local Development Setup

Clone and set up the project:

git clone <repository-url>
cd mangAI
python3 -m venv virtualenv
source virtualenv/bin/activate  # On macOS/Linux
# or: virtualenv\Scripts\activate  # On Windows
pip install --upgrade pip
pip install -r requirements.txt

Configure API credentials: Create a .env file in the project root:

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# ElevenLabs Configuration
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character

# Application Settings
DEFAULT_YOLO_MODEL=frame
YOLO_CONFIDENCE_THRESHOLD=0.5
OCR_CONFIDENCE_THRESHOLD=0.3

Run the application:

./start.sh
# or directly:
streamlit run app.py

Access the application: Open your browser and go to http://localhost:8501

Using Virtual Environment (Recommended)

The project includes a pre-configured virtual environment in the virtualenv/ directory:

# Activate the existing virtual environment
source virtualenv/bin/activate  # On macOS/Linux
# or: virtualenv\Scripts\activate  # On Windows

# Install any missing dependencies
pip install -r requirements.txt

# Run the application
./start.sh

macOS

brew install tesseract

3. **Test the integration:**
```bash
python test_integration.py

Run the application:
```
./start.sh
# or
streamlit run app.py
```

Project Structure

mangAI/
├── app.py                    # Main Streamlit application with multi-voice interface
├── config.py                 # Configuration management with directory creation
├── requirements.txt          # Python dependencies (OpenAI, ElevenLabs, PaddleOCR)
├── start.sh                  # Startup script for virtual environment
├── architecture_diagram.md   # Comprehensive system architecture documentation
├── README.md                 # Project documentation
├── modules/                  # Core processing modules
│   ├── __init__.py
│   ├── frame_detector.py     # YOLO-based frame detection with reading order
│   ├── ocr_processor.py      # PaddleOCR text extraction with confidence filtering
│   ├── llm_processor.py      # OpenAI GPT-4 Vision and Text processing
│   └── tts_generator.py      # ElevenLabs multi-voice TTS generation
├── models/                   # YOLO model files
│   ├── yolo8l_50epochs/
│   ├── yolo8l_50epochs_frame/
│   └── yolo8s_50epochs/
├── images/                   # Test manga images
│   ├── test1.jpg
│   ├── test2.jpg
│   └── ...
├── audio_output/             # Structured processing outputs
│   ├── processed_20240101_120000/
│   │   ├── frames/           # Extracted manga frames
│   │   ├── ocr/             # OCR results and combined text
│   │   └── audio/           # Multi-voice audio files and transcript
│   └── processed_YYYYMMDD_HHMMSS/
├── logs/                     # Application logs
└── virtualenv/               # Pre-configured Python environment
    ├── bin/
    ├── lib/
    └── ...

Configuration

Environment Variables

Create a .env file with the following configuration:

# OpenAI Configuration (Required)
OPENAI_API_KEY=your_openai_api_key_here

# ElevenLabs Configuration (Required)
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character

# Application Settings
DEFAULT_YOLO_MODEL=frame
YOLO_CONFIDENCE_THRESHOLD=0.5
OCR_CONFIDENCE_THRESHOLD=0.3

API Configuration

OpenAI API Setup:
- Get your API key from OpenAI Platform
- The app uses GPT-4 Vision for scene analysis and GPT-4 Text for narrative generation
ElevenLabs API Setup:
- Get your API key from ElevenLabs
- Create or select voice IDs for narrator and character roles
- The app generates separate audio tracks for different voices
Model Configuration:
- frame: Best for manga frame detection
- yolo8l_50epochs: Alternative YOLO model
- yolo8s_50epochs: Smaller, faster model option

Usage

Upload Image: Select an English manga page image (PNG, JPG, JPEG)
Configure Settings:
- Choose YOLO detection model
- Adjust confidence thresholds if needed
Generate Audio: Click “Generate Audio” to start processing
- Frame detection and extraction
- OCR text extraction with confidence filtering
- AI scene analysis using GPT-4 Vision
- Narrative script generation using GPT-4 Text
- Multi-voice audio generation using ElevenLabs
Review Results:
- View processing statistics
- Play separate narrator/character audio or combined version
- Download individual audio files or complete transcript
Explore Output: Browse the timestamped processing folder with organized frames, OCR results, and audio files

Development

Adding New Modules

Each processing module follows a consistent interface pattern:

class NewProcessor:
    def __init__(self, config=None):
        """Initialize the processor with configuration"""
        self.config = config or Config()

    def process(self, input_data, output_dir=None):
        """Main processing method with structured output"""
        pass

    def get_statistics(self):
        """Return processing statistics"""
        pass

Model Integration

Adding New YOLO Models:

Place model files in ./models/model_name/best.pt
Update config.py MODEL_PATHS dictionary
The frame detector will automatically load and use them

Integrating Alternative LLM Providers:

Extend LLMProcessor class in modules/llm_processor.py
Implement vision and text processing methods
Add provider configuration in config.py
Update frontend provider selection

Adding New TTS Providers:

Extend TTSGenerator class in modules/tts_generator.py
Implement multi-voice generation methods
Add API configuration and voice settings
Update frontend voice configuration options

Directory Structure Standards

All processing modules should use the structured directory pattern:

processed_YYYYMMDD_HHMMSS/
├── frames/          # Input frames and extraction results
├── ocr/            # OCR results and text processing
└── audio/          # Audio files and transcripts

Performance

Frame Detection: ~1-3 seconds per image (YOLO processing)
OCR Processing: ~2-5 seconds per frame (PaddleOCR with confidence filtering)
LLM Vision Analysis: ~3-8 seconds per frame (GPT-4 Vision processing)
LLM Script Generation: ~5-15 seconds (GPT-4 Text narrative creation)
Multi-Voice TTS Generation: ~10-30 seconds (ElevenLabs narrator + character voices)

Total Processing Time: ~2-5 minutes per manga page (depending on frame count and text complexity)

Troubleshooting

Common Issues

API Configuration:
- Ensure OpenAI API key is valid and has GPT-4 access
- Verify ElevenLabs API key and voice IDs are correct
- Check API rate limits and quotas
Model Files:
- Ensure YOLO models are in the correct paths (./models/*/best.pt)
- Check model file permissions and accessibility
Processing Errors:
- OCR confidence too low: Adjust OCR_CONFIDENCE_THRESHOLD
- Frame detection issues: Try different YOLO models or adjust confidence
- LLM processing failures: Check API keys and rate limits
Audio Generation:
- Voice ID not found: Verify ElevenLabs voice IDs in configuration
- Audio quality issues: Check voice settings and text preprocessing
- File output errors: Ensure write permissions to audio_output/ directory

Logs and Debugging

Check processing logs:

# View application logs
tail -f logs/app.log

# Check specific processing folder
ls -la audio_output/processed_*/

# Verify API connectivity
python -c "import openai; print('OpenAI OK')"
python -c "import elevenlabs; print('ElevenLabs OK')"

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make your changes following the module interface patterns
Test with sample manga images
Update documentation if needed
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI for GPT-4 Vision and Text models
ElevenLabs for high-quality multi-voice TTS
Ultralytics for YOLO object detection models
PaddleOCR for robust OCR text extraction
Streamlit for the interactive web interface

MangAI transforms static manga pages into immersive audio experiences using cutting-edge AI technologies. From visual analysis to multi-voice narration, experience your favorite manga like never before! 🎭📚🎵