mangAI

MangAI πŸ“šπŸŽ΅

MangAI Logo A manga-to-audio application that converts English manga pages into immersive audio narratives using AI. Upload a manga image and generate a complete audio story with separate narrator and character voices using OpenAI GPT-4 Vision, GPT-4 Text, and ElevenLabs multi-voice TTS!

Features

Architecture

πŸ“Š For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit App     β”‚  ← Web Frontend with Multi-Voice Controls
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frame Detector    β”‚  ← YOLO Models with Reading Order Detection
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   OCR Processor     β”‚  ← PaddleOCR (English) with Confidence Filtering
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LLM Vision        β”‚  ← GPT-4 Vision for Scene Analysis
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LLM Narrator      β”‚  ← GPT-4 Text for Script Generation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Multi-Voice TTS    β”‚  ← ElevenLabs with Narrator & Character Voices
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features

Architecture

πŸ“Š For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit App     β”‚  ← Web Frontend
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frame Detector    β”‚  ← YOLO Models (integrated from yolov8Model.py)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   OCR Processor     β”‚  ← Tesseract/OCR (English only)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Text Processor    β”‚  ← Simple text combination
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   TTS Generator     β”‚  ← Text-to-Speech (English)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Migration from Original Code

The original yolov8Model.py functionality has been integrated into modules/frame_detector.py with the following improvements:

Quick Start

Prerequisites

Before running MangAI, you need to obtain API keys for:

Local Development Setup

  1. Clone and set up the project:

    git clone <repository-url>
    cd mangAI
    python3 -m venv virtualenv
    source virtualenv/bin/activate  # On macOS/Linux
    # or: virtualenv\Scripts\activate  # On Windows
    pip install --upgrade pip
    pip install -r requirements.txt
    
  2. Configure API credentials: Create a .env file in the project root:

    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_api_key_here
    
    # ElevenLabs Configuration
    ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
    ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
    ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character
    
    # Application Settings
    DEFAULT_YOLO_MODEL=frame
    YOLO_CONFIDENCE_THRESHOLD=0.5
    OCR_CONFIDENCE_THRESHOLD=0.3
    
  3. Run the application:

    ./start.sh
    # or directly:
    streamlit run app.py
    
  4. Access the application: Open your browser and go to http://localhost:8501

The project includes a pre-configured virtual environment in the virtualenv/ directory:

# Activate the existing virtual environment
source virtualenv/bin/activate  # On macOS/Linux
# or: virtualenv\Scripts\activate  # On Windows

# Install any missing dependencies
pip install -r requirements.txt

# Run the application
./start.sh

macOS

brew install tesseract


3. **Test the integration:**
```bash
python test_integration.py
  1. Run the application:
    ./start.sh
    # or
    streamlit run app.py
    

Project Structure

mangAI/
β”œβ”€β”€ app.py                    # Main Streamlit application with multi-voice interface
β”œβ”€β”€ config.py                 # Configuration management with directory creation
β”œβ”€β”€ requirements.txt          # Python dependencies (OpenAI, ElevenLabs, PaddleOCR)
β”œβ”€β”€ start.sh                  # Startup script for virtual environment
β”œβ”€β”€ architecture_diagram.md   # Comprehensive system architecture documentation
β”œβ”€β”€ README.md                 # Project documentation
β”œβ”€β”€ modules/                  # Core processing modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ frame_detector.py     # YOLO-based frame detection with reading order
β”‚   β”œβ”€β”€ ocr_processor.py      # PaddleOCR text extraction with confidence filtering
β”‚   β”œβ”€β”€ llm_processor.py      # OpenAI GPT-4 Vision and Text processing
β”‚   └── tts_generator.py      # ElevenLabs multi-voice TTS generation
β”œβ”€β”€ models/                   # YOLO model files
β”‚   β”œβ”€β”€ yolo8l_50epochs/
β”‚   β”œβ”€β”€ yolo8l_50epochs_frame/
β”‚   └── yolo8s_50epochs/
β”œβ”€β”€ images/                   # Test manga images
β”‚   β”œβ”€β”€ test1.jpg
β”‚   β”œβ”€β”€ test2.jpg
β”‚   └── ...
β”œβ”€β”€ audio_output/             # Structured processing outputs
β”‚   β”œβ”€β”€ processed_20240101_120000/
β”‚   β”‚   β”œβ”€β”€ frames/           # Extracted manga frames
β”‚   β”‚   β”œβ”€β”€ ocr/             # OCR results and combined text
β”‚   β”‚   └── audio/           # Multi-voice audio files and transcript
β”‚   └── processed_YYYYMMDD_HHMMSS/
β”œβ”€β”€ logs/                     # Application logs
└── virtualenv/               # Pre-configured Python environment
    β”œβ”€β”€ bin/
    β”œβ”€β”€ lib/
    └── ...

Configuration

Environment Variables

Create a .env file with the following configuration:

# OpenAI Configuration (Required)
OPENAI_API_KEY=your_openai_api_key_here

# ElevenLabs Configuration (Required)
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character

# Application Settings
DEFAULT_YOLO_MODEL=frame
YOLO_CONFIDENCE_THRESHOLD=0.5
OCR_CONFIDENCE_THRESHOLD=0.3

API Configuration

  1. OpenAI API Setup:

    • Get your API key from OpenAI Platform
    • The app uses GPT-4 Vision for scene analysis and GPT-4 Text for narrative generation
  2. ElevenLabs API Setup:

    • Get your API key from ElevenLabs
    • Create or select voice IDs for narrator and character roles
    • The app generates separate audio tracks for different voices
  3. Model Configuration:

    • frame: Best for manga frame detection
    • yolo8l_50epochs: Alternative YOLO model
    • yolo8s_50epochs: Smaller, faster model option

Usage

  1. Upload Image: Select an English manga page image (PNG, JPG, JPEG)
  2. Configure Settings:
    • Choose YOLO detection model
    • Adjust confidence thresholds if needed
  3. Generate Audio: Click β€œGenerate Audio” to start processing
    • Frame detection and extraction
    • OCR text extraction with confidence filtering
    • AI scene analysis using GPT-4 Vision
    • Narrative script generation using GPT-4 Text
    • Multi-voice audio generation using ElevenLabs
  4. Review Results:
    • View processing statistics
    • Play separate narrator/character audio or combined version
    • Download individual audio files or complete transcript
  5. Explore Output: Browse the timestamped processing folder with organized frames, OCR results, and audio files

Development

Adding New Modules

Each processing module follows a consistent interface pattern:

class NewProcessor:
    def __init__(self, config=None):
        """Initialize the processor with configuration"""
        self.config = config or Config()

    def process(self, input_data, output_dir=None):
        """Main processing method with structured output"""
        pass

    def get_statistics(self):
        """Return processing statistics"""
        pass

Model Integration

Adding New YOLO Models:

  1. Place model files in ./models/model_name/best.pt
  2. Update config.py MODEL_PATHS dictionary
  3. The frame detector will automatically load and use them

Integrating Alternative LLM Providers:

  1. Extend LLMProcessor class in modules/llm_processor.py
  2. Implement vision and text processing methods
  3. Add provider configuration in config.py
  4. Update frontend provider selection

Adding New TTS Providers:

  1. Extend TTSGenerator class in modules/tts_generator.py
  2. Implement multi-voice generation methods
  3. Add API configuration and voice settings
  4. Update frontend voice configuration options

Directory Structure Standards

All processing modules should use the structured directory pattern:

processed_YYYYMMDD_HHMMSS/
β”œβ”€β”€ frames/          # Input frames and extraction results
β”œβ”€β”€ ocr/            # OCR results and text processing
└── audio/          # Audio files and transcripts

Performance

Total Processing Time: ~2-5 minutes per manga page (depending on frame count and text complexity)

Troubleshooting

Common Issues

  1. API Configuration:

    • Ensure OpenAI API key is valid and has GPT-4 access
    • Verify ElevenLabs API key and voice IDs are correct
    • Check API rate limits and quotas
  2. Model Files:

    • Ensure YOLO models are in the correct paths (./models/*/best.pt)
    • Check model file permissions and accessibility
  3. Processing Errors:

    • OCR confidence too low: Adjust OCR_CONFIDENCE_THRESHOLD
    • Frame detection issues: Try different YOLO models or adjust confidence
    • LLM processing failures: Check API keys and rate limits
  4. Audio Generation:

    • Voice ID not found: Verify ElevenLabs voice IDs in configuration
    • Audio quality issues: Check voice settings and text preprocessing
    • File output errors: Ensure write permissions to audio_output/ directory

Logs and Debugging

Check processing logs:

# View application logs
tail -f logs/app.log

# Check specific processing folder
ls -la audio_output/processed_*/

# Verify API connectivity
python -c "import openai; print('OpenAI OK')"
python -c "import elevenlabs; print('ElevenLabs OK')"

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes following the module interface patterns
  4. Test with sample manga images
  5. Update documentation if needed
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments


MangAI transforms static manga pages into immersive audio experiences using cutting-edge AI technologies. From visual analysis to multi-voice narration, experience your favorite manga like never before! πŸŽ­πŸ“šπŸŽ΅