|
A manga-to-audio application that converts English manga pages into immersive audio narratives using AI. Upload a manga image and generate a complete audio story with separate narrator and character voices using OpenAI GPT-4 Vision, GPT-4 Text, and ElevenLabs multi-voice TTS! |
π For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.
βββββββββββββββββββββββ
β Streamlit App β β Web Frontend with Multi-Voice Controls
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Frame Detector β β YOLO Models with Reading Order Detection
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β OCR Processor β β PaddleOCR (English) with Confidence Filtering
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β LLM Vision β β GPT-4 Vision for Scene Analysis
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β LLM Narrator β β GPT-4 Text for Script Generation
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Multi-Voice TTS β β ElevenLabs with Narrator & Character Voices
βββββββββββββββββββββββ
π For detailed interactive diagrams, see architecture_diagram.md - Contains comprehensive Mermaid diagrams showing system architecture, data flow, and component interactions.
βββββββββββββββββββββββ
β Streamlit App β β Web Frontend
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Frame Detector β β YOLO Models (integrated from yolov8Model.py)
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β OCR Processor β β Tesseract/OCR (English only)
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Text Processor β β Simple text combination
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β TTS Generator β β Text-to-Speech (English)
βββββββββββββββββββββββ
The original yolov8Model.py functionality has been integrated into modules/frame_detector.py with the following improvements:
Before running MangAI, you need to obtain API keys for:
Clone and set up the project:
git clone <repository-url>
cd mangAI
python3 -m venv virtualenv
source virtualenv/bin/activate # On macOS/Linux
# or: virtualenv\Scripts\activate # On Windows
pip install --upgrade pip
pip install -r requirements.txt
Configure API credentials:
Create a .env file in the project root:
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
# ElevenLabs Configuration
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character
# Application Settings
DEFAULT_YOLO_MODEL=frame
YOLO_CONFIDENCE_THRESHOLD=0.5
OCR_CONFIDENCE_THRESHOLD=0.3
Run the application:
./start.sh
# or directly:
streamlit run app.py
Access the application:
Open your browser and go to http://localhost:8501
The project includes a pre-configured virtual environment in the virtualenv/ directory:
# Activate the existing virtual environment
source virtualenv/bin/activate # On macOS/Linux
# or: virtualenv\Scripts\activate # On Windows
# Install any missing dependencies
pip install -r requirements.txt
# Run the application
./start.sh
brew install tesseract
3. **Test the integration:**
```bash
python test_integration.py
./start.sh
# or
streamlit run app.py
mangAI/
βββ app.py # Main Streamlit application with multi-voice interface
βββ config.py # Configuration management with directory creation
βββ requirements.txt # Python dependencies (OpenAI, ElevenLabs, PaddleOCR)
βββ start.sh # Startup script for virtual environment
βββ architecture_diagram.md # Comprehensive system architecture documentation
βββ README.md # Project documentation
βββ modules/ # Core processing modules
β βββ __init__.py
β βββ frame_detector.py # YOLO-based frame detection with reading order
β βββ ocr_processor.py # PaddleOCR text extraction with confidence filtering
β βββ llm_processor.py # OpenAI GPT-4 Vision and Text processing
β βββ tts_generator.py # ElevenLabs multi-voice TTS generation
βββ models/ # YOLO model files
β βββ yolo8l_50epochs/
β βββ yolo8l_50epochs_frame/
β βββ yolo8s_50epochs/
βββ images/ # Test manga images
β βββ test1.jpg
β βββ test2.jpg
β βββ ...
βββ audio_output/ # Structured processing outputs
β βββ processed_20240101_120000/
β β βββ frames/ # Extracted manga frames
β β βββ ocr/ # OCR results and combined text
β β βββ audio/ # Multi-voice audio files and transcript
β βββ processed_YYYYMMDD_HHMMSS/
βββ logs/ # Application logs
βββ virtualenv/ # Pre-configured Python environment
βββ bin/
βββ lib/
βββ ...
Create a .env file with the following configuration:
# OpenAI Configuration (Required)
OPENAI_API_KEY=your_openai_api_key_here
# ElevenLabs Configuration (Required)
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_NARRATOR_VOICE_ID=voice_id_for_narrator
ELEVENLABS_CHARACTER_VOICE_ID=voice_id_for_character
# Application Settings
DEFAULT_YOLO_MODEL=frame
YOLO_CONFIDENCE_THRESHOLD=0.5
OCR_CONFIDENCE_THRESHOLD=0.3
OpenAI API Setup:
ElevenLabs API Setup:
Model Configuration:
frame: Best for manga frame detectionyolo8l_50epochs: Alternative YOLO modelyolo8s_50epochs: Smaller, faster model optionEach processing module follows a consistent interface pattern:
class NewProcessor:
def __init__(self, config=None):
"""Initialize the processor with configuration"""
self.config = config or Config()
def process(self, input_data, output_dir=None):
"""Main processing method with structured output"""
pass
def get_statistics(self):
"""Return processing statistics"""
pass
Adding New YOLO Models:
./models/model_name/best.ptconfig.py MODEL_PATHS dictionaryIntegrating Alternative LLM Providers:
LLMProcessor class in modules/llm_processor.pyconfig.pyAdding New TTS Providers:
TTSGenerator class in modules/tts_generator.pyAll processing modules should use the structured directory pattern:
processed_YYYYMMDD_HHMMSS/
βββ frames/ # Input frames and extraction results
βββ ocr/ # OCR results and text processing
βββ audio/ # Audio files and transcripts
Total Processing Time: ~2-5 minutes per manga page (depending on frame count and text complexity)
API Configuration:
Model Files:
./models/*/best.pt)Processing Errors:
OCR_CONFIDENCE_THRESHOLDAudio Generation:
audio_output/ directoryCheck processing logs:
# View application logs
tail -f logs/app.log
# Check specific processing folder
ls -la audio_output/processed_*/
# Verify API connectivity
python -c "import openai; print('OpenAI OK')"
python -c "import elevenlabs; print('ElevenLabs OK')"
git checkout -b feature/your-featureThis project is licensed under the MIT License - see the LICENSE file for details.
MangAI transforms static manga pages into immersive audio experiences using cutting-edge AI technologies. From visual analysis to multi-voice narration, experience your favorite manga like never before! πππ΅