mangAI

MangAI Architecture Diagrams

This document contains comprehensive Mermaid diagrams showing the architecture and data flow of the MangAI application.

1. High-Level System Architecture

graph TD
    A[User Interface - Streamlit] --> B[Frame Detector Module]
    B --> C[OCR Processor Module]
    C --> D[LLM Vision Processor]
    D --> E[LLM Narrator Module]
    E --> F[Multi-Voice TTS Generator]
    F --> G[Structured Audio Output]
    
    H[Config Management] --> A
    H --> B
    H --> C
    H --> D
    H --> E
    H --> F
    
    I[YOLO Models] --> B
    J[PaddleOCR Engine] --> C
    K[OpenAI GPT Models] --> D
    K --> E
    L[ElevenLabs API] --> F
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
    style F fill:#f1f8e9
    style G fill:#fff8e1
    style H fill:#e3f2fd

2. Detailed Component Interaction

graph TB
    subgraph "MangAI Application"
        subgraph "Frontend Layer"
            UI[app.py - Streamlit UI]
            CONFIG[config.py - Configuration]
            ENV[.env - Environment Variables]
        end
        
        subgraph "Processing Pipeline"
            FD[Frame Detector<br/>modules/frame_detector.py]
            OCR[OCR Processor<br/>modules/ocr_processor.py]
            LLM_VIS[LLM Vision Processor<br/>modules/llm_processor.py]
            LLM_NAR[LLM Narrator Module<br/>modules/llm_processor.py]
            TTS[Multi-Voice TTS Generator<br/>modules/tts_generator.py]
        end
        
        subgraph "External Dependencies"
            YOLO[YOLO Models<br/>ultralytics]
            PADDLE[PaddleOCR<br/>paddlepaddle]
            OPENAI[OpenAI API<br/>GPT-4 Vision & Text]
            ELEVEN[ElevenLabs API<br/>Multi-Voice TTS]
        end
        
        subgraph "File System"
            MODELS[models/<br/>*.pt YOLO files]
            IMAGES[images/<br/>test images]
            PROCESSED[processed_<timestamp>/<br/>├── frames/<br/>├── ocr/<br/>└── audio/]
        end
    end
    
    UI --> CONFIG
    CONFIG --> ENV
    UI --> FD
    FD --> OCR
    OCR --> LLM_VIS
    LLM_VIS --> LLM_NAR
    LLM_NAR --> TTS
    
    FD --> YOLO
    FD --> MODELS
    OCR --> PADDLE
    LLM_VIS --> OPENAI
    LLM_NAR --> OPENAI
    TTS --> ELEVEN
    TTS --> PROCESSED
    
    UI -.-> IMAGES
    
    style UI fill:#e3f2fd
    style CONFIG fill:#f3e5f5
    style ENV fill:#e8f5e8
    style FD fill:#fff3e0
    style OCR fill:#fce4ec
    style LLM_VIS fill:#f1f8e9
    style LLM_NAR fill:#fff8e1
    style TTS fill:#e1f5fe

3. Data Flow Sequence

sequenceDiagram
    participant U as User
    participant UI as Streamlit UI
    participant FD as Frame Detector
    participant OCR as OCR Processor
    participant LLM_VIS as LLM Vision
    participant LLM_NAR as LLM Narrator
    participant TTS as Multi-Voice TTS
    participant FS as File System
    
    U->>UI: Upload manga image
    UI->>UI: Validate file & display preview
    U->>UI: Configure settings (model, speech rate)
    UI->>UI: Update configuration
    U->>UI: Click "Generate Audio"
    
    UI->>FS: Create timestamped processing directory
    FS-->>UI: Return directory structure (frames/, ocr/, audio/)
    
    UI->>FD: Process image for frame detection
    FD->>FD: Load YOLO model
    FD->>FD: Detect manga frames
    FD->>FD: Order frames by reading sequence
    FD->>FS: Save extracted frames to frames/
    FD-->>UI: Return ordered frame images & paths
    
    UI->>OCR: Extract text from each frame
    OCR->>OCR: Preprocess frame images
    OCR->>OCR: Run PaddleOCR (English)
    OCR->>OCR: Filter by confidence threshold
    OCR->>FS: Save OCR results to ocr/
    OCR-->>UI: Return text fragments with metadata
    
    UI->>LLM_VIS: Analyze frames for scene description
    LLM_VIS->>LLM_VIS: Process each frame with GPT-4 Vision
    LLM_VIS-->>UI: Return scene descriptions
    
    UI->>LLM_NAR: Generate manga script from scenes & text
    LLM_NAR->>LLM_NAR: Combine descriptions and text into narrative
    LLM_NAR->>LLM_NAR: Structure into narrator/character roles
    LLM_NAR-->>UI: Return structured script data
    
    UI->>TTS: Generate multi-voice audio from script
    TTS->>TTS: Parse narrator vs character segments
    TTS->>TTS: Generate narrator audio with narrator voice
    TTS->>TTS: Generate character audio with character voice
    TTS->>FS: Save separate audio files to audio/
    TTS->>FS: Create transcript file
    TTS-->>UI: Return primary audio file path
    
    UI->>U: Display results & multi-voice audio player
    U->>UI: Play audio, download files, or view transcript

4. Module Dependencies

graph LR
    subgraph "Core Application"
        APP[app.py] --> CONFIG[config.py]
        APP --> ST[streamlit]
        APP --> PIL[PIL.Image]
        APP --> NP[numpy]
    end
    
    subgraph "Processing Modules"
        APP --> FD[modules.frame_detector]
        APP --> OCR[modules.ocr_processor]
        APP --> LLM[modules.llm_processor]
        APP --> TTS[modules.tts_generator]
    end
    
    subgraph "Frame Detection Stack"
        FD --> YOLO[ultralytics.YOLO]
        FD --> CV2[cv2/OpenCV]
        FD --> PIL2[PIL.Image]
        FD --> NP2[numpy]
    end
    
    subgraph "OCR Processing Stack"
        OCR --> PADDLE[paddleocr.PaddleOCR]
        OCR --> CV2B[cv2/OpenCV]
        OCR --> NP3[numpy]
    end
    
    subgraph "LLM Processing Stack"
        LLM --> OPENAI[openai.OpenAI]
        LLM --> BASE64[base64]
    end
    
    subgraph "TTS Generation Stack"
        TTS --> ELEVEN[elevenlabs.client]
        TTS --> PYDUB[pydub]
    end
    
    subgraph "System Utilities"
        CONFIG --> OS[os.path]
        CONFIG --> DATETIME[datetime]
        CONFIG --> PATHLIB[pathlib]
    end
    
    style APP fill:#e1f5fe
    style FD fill:#f3e5f5
    style OCR fill:#e8f5e8
    style LLM fill:#fff3e0
    style TTS fill:#fce4ec
    style YOLO fill:#ffebee
    style PADDLE fill:#ffebee
    style OPENAI fill:#ffebee
    style ELEVEN fill:#ffebee

5. User Interaction Flow

stateDiagram-v2
    [*] --> ImageUpload
    ImageUpload --> FileValidation
    FileValidation --> PreviewDisplay
    PreviewDisplay --> ConfigurationSettings
    ConfigurationSettings --> ReadyToProcess
    ReadyToProcess --> ProcessingPipeline
    
    state ProcessingPipeline {
        [*] --> DirectoryCreation
        DirectoryCreation --> FrameDetection
        FrameDetection --> OCRExtraction
        OCRExtraction --> LLMVisionAnalysis
        LLMVisionAnalysis --> LLMNarrativeGeneration
        LLMNarrativeGeneration --> MultiVoiceTTSGeneration
        MultiVoiceTTSGeneration --> [*]
    }
    
    ProcessingPipeline --> ResultsDisplay
    ResultsDisplay --> MultiVoiceAudioPlayback
    MultiVoiceAudioPlayback --> FileDownload
    FileDownload --> [*]
    
    ImageUpload : User uploads manga image
    FileValidation : System validates file type/size
    PreviewDisplay : Display image preview
    ConfigurationSettings : User configures YOLO model & speech settings
    ReadyToProcess : Ready state with Generate button
    DirectoryCreation : Create timestamped processing directory
    FrameDetection : YOLO detects & extracts manga frames
    OCRExtraction : PaddleOCR extracts English text
    LLMVisionAnalysis : GPT-4 Vision analyzes frame scenes
    LLMNarrativeGeneration : GPT-4 Text creates structured script
    MultiVoiceTTSGeneration : ElevenLabs generates narrator & character audio
    ResultsDisplay : Show statistics & multi-voice breakdown
    MultiVoiceAudioPlayback : Play narrator/character audio separately or combined
    FileDownload : Download audio files & transcript

6. File System Organization

graph TD
    ROOT[mangAI/] --> APP[app.py]
    ROOT --> CONFIG[config.py]
    ROOT --> REQ[requirements.txt]
    ROOT --> START[start.sh]
    ROOT --> README[README.md]
    ROOT --> ARCH[architecture_diagram.md]
    
    ROOT --> MODULES[modules/]
    MODULES --> INIT[__init__.py]
    MODULES --> FD_PY[frame_detector.py]
    MODULES --> OCR_PY[ocr_processor.py]
    MODULES --> LLM_PY[llm_processor.py]
    MODULES --> TTS_PY[tts_generator.py]
    
    ROOT --> MODELS[models/]
    MODELS --> YOLO8L[yolo8l_50epochs/]
    MODELS --> YOLO8L_FRAME[yolo8l_50epochs_frame/]
    MODELS --> YOLO8S[yolo8s_50epochs/]
    YOLO8L --> BEST1[best.pt]
    YOLO8L_FRAME --> BEST2[best.pt]
    YOLO8S --> BEST3[best.pt]
    
    ROOT --> IMAGES[images/]
    IMAGES --> TEST1[test1.jpg]
    IMAGES --> TEST2[test2.jpg]
    IMAGES --> TEST3[test3.jpg]
    IMAGES --> TEST4[test4.jpg]
    
    ROOT --> AUDIO[audio_output/]
    AUDIO --> PROC1[processed_20240101_120000/]
    AUDIO --> PROC2[processed_20240101_135000/]
    AUDIO --> PROCN[processed_YYYYMMDD_HHMMSS/]
    
    PROC1 --> FRAMES1[frames/]
    PROC1 --> OCR1[ocr/]
    PROC1 --> AUDIO1[audio/]
    
    FRAMES1 --> FRAME1[frame_0.jpg]
    FRAMES1 --> FRAME2[frame_1.jpg]
    FRAMES1 --> FRAMEX[frame_n.jpg]
    
    OCR1 --> OCR_JSON[ocr_results.json]
    OCR1 --> OCR_TXT[combined_text.txt]
    
    AUDIO1 --> NARRATOR[narrator_audio.wav]
    AUDIO1 --> CHARACTER[character_audio.wav]
    AUDIO1 --> COMBINED[combined_audio.wav]
    AUDIO1 --> TRANSCRIPT[transcript.txt]
    
    ROOT --> LOGS[logs/]
    ROOT --> VENV[virtualenv/]
    
    style ROOT fill:#e1f5fe
    style MODULES fill:#f3e5f5
    style MODELS fill:#e8f5e8
    style AUDIO fill:#fff3e0
    style PROC1 fill:#fce4ec

7. Processing Pipeline Detail

flowchart TD
    START([User Uploads Image]) --> VALIDATE{Validate File}
    VALIDATE -->|Valid| PREVIEW[Display Preview]
    VALIDATE -->|Invalid| ERROR[Show Error Message]
    ERROR --> START
    
    PREVIEW --> SETTINGS[Configure Settings]
    SETTINGS --> GENERATE[Click Generate Button]
    
    GENERATE --> INIT_FD[Initialize Frame Detector]
    INIT_FD --> LOAD_MODEL[Load YOLO Model]
    LOAD_MODEL --> DETECT[Detect Manga Frames]
    DETECT --> ORDER[Order Frames by Reading Sequence]
    ORDER --> CROP[Crop Individual Frames]
    
    CROP --> INIT_OCR[Initialize OCR Processor]
    INIT_OCR --> PREPROCESS[Preprocess Frame Images]
    PREPROCESS --> OCR_EXTRACT[Extract Text with Tesseract]
    OCR_EXTRACT --> FILTER[Filter by Confidence Threshold]
    
    FILTER --> COMBINE[Combine Text Fragments]
    COMBINE --> CLEAN[Clean Text for Speech]
    CLEAN --> OPTIMIZE[Optimize for TTS]
    
    OPTIMIZE --> INIT_TTS[Initialize TTS Generator]
    INIT_TTS --> CONFIG_VOICE[Configure Voice Settings]
    CONFIG_VOICE --> GENERATE_AUDIO[Generate Audio]
    GENERATE_AUDIO --> SAVE_FILE[Save Audio File]
    
    SAVE_FILE --> DISPLAY[Display Results]
    DISPLAY --> PLAY[Audio Player]
    DISPLAY --> DOWNLOAD[Download Link]
    PLAY --> FINISH([Process Complete])
    DOWNLOAD --> FINISH
    
    style START fill:#e8f5e8
    style FINISH fill:#e8f5e8
    style VALIDATE fill:#fff3e0
    style ERROR fill:#ffebee

Key Features Highlighted

Migration Notes

The original yolov8Model.py functionality has been successfully integrated into the modular architecture while maintaining backward compatibility and adding enhanced features for better maintainability.