GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing




Abstract


Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. We will open-source the code for future research and applications.



Video




Overview


overview

Overview of GenArtist: . The MLLM agent is responsible for decomposing problems and planning using a tree structure, then invoking tools to address the issues. Employing the agent as the "brain" effectively realizes a unified generation and editing system.



Generation Results


GenArtist generates more accurate images given complex text prompts, compared to existing STOA text-to-image models including SDXL, LMD+, RPG, PixArt, Playground, Midjourney, DALL-E 3:

overview

GenArtist generates more accurate images given complex user instructions for the image editing tasks:

overview

The basic workflow and illustration for image generation tasks:

overview

The basic workflow and illustration for image editing tasks:

overview

BibTeX

@article{wang2024div,
            author    = {Zhenyu, Wang and Aoxue, Li and Zhenguo, Li and Xihui, Liu},
            title     = {GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing},
            journal   = {arXiv preprint arXiv:2407.05600},
            year      = {2024},
        }