The DiT-VTON model offers diverse use cases, enabling inpainting within user-specified editing regions using content guided by a reference image. The model can semantically infer and generate expected objects, textures, perform local editing, and even identify specific body parts for virtual try-on tasks including multi-garment try-on, showcasing its versatility in content-aware editing and synthesis. Meanwhile, we have pioneered the expansion of this research area beyond traditional garment virtual try-on to to virtual try-all, extending its application to a wide range of product categories, including furniture, jewelry, shoes, and other wearables such as scarves, glasses, and handbags, etc.
The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages an architecture based on a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation (text-to-image), adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. Our findings indicate that token concatenation combined with pose stitching yields the best performance. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities, such as pose preservation, precise localized region editing and refinement, texture transfer and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on public datasets VITON-HD and DressCode on the VTO task, achieving superior detail preservation and robustness without reliance on additional image condition encoders. It also surpasses state-of-the-art models that have VTA and image editing capabilities on a varied dataset composed of thousands of product categories. As a result, DiT-VTON significantly advances VTO applicability in diverse real-world scenarios, enhancing both the realism and personalization of online shopping experiences.
Illustration of different model configurations of DiT-VTON to effectively integrate image conditions.
@article{DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing,
author = {Qi Li, Shuwen Qiu, Julien Han, Kee Kiat Koo, Karim Bouyarmane},
title = {DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing},
year = {2024},
}