MaterialPicker: Multi-Modal DiT-Based Material Generation

Xiaohe Ma¹, Valentin Deschaintre², Miloš Hašan², Fujun Luan², Kun Zhou¹, Hongzhi Wu¹, Yiwei Hu²

¹State Key Lab of CAD&CG,Zhejiang University, ²Adobe Research

arXiv

We introduce MaterialPicker, a DiT-based model that generates high-quality materials, conditioned on image crops and/or text prompts. Our model accurately captures textures and material properties even from photographs of distorted or partially obscured surfaces. We demonstrate MaterialPicker by extracting material properties (albedo, normal, height, roughness and metallicity, shown in a column next to the input crops) from smartphone-captured photos, then applying these materials in a 3D scene for photo-realistic rendering results.

Abstract

High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.

Results

We conduct tests on diverse material types in both indoor and outdoor scenes, demonstrating the generalization capability of our model. The first column shows real photographs captured by smartphones, before cropping. The box indicates the cropped area, which is the image actually used as input for the model in the second column. Third to the ninth columns show the generated material maps and rendering under two environment maps. The last column shows the mask of the dominant material location automatically predicted by our model. The leftmost side of each row is labeled with the text conditioning input used.