Generative Image Inpainting with Contextual Attention

  • 1 University of Illinois at Urbanaā€“Champaign
  • 2 Adobe Research

CelebA-HQ Demo:

Some notes:
1. Results are direct outputs from trained generative neural networks. No post-processing steps are applied.
2. Model is trained on CelebA-HQ (with randomly sampling 2k as validation set for demo).
3. Current image resolution is 256x256, higher-resolution models are trained in progress.
4. Demo is for research purposes only.

In addition to research purposes, have fun as well! Tag #deepfill.

Now I am watching you! Smile!šŸ˜ Remove watermark. Edit bangs! Swap eyes! No mustache!


Recent deep learning based approaches have shown promising results on image inpainting for the challenging task of filling in large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces, textures and natural images demonstrate that the proposed approach generates higher-quality inpainting results than existing ones. Code and trained models will be released.

Example inpainting results of our method on images of natural scene (Places2), face (CelebA) and object (ImageNet). Missing regions are shown in white. In each pair, the left is input image and right is the direct output of our trained generative neural networks without any post-processing.

Contextual Attention

Contextual attention layer learns where to borrow or copy feature information from known background patches (orange pixels) to generate missing patches (blue pixels). Firstly convolution is used to compute matching score between foreground patches with background patches (as convolutional filters). Then softmax is applied to compare and get attention score for each pixel. Finally foreground patches are reconstructed with background ones by performing deconvolution on score map. Contextual attention layer is differentiable and fully-convolutional.

Model Architecture

Main results (512x680 resolution) on Places2 are here.

Visual attention interpretation examples. Visualization (highlighted regions) on which parts in input image are mostly attended. Each triad, from left to right, shows input image, result and attention visualization.


More results (input, output and attention map) on CelebA:

More results (input, output and attention map) on ImageNet:


  title={Generative Image Inpainting with Contextual Attention},
  author={Yu, Jiahui and Lin, Zhe and Yang, Jimei and Shen, Xiaohui and Lu, Xin and Huang, Thomas S},
  journal={arXiv preprint arXiv:1801.07892},