This post summarizes the material in Luan, Paris, Schechtman, and Bala’s article “Deep Photo Style Transfer”
Without fail, every photo you see is edited in one way or another. Edges are retouched, and color levels are adjusted for aesthetics. These edits, however, aren’t usually globally conscious, i.e. they are not taking into account the entire picture when making pixel-level changes. An alternative to local retouching would be to make the entire photo look more similar to the “abstract idea” embodied by another image. This goal of a first picture (called the input picture) with the qualitative elements, like the color scheme and brightness, of another picture (called the reference picture), and, is called “style transfer,” and it can be accomplished using convolutional neural networks.1
A team of researchers from Cornell University and Adobe recently improved on earlier techniques for “style transfer” in two important ways. First, the new method doesn’t change the geometry of the image. Previous methods would take the straight edge of a building in the input image and, in the process of applying new colors, twist it in unnatural ways. Secondly, they manage to maintain “semantic accuracy”—that is, while the style transfer “knows” about the entire image, styles don’t bleed across important visual barriers. In other words, the reference image’s sky doesn’t affect the input image’s houses.
To understand the big contribution of the author’s “neural style” algorithm, we need to understand previous works. Suppose we have an input image and we want to apply the style from a reference image to create an output image. The usual objective function for “style transfer” minimizes a linear combination of two terms. The first term ensures that the output image is similar to the input image (it is the sum of all squared element-wise differences between the feature maps of the input and output images). The second term ensures that the output image is similar to the reference image (this term is the sum of all the squared element-wise differences between the Gram matrices of the output and style images). The second term is also multiplied by a constant that controls the trade-off between content (the first term) and style (the second term).
The issue with that method is that it creates output images that aren’t “photorealistic.” Those images have distortions, both in terms of shape and colors bleeding into areas they shouldn’t. To get around this, the authors note that “the input is already photorealistic.” By making sure the input image doesn’t change too much, they can exploit that photorealism.
The authors change the objective function by adding one term that makes sure the changes between the input and output images are locally linear in terms of color. They also make a change to the term that makes sure the output image is similar to the reference image. One of their concerns is that the style of the sky in the reference image will affect the style of the houses (for example) in the output image. To get around that, they use another neural net to tag areas of the input and reference images with descriptions like: lake, sky, house, tree, etc. The output-reference term then makes sure that similarly tagged areas from the output and reference images are similar. The new terms each get a constant that controls how much they affect the output image (style v. content v. local transformation).
In the images above, the left-most column is the input image, the center column is the style image and the right column is the output image. The results are pretty staggering, and the paper itself contains comparisons with older methods that make their improvements obvious. The deep blacks present in the folds of the rose transfer flawlessly, something older algorithms couldn’t manage to pull off.
This isn’t a post, even a little, about ethics, but it seems important to keep broader context in mind when thinking about methods that convincingly blur the line between what isn’t and what is. In a world where the internet can’t agree whether a dress is blue-and-black or white-and-gold, even without editing, the way powerful algorithms are changing the meaning of the word “photorealistic” can be unnerving. Deep-style transfer is incredible, but as the field improves social guidelines about content fidelity need to improve if we have any chance of telling output and input images apart.
1. At a very high level, convolutional neural networks are neural networks with particular rules on how neurons from a previous layer affect neurons in a subsequent layer (these rules apply to the convolutional layers). There are two rules. One, neurons can only affect other nearby neurons, and two, the way the effect of a neuron on another neuron is calculated is the same for all neurons. These rules are typically called “locally connected” and “shared weight.” A convolutional neural network is then just a neural network where most of the layers are locally connected and have shared weights. For a more thorough, better look at the subject: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/