Interview with Professor James O. Berger

Dr. James O. Berger is the Arts and Sciences Professor of Statistics at Duke University. He is the recipient of numerous major academic accolades for his research and service. He is a Fellow of the Institute of Mathematical Statistics (IMS), American Statistical Association (ASA), American Association for the Advancement of Science (AAAS), and the International Society for Bayesian Analysis (ISBA). He has served in numerous leadership roles such as President of the Institute of Mathematical Statistics (IMS), Chair of the Section of Bayesian Statistical Science of the ASA, President of the International Society for Bayesian Analysis, and Chair for the Advisory Committee for the NSF Directorate on Mathematical and Physical Sciences. He has directed 38 doctorates in statistics.

How did you become interested in statistics?

Prof. Berger completed all his degrees at Cornell University. He was originally interested in mathematics. Quite fortunately, there were three prominent statisticians in the mathematics department who piqued his interest: Larry Brown, Jack Kiefer, and Jack Wolfowitz. Eventually, he completed his thesis with Larry Brown and became a statistician.

Who were your influences?

“There were the three mentioned: Larry Brown, Jack Kiefer, and Jack Wolfowitz. At Cornell math there was a fourth influential statistician, Roger Farrell. All four of them have passed away, but they were my big influences at Cornell. Then I came directly to Purdue as an assistant professor. At Purdue, Herman Rubin was my big scientific influence. Having gone through a math program, there were a lot of things I did not know about statistics. Herman was the one who taught me a lot about statistics and statistical computing. Then there was Shanti Gupta, who was the department chairman; he was my mentor in many ways. Of course, the entire faculty was very strong in mathematical statistics, so I learned a lot from everyone.”

What was like being involved in the department at Duke?

Prof. Berger claims “[he] was always extremely happy [at Purdue]”. There were a collection of reasons for transitioning to Duke.I moved to Duke [from Purdue] after my children had grown up and I had spent my entire scientific career (23 years [as a professor at Purdue]). I was interested in trying something new. Duke was also primarily, and this is still very much true today, a Bayesian statistics department. There were always Bayesians at Purdue, but I wondered what it would be like to be an in environment where everyone was a Bayesian. It was also the time that MCMC had just taken off. The Duke people were all developing the latest MCMC techniques. I was thinking that this would be something very useful for my research.”

How do we classify Bayesian thoughts?

“The three main classifications, I would say, are subjective Bayes, objective Bayes, and robust Bayes. Of course, there is also a huge difference between theoretical work and applied work in Bayesian analysis. Lately MCMC – and computation more generally – has become almost a separate world. Indeed, the MCMC computational tools have become so prominent that there are now purported frequentists who ‘do MCMC,’ not understanding that it is Bayesian.”

You were a contributor to the ASA Statement on Statistical Inference and P-values. What are your thoughts on this?

“I did not disagree with the final statement, but I was disappointed that there was not more. The problem is that the statement does not say anything that was not known eighty years ago… These things have been stated repeatedly for at least eighty years. When ASA made this effort, I was very hopeful that they would take the next step – to actually recommend something specific as an alternative to p-values.”

One can always say ‘don’t do this, don’t do that’, but the poor scientist is left wondering what to do instead and just falls back on continuing to use the P-value. I was hoping we could come up with some clear guidance on how to move forward.

The problem is that, while the contributors to the ASA statement could all agree that the way scientists use P-values is wrong and list the reasons it is wrong, the contributors could not reach any consensus about what to do instead.

The ASA allowed everyone to write their suggestions as to what should be done instead, and there is a supplementary website with these suggestions; but, again, most of the suggestions differ rather strongly leaving the scientist with no clear guidance. There was also a conference on this, and an upcoming issue of TAS devoted to the issue. The continuing ASA effort on this is highly welcome.

There was one fairly broad consensus that arose from enlisting a number of leading scientists from a variety of disciplines – mainly social sciences – and statisticians. The 72 leaders contacted all agreed that statistical practice would be much improved if we declared significance to be at the .005 level rather than the current .05. Again, many of the leaders had differing reasons as to why this would be a good change, but all agreed with the change.” This article is ‘Redefine statistical significance’ by Daniel J. Benjamin et al. published in Nature Human Behaviour 2, 6-10 (2018).

How has Bayesian statistics changed in your career?

Prof. Berger chuckles and says “It’s a lot more fun!” He then elaborates with “I became a Bayesian through talking with Herman [Rubin] and others. From a foundational standpoint, I believed the axioms of rationality. It all made sense to me. At first I was a lip-service Bayesian, but eventually I came to understand that Bayesian statistics is a very natural way to address applied problems; it is much easier to answer the problem and it is much easier to communicate with scientists. On the frequentist side, you cannot answer every question the scientist poses. You can only answer a list of possible questions. If the scientist’s question is not on the list, you have to say sorry. On the Bayesian side, any question the scientist poses is one you can attempt to answer. Thus the more applications I became involved with, the more convinced I became that Bayes was the way to go.”

What are good places in terms of books and websites for a general applied worker to learn about Bayesian statistics for a non-statistician hoping to integrate it into their research?

Prof. Berger recommends seeking out an introduction-to-Bayes book written in one’s own discipline as a starting point. There is a vast introductory Bayesian literature of this type.

For years you have expressed concern about false positives due to multiple testing effects and other biases.  Do you see any progress in mitigating these problems– e.g., in the genetics field, in publication standards, etc., and how do you see the way forward?

“Yes, there is definite progress. … From 1997-2007 in Genome Wide Association Studies, very few ‘discoveries’ replicated because they were not dealing with multiple testing right. Suddenly, in 2007, things started replicating because they started dealing with multiple testing properly. Many fields of science at first do not understand what a severe problem multiple testing is, and they do not adjust enough. Then gradually they realize all their science is wrong because they are not adjusting enough, and somebody does something to change the field. That is progress. This happens more in the hard sciences and biology and medicine. In the social sciences, there are a lot more failures to adjust for multiple testing and other biases, and there is very little checking for reproducibility. So the problems are still there.

The other issue is that adjusting for multiple testing can be so difficult that there are many problems for which we do not know how to do it. People feel the need to produce answers so they do something, even if they know they have not adjusted properly. This a real area where we need more statistical research for the scientists – finding adjustments for multiple testing that the scientists can use.”

What do you think of the emerging field of data science?

“I was just a conference and on a panel, and this was one of the questions that was asked of the panel. My response was ‘I don’t really want to talk about it.’ The reason is that I understand there is an emerging field of data science and I understand what it is all about. I also understand it is going to be extremely important. I am all in favor of statistics departments adapting themselves to this field, but I do not feel personally that I am capable of giving advice on how to adapt. There is still an enormous amount of standard statistical problems that need to be solved. If someone wants to move in the data science direction, that is great, but (most?) statisticians do not need to do so.

For this reason, it is difficult to give statistics departments advice as to how to adapt to data science. The one recommendation I might make, regarding curriculum changes, is that it is important to give students options, but pursuing a traditional statistics degree should certainly be one of the options!”

How have statistics departments changed over your tenure?

“When I got my PhD, the highly respected thing was mathematical statistics. Applied statistics was acknowledged as important, but secondary (in academia) to mathematical statistics. When I was applying to assistant professorships, my mentors gave Purdue a strong recommendation because it was [at the time] one of the best departments in mathematical statistics. This primacy of mathematical statistics has changed over the years to where we are today, where all kinds of statistical research are respected. Indeed, the pendulum may have shifted the other way, with something done for pure mathematical statistics reasons being sometimes denigrated. Indeed, for that reason I encourage my students today to do more interdisciplinary research than theorem proving. It’s still good to prove a nice theorem or two, but you do not have to do so. Also, data scientists today have little to do with mathematical statistics (not a good thing).”

Do tenure processes fulfill the world’s current need for scholars in statistics?

“I think so. I do not see any problems with it. In times past there have been problems, particularly in the departments with a large number of people who only appreciated mathematical statistics, when someone highly interdisciplinary came up for tenure. But I don’t see this as much of a problem now.”

Are we pushing the boundaries of statistics at the same rate that has been done before?

“We are probably faster. As we are engaging the problems of the world, rather than self-generated problems, we are forced to advance much more quickly… not that all the rushed advancement is necessarily good (witness the history of the multiple testing problem). One area I work in these days is uncertainty quantification. It is the interface of statistics and data with complex computer models of processes (e.g., climate models). There can almost never be a firm statistical answer for most problems in this area, but we have to proceed anyway … It can be unsettling to work in an area and not be sure if your answers are rigorously based.”

Do you think statistics communication poses difficulties that other science communication fields do not have, in the sense of statisticians communicating with non-statisticians? If yes, why?

“It is a learning process. I directed SAMSI (Statistics and Applied Mathematics Sciences Institute) and its entire purpose was to bring people from different fields together to communicate. We had an elaborate process, starting with the opening workshops, to encourage communication. It is a difficult process, but necessary.

It is probably easier for statisticians to communicate with other scientists, than the reverse. Two scientists from other disciplines have a much worse time communicating than with a statistician. Partly, this is because most scientists have some knowledge of statistics. And most statisticians have experience communicating with those in other disciplines. Indeed, I feel it is quite important for statistics graduate programs to give their students the experience of talking to people in other disciplines.”

What are ways statisticians can improve their communication skills?

“I think most graduate students have consulting experience. And many graduate students are involved with large, interdisciplinary collaborations these days, in which they can learn how to communicate. And there are programs like the SAMSI program that have structures for doing this. Today, unless graduate students stick their heads in the sand, they will get exposed to communication with other scientists.

That said, when one moves to interdisciplinary work in a new discipline, one has to go through the learning process all over. One may understand the communication pitfalls from previous experiences, but one still has to learn enough of the language of the other discipline to be able to communicate, and needs to assess the statistical level of the scientists to know what to communicate.”

— Interview by Will A. Eagan —


Interview with Professor Rebecca W. Doerge

Professor Rebecca W. Doerge is the Glen de Vries Dean’s Chair in the Mellon College of Science at Carnegie Mellon University (CMU). She holds faculty appointments in both the Department of Statistics and the Department of Biological Sciences. In the past, she served as the Head of the Department of Statistics at Purdue University where she was the Trent and Judith Anderson Distinguished Professor of Statistics and Professor of Agronomy. At Purdue, she served as the President’s Fellow on Big Data and Simulation and the Director of the Statistical Bioinformatics Center. She has numerous accolades for her scholarship and service such as elected fellow of the American Statistical Association (ASA), American Association for the Advancement of Science (AAAS), and fellow of the Committee on Institutional Cooperation (CIC).

Your research area is in statistical genetics is by its very nature is applied. How do you feel the perception of statistics as a discipline by other disciplines has changed over your career?

Professor Doerge argues that it has two directions: how statistics is viewed by other disciplines and the reverse. Many of the applied disciplines are very “appreciative and respectful that statistics is a discipline in itself, and are very willing to work with a statistician.” She notes that many non-specialists appreciate the time that applied statisticians have invested to learn their disciplinary challenges and vocabulary. Specifically, in the biology and genetics world “statisticians who have invested the time and effort are considered as equals.”

“Now if you flip it on its ear, and discuss how are [applied disciplines] that use statistics perceived by statisticians. The answer, in my opinion, is not well.” One reason is the perception that those in applied disciplines lack theoretical depth and rigor. She does believe that is changing. She looks to bioinformatics as an example where the very best statisticians learned the biology, and then proceeded to learn the computing along with the statistics.

“It is not an equal two-sided arrow.” One remedy she suggests is that card-carrying statisticians come to accept there is a considerable number of non-statisticians who know a lot (about statistics), and who can still contribute to the field of statistics, specifically in their own discipline (e.g. ecology, forensics, genetics, etc.) Statisticians need to get beyond the view that either “you are a statistician or you are not,” and begin to embrace collaborative science.

Commenting on her own research background as a statistical geneticist, she feels that her own work is viewed as a contribution to the genetics community, but not to the statistics community. Even so, the fact she is a statistician working in genetics is a contribution unto its own.

What have you learned in your time serving as a department head at Purdue and now as a dean at CMU?

First of all, she confesses serving in administrative roles was never her plan. She claims her strength is mentoring graduate students, and attributes her success in that capacity to “treating each independently” and knowing each student as a person. She developed this skillset starting as an assistant professor. When she was the department head at Purdue, she learned that faculty, especially junior faculty, were much like graduate students. Specifically, you must foster respect and establish a repertoire; build relationships. The surprising, and disappointing, part about becoming a department head and then Dean is the negative view that faculty have about administrators. This is an issue that she wishes to speak openly about as she feels it is a message and attitude that must change.

Professor Doerge does hope to demonstrate that one can be successful as both successful as an administrator and a researcher. One of her personal goals is to change the attitude about administrators. She strives to be a “cool administrator who does research” and fosters a teamwork atmosphere.

Professor Doerge feels mentoring graduate students prepared her to become a department head. Further, she feels being a head of a department, especially at one of the largest statistics departments in the country, is great preparation for serving as a dean.  Having been a head of department, I am a more supportive and sympathetic dean. “It’s all about communication, respect, knowing people, not reacting, consider the person sitting in front of you, and transparency.”

Drawing from her experience, she believes that successful leadership is about people. “Everyone wants to be valued, respected, and informed.”

At Purdue, you held a joint appointment in Agronomy. At CMU, you hold a joint appointment in Biological Sciences. What benefits and challenges do you see for joint appointments for statisticians?

Professor Doerge believes her Purdue joint appointment between the departments of Statistics and Agronomy in the Colleges of Science and Agronomy, respectively, enabled her to become a visible leader on campus.

“There are faculty and administrators that do not like joint faculty appointments, but I like them a lot. For interdisciplinary statisticians, you have the benefit of the statistics community and the community of the application. This doubles the exposure. You serve as a connector for undergraduates, graduate students, and faculty. At a university, it allows those in joint appointments to become a leader in transdisciplinary collaboration. I see it as a huge positive all the way around.

The ongoing issues with joint appointments typically occur in preparation for promotion. Specifically, the promotion process for junior faculty may become far more complicated when different departments and perhaps colleges are involved.  In Professor Doerge’s opinion joint appointments work well when it there is a major and minor appointment. At Purdue 75% and 25% faculty appointments are common, yet at Carnegie Mellon University it is quite common to have 50-50% appointments that require independent votes from both departments and colleges.

Carnegie Mellon is one of the very few universities to have its own department dedicated to machine learning? How does this influence the academic culture for a statistician?

Professor Doerge believes that having a CMU Department of Machine Learning enables/supports interdisciplinary collaborations between machine learning folks and specific disciplines such as  physicists and biologists; disciplines that often do not always have the opportunity to interact in the statistics world. The fact that CMU has a department of machine learning fits the culture of CMU. We are “notorious for solving real world problems with computing.”  

What do you think of the emerging field of data science?

“My first reaction is: I do not understand what [data science] is because I am a statistician. My second reaction is data science is more understandable as a discipline descriptor than statistics simply because “data” is in the name. It brings a thoughtfulness about data, and what data can do for us … to a society/culture level. Data science as a group of words is modernizing the discipline of statistics.” She brings up that statistics departments should include data science in their names as was the case at Yale University and recently at Carnegie Mellon University.

Do tenure processes fulfill the world’s current need for scholars in statistics?

“No, it produces scholars of the past. Current day scientists/academics are becoming increasing collaborative; the tenure process should take this into account. The majority of the tenure track positions in the U.S. today are not delivering faculty who work in teams. Interdisciplinary research, teamwork, working with students, faculty, and staff is the path to solving real problems. The driving question we must ask ourselves every day is, are we solving real world problems in the best manner possible? Are we putting the power of diversity of thought, training, and culture to work?”

Are we pushing the boundaries of statistics at the same rate that has been done before?

“I think so… yes, and probably faster. Big data are pushing statistics both theoretically and in application. We are living in a reality where the magnitude of data available may be beyond asymptotics. No one ever thought we would collect this much data. Are we working beyond infinity simply because we are turning to machine learning, deep learning, and active learning, instead of asymptotic theory?”

Do you think statistics communication pose difficulties that other science communication fields do not have? If yes, why?

“I hate to generalize, but historically I think statisticians are not terribly gregarious and communicative.” She cautions her generalization with stating “There are always exceptions.” She adds “There needs to be an appreciation within the statistics community that communication is important. As with technical skills like computing and mathematics, we need to value communication: Written, verbal, presentation, and professionalism.”

What are ways statisticians can improve their communication skills?

Her answer is to “do it.” “We all shy away from things that make us uncomfortable. For graduate students, the best thing to do is teach. There are two different levels of communicating in this setting: lecturing and office hours. Lecturing allows for organizing thoughts, presenting it to the audience, see if they are paying attention, etc. Office hours allow for one-on-one communication.

Give talks, posters, and volunteer within your department, university, and community. Gaining experience with communication skills does not have to be limited to statistics. For written communication, you just need to write. … you learn about yourself when you write. Do it every day. It does not have to be restricted to statistics.”

As an example, in her own research group, graduate students write a weekly summary of their research progress. It is not about necessarily what they accomplished every week. Instead it is about organizing thoughts and putting those thoughts/results/ideas into writing. The best way to gain confidence with both written and spoken communication is to practice, practice, practice everyday.”


— Interview by Will Eagan–

A teaser to Ryan Martin’s “A Statistical Inference Course Based on p-Values”

This post is a segue into Ryan Martin’s “A statistical inference course based on p-values” 

Have you ever seen the authors of a science paper report  p = 0.000? Is that possible? And what does that mean? Well, if it were possible, it would mean that there’s a zero percent chance we’d see the experiment’s results if the null hypothesis was true; that with 100% certainty, the investigators believe they would never get a false positive when re-testing their hypothesis.. But such an assumption is a little irrational, as nothing in life is so certain–even if we had the time and resources to survey every subject in a population. At best, these zero-p authors meant to express p < 0.05 to indicate that the p-value is below a desired significance value. If this is the case, then we can then recite the standard textbook interpretation we learn of p-values that goes something like:

If the p-value is less than our significance level of 0.05, we have enough evidence to reject our null hypothesis.

Since there are multiple interpretations for the p-value, we define it here for a toy example as follows: suppose we collect some observation y* that measures an unknown \theta of interest. Perhaps this elusive \theta represents the average number of credit cards an American owns. Under this experiment, y is available to us–in that we can select a representative group of Americans, check their credit reports for the number of credit accounts opened, and calculate an average–but  is only available through y. If we had all the time and money in the world, we’d collect as many ys as possible, because that’d give us a way to form the density function for y upon which we can do inference on \theta. In this case, doing inference on \theta might entail figuring out where the majority of our ys lie, but since we don’t have all the time and money in the world to get all the ys possible, we use what’s available to us, in our statistical toolbox, and that sometimes arrives in the form of a hypothesis test.  

In order to carry out a proper hypothesis test, we must set up a null and alternative hypothesis. Let’s say, how things stand, the status quo is to take \theta_0 = 7 credit cards, which we’ll define as our null hypothesis. For simplicity, we’ll take our alternative as \theta \neq 7, so given some data point y*=5, and some assumptions on how we believe our \thetas are distributed (for example, normal with a mean of 7 and standard deviation of 1), we can construct the figure below:


Illustrating the p-value (shaded in red), or the proportion that the distribution of \thetas under the null hypothesis fall at values at or more extreme than what we observed y*.

While this example might seem overly simplified, we chose it as it illustrates the original intention of the p-value, simply as a measure of “statistical position” of the data given the null hypothesis (Fraser and Reid, 2016). Yet, despite this simplicity, given the computational resources available at the time in which hypothesis testing gained popularity, calculating this p-value was challenging. As a result, reference tables were created in which p-values were evaluated relative to standard values, like 1%, 5%, 10% and 20%, such that the end result of investigations would simply be some approximation of the p-value, leading to conclusions like “not significant at the 5% level” or “significant at the 1% level”. This, unintentionally, moved p-values to take on more meaning than just that of a measure of “statistical position” but that of a way to make a decision for, or against, some null model (Fraser and Reid, 2016).

As it caught on that p-values below 0.05 were meaningful enough to make a stand against some null hypothesis, in order to ensure an experiment leads to interesting results, researchers have started to game the data pipeline: in the selectivity of subjects, in the disclosure of results, and in methodological development (Leek and Peng, 2015). Our use and abuse of p-values have led to controversies around the inability to reproduce experimental results, to the extent that some journals have even banned the use of p-values altogether. Yet statisticians remained undeterred by such moves and continue to rely on this statistic, understanding its strength in constructing tests with desired frequentist Type I error (chance of a false positive).

Given this reliance on p-values, Martin and Liu have proposed an alternative reassessment of the p-value (Martin and Liu, 2013). They’ve designed, instead, a framework for arriving at the plausibility of a specific hypothesis, which allows for the more intuitive approach of interpreting it as some quantitative index for “truth” of some hypothesis \theta given the data y*. They recognize that while it’s difficult to prove whether some \theta is true given the data, it’s possible to ascribe some level of plausibility to \theta. Their framework thus allows for the construction of plausibility functions, from which investigators can use to arrive at some conclusion towards a hypothesis \theta by choosing some cutoff c, where if the plausibility is less than c, the hypothesis is sufficiently implausible. Now if we were to symbolize this in mathematical terms, letting pl(A) represent the plausibility of some hypothesis A, utilizing Martin and Liu’s new framework, we can make subjective conclusions based on whether:

pl_D(A) \leq c

From a practical point of view, this plausibility function, at least applied to our toy problem discussed above, results in the same “p-value function,” or “confidence distribution function,” as explored in Xie and Singh (2013), among others. This p-value function is an extension of the p-value itself; instead of calculating a single p-value associated with one null hypothesis \theta, imagine calculating it for a range of possible \thetas, resulting in:


We thus avoid singling out one alternative, leaving it up to domain experts to choose the appropriate cut-off c, where \theta values that fall below this plausibility threshold c are sufficiently implausible. Valid inference is then achieved when we connect such subjective opinions to a sound model. In other words, inferences are meaningful when they can generalize beyond some single study, so a sound model must be designed to capture the variability in the data observed, as well as any other sample that would be attained under similar conditions. If we can arrive at such a model called P_\theta, indexed by some parameter \theta, valid inference is achieved if:

P_\theta (pl_D(A) \leq c) \leq \alpha

where \alpha is some small value so that the probability of arriving at our subjective opinion that A is sufficiently implausible must be no greater than \alpha under our proposed model P_\theta. We thus control Type I error such that we guard ourselves against saying the data supports a particular alternative when in reality it is false. If this is satisfied, then our subjective opinions based on our plausibility function is valid. While the p-value function arises as a special case of this new inferential framework, plausibility functions are in fact more general as they extend past the hypothesis testing context and steer clear of the need for any asymptotic justification through the use of predictive random sets (the technical differences of which can be further explored here: and Beyond these advantages though, the potency of this new inferential line of thought lies in its detachment from exactitude–from using p = 0.05 or p = 0.005–from setting some arbitrary standard that determine how publishable or worthy an experiment is. It focuses, instead, on the development of plausibility functions that can be easily interpreted as quantitative indexes of “truth” upon which researchers can use to guide their decisions–underscoring the subjectivity involved in doing statistics, the fact that we cannot free ourselves from the responsibility of using our own judgement.


Crane, H. and R. Martin (2018), “Is statistics meeting the needs of science?” PsyArXiv. 

Fraser, D. and N. Reid (2016), "Crisis in science? or crisis in statistics!"

Leek, J., and R. Peng (2015), “Statistics: P values are just the tip of iceberg,” Nature, 520, 612.

Lilienfeld, S. O. et al. (2015), “Fifty psychological and psychiatric terms to avoid: a list of inaccurate, misleading, misused, ambiguous, and logically confused words and phrases,” Frontiers in Psychology, 6, 1110. 

Martin, R. (2017), “A statistical inference course based on p-values,” The American Statistician, 71 (2), 128-136. 

Martin, R., and C. Liu (2013). “Inferential models: a framework for prior-free posterior probabilistic inference.” Journal of the American Statistical Association, 108, 301–313.

Xie, M. and J. Singh (2013), "Confidence distribution, the frequentist distribution estimator of a parameter: a review with discussion," International Statistical Review, 81, 3-39. 


Using neural networks to mix and match photographs

This post summarizes the material in Luan, Paris, Schechtman, and Bala’s  article “Deep Photo Style Transfer”

Without fail, every photo you see is edited in one way or another. Edges are retouched, and color levels are adjusted for aesthetics. These edits, however, aren’t usually globally conscious, i.e. they are not taking into account the entire picture when making pixel-level changes. An alternative to local retouching would be to make the entire photo look more similar to the “abstract idea” embodied by another image. This goal of a first picture (called the input picture) with the qualitative elements, like the color scheme and brightness, of another picture (called the reference picture), and, is called “style transfer,” and it can be accomplished using convolutional neural networks.1

A team of researchers from Cornell University and Adobe recently improved on earlier techniques for “style transfer” in two important ways. First, the new method doesn’t change the geometry of the image. Previous methods would take the straight edge of a building in the input image and, in the process of applying new colors, twist it in unnatural ways. Secondly, they manage to maintain “semantic accuracy”—that is, while the style transfer “knows” about the entire image, styles don’t bleed across important visual barriers. In other words, the reference image’s sky doesn’t affect the input image’s houses.

To understand the big contribution of the author’s “neural style” algorithm, we need to understand previous works. Suppose we have an input image and we want to apply the style from a reference image to create an output image. The usual objective function for “style transfer” minimizes a linear combination of two terms. The first term ensures that the output image is similar to the input image (it is the sum of all squared element-wise differences between the feature maps of the input and output images). The second term ensures that the output image is similar to the reference image (this term is the sum of all the squared element-wise differences between the Gram matrices of the output and style images). The second term is also multiplied by a constant that controls the trade-off between content (the first term) and style (the second term).

The issue with that method is that it creates output images that aren’t “photorealistic.” Those images have distortions, both in terms of shape and colors bleeding into areas they shouldn’t. To get around this, the authors note that “the input is already photorealistic.” By making sure the input image doesn’t change too much, they can exploit that photorealism.

The authors change the objective function by adding one term that makes sure the changes between the input and output images are locally linear in terms of color. They also make a change to the term that makes sure the output image is similar to the reference image. One of their concerns is that the style of the sky in the reference image will affect the style of the houses (for example) in the output image. To get around that, they use another neural net to tag areas of the input and reference images with descriptions like: lake, sky, house, tree, etc. The output-reference term then makes sure that similarly tagged areas from the output and reference images are similar. The new terms each get a constant that controls how much they affect the output image (style v. content v. local transformation).photo_style_transf_2

In the images above, the left-most column is the input image, the center column is the style image and the right column is the output image. The results are pretty staggering, and the paper itself contains comparisons with older methods that make their improvements obvious. The deep blacks present in the folds of the rose transfer flawlessly, something older algorithms couldn’t manage to pull off.

This isn’t a post, even a little, about ethics, but it seems important to keep broader context in mind when thinking about methods that convincingly blur the line between what isn’t and what is. In a world where the internet can’t agree whether a dress is blue-and-black or white-and-gold, even without editing, the way powerful algorithms are changing the meaning of the word “photorealistic” can be unnerving. Deep-style transfer is incredible, but as the field improves social guidelines about content fidelity need to improve if we have any chance of telling output and input images apart.


1. At a very high level, convolutional neural networks are neural networks with particular rules on how neurons from a previous layer affect neurons in a subsequent layer (these rules apply to the convolutional layers). There are two rules. One, neurons can only affect other nearby neurons, and two, the way the effect of a neuron on another neuron is calculated is the same for all neurons. These rules are typically called “locally connected” and “shared weight.” A convolutional neural network is then just a neural network where most of the layers are locally connected and have shared weights. For a more thorough, better look at the subject:

What does probability mean anyway?

This post summarizes the material in Philip Stark and David Freedman’s article “What is the chance of an earthquake” at a high level.

If I tell you an event has a 70% chance of happening, what does that mean? Does your interpretation differ depending on the event? When we learn probability, we typically think about flipping coins, rolling dice, picking cards, or pulling balls out of urns. We also connect probability to more complicated phenomena such as the chance of rain tomorrow. But does our probabilistic intuition hold up to even more complicated events? For example, how do we interpret the following statement?
“What is the chance that an earthquake of magnitude 6.7 or greater will
occur before the year 2030 in the San Francisco Bay Area? The U.S. Geological Survey estimated the chance to be 0.7\pm 0.1 (USGS, 1999).”

(USGS: 1999, “Working Group on California Earthquake Probabilities. Earthquake Probabilities in the San Francisco Bay Region: 2000-2030 – A Summary of Findings”. Technical Report Open-File Report 99-517, USGS, Menlo Park, CA.)

Philip Stark and David Freedman use this example to walk us through the interpretation of the point estimate (0.7), the uncertainty estimate (\pm 0.1), and what probability (“chance”) means in this context.

Since the first interpretation of probability that we often encounter comes from examples of games of chance, gambling gives us the symmetry and equally likely outcomes interpretation. Let’s consider a fair coin toss. Both heads and tails must be equally likely, so the probability that we get a tails must be 1/2. The same is true for a fair die. Now we have six equally likely states, so the probability of getting a two is 1/6. If we try to apply this interpretation of probability to the earthquake problem, we do not have a natural sense of symmetry to exploit, so we must turn to other interpretations.

Another natural interpretation of probability is by imagining multiple repetitions of an experiment and using the proportion of the time that a desired outcome occurs as the probability of that outcome. This is referred to as the frequentist approach. In the earthquake case, it seems nonsensical to think about repeating the years until 2030 over and over again to estimate the probability of an earthquake occurring. Again, we must look for another probability interpretation to help us make sense of the earthquake probability statement.

An alternative to the frequentist approach is the Bayesian approach where instead of imagining multiple replications of a scenario to obtain a probability, we map a probability to a degree of belief in a certain outcome. If we believe something is impossible, it has probability zero. Likewise, if we have full confidence in an event taking place, it has probability one. In the earthquake scenario we want an inherent probability, not a overview of others’ opinions about how likely an earthquake is, even if these opinions are those of experts.

Another common probability interpretation is the Principle of Insufficient Reason which states: “If there is no reason to believe that outcomes are not equally likely, take them to be equally likely.” To apply this interpretation we must first define the set of potential outcomes. In the earthquake example, there are infinite time points between now and 2030 when an earthquake could occur. How we define the set of potential outcomes in terms of these time points can impact our probabilities. It seems strange that the probabilities would then be different under this principle due to our definition of potential outcomes, not due to something fundamental property of earthquakes. This probability interpretation is also found wanting.

Moving towards a more theoretical framework, probabilities can also be interpreted formally as mathematical probability. In this case probabilities must follow a certain set of rules. They must be non-negative (what would it mean to have a negative probability?). The probability of all possible outcomes must sum to one (something must happen, including a “nothing happening” result). If outcomes cannot happen at the same time, then the probability of at least one happening is the sum of all of its probabilities. This formalism can help us in the earthquake example, but we still need some additional structure beyond these rules.

Probability models build upon the mathematical probability rules, adding structure and helping us interpret probability as, in the words of Stark and Freedman, “just a property of a mathematical model intended to describe some features of the natural world.” We need to ensure that our model “matches” well with the phenomenon it aims to illustrate. Matching can be defined in terms of the frequentist approach. If we simulated from the proposed probability model many times, does the proportion of the event occurring in the simulation match the “true” probability of the event? For the earthquake example, we could design some model that explains earthquake behavior and use it to determine the probability of an occurrence. However, because earthquakes are few and far between, we do not have much data to build and test the model on to ensure that our predictions of earthquakes match well with what might happen in reality.

Now that we have these probability interpretations in mind, let’s revisit the USGS forecast for an earthquake. It seems like we must fit the USGS estimate into the probability model category; a mathematical model was made and under that model, the probability of an earthquake occurring was predicted as 0.7.

An important feature of the probability model interpretation is that the proposed model must match well with the phenomenon it aims to illustrate in order to be meaningful and useful. If we are unsure about how well our model matches the truth (which is most often the case), we must be sure to communicate this uncertainty.

What does the uncertainty estimate for the earthquake case mean? In this case, the probability model that the USGS created for the earthquake process was simulated from multiple times (like the multiple realities suggested by the Frequentist approach), and the 0.1 represents the variability in the outcomes of the simulations that were used to arrive at the point estimate of 0.7.

However, Freedman and Stark point out that many more sources of uncertainty should be incorporated as well. The model itself is an imperfect representation of the true earthquake generating process. This particular model happens to be made up of smaller models for various geologic features, leaving room for imperfection in these representations as well. Some of these sub-models require input parameters that are also subject to uncertainty. Realistically, the uncertainty estimate given by the USGS is much larger.

The big take-aways from this thought exercise are that some convenient probability interpretations cannot be used to reason about certain complex, chance events, and any probabilities ascertained from the probability model interpretation should include the model building itself as a source of uncertainty.

Want to learn more? Read the details in the original paper.

Thoughts? Questions? Feedback? Let me know: @sastoudt