It is widely known that darker beers tend to be brewed with waters with higher alkalinity and, thus higher residual alkalinity to the point that some have decided that a beer of a particular color 'requires' a particular level of alkalinity. This conclusion is (or should be) based on an observed correlation between the colors of a set of real beers and the RA of the water from which they were brewed. In this note we'll have a quick look at the dangers of drawing conclusions from cursory examination of correlation data. This is applicable to any kind of data and one should be wary not only with respect to brewing but elsewhere especially in politics. Correlation can not only be deceiving as we will illustrate here but it is also absolutely essential that one understands that correlation does not imply causation. You can't have causation without correlation but you can have correlation without causation. It has been observed, for example, that drowning deaths are correlated with ice cream consumption among children. A politician on a campaign to prevent the sales of ice cream to children based on this is serving his own ends - not the public's. Correlations such as this one are caused by a common factor, in this trivial example, weather. When the weather is warmer kids go swimming more often and they eat more ice cream. Another famous example was based on the observation that highway deaths went down appreciably when the government lowered the speed limit during a past 'fuel crisis' thus 'proving' that lower speeds saved lives. But they also went down in Germany where the speed limits were not lowered. The common effect here was that people drove less because fuel was expensive and hard to get.
When a valid correlation is observed one can form the hypothesis that there is causation and test that hypothesis. If one does that with either beer color and water RA or ice cream and drowning causation will be rejected.
We'll illustrate with some simulated data by which we mean data generated in a computer. Suppose a brewer has color and water RA data on 50 beers and that when he plots one against the other he sees a chart like this one:
Remember that this is not data on real beers. It is computer simulated data intended to illustrate a concept. The correlation between Beer color and Water RA in this figure is quite apparent as shown by the straight line which best fits the data indicating that an increase in water RA of one hundred results in an increase in beer color of 25 but it is not very strong. Beers made with water close to 100 could have colors ranging from 20 to over 70. For RA near 50 beers could have colors ranging from 5 to 30. It should be clear from this that while there is correlation the 'model' (beer color ~ 0.25*RA) doesn't model the actual color of the beer very well. The number r (Pearson's r) in the title box of the graph is a measure of the tighness of the fit i.e. how well the model describes the data. If beer color were exactly 0.25 times water RA the two variables would be said to be completely correlated and r would =1. The larger r the more tightly bunched about the fit line the data point fall. The value of r2 = 0.35 in this data set indicates that only 35% of the variance in beer colors is attributable to a linear relationship between beer color and RA. Put in other words, while one can say that in general darker beers tend to come from more alkaline waters RA, is not a very good predictor of beer color.
As a final warning about drawing conclusions from correlation data I note that the data in the figure were drawn from independent random number generators and so, in fact are not correlated at all. I did 40,000 experiments in which pairs of 50 points each were drawn from independent generators. The probability that one of these would show the correlation of the figure above is 40,000:1 (in other words, I picked the one that showed the greatest correlation) but most would show lower levels (91% would exhibit -0.2 < r < 0.2). Thus, even when appreciable correlation is seen we must accept the fact that there is a probability, even though quite small, that the varibles being compared are, in fact, uncorrelated.