import pandas as pd
import numpy as np
from plotnine import (
ggplot,
aes,
after_stat,
geom_density,
geom_histogram,
geom_vline,
geom_rect,
labs,
annotate,
theme_tufte
)from plotnine.data import mpg
Density Plot
mpg.head()
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
The defaults are not exactly beautiful, but still quite clear.
='cty'))
(ggplot(mpg, aes(x+ geom_density()
)
<Figure Size: (640 x 480)>
Plotting multiple groups is straightforward, but as each group is plotted as an independent PDF summing to 1, the relative size of each group will be normalized.
='cty', color='drv', fill='drv'))
(ggplot(mpg, aes(x+ geom_density(alpha=0.1)
)
<Figure Size: (640 x 480)>
To plot multiple groups and scale them by their relative size, you can map the y
aesthetic to 'count'
(calculated by stat_density
).
='cty', color='drv', fill='drv'))
(ggplot(mpg, aes(x+ geom_density(aes(y=after_stat('count')), alpha=0.1)
)
<Figure Size: (640 x 480)>
Density Plot + Histogram
To overlay a histogram onto the density, the y
aesthetic of the density should be mapped to the 'count'
scaled by the binwidth
of the histograms.
Why?
The count
calculated by stat_density
is \(count = density * n\) where n
is the number of points . The density
curves have an area of 1 and have no information about the absolute frequency of the values along curve; only the relative frequencies. The count
curve reveals the absolute frequencies. The scale of this count
corresponds to the count
calculated by the stat_bin
for the histogram when the bins are 1 unit wide i.e. binwidth=1
. The count * binwidth
curve matches the scale of count
s for the histogram for a give binwidth
.
= 2 # The same for geom_density and geom_histogram
binwidth
='cty', color='drv', fill='drv'))
(ggplot(mpg, aes(x+ geom_density(aes(y=after_stat('count*binwidth')), alpha=0.1)
+ geom_histogram(aes(fill='drv', y=after_stat('count')), binwidth=binwidth, color='none', alpha=0.5)
# It is the histogram that gives us the meaningful y axis label
# i.e. 'count' and not 'count*2'
+ labs(y='count')
)
<Figure Size: (640 x 480)>
Shading a Region under a Density Curve
Extending geom_density to create an effect of a shaded region
Create some data and plot the density
= 101
n = pd.DataFrame({'x': np.arange(n)})
df
'x'))
(ggplot(df, aes(+ geom_density()
)
<Figure Size: (640 x 480)>
Suppose we want to mark a region as special e.g. (40, 60)
, we can use vertical lines to annotate
it.
= (40, 60)
region
'x'))
(ggplot(df, aes(+ geom_density()
+ annotate(geom_vline, xintercept=region) #new line
)
<Figure Size: (640 x 480)>
To make it standout more we can highlight. To do that, the first thought is to use a rectangle.
= (40, 60)
region
'x'))
(ggplot(df, aes(+ geom_density()
+ annotate(geom_rect, xmin=region[0], xmax=region[1], ymin=0, ymax=float('inf'), alpha=0.5) # new line
+ annotate(geom_vline, xintercept=region)
)
<Figure Size: (640 x 480)>
Since y
upper-bound varies along the curve, a rectangular highlight has to stretch up to the top of the panel.
To hightlight only within the density curve, we have to use a second density curve. We need to calculate the density as normal, but just before the curve & region are plotted, we should keep only the region we want.
We create our own geom_density_highlight
and override the setup_data
method. First, we override but do nothing, we only inspect the data to see what we have to work with.
# new class
class geom_density_highlight(geom_density):
def setup_data(self, data):
= super().setup_data(data)
data print(data)
return data
= (40, 60)
region
'x'))
(ggplot(df, aes(+ geom_density()
+ geom_density_highlight(fill='black', alpha=0.5) # new line
+ annotate(geom_vline, xintercept=region)
)
PANEL count density group n scaled x y
0 1 0.519038 0.005139 -1 101 0.519039 0.000000 0.005139 \
1 1 0.522757 0.005176 -1 101 0.522758 0.097752 0.005176
2 1 0.526473 0.005213 -1 101 0.526474 0.195503 0.005213
3 1 0.530187 0.005249 -1 101 0.530188 0.293255 0.005249
4 1 0.533899 0.005286 -1 101 0.533900 0.391007 0.005286
... ... ... ... ... ... ... ... ...
1019 1 0.533899 0.005286 -1 101 0.533900 99.608993 0.005286
1020 1 0.530187 0.005249 -1 101 0.530188 99.706745 0.005249
1021 1 0.526473 0.005213 -1 101 0.526474 99.804497 0.005213
1022 1 0.522757 0.005176 -1 101 0.522758 99.902248 0.005176
1023 1 0.519038 0.005139 -1 101 0.519039 100.000000 0.005139
ymin ymax
0 0 0.005139
1 0 0.005176
2 0 0.005213
3 0 0.005249
4 0 0.005286
... ... ...
1019 0 0.005286
1020 0 0.005249
1021 0 0.005213
1022 0 0.005176
1023 0 0.005139
[1024 rows x 10 columns]
<Figure Size: (640 x 480)>
The highlight has filled the whole region, but the printed data suggests that we can limit the rows to those where x
column is within our region.
class geom_density_highlight(geom_density):
# new method
def __init__(self, *args, region=(-np.inf, np.inf), **kwargs):
super().__init__(*args, **kwargs)
self.region = region
def setup_data(self, data):
= super().setup_data(data)
data = f'{self.region[0]} <= x <= {self.region[1]}' # new line
s = data.query(s).reset_index(drop=True) # new line
data return data
= (40, 60)
region
'x'))
(ggplot(df, aes(+ geom_density()
+ geom_density_highlight(region=region, fill='black', alpha=0.5) # modified line
+ annotate(geom_vline, xintercept=region)
)
<Figure Size: (640 x 480)>
That is it, but we can make it look better.
class geom_density_highlight(geom_density):
def __init__(self, *args, region=(-np.inf, np.inf), **kwargs):
super().__init__(*args, **kwargs)
self.region = region
def setup_data(self, data):
= super().setup_data(data)
data = f'{self.region[0]} <= x <= {self.region[1]}'
s = data.query(s).reset_index(drop=True)
data return data
= (40, 60)
region = '#029386'
teal
# Gallery Plot
'x'))
(ggplot(df, aes(+ geom_density_highlight(region=region, fill=teal+'88', color='none')
+ geom_density(fill=teal+'44', color=teal, size=.7)
+ annotate(geom_vline, xintercept=region, color=teal, size=.7)
+ theme_tufte()
)
<Figure Size: (640 x 480)>
This example was motivated by a question from github user Rishika-Ravindran.