Class Distribution
This class implements a distribution object that is defined by its pdf (probability density function)
Interestingly, I could not find in numpy/scipy a class that could implement a distribution just from its pdf. The idea of such object is to be able to compute statistics of this distribution without any pain.
Also this class implements basic operations such as + - / *, with scalars or distributions, which comes handy when doing probabilities. The normalization is left to the user’s decision but can be quickly done using the normalize() method
Note that operations with scalars requires to write the distribution on the left side of the operators.
Implementation notes:
- the pdf is given by a linear interpolation of the samples,
- the pdf’s norm is given by a scipy.integrate.simps integration (fast and robust)
- the cdf is given by the linear interpolation of the cumulative sum of the pdf samples.
- percentiles are calculated directly by bracketing the cdf and from linear interpolations
Test case: combining 4 experimental measurements
Let’s have 4 independent measurements of the same quantity with Gaussian uncertainties. The measurements are samples of a given Gaussian distribution of which we want to estimate the mean and dispersion values
A quick Bayesian inference (with uniform priors) will show that if all measurements are from the same distribution, the production of the 4 posterior distributions will give you the underlying Gaussian parameters.
We also find that the product of Gaussians is a Gaussian
A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid
Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde
Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.
It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!
This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.
Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.
- x, y: ndarray[ndim=1]
- The x-coords, y-coords of the input data points respectively
- gridsize: tuple
- A (nx,ny) tuple of the size of the output grid (default: 200x200)
- extents: (xmin, xmax, ymin, ymax) tuple
- tuple of the extents of output grid (default: extent of input data)
- nocorrelation: bool
- If True, the correlation between the x and y coords will be ignored when preforming the KDE. (default: False)
- weights: ndarray[ndim=1]
- An array of the same shape as x & y that weights each sample (x_i, y_i) by each value in weights (w_i). Defaults to an array of ones the same size as x & y. (default: None)
- adjust : float
- An adjustment factor for the bw. Bandwidth becomes bw * adjust.
- g: ndarray[ndim=2]
- A gridded 2D kernel density estimate of the input points.
- e: (xmin, xmax, ymin, ymax) tuple
- Extents of g
A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid
Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde
Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.
It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!
This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.
Example usage and timing
from scipy.stats import gaussian_kde
- def npkde(x, xe):
- kde = gaussian_kde(x) r = kde.evaluate(xe) return r
x = np.random.normal(0, 1, 1e6)
%timeit fastkde1D(x) 10 loops, best of 3: 31.9 ms per loop
%timeit npkde(x, xe) 1 loops, best of 3: 11.8 s per loop
~ 1e4 speed up!!! However gaussian_kde is not optimized for this application
Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.
- xin: ndarray[ndim=1]
- The x-coords, y-coords of the input data points respectively
- gridsize: int
- A nx integer of the size of the output grid (default: 200x200)
- extents: (xmin, xmax) tuple
- tuple of the extents of output grid (default: extent of input data)
- weights: ndarray[ndim=1]
- An array of the same shape as x that weights each sample x_i by w_i. Defaults to an array of ones the same size as x (default: None)
- adjust : float
- An adjustment factor for the bw. Bandwidth becomes bw * adjust.
- g: ndarray[ndim=2]
- A gridded 2D kernel density estimate of the input points.
- e: (xmin, xmax, ymin, ymax) tuple
- Extents of g
Bayesian Blocks Implementation
By Jake Vanderplas. License: BSD Based on algorithm outlined in http://adsabs.harvard.edu/abs/2012arXiv1207.5578S
This is an incomplete implementation: it may fail for some datasets. Alternate fitness functions and prior forms can be found in the paper listed above.
Compute the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]).
If weights is specified, it specifies values at the coordinate (x[i], y[i]). These values are accumulated for each bin and then reduced according to reduce_w function, which defaults to numpy’s sum function (np.sum). (If weights is specified, it must also be a 1-D sequence of the same length as x and y.)
Determine the optimal binning of the data based on common estimators and returns either the number of bins of the width to use.
data 1d dataset to estimate from
method the method to use: str in {sturge, scott, freedman} ret set to N will return the number of bins / edges
set to W will return the width
Sturges, H. A. (1926).”The choice of a class interval”. J. American Statistical Association, 65-66
Scott, David W. (1979), “On optimal and data-based histograms”. Biometrika, 66, 605-610
Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 57, 453-476
computes quantiles from an array
Quantiles := points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets.
The quantile with a fraction 50 is called the median (50% of the distribution)
Variations on boxplots.
Make a violin plot of each dataset in the data sequence.
A violin plot is a boxplot combined with a kernel density estimate of the probability density function per point.
A dictionary with plotting options. Any of the following can be provided, if not present in plot_opts the defaults will be used:
- 'violin_fc', MPL color. Fill color for violins. Default is 'y'.
- 'violin_ec', MPL color. Edge color for violins. Default is 'k'.
- 'violin_lw', scalar. Edge linewidth for violins. Default is 1.
- 'violin_alpha', float. Transparancy of violins. Default is 0.5.
- 'cutoff', bool. If True, limit violin range to data range.
Default is False.
- 'cutoff_val', scalar. Where to cut off violins if `cutoff` is
True. Default is 1.5 standard deviations.
- 'cutoff_type', {'std', 'abs'}. Whether cutoff value is absolute,
or in standard deviations. Default is 'std'.
- 'violin_width' : float. Relative width of violins. Max available
space is 1, default is 0.8.
- 'label_fontsize', MPL fontsize. Adjusts fontsize only if given.
- 'label_rotation', scalar. Adjusts label rotation only if given.
Specify in degrees.
beanplot : Bean plot, builds on violinplot. matplotlib.pyplot.boxplot : Standard boxplot.
The appearance of violins can be customized with plot_opts. If customization of boxplot elements is required, set show_boxplot to False and plot it on top of the violins by calling the Matplotlib boxplot function directly. For example:
violinplot(data, ax=ax, show_boxplot=False)
ax.boxplot(data, sym='cv', whis=2.5)
It can happen that the axis labels or tick labels fall outside the plot area, especially with rotated labels on the horizontal axis. With Matplotlib 1.1 or higher, this can easily be fixed by calling ax.tight_layout(). With older Matplotlib one has to use plt.rc or plt.rcParams to fix this, for example:
plt.rc('figure.subplot', bottom=0.25)
violinplot(data, ax=ax)
J.L. Hintze and R.D. Nelson, “Violin Plots: A Box Plot-Density Trace Synergism”, The American Statistician, Vol. 52, pp.181-84, 1998.
Make a bean plot of each dataset in the data sequence.
A bean plot is a combination of a violinplot (kernel density estimate of the probability density function per point) with a line-scatter plot of all individual data points.
A dictionary with plotting options. All the options for violinplot can be specified, they will simply be passed to violinplot. Options specific to beanplot are:
- ‘bean_color’, MPL color. Color of bean plot lines. Default is ‘k’.
Also used for jitter marker edge color if jitter is True.
- ‘bean_size’, scalar. Line length as a fraction of maximum length.
Default is 0.5.
‘bean_lw’, scalar. Linewidth, default is 0.5.
‘bean_show_mean’, bool. If True (default), show mean as a line.
- ‘bean_show_median’, bool. If True (default), show median as a
marker.
‘bean_mean_color’, MPL color. Color of mean line. Default is ‘b’.
‘bean_mean_lw’, scalar. Linewidth of mean line, default is 2.
- ‘bean_median_color’, MPL color. Color of median marker. Default
is ‘r’.
‘bean_median_marker’, MPL marker. Marker type, default is ‘+’.
- ‘jitter_marker’, MPL marker. Marker type for jitter=True.
Default is ‘o’.
‘jitter_marker_size’, int. Marker size. Default is 4.
‘jitter_fc’, MPL color. Jitter marker face color. Default is None.
‘bean_legend_text’, str. If given, add a legend with given text.
violinplot : Violin plot, also used internally in beanplot. matplotlib.pyplot.boxplot : Standard boxplot.
P. Kampstra, “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions”, J. Stat. Soft., Vol. 28, pp. 1-9, 2008.