Table Of Contents

Previous topic

Welcome to faststats’s documentation!

This Page

faststats Package

faststats Package

distrib Module

Class Distribution

This class implements a distribution object that is defined by its pdf (probability density function)

Interestingly, I could not find in numpy/scipy a class that could implement a distribution just from its pdf. The idea of such object is to be able to compute statistics of this distribution without any pain.

Also this class implements basic operations such as + - / *, with scalars or distributions, which comes handy when doing probabilities. The normalization is left to the user’s decision but can be quickly done using the normalize() method

Note that operations with scalars requires to write the distribution on the left side of the operators.

Implementation notes:

  • the pdf is given by a linear interpolation of the samples,
  • the pdf’s norm is given by a scipy.integrate.simps integration (fast and robust)
  • the cdf is given by the linear interpolation of the cumulative sum of the pdf samples.
  • percentiles are calculated directly by bracketing the cdf and from linear interpolations
class faststats.distrib.Distribution(x, pdf, name=None, *args, **kwargs)[source]

Bases: object

cdf(x)[source]

Cumulative distribution function

isf(x)[source]

Inverse survival function (Complementary CDF inverse)

kurtosis[source]
mean[source]
moment(order, reduced=False)[source]

Non-central moments

normalize()[source]

Normalize the sampled pdf by its norm

pdf(x)[source]

Probability density function

ppf(x)[source]

Percentile point function (i.e. CDF inverse)

rvs(N)[source]

Random samples

sf(x)[source]

Survival function = complementary CDF

skew[source]
std[source]
variance[source]
faststats.distrib.main()[source]

Test case: combining 4 experimental measurements

Let’s have 4 independent measurements of the same quantity with Gaussian uncertainties. The measurements are samples of a given Gaussian distribution of which we want to estimate the mean and dispersion values

A quick Bayesian inference (with uniform priors) will show that if all measurements are from the same distribution, the production of the 4 posterior distributions will give you the underlying Gaussian parameters.

if mu = {mk}, and sig = {sk} for k=1, N:
p(m, s | mu, sig) ~ prod_k p(mk, sk | m, s) p(m, s)

We also find that the product of Gaussians is a Gaussian

fastkde Module

faststats.fastkde.fastkde(x, y, gridsize=(200, 200), extents=None, nocorrelation=False, weights=None, adjust=1.0)[source]

A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid

Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde

Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.

It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!

This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.

Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.

x, y: ndarray[ndim=1]
The x-coords, y-coords of the input data points respectively
gridsize: tuple
A (nx,ny) tuple of the size of the output grid (default: 200x200)
extents: (xmin, xmax, ymin, ymax) tuple
tuple of the extents of output grid (default: extent of input data)
nocorrelation: bool
If True, the correlation between the x and y coords will be ignored when preforming the KDE. (default: False)
weights: ndarray[ndim=1]
An array of the same shape as x & y that weights each sample (x_i, y_i) by each value in weights (w_i). Defaults to an array of ones the same size as x & y. (default: None)
adjust : float
An adjustment factor for the bw. Bandwidth becomes bw * adjust.
g: ndarray[ndim=2]
A gridded 2D kernel density estimate of the input points.
e: (xmin, xmax, ymin, ymax) tuple
Extents of g
faststats.fastkde.fastkde1D(xin, gridsize=200, extents=None, weights=None, adjust=1.0)[source]

A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid

Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde

Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.

It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!

This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.

Example usage and timing

from scipy.stats import gaussian_kde

def npkde(x, xe):
kde = gaussian_kde(x) r = kde.evaluate(xe) return r

x = np.random.normal(0, 1, 1e6)

%timeit fastkde1D(x) 10 loops, best of 3: 31.9 ms per loop

%timeit npkde(x, xe) 1 loops, best of 3: 11.8 s per loop

~ 1e4 speed up!!! However gaussian_kde is not optimized for this application

Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.

xin: ndarray[ndim=1]
The x-coords, y-coords of the input data points respectively
gridsize: int
A nx integer of the size of the output grid (default: 200x200)
extents: (xmin, xmax) tuple
tuple of the extents of output grid (default: extent of input data)
weights: ndarray[ndim=1]
An array of the same shape as x that weights each sample x_i by w_i. Defaults to an array of ones the same size as x (default: None)
adjust : float
An adjustment factor for the bw. Bandwidth becomes bw * adjust.
g: ndarray[ndim=2]
A gridded 2D kernel density estimate of the input points.
e: (xmin, xmax, ymin, ymax) tuple
Extents of g

hist Module

faststats.hist.bayesian_blocks(t)[source]

Bayesian Blocks Implementation

By Jake Vanderplas. License: BSD Based on algorithm outlined in http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

t : ndarray, length N
data to be histogrammed
bins : ndarray
array containing the (N+1) bin edges

This is an incomplete implementation: it may fail for some datasets. Alternate fitness functions and prior forms can be found in the paper listed above.

faststats.hist.fast_histogram2d(x, y, bins=10, weights=None, reduce_w=None, NULL=None, reinterp=None)[source]

Compute the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]).

If weights is specified, it specifies values at the coordinate (x[i], y[i]). These values are accumulated for each bin and then reduced according to reduce_w function, which defaults to numpy’s sum function (np.sum). (If weights is specified, it must also be a 1-D sequence of the same length as x and y.)

x: ndarray[ndim=1]
first data sample coordinates
y: ndarray[ndim=1]
second data sample coordinates
bins: int or [int, int]
int, the number of bins for the two dimensions (nx=ny=bins) or [int, int], the number of bins in each dimension (nx, ny = bins)
weights: ndarray[ndim=1]
values w_i weighing each sample (x_i, y_i)
accumulated and reduced (using reduced_w) per bin
reduce_w: callable
function that will reduce the weights values accumulated per bin defaults to numpy’s sum function (np.sum)
NULL: value type
filling missing data value
reinterp: str
values are [None, ‘nn’, linear’] if set, reinterpolation is made using mlab.griddata to fill missing data within the convex polygone that encloses the data
B: ndarray[ndim=2]
bi-dimensional histogram
extent: tuple(4)
(xmin, xmax, ymin, ymax) extension of the histogram
steps: tuple(2)
(dx, dy) bin size in x and y direction
faststats.hist.optbins(data, method='freedman', ret='N')[source]

Determine the optimal binning of the data based on common estimators and returns either the number of bins of the width to use.

data 1d dataset to estimate from

method the method to use: str in {sturge, scott, freedman} ret set to N will return the number of bins / edges

set to W will return the width
  • Sturges, H. A. (1926).”The choice of a class interval”. J. American Statistical Association, 65-66

  • Scott, David W. (1979), “On optimal and data-based histograms”. Biometrika, 66, 605-610

  • Freedman, D.; Diaconis, P. (1981). “On the histogram as a density estimator: L2 theory”.

    Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 57, 453-476

*Scargle, J.D. et al (2012) “Studies in Astronomical Time Series Analysis. VI. Bayesian
Block Representations.”
faststats.hist.quantiles(x, qlist=[2.5, 25, 50, 75, 97.5])[source]

computes quantiles from an array

Quantiles := points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets.

The quantile with a fraction 50 is called the median (50% of the distribution)

Inputs:
x - variable to evaluate from qlist - quantiles fraction to estimate (in %)
Outputs:
Returns a dictionary of requested quantiles from array

plot Module

Variations on boxplots.

faststats.plot.violinplot(data, ax=None, labels=None, positions=None, side='both', show_boxplot=True, plot_opts={})[source]

Make a violin plot of each dataset in the data sequence.

A violin plot is a boxplot combined with a kernel density estimate of the probability density function per point.

data : sequence of ndarrays
Data arrays, one array per value in positions.
ax : Matplotlib AxesSubplot instance, optional
If given, this subplot is used to plot in instead of a new figure being created.
labels : list of str, optional
Tick labels for the horizontal axis. If not given, integers 1..len(data) are used.
positions : array_like, optional
Position array, used as the horizontal axis of the plot. If not given, spacing of the violins will be equidistant.
side : {‘both’, ‘left’, ‘right’}, optional
How to plot the violin. Default is ‘both’. The ‘left’, ‘right’ options can be used to create asymmetric violin plots.
show_boxplot : bool, optional
Whether or not to show normal box plots on top of the violins. Default is True.
plot_opts : dict, optional

A dictionary with plotting options. Any of the following can be provided, if not present in plot_opts the defaults will be used:

- 'violin_fc', MPL color.  Fill color for violins.  Default is 'y'.
- 'violin_ec', MPL color.  Edge color for violins.  Default is 'k'.
- 'violin_lw', scalar.  Edge linewidth for violins.  Default is 1.
- 'violin_alpha', float.  Transparancy of violins.  Default is 0.5.
- 'cutoff', bool.  If True, limit violin range to data range.
      Default is False.
- 'cutoff_val', scalar.  Where to cut off violins if `cutoff` is
      True.  Default is 1.5 standard deviations.
- 'cutoff_type', {'std', 'abs'}.  Whether cutoff value is absolute,
      or in standard deviations.  Default is 'std'.
- 'violin_width' : float.  Relative width of violins.  Max available
      space is 1, default is 0.8.
- 'label_fontsize', MPL fontsize.  Adjusts fontsize only if given.
- 'label_rotation', scalar.  Adjusts label rotation only if given.
      Specify in degrees.
fig : Matplotlib figure instance
If ax is None, the created figure. Otherwise the figure to which ax is connected.

beanplot : Bean plot, builds on violinplot. matplotlib.pyplot.boxplot : Standard boxplot.

The appearance of violins can be customized with plot_opts. If customization of boxplot elements is required, set show_boxplot to False and plot it on top of the violins by calling the Matplotlib boxplot function directly. For example:

violinplot(data, ax=ax, show_boxplot=False)
ax.boxplot(data, sym='cv', whis=2.5)

It can happen that the axis labels or tick labels fall outside the plot area, especially with rotated labels on the horizontal axis. With Matplotlib 1.1 or higher, this can easily be fixed by calling ax.tight_layout(). With older Matplotlib one has to use plt.rc or plt.rcParams to fix this, for example:

plt.rc('figure.subplot', bottom=0.25)
violinplot(data, ax=ax)

J.L. Hintze and R.D. Nelson, “Violin Plots: A Box Plot-Density Trace Synergism”, The American Statistician, Vol. 52, pp.181-84, 1998.

faststats.plot.beanplot(data, ax=None, labels=None, positions=None, side='both', jitter=False, plot_opts={})[source]

Make a bean plot of each dataset in the data sequence.

A bean plot is a combination of a violinplot (kernel density estimate of the probability density function per point) with a line-scatter plot of all individual data points.

data : sequence of ndarrays
Data arrays, one array per value in positions.
ax : Matplotlib AxesSubplot instance, optional
If given, this subplot is used to plot in instead of a new figure being created.
labels : list of str, optional
Tick labels for the horizontal axis. If not given, integers 1..len(data) are used.
positions : array_like, optional
Position array, used as the horizontal axis of the plot. If not given, spacing of the violins will be equidistant.
side : {‘both’, ‘left’, ‘right’}, optional
How to plot the violin. Default is ‘both’. The ‘left’, ‘right’ options can be used to create asymmetric violin plots.
jitter : bool, optional
If True, jitter markers within violin instead of plotting regular lines around the center. This can be useful if the data is very dense.
plot_opts : dict, optional

A dictionary with plotting options. All the options for violinplot can be specified, they will simply be passed to violinplot. Options specific to beanplot are:

  • ‘bean_color’, MPL color. Color of bean plot lines. Default is ‘k’.

    Also used for jitter marker edge color if jitter is True.

  • ‘bean_size’, scalar. Line length as a fraction of maximum length.

    Default is 0.5.

  • ‘bean_lw’, scalar. Linewidth, default is 0.5.

  • ‘bean_show_mean’, bool. If True (default), show mean as a line.

  • ‘bean_show_median’, bool. If True (default), show median as a

    marker.

  • ‘bean_mean_color’, MPL color. Color of mean line. Default is ‘b’.

  • ‘bean_mean_lw’, scalar. Linewidth of mean line, default is 2.

  • ‘bean_median_color’, MPL color. Color of median marker. Default

    is ‘r’.

  • ‘bean_median_marker’, MPL marker. Marker type, default is ‘+’.

  • ‘jitter_marker’, MPL marker. Marker type for jitter=True.

    Default is ‘o’.

  • ‘jitter_marker_size’, int. Marker size. Default is 4.

  • ‘jitter_fc’, MPL color. Jitter marker face color. Default is None.

  • ‘bean_legend_text’, str. If given, add a legend with given text.

fig : Matplotlib figure instance
If ax is None, the created figure. Otherwise the figure to which ax is connected.

violinplot : Violin plot, also used internally in beanplot. matplotlib.pyplot.boxplot : Standard boxplot.

P. Kampstra, “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions”, J. Stat. Soft., Vol. 28, pp. 1-9, 2008.

faststats.plot.kde_plot(x, ax=None, orientation='horizontal', cutoff=False, log=False, cutoff_type='std', cutoff_val=1.5, pos=100, pos_marker='line', pos_width=0.05, pos_kwargs={}, **kwargs)[source]