faststats Package¶

`faststats` Package¶

`distrib` Module¶

Class Distribution

This class implements a distribution object that is defined by its pdf (probability density function)

Interestingly, I could not find in numpy/scipy a class that could implement a distribution just from its pdf. The idea of such object is to be able to compute statistics of this distribution without any pain.

Also this class implements basic operations such as + - / *, with scalars or distributions, which comes handy when doing probabilities. The normalization is left to the user’s decision but can be quickly done using the normalize() method

Note that operations with scalars requires to write the distribution on the left side of the operators.

Implementation notes:

the pdf is given by a linear interpolation of the samples,

the pdf’s norm is given by a scipy.integrate.simps integration (fast and robust)

the cdf is given by the linear interpolation of the cumulative sum of the pdf samples.

percentiles are calculated directly by bracketing the cdf and from linear interpolations

class faststats.distrib.Distribution(x, pdf, name=None, *args, **kwargs)[source]¶

Bases: object

cdf(x)[source]¶: Cumulative distribution function

isf(x)[source]¶: Inverse survival function (Complementary CDF inverse)

kurtosis[source]¶

mean[source]¶

moment(order, reduced=False)[source]¶: Non-central moments

normalize()[source]¶: Normalize the sampled pdf by its norm

pdf(x)[source]¶: Probability density function

ppf(x)[source]¶: Percentile point function (i.e. CDF inverse)

rvs(N)[source]¶: Random samples

sf(x)[source]¶: Survival function = complementary CDF

skew[source]¶

std[source]¶

variance[source]¶

faststats.distrib.main()[source]¶

Test case: combining 4 experimental measurements

Let’s have 4 independent measurements of the same quantity with Gaussian uncertainties. The measurements are samples of a given Gaussian distribution of which we want to estimate the mean and dispersion values

A quick Bayesian inference (with uniform priors) will show that if all measurements are from the same distribution, the production of the 4 posterior distributions will give you the underlying Gaussian parameters.

if mu = {mk}, and sig = {sk} for k=1, N:: p(m, s | mu, sig) ~ prod_k p(mk, sk | m, s) p(m, s)

We also find that the product of Gaussians is a Gaussian

`fastkde` Module¶

faststats.fastkde.fastkde(x, y, gridsize=(200, 200), extents=None, nocorrelation=False, weights=None, adjust=1.0)[source]¶

A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid

Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde

Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.

It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!

This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.

Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.

x, y: ndarray[ndim=1]

The x-coords, y-coords of the input data points respectively

gridsize: tuple

A (nx,ny) tuple of the size of the output grid (default: 200x200)

extents: (xmin, xmax, ymin, ymax) tuple

tuple of the extents of output grid (default: extent of input data)

nocorrelation: bool

If True, the correlation between the x and y coords will be ignored when preforming the KDE. (default: False)

weights: ndarray[ndim=1]

An array of the same shape as x & y that weights each sample (x_i, y_i) by each value in weights (w_i). Defaults to an array of ones the same size as x & y. (default: None)

adjust : float

An adjustment factor for the bw. Bandwidth becomes bw * adjust.

g: ndarray[ndim=2]

A gridded 2D kernel density estimate of the input points.

e: (xmin, xmax, ymin, ymax) tuple

Extents of g

faststats.fastkde.fastkde1D(xin, gridsize=200, extents=None, weights=None, adjust=1.0)[source]¶

A fft-based Gaussian kernel density estimate (KDE) for computing the KDE on a regular grid

Note that this is a different use case than scipy’s original scipy.stats.kde.gaussian_kde

Performs a gaussian kernel density estimate over a regular grid using a convolution of the gaussian kernel with a 2D histogram of the data.

It computes the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]). histogram of the data is a faster implementation than numpy.histogram as it avoids intermediate copies and excessive memory usage!

This function is typically several orders of magnitude faster than scipy.stats.kde.gaussian_kde. For large (>1e7) numbers of points, it produces an essentially identical result.

Example usage and timing

from scipy.stats import gaussian_kde

def npkde(x, xe):

kde = gaussian_kde(x) r = kde.evaluate(xe) return r

x = np.random.normal(0, 1, 1e6)

%timeit fastkde1D(x) 10 loops, best of 3: 31.9 ms per loop

%timeit npkde(x, xe) 1 loops, best of 3: 11.8 s per loop

~ 1e4 speed up!!! However gaussian_kde is not optimized for this application

Boundary conditions on the data is corrected by using a symmetric / reflection condition. Hence the limits of the dataset does not affect the pdf estimate.

xin: ndarray[ndim=1]

The x-coords, y-coords of the input data points respectively

gridsize: int

A nx integer of the size of the output grid (default: 200x200)

extents: (xmin, xmax) tuple

tuple of the extents of output grid (default: extent of input data)

weights: ndarray[ndim=1]

An array of the same shape as x that weights each sample x_i by w_i. Defaults to an array of ones the same size as x (default: None)

adjust : float

An adjustment factor for the bw. Bandwidth becomes bw * adjust.

g: ndarray[ndim=2]

A gridded 2D kernel density estimate of the input points.

e: (xmin, xmax, ymin, ymax) tuple

Extents of g

`hist` Module¶

faststats.hist.bayesian_blocks(t)[source]¶

Bayesian Blocks Implementation

By Jake Vanderplas. License: BSD Based on algorithm outlined in http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

t : ndarray, length N: data to be histogrammed

bins : ndarray: array containing the (N+1) bin edges

This is an incomplete implementation: it may fail for some datasets. Alternate fitness functions and prior forms can be found in the paper listed above.

faststats.hist.fast_histogram2d(x, y, bins=10, weights=None, reduce_w=None, NULL=None, reinterp=None)[source]¶

Compute the sparse bi-dimensional histogram of two data samples where x, and y are 1-D sequences of the same length. If weights is None (default), this is a histogram of the number of occurences of the observations at (x[i], y[i]).

If weights is specified, it specifies values at the coordinate (x[i], y[i]). These values are accumulated for each bin and then reduced according to reduce_w function, which defaults to numpy’s sum function (np.sum). (If weights is specified, it must also be a 1-D sequence of the same length as x and y.)

x: ndarray[ndim=1]: first data sample coordinates
y: ndarray[ndim=1]: second data sample coordinates

bins: int or [int, int]

int, the number of bins for the two dimensions (nx=ny=bins) or [int, int], the number of bins in each dimension (nx, ny = bins)

weights: ndarray[ndim=1]

values w_i weighing each sample (x_i, y_i): accumulated and reduced (using reduced_w) per bin

reduce_w: callable

function that will reduce the weights values accumulated per bin defaults to numpy’s sum function (np.sum)

NULL: value type

filling missing data value

reinterp: str

values are [None, ‘nn’, linear’] if set, reinterpolation is made using mlab.griddata to fill missing data within the convex polygone that encloses the data

B: ndarray[ndim=2]: bi-dimensional histogram
extent: tuple(4): (xmin, xmax, ymin, ymax) extension of the histogram
steps: tuple(2): (dx, dy) bin size in x and y direction

faststats.hist.optbins(data, method='freedman', ret='N')[source]¶

Determine the optimal binning of the data based on common estimators and returns either the number of bins of the width to use.

data 1d dataset to estimate from

method the method to use: str in {sturge, scott, freedman} ret set to N will return the number of bins / edges

set to W will return the width

Sturges, H. A. (1926).”The choice of a class interval”. J. American Statistical Association, 65-66
Scott, David W. (1979), “On optimal and data-based histograms”. Biometrika, 66, 605-610
Freedman, D.; Diaconis, P. (1981). “On the histogram as a density estimator: L2 theory”.

Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 57, 453-476

*Scargle, J.D. et al (2012) “Studies in Astronomical Time Series Analysis. VI. Bayesian: Block Representations.”

faststats.hist.quantiles(x, qlist=[2.5, 25, 50, 75, 97.5])[source]¶

computes quantiles from an array

Quantiles := points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets.

The quantile with a fraction 50 is called the median (50% of the distribution)

Inputs:: x - variable to evaluate from qlist - quantiles fraction to estimate (in %)
Outputs:: Returns a dictionary of requested quantiles from array

`plot` Module¶

Variations on boxplots.

faststats.plot.violinplot(data, ax=None, labels=None, positions=None, side='both', show_boxplot=True, plot_opts={})[source]¶

Make a violin plot of each dataset in the data sequence.

A violin plot is a boxplot combined with a kernel density estimate of the probability density function per point.

data : sequence of ndarrays

Data arrays, one array per value in positions.

ax : Matplotlib AxesSubplot instance, optional

If given, this subplot is used to plot in instead of a new figure being created.

labels : list of str, optional

Tick labels for the horizontal axis. If not given, integers 1..len(data) are used.

positions : array_like, optional

Position array, used as the horizontal axis of the plot. If not given, spacing of the violins will be equidistant.

side : {‘both’, ‘left’, ‘right’}, optional

How to plot the violin. Default is ‘both’. The ‘left’, ‘right’ options can be used to create asymmetric violin plots.

show_boxplot : bool, optional

Whether or not to show normal box plots on top of the violins. Default is True.

plot_opts : dict, optional

A dictionary with plotting options. Any of the following can be provided, if not present in plot_opts the defaults will be used:

- 'violin_fc', MPL color.  Fill color for violins.  Default is 'y'.
- 'violin_ec', MPL color.  Edge color for violins.  Default is 'k'.
- 'violin_lw', scalar.  Edge linewidth for violins.  Default is 1.
- 'violin_alpha', float.  Transparancy of violins.  Default is 0.5.
- 'cutoff', bool.  If True, limit violin range to data range.
      Default is False.
- 'cutoff_val', scalar.  Where to cut off violins if `cutoff` is
      True.  Default is 1.5 standard deviations.
- 'cutoff_type', {'std', 'abs'}.  Whether cutoff value is absolute,
      or in standard deviations.  Default is 'std'.
- 'violin_width' : float.  Relative width of violins.  Max available
      space is 1, default is 0.8.
- 'label_fontsize', MPL fontsize.  Adjusts fontsize only if given.
- 'label_rotation', scalar.  Adjusts label rotation only if given.
      Specify in degrees.

fig : Matplotlib figure instance: If ax is None, the created figure. Otherwise the figure to which ax is connected.

beanplot : Bean plot, builds on violinplot. matplotlib.pyplot.boxplot : Standard boxplot.

The appearance of violins can be customized with plot_opts. If customization of boxplot elements is required, set show_boxplot to False and plot it on top of the violins by calling the Matplotlib boxplot function directly. For example:

violinplot(data, ax=ax, show_boxplot=False)
ax.boxplot(data, sym='cv', whis=2.5)

It can happen that the axis labels or tick labels fall outside the plot area, especially with rotated labels on the horizontal axis. With Matplotlib 1.1 or higher, this can easily be fixed by calling ax.tight_layout(). With older Matplotlib one has to use plt.rc or plt.rcParams to fix this, for example:

plt.rc('figure.subplot', bottom=0.25)
violinplot(data, ax=ax)

J.L. Hintze and R.D. Nelson, “Violin Plots: A Box Plot-Density Trace Synergism”, The American Statistician, Vol. 52, pp.181-84, 1998.

faststats.plot.beanplot(data, ax=None, labels=None, positions=None, side='both', jitter=False, plot_opts={})[source]¶

Make a bean plot of each dataset in the data sequence.

A bean plot is a combination of a violinplot (kernel density estimate of the probability density function per point) with a line-scatter plot of all individual data points.

data : sequence of ndarrays

Data arrays, one array per value in positions.

ax : Matplotlib AxesSubplot instance, optional

If given, this subplot is used to plot in instead of a new figure being created.

labels : list of str, optional

Tick labels for the horizontal axis. If not given, integers 1..len(data) are used.

positions : array_like, optional

Position array, used as the horizontal axis of the plot. If not given, spacing of the violins will be equidistant.

side : {‘both’, ‘left’, ‘right’}, optional

How to plot the violin. Default is ‘both’. The ‘left’, ‘right’ options can be used to create asymmetric violin plots.

jitter : bool, optional

If True, jitter markers within violin instead of plotting regular lines around the center. This can be useful if the data is very dense.

plot_opts : dict, optional

A dictionary with plotting options. All the options for violinplot can be specified, they will simply be passed to violinplot. Options specific to beanplot are:

‘bean_color’, MPL color. Color of bean plot lines. Default is ‘k’.

Also used for jitter marker edge color if jitter is True.

‘bean_size’, scalar. Line length as a fraction of maximum length.

Default is 0.5.

‘bean_lw’, scalar. Linewidth, default is 0.5.

‘bean_show_mean’, bool. If True (default), show mean as a line.

‘bean_show_median’, bool. If True (default), show median as a

marker.

‘bean_mean_color’, MPL color. Color of mean line. Default is ‘b’.

‘bean_mean_lw’, scalar. Linewidth of mean line, default is 2.

‘bean_median_color’, MPL color. Color of median marker. Default

is ‘r’.

‘bean_median_marker’, MPL marker. Marker type, default is ‘+’.

‘jitter_marker’, MPL marker. Marker type for jitter=True.

Default is ‘o’.

‘jitter_marker_size’, int. Marker size. Default is 4.

‘jitter_fc’, MPL color. Jitter marker face color. Default is None.

‘bean_legend_text’, str. If given, add a legend with given text.

fig : Matplotlib figure instance: If ax is None, the created figure. Otherwise the figure to which ax is connected.

violinplot : Violin plot, also used internally in beanplot. matplotlib.pyplot.boxplot : Standard boxplot.

P. Kampstra, “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions”, J. Stat. Soft., Vol. 28, pp. 1-9, 2008.

faststats.plot.kde_plot(x, ax=None, orientation='horizontal', cutoff=False, log=False, cutoff_type='std', cutoff_val=1.5, pos=100, pos_marker='line', pos_width=0.05, pos_kwargs={}, **kwargs)[source]¶

Table Of Contents

Previous topic

This Page

faststats Package¶