madminer.limits module

class madminer.limits.AsymptoticLimits(filename=None, include_nuisance_parameters=False)[source]

Bases: madminer.analysis.DataAnalyzer

Statistical inference based on asymptotic properties of the likelihood ratio as test statistics.

This class provides two high-level functions:

  • AsymptoticLimits.observed_limits() calculates p-values over a grid in parameter space for a given set of observed data.
  • AsymptoticLimits.expected_limits() calculates expected p-values over a grid in parameter space based on “Asimov data”, a large hypothetical data set drawn from a given parameter point. This method is typically used to define expected exclusion limits or significances.

Both functions support inference based on…

  • histograms of kinematic observables,
  • based on histograms of score vectors estimated with the madminer.ml.ScoreEstimator class (SALLY and SALLINO techniques),
  • based on likelihood or likelihood ratio functions estimated with the madminer.ml.LikelihoodEstimator and madminer.ml.ParameterizedRatioEstimator classes (NDE, SCANDAL, CARL, RASCAL, ALICES, and so on).

Currently, this class requires a morphing setup. It does not yet support nuisance parameters.

Parameters:
filename : str

Path to MadMiner file (for instance the output of madminer.delphes.DelphesProcessor.save()).

include_nuisance_parameters : bool, optional

If True, nuisance parameters are taken into account. Currently not implemented. Default value: False.

Methods

asymptotic_p_value(self, log_likelihood_ratio) Calculates the p-value corresponding to a given log likelihood ratio and number of degrees of freedom assuming the asymptotic approximation.
event_loader(self[, start, end, batch_size, …]) Yields batches of events in the MadMiner file.
expected_limits(self, mode, theta_true[, …]) Calculates expected p-values over a grid in parameter space.
observed_limits(self, mode, x_observed[, …]) Calculates p-values over a grid in parameter space based on a given set of observed events.
weighted_events(self[, theta, nu, …]) Returns all events together with the benchmark weights (if theta is None) or weights for a given theta.
xsec_gradients(self, thetas[, nus, …]) Returns the gradient of total cross sections with respect to parameters.
xsecs(self[, thetas, nus, partition, …]) Returns the total cross sections for benchmarks or parameter points.
asymptotic_p_value(self, log_likelihood_ratio, dof=None)[source]

Calculates the p-value corresponding to a given log likelihood ratio and number of degrees of freedom assuming the asymptotic approximation.

Parameters:
log_likelihood_ratio : ndarray

Log likelihood ratio (without the factor -2)

dof : int or None, optional

Number of parameters / degrees of freedom. None means the overall number of parameters is used. Default value: None.

Returns:
p_values : ndarray

p-values.

expected_limits(self, mode, theta_true, grid_ranges=None, grid_resolutions=25, include_xsec=True, model_file=None, hist_vars=None, score_components=None, hist_bins=None, thetaref=None, luminosity=300000.0, weighted_histo=True, n_histo_toys=100000, histo_theta_batchsize=1000, dof=None, test_split=0.2, return_histos=True, return_asimov=False, fix_adaptive_binning='auto-grid', sample_only_from_closest_benchmark=True, postprocessing=None, n_asimov=None, n_binning_toys=100000, thetas_eval=None)[source]

Calculates expected p-values over a grid in parameter space.

theta_true specifies which parameter point is assumed to be true. Based on this parameter point, the function generates a large artificial “Asimov data set”. p-values are then calculated with frequentist hypothesis tests using the likelihood ratio as test statistic. The asymptotic approximation is used, see https://arxiv.org/abs/1007.1727.

Depending on the keyword mode, the likelihood ratio is calculated with one of several different methods:

  • With mode=”rate”, MadMiner only calculates the Poisson likelihood of the total number of events.
  • With mode=”histo”, the kinematic likelihood is estimated with histograms of a small number of observables given by the keyword hist_vars. hist_bins determines the binning of the histograms. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”ml”, the likelihood ratio is estimated with a parameterized neural network. model_file has to point to the filename of a saved LikelihoodEstimator or ParameterizedRatioEstimator instance or a corresponding Ensemble (i.e. be the same filename used when calling estimator.save()). include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”sally”, the likelihood ratio is estimated with histograms of the components of the estimated score vector. model_file has to point to the filename of a saved ScoreEstimator instance. With score_components, the histogram can be restricted to some components of the score. hist_bins defines the binning of the histograms. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”adaptive-sally”, the likelihood ratio is estimated with histograms of the components of the estimated score vector. The approach is essentially the same as for “sally”, but the histogram binning is optimized for every parameter point by adding a new h = score * (theta - thetaref) dimension to the histogram. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”sallino”, the likelihood ratio is estimated with one-dimensional histograms of the scalar variable h = score * (theta - thetaref) for each point theta along the parameter grid. model_file has to point to the filename of a saved ScoreEstimator instance. hist_bins defines the binning of the histogram. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.

MadMiner calculates one p-value for every parameter point on an evenly spaced grid specified by grid_ranges and grid_resolutions. For instance, in a three-dimensional parameter space, grid_ranges=[(-1., 1.), (-2., 2.), (-3., 3.)] and grid_resolutions=[10,10,10] will start the calculation along 10^3 parameter points in a cube with edges (-1, 1) in the first parameter and so on.

Parameters:
mode : {“rate”, “histo”, “ml”, “sally”, “sallino”, “adaptive-sally”}

Defines how the likelihood ratio test statistic is calculated. See above.

theta_true : ndarray

Parameter point assumed to be true to calculate the Asimov data.

grid_ranges : list of (tuple of float) or None, optional

Specifies the boundaries of the parameter grid on which the p-values are evaluated. It should be [(min, max), (min, max), …, (min, max)], where the list goes over all parameters and min and max are float. If None, thetas_eval has to be given. Default: None.

grid_resolutions : int or list of int, optional

Resolution of the parameter space grid on which the p-values are evaluated. If int, the resolution is the same along every dimension of the hypercube. If list of int, the individual entries specify the number of points along each parameter individually. Default value: 25.

include_xsec : bool, optional

Whether the Poisson likelihood representing the total number of events is included in the analysis. Default value: True.

model_file : str or None, optional

Filename of a saved neural network estimating the likelihood, likelihood ratio, or score. Required if mode is anything except “rate” or “histo”. Default value: None.

hist_vars : list of str or None, optional

Kinematic variables used in the histograms when mode is “histo”. The names are the same as used for instance in DelphesReader. Default value: None.

score_components : None or list of int, optional

Defines the score components used when mode is “sally” or “adaptive-sally”. Default value: None.

hist_bins : int or list of (int or ndarray) or None, optional

Defines the histogram binning when mode is “histo”, “sally”, “adaptive-sally”, or “sallino”. If int, gives the number of bins automatically chosen for each summary statistic. If list, each entry corresponds to one summary statistic (e.g. kinematic variable specified by hist_vars or estimated score component); an int entry corresponds to the number of automatically chosen bins, an ndarray specifies the bin edges along this dimension explicitly. If None, the bins are chosen according to the defaults: for one summary statistic the default is 25 bins, for 2 it’s 8 bins along each direction, for more it’s 5 per dimension. Default value: None.

thetaref : ndarray or None, optional

Defines the reference parameter point at which the score is evaluated for mode “sallino” or “adaptive-sally”. If None, the origin in parameter space, [0., 0., …, 0.], is used. Default value: None.

luminosity : float, optional

Integrated luminosity in pb^{-1} assumed in the analysis. Default value: 300000.

weighted_histo : bool, optional

If True, the histograms used for the modes “histo”, “sally”, “sallino”, and “adaptive-sally” use one set of weighted events to construct the histograms at every point along the parameter grid, only with different weights for each parameter point on the grid. If False, independent unweighted event samples are drawn for each parameter point on the grid. Default value: True.

n_histo_toys : int or None, optional

Number of events drawn to construct the histograms used for the modes “histo”, “sally”, “sallino”, and “adaptive-sally”. If None and weighted_histo is True, all events in the training fraction of the MadMiner file are used. If None and weighted_histo is False, 100000 events are used. Default value: 100000.

histo_theta_batchsize : int or None, optional

Number of histograms constructed in parallel for the modes “histo”, “sally”, “sallino”, and “adaptive-sally” and if weighted_histo is True. A larger number speeds up the calculation, but requires more memory. Default value: 1000.

dof : int or None, optional

If not None, sets the number of parameters for the calculation of the p-values. If None, the overall number of parameters is used. Default value: None.

test_split : float, optional

Fraction of weighted events in the MadMiner file reserved for evaluation. Default value: 0.2.

return_histos : bool, optional

If True and if mode is “histo”, “sally”, “adaptive-sally”, or “sallino”, the function returns histogram objects for each point along the grid.

fix_adaptive_binning : [False, “center”, “grid”, “auto-grid”, “auto-center”], optional

If not False and if mode is “histo”, “sally”, “adaptive-sally”, or “sallino”, the automatic histogram binning is the same for every point along the parameter grid. For “center”, the central point in the parameter grid is used to determine the binning, for “grid” all points in the parameter grid are combined for this. For “auto-grid” or “auto-center”, this option is turned on if mode is “histo” or “sally”, but not for “adaptive-sally” or “sallino”. Default value: “auto-grid”.

sample_only_from_closest_benchmark : bool, optional

If True, only events originally generated from the closest benchmarks are used when generating the Asimov data (and, if weighted_histo is False, the histogram data). Default value: True.

return_asimov : bool, optional

Whether the values of the summary statistics in the Asimov (“expected observed”) data set are returned. Default value: False.

postprocessing : None or function, optional

If not None, points to a function that processes the summary statistics before being fed into histograms. Default value: None.

n_binning_toys : int or None, optional

Number of toy events used to determine the binning of adaptive histograms. Default value: 100000.

n_asimov : int or None, optional

Size of the Asimov sample. If None, all weighted events in the MadMiner file are used. Default value: None.

thetas_eval : ndarray or None

Manually specifies the parameter point at which the likelihood and p-values are evaluated. If None, grid_ranges and resolution are used instead to construct a regular grid. Default value: None.

Returns:
parameter_grid : ndarray

Parameter points at which the p-values are evaluated with shape (n_grid_points, n_parameters).

p_values : ndarray

Observed p-values for each parameter point on the grid, with shape (n_grid_points,).

mle : int

Index of the parameter point with the best fit (largest p-value / smallest -2 log likelihood ratio).

log_likelihood_ratio_kin : ndarray or None

log likelihood ratio based only on kinematics for each point of the grid, with shape (n_grid_points,).

log_likelihood_rate : ndarray or None

log likelihood based only on the total rate for each point of the grid, with shape (n_grid_points,).

histos : None or list of Histogram

None if return_histos is False. Otherwise a list of histogram objects for each point on the grid. This can be useful for debugging or for plotting the histograms.

observed_limits(self, mode, x_observed, grid_ranges=None, grid_resolutions=25, include_xsec=True, model_file=None, hist_vars=None, score_components=None, hist_bins=None, thetaref=None, luminosity=300000.0, weighted_histo=True, n_histo_toys=100000, histo_theta_batchsize=1000, n_observed=None, dof=None, test_split=0.2, return_histos=True, return_observed=False, fix_adaptive_binning='auto-grid', postprocessing=None, n_binning_toys=100000, thetas_eval=None)[source]

Calculates p-values over a grid in parameter space based on a given set of observed events.

x_observed specifies the observed data as an array of observables, using the same observables and their order as used throughout the MadMiner workflow.

The p-values with frequentist hypothesis tests using the likelihood ratio as test statistic. The asymptotic approximation is used, see https://arxiv.org/abs/1007.1727.

Depending on the keyword mode, the likelihood ratio is calculated with one of several different methods:

  • With mode=”rate”, MadMiner only calculates the Poisson likelihood of the total number of events.
  • With mode=”histo”, the kinematic likelihood is estimated with histograms of a small number of observables given by the keyword hist_vars. hist_bins determines the binning of the histograms. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”ml”, the likelihood ratio is estimated with a parameterized neural network. model_file has to point to the filename of a saved LikelihoodEstimator or ParameterizedRatioEstimator instance or a corresponding Ensemble (i.e. be the same filename used when calling estimator.save()). include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”sally”, the likelihood ratio is estimated with histograms of the components of the estimated score vector. model_file has to point to the filename of a saved ScoreEstimator instance. With score_components, the histogram can be restricted to some components of the score. hist_bins defines the binning of the histograms. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”adaptive-sally”, the likelihood ratio is estimated with histograms of the components of the estimated score vector. The approach is essentially the same as for “sally”, but the histogram binning is optimized for every parameter point by adding a new h = score * (theta - thetaref) dimension to the histogram. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.
  • With mode=”sallino”, the likelihood ratio is estimated with one-dimensional histograms of the scalar variable h = score * (theta - thetaref) for each point theta along the parameter grid. model_file has to point to the filename of a saved ScoreEstimator instance. hist_bins defines the binning of the histogram. include_xsec sets whether the Poisson likelihood of the total number of events is included or not.

MadMiner calculates one p-value for every parameter point on an evenly spaced grid specified by grid_ranges and grid_resolutions. For instance, in a three-dimensional parameter space, grid_ranges=[(-1., 1.), (-2., 2.), (-3., 3.)] and grid_resolutions=[10,10,10] will start the calculation along 10^3 parameter points in a cube with edges (-1, 1) in the first parameter and so on.

Parameters:
mode : {“rate”, “histo”, “ml”, “sally”, “sallino”, “adaptive-sally”}

Defines how the likelihood ratio test statistic is calculated. See above.

x_observed : ndarray

Observed data with shape (n_events, n_observables). The observables have to be the same used throughout the MadMiner analysis, for instance specified in the DelphesReader class with add_observables.

grid_ranges : list of (tuple of float) or None, optional

Specifies the boundaries of the parameter grid on which the p-values are evaluated. It should be [(min, max), (min, max), …, (min, max)], where the list goes over all parameters and min and max are float. If None, thetas_eval has to be given. Default: None.

grid_resolutions : int or list of int, optional

Resolution of the parameter space grid on which the p-values are evaluated. If int, the resolution is the same along every dimension of the hypercube. If list of int, the individual entries specify the number of points along each parameter individually. Doesn’t have any effect if grid_ranges is None. Default value: 25.

include_xsec : bool, optional

Whether the Poisson likelihood representing the total number of events is included in the analysis. Default value: True.

model_file : str or None, optional

Filename of a saved neural network estimating the likelihood, likelihood ratio, or score. Required if mode is anything except “rate” or “histo”. Default value: None.

hist_vars : list of str or None, optional

Kinematic variables used in the histograms when mode is “histo”. The names are the same as used for instance in DelphesReader. Default value: None.

score_components : None or list of int, optional

Defines the score components used when mode is “sally” or “adaptive-sally”. Default value: None.

hist_bins : int or list of (int or ndarray) or None, optional

Defines the histogram binning when mode is “histo”, “sally”, “adaptive-sally”, or “sallino”. If int, gives the number of bins automatically chosen for each summary statistic. If list, each entry corresponds to one summary statistic (e.g. kinematic variable specified by hist_vars or estimated score component); an int entry corresponds to the number of automatically chosen bins, an ndarray specifies the bin edges along this dimension explicitly. If None, the bins are chosen according to the defaults: for one summary statistic the default is 25 bins, for 2 it’s 8 bins along each direction, for more it’s 5 per dimension. Default value: None.

thetaref : ndarray or None, optional

Defines the reference parameter point at which the score is evaluated for mode “sallino” or “adaptive-sally”. If None, the origin in parameter space, [0., 0., …, 0.], is used. Default value: None.

luminosity : float, optional

Integrated luminosity in pb^{-1} assumed in the analysis. Default value: 300000.

weighted_histo : bool, optional

If True, the histograms used for the modes “histo”, “sally”, “sallino”, and “adaptive-sally” use one set of weighted events to construct the histograms at every point along the parameter grid, only with different weights for each parameter point on the grid. If False, independent unweighted event samples are drawn for each parameter point on the grid. Default value: True.

n_histo_toys : int or None, optional

Number of events drawn to construct the histograms used for the modes “histo”, “sally”, “sallino”, and “adaptive-sally”. If None and weighted_histo is True, all events in the training fraction of the MadMiner file are used. If None and weighted_histo is False, 100000 events are used. Default value: 100000.

histo_theta_batchsize : int or None, optional

Number of histograms constructed in parallel for the modes “histo”, “sally”, “sallino”, and “adaptive-sally” and if weighted_histo is True. A larger number speeds up the calculation, but requires more memory. Default value: 1000.

n_observed : int or None, optional

If not None, the likelihood ratio is rescaled to this number of observed events before calculating p-values. Default value: None.

dof : int or None, optional

If not None, sets the number of parameters for the calculation of the p-values. If None, the overall number of parameters is used. Default value: None.

test_split : float, optional

Fraction of weighted events in the MadMiner file reserved for evaluation. Default value: 0.2.

return_histos : bool, optional

If True and if mode is “histo”, “sally”, “adaptive-sally”, or “sallino”, the function returns histogram objects for each point along the grid.

fix_adaptive_binning : [False, “center”, “grid”, “auto-grid”, “auto-center”], optional

If not False and if mode is “histo”, “sally”, “adaptive-sally”, or “sallino”, the automatic histogram binning is the same for every point along the parameter grid. For “center”, the central point in the parameter grid is used to determine the binning, for “grid” all points in the parameter grid are combined for this. For “auto-grid” or “auto-center”, this option is turned on if mode is “histo” or “sally”, but not for “adaptive-sally” or “sallino”. Default value: “auto-grid”.

return_observed : bool, optional

Whether the observed values of the summary statistics are returned. Default value: False.

postprocessing : None or function

If not None, points to a function that processes the summary statistics before being fed into histograms. Default value: None.

n_binning_toys : int or None

Number of toy events used to determine the binning of adaptive histograms. Default value: 100000.

thetas_eval : ndarray or None

Manually specifies the parameter point at which the likelihood and p-values are evaluated. If None, grid_ranges and resolution are used instead to construct a regular grid. Default value: None.

Returns:
parameter_grid : ndarray

Parameter points at which the p-values are evaluated with shape (n_grid_points, n_parameters).

p_values : ndarray

Observed p-values for each parameter point on the grid, with shape (n_grid_points,).

mle : int

Index of the parameter point with the best fit (largest p-value / smallest -2 log likelihood ratio).

log_likelihood_ratio_kin : ndarray or None

log likelihood ratio based only on kinematics for each point of the grid, with shape (n_grid_points,).

log_likelihood_rate : ndarray or None

log likelihood based only on the total rate for each point of the grid, with shape (n_grid_points,).

histos : None or list of Histogram

None if return_histos is False. Otherwise a list of histogram objects for each point on the grid. This can be useful for debugging or for plotting the histograms.