Chair: Nicola De Filippis
Members: Veronica Sanz, Harrison Prosper, Sabine Kraml, Daniel Schmeier, Sezen Sekmen, Giacomo Polesello, Jory Sonneveld, Nishita Desai, Camillo Carillo, Sanjoy Biswas, Monika Wielers, Ilaria Brivio, Jose Miguel No
The goal of the study we started in LesHouches is:
Below there is a possible list of items to be addressed in preparation for the Les Houches proceedings. Please add more or correct my list if needed:
The lhe files for the different models are here:
ZZto2e2mu: (xsection=0.29084336591184917 pb)
https://cernbox.cern.ch/public.php?service=files&t=43cc4803a552ebe0188df8fd8a655672
ZZto4e: (xsection=0.13205154366855495 pb)
https://cernbox.cern.ch/public.php?service=files&t=fc2b0c901ecae46f25ee46fde49e6e4d
ZZto4mu: (xsection=0.13205154366855495 pb)
https://cernbox.cern.ch/public.php?service=files&t=17bd6e6956b86d475f1fa314f1d617f8
ZH, H→ZZ→4l: (xsection=2.62619200000000028e-04 pb)
https://cernbox.cern.ch/public.php?service=files&t=2d4ae1b2be68a7181f904e7dddd3e050
ttH, H→ZZ→4l: (xsection=1.53566999999999988e-04 pb)
https://cernbox.cern.ch/public.php?service=files&t=8aaba0f2678ef5d48cbf6d737f386ef6
WminusH, H→ZZ→4l:
https://cernbox.cern.ch/public.php?service=files&t=93f1c85817d3c06a90ef991ed0ba874d
WplusH, H→ZZ→4l: (xsection WplusH+WminusH= 4.16760000000000015e-04 pb)
https://cernbox.cern.ch/public.php?service=files&t=f3fbf337f9bde773f45436562904b825
ggH, H→ZZ→4l: (xsection=1.32638400000000007e-02 pb)
https://www.dropbox.com/sh/wauvsr4evou4zkt/AAB8z3YKN8ZMVvzMIBH3i7fua?dl=0
VBF, H→ZZ→4l: (xesction=0.001131896 pb )
https://cernbox.cern.ch/public.php?service=files&t=114cec87be1d3f480a937722b85fbca2
Help Needed for gamma gamma with this part, new background
Here are the .root files for some of the files above (more to be added).
The files in the directory itself are created using Standard Delphes 3.2.0 and with the PileUp events from below. Electrons, Muons and Photons are not isolated in this sample! Track and Tower information is available in the ROOT file to apply an external isolation routine. The used detector card can be found under “Auxiliary”.
Each signal .root file contains 1000 events, background samples contain 5000 events.
Cross sections x BR(H→ZZ→4l, l=e,mu only) (in pb) (2HDM, M_zprime=variable, M_A0=300 GeV, tan(beta)=1, gZ=0.8) | |
---|---|
Model File | sigma x Br (pb) |
MZP600_MA0300 | 0.04669*1.25E-04 |
MZP800_MA0300 | 0.05174*1.25E-04 |
MZP1000_MA0300 | 0.04197*1.25E-04 |
MZP1200_MA0300 | 0.03176*1.25E-04 |
MZP1400_MA0300 | 0.02356*1.25E-04 |
MZP1700_MA0300 | 0.01510*1.25E-04 |
MZP2000_MA0300 | 0.009734*1.25E-04 |
MZP2500_MA0300 | 0.004860*1.25E-04 |
Cross sections x BR(H→ZZ→4l) (in pb) (M_zprime=1TeV, M_scalar=1TeV) | |
---|---|
Model File | sigma x Br (pb) |
Higgs_hhxx_scalar_nohdecay_1GeV_13TeV | 1.065e-11 |
Higgs_hhxx_scalar_nohdecay_10GeV_13TeV | 1.0567e-11 |
Higgs_hhxx_scalar_nohdecay_100GeV_13TeV | 2.2276e-17 |
Higgs_hhxx_scalar_nohdecay_500GeV_13TeV | 5.789e-20 |
Higgs_hhxx_scalar_nohdecay_1000GeV_13TeV | 1.333e-21 |
Higgs_Zprime_nohdecay_1GeV_13TeV | 0.00000636732 |
Higgs_Zprime_nohdecay_10GeV_13TeV | 0.0000063590400 |
Higgs_Zprime_nohdecay_100GeV_13TeV | 0.0000002264028 |
Higgs_Zprime_nohdecay_500GeV_13TeV | 0.00000001475772 |
Higgs_Zprime_nohdecay_1000GeV_13TeV | 0.00000000003287160 |
Higgs_scalar_nohdecay_1GeV_13TeV | 0.000849144 |
Cross sections x BR(H→gg) (in pb) (M_zprime=1TeV, M_scalar=1TeV): | |
---|---|
Model File | sigma x Br (in pb) |
Higgs_hhxx_scalar_nohdecay_1GeV_13TeV | 8.05296e-11 |
Higgs_hhxx_scalar_nohdecay_10GeV_13TeV | 7.9777e-11 |
Higgs_hhxx_scalar_nohdecay_100GeV_13TeV | 1.6817e-16 |
Higgs_hhxx_scalar_nohdecay_500GeV_13TeV | 4.37076e-19 |
Higgs_hhxx_scalar_nohdecay_1000GeV_13TeV | 1.0064e-20 |
Higgs_Zprime_nohdecay_1GeV_13TeV | 0.0000525996 |
Higgs_Zprime_nohdecay_10GeV_13TeV | 0.0000525312 |
Higgs_Zprime_nohdecay_100GeV_13TeV | 0.000001870284 |
Higgs_Zprime_nohdecay_500GeV_13TeV | 0.0000001219116 |
Higgs_Zprime_nohdecay_1000GeV_13TeV | 0.000000000271548 |
Data-Card
Not Working with Standard Delphes
I'll try to give a quick tutorial on the necessary steps to do the above chain very conveniently, so that everyone can test the above LHE files. I suppose most people have the relevant tools already but I still try to cover everything. Let me start with the necessary installation steps:
add Branch TrackMerger/tracks Track Track add Branch Calorimeter/towers Tower Tower
as this information is used in the analysis (see github link below).
Main:numberOfEvents = 1000 Beams:frameType = 4 Beams:LHEF = /.../Higgs_hhxx_scalar_nohdecay_100GeV_13TeV.lhe
To fix the decays of the Higgs, you have to add the following to the above pythia input card:
For H→gamma gamma
25:onMode = off 25:onIfMatch = 22 22
For H → ZZ → 4l
25:onMode = off 25:onIfMatch = 23 23 23:onMode = off 23:onIfAny = 11 13 15
Macros for 4l analysis running on delphes samples:
git clone https://github.com/hbprosper/monoHiggs.git
List of plots that can be produced for 4l analysis:
a) Before all the cuts:
- Order the lepton in increasing pT and plot them on the same canvas.
- plots the relative isolation variable of the leptons
b) Then build the Z1 as reported in slide 1 (see References). Plot the Z1 mass . Plots the isolation variable of the leptons not included in the Z1 building. Plots the impact parameter if available.
c) Build the Z2 as pointed in the slide above
d) plots the 4l invariant mass after the full selection and the MET.
e) Plots the same stuff soft signal and background and try to over-impose.
See References for the cuts and a list of Delphes objects and their properties.
Slides of mon-Higgs+DM analysis results in the four lepton and MET final state:
Preliminary results on gamma gamma and 4muons:
Some summary slides for summary talk by Sacha Nikitenko:
https://cernbox.cern.ch/public.php?service=files&t=645c4898c882aa28434f6b7bd93d9576
Tool for statistical analysis by Harrison Prosper:
Daniel: This does not seem to be the standard way ATLAS/CMS set limits, which is based on the frequentist-CLs. Why are we not using that one?
Harrison: It seems reasonable to me to favor statistical methods that are well-founded, by which I mean methods that professional statisticians (rather than amateurs like us) have shown to be sound. Just as people defer to physicists about physics, I think it reasonable to defer to statisticians about statistics. The method implemented in this package is Bayesian, which is well-founded in this sense. By contrast, CLs is not. Moreover, 1) asymptotic CLs may be overly optimistic when observed counts are low; 2) the ATLAS/CMS method requires performing two fits per toy experiment and therefore requires a rule to deal with non-converging or bad fits; 3) the use of profile likelihoods, which underlies the ATLAS/CMS method, when there are multiple nuisance parameters in the problem can yield results that may be too optimistic, and 4) worst of all, CLs (when done correctly) is a huge sink of CPU and manpower for computing numbers which, in the end, simply say that nothing was found! Because of the enormous amount of effort consumed computing CLs during Run I, the CMS SUSY group is revisiting the question of whether limit setting using CLs is sensible for Run II. High energy physicists are prone to a “Masters of the Universe” complex, which is manifested in our dismissal of the wisdom of other professionals. Alas, since I am a high energy physicist ipso facto I am prone to some level of arrogance…but I am not so arrogant as to dismiss two and a half centuries of work on statistics in favor of something like CLs! Surely, Laplace, Fisher, Neyman, Bernardo, Berger and dozens more have useful things to teach us?
Daniel: Since you mention the disadvantages of CLs (which certainly are true!), I have to write a few lines to 'defend' it: Bayesian statistics might be mathematically well founded (more than CLs), but scientifically raises the huge issue that one has to put in a prior degree of belief in a theory via $\pi(\sigma)$. From a theoretical point of view, there is no reason why to assume this to be flat, lognormal, spiky or anything else. I could have a model which describes Nature (= the observation) perfectly for one particular parameter setting but completely fails for any other and if my prior is anything different than a delta-distribution, a Bayesian approach will exclude it due to the marginalisation.
I don't want to raise a discussion about which approach is the best, because neither one is. My only concern is - and that was the main point behind my original question - that for the sake of comparison, one should stick to the statistical methods which have been defined as standard, unless there is a strong reason why the standard approach is not applicable to the given scenario. For the models we have, I don't see any such reason. Moreover, the ATLAS study linked in the References section determines their limit using CLs and we most likely want to compare our result to theirs. If we use different means to quantify our results, I fear it will be hard to disentangle the physics from the statistics.
This standalone tool can be used to compute upper limits on a cross section (or signal strength) given one or more observed counts, $N$. The package can be downloaded from github
using
git clone https://github.com/hbprosper/limits.git
To build it, do
cd limits make
To initialize some environment variables, do
source setup.sh
and to test it do,
cd example ./example1.py 1 1 0 0 0 1
This calculates the 90% upper limit for count $N = 1$. (The other parameters, signal efficiency $\epsilon$, efficiency uncertainty $\delta\epsilon$, background $B$, background uncertainty $\delta B$, and an integrated luminosity ${\cal L}$ are set to 1, 0, 0, 0, and 1, respectively.) The 90% upper limit is the well known value 3.9 for a Poisson distribution with $N = 1$.
Usage
The package contains two C++ classes (callable from Python via PyROOT
) MultiPoisson
and Bayes
. The first class sets up a multi-Poisson model, while the second class computes (Bayesian) limits. The Python program blimit.py
(which is used in example1.py
) shows how to use these two classes. Here are the essential code snippets.
# load the liblimits shared library using PyROOT from ROOT import gSystem gSystem.Load('liblimits') # make the MultiPoisson and Bayes classes visible to Python from ROOT import MultiPoisson, Bayes # initialize multi-Poisson model # from an input file. model = MultiPoisson('inputs.dat', 1) # set up Bayes limit calculator data = model.counts() sigmamin = 0.0 sigmamax = 20.0 CL = 0.90 bayes = Bayes(model, data, sigmamin, sigmamax, CL) # compute upper limit limit = bayes.quantile() print "limit = %10.2f" % limit
Description
The limits
package computes cross section limits for the multi-Poisson model,
$$
p(D \,|\, \sigma, \epsilon, b, {\cal L}) = \prod_{i=1}^K \frac{e^{-n_i} \, n_i^{N_i}}{N_i!},
$$
where $D \equiv N_1,\cdots, N_K$ denotes the set of observed counts and
$$n_i = \sigma \, \epsilon_i \, {\cal L} + b_i$$
is the expected count per bin,
with $\sigma$, $\epsilon_i$, and $b_i$ the cross section, and the bin-by-bin expected efficiency and expected background, respectively. In order to use the program to compute limits on a signal strength, $\mu$, rather than a cross section, where $n_i = \mu \, s_i + b_i$, one need merely set ${\cal L} = 1$ and make the identifications $ \sigma \rightarrow \mu$, $\epsilon \rightarrow s_i$.
The program marginalizes over the model parameters $\epsilon, b, {\cal L}$,
$$
p(D \,|\, \sigma) = \int p(D \,|\, \sigma, \epsilon, b, {\cal L}) \, \pi(\epsilon, b, {\cal L}) \, d\epsilon \, db \, d {\cal L},
$$
with respect to an evidence-based prior density, $\pi(\epsilon, b, {\cal L})$, that encodes whatever is known about these parameters. The function $p(D \,|\, \sigma)$ is called the marginal (or integrated) likelihood. Bayes theorem,
$$
p(\sigma \,|\, D ) = p(D \,|\, \sigma) \, \pi(\sigma) \, / \, p(D),
$$
is then used
to compute the posterior density $p(\sigma \,|\, D)$. The posterior probability, $p(\sigma \,|\, D) \, d\sigma$, is the probability (interpreted as a degree of belief) that the true value of $\sigma$ lies in the interval $[\sigma, \sigma + d\sigma]$. The function $\pi(\sigma)$ is the prior density for the cross section. In principle, it encodes what we know about the cross section independently of the new data $D$. In practice, we usually set $\pi(\sigma) = 1$, though the program allows for more sophisticated choices.
The upper limit on the cross section $\sigma_{UP}$ at $100 \times (1-\alpha)$% CL is the solution of the equation,
$$\int_0^{\sigma_{UP}} p(\sigma \,|\, D) \, d\sigma = 1 - \alpha,$$ which is calculated by the class Bayes
.
The evidence-based prior density $\pi(\epsilon, b, {\cal L})$ can be very complicated. This is because the signal efficiencies and backgrounds typically depend on many underlying experimental and theoretical parameters that are not known precisely, such as the jet energy scale (JES), jet energy resolution (JER), the trigger efficiencies, the object identification efficiencies, numerous data to Monte Carlo correction factors, parton distribution functions (PDF), and artifacts such as the renormalization and factorization scales that enter theoretical calculations. ATLAS and CMS make valiant attempts to model explicitly how all uncertain quantities affect the signal efficiencies and backgrounds (in principle) across multiple signals, multiple backgrounds, multiple signal regions and multiple bins. The explicit modeling, however, of these dependencies quickly leads to enormous complexity.
The limits
package takes a different approach. In any given analysis, it is generally possible to identify the independent (or approximately independent) sources of uncertainty. For example, the uncertainty in the jet energy scale is surely independent of the uncertainty in the CT10nlo
PDFs. All uncertainties, whatever their provenance and whatever words are used to describe them, be it statistical, systematic, theoretical, best guess, etc., can be accounted for in the same way:
Consider the following example. Parton distribution functions are not known precisely and, therefore, are a source of uncertainty. The uncertainty in PDFs, for example, the CT10nlo
PDFs, can be
encapsulated in an ensemble of PDF sets that can be created using the LHAPDF
program hessian2replicas
. For each set of CT10nlo
PDF sets, one runs the entire analysis chain whose output is either a set of histograms with $K$ bins or, more generally, a flat ntuple from which histograms can be made. Every event is re-weighted by the ratio of the product of the new PDFs to the product of PDFs with which the events were generated. The resulting ensemble of histograms and, therefore, efficiencies and backgrounds incorporates the uncertainty due to our imperfect knowledge of the PDFs. Given this ensemble, the marginal likelihood is approximated by simply averaging the likelihood,
$$
p(D \,|\, \sigma) \approx \frac{1}{M} \sum_{j=1}^M p(D \,|\, \sigma, \theta_j),
$$
with respect to the parameters $\theta \equiv \epsilon, b$. Each term in the sum is associated with one random variation of the CT10nlo
PDF parameters.
The variations of imprecisely known quantities, such as the JES and PDFs, can, and should, be done simultaneously by running the analysis (or analyses) $M$ times, in parallel, each with a different random variation of the imprecisely known quantities. Varying them one by one in order to quantify the relative importance of each systematic effect is instructive. However, adding the resulting uncertainties in quadrature may not be adequate.
The key to this approach is to think carefully about how each randomly varied parameter affects the observables in the analysis. How does the jet energy scale and jet energy resolution affect jet observables and therefore the missing transverse energy? How do the data to Monte Carlo corrections affect the observables and the event weight? How does the trigger efficiency affect the event weight? How do the PDFs affect the event weight, and so on?
Final remarks
Ideally, every analysis chain comprises two independent modules: event by event, the first module takes as input the nominal (experiment or framework-dependent) objects and a particular point in the space of systematic parameters and outputs the re-calibrated framework-independent objects, ideally, as vectors of TParticle
objects, while the second module takes as input the re-calibrated TParticle
objects and outputs histograms and, or, flat ntuples. The second module should have no knowledge of the provenance of the TParticle
objects. If we wrote our analysis codes this way, they would become framework-independent reusable modules. But, I've just realized I have been dreaming!
We started to work on the proceeding. The deadline is end of February. You can find the last version at link: