“ICHEP MC Production” Post-Mortem J-R Vlimant
- n behalf of everyone else
ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone - - PowerPoint PPT Presentation
ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone else Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation
6/26/12 2
Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation post-mortem analysis is planned for the Computing & Offline Management Meeting in Trieste, 25-27 July 2012. Full post-mortem will be done by then Lots of lessons learned will be turned into action items then. It's easy to only notice what goes wrong.
6/26/12 3
https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_GEN-SIM_speed.html
6/26/12 4
tailed into beginning of May
✔ Validation samples (end of March - end of April) ✔ Low PU production ( April 18 – May 21) (PU_S8 or E8TeV4BX50ns) ✔ TSG production (April 21 - May 14) (PU_S9) ✔ HPA Production (end of March – today) (PU_S7 and PU_S6)
https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12
http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_START52_AODSIM_speed.html
6/26/12 5
✔ Defined with Physics Coordination ✗ Production overshot by <~1week ➔ Data popularity analysis ?
✔ Defined by all groups, filtered by Physics Coordination, compiled, and
arranged for production
✔ 5 blocks+1block for the rest (see details in next slides) ✔ Everything else not on that list was frozen in production (or not attend to) ✗ Complications were met with samples already submitted in gen-sim,
acquired in the queue, with lower priority, inherited from the beginning of Summer12 (early Feb)
✔ Not much issue met with Digi-Reco prioritization (since nothing had been
started yet)
✔ Overall, the production went fine
6/26/12 6
✗ DiPhotonJets_7TeV-madgraph useless in Summer12 ✗ TTJets_MassiveBinDECAY available in PU_S6 as
requestd, missing PU_S7
✔ 140M to AODSIM
✗ 4 Higgs request still new : means not defined in PREP ✗ EWK : DY4JetsToLL_M-50 digi-reco stalled ✗ EWK : DY2JetsToLL_M-50 gen-sim extension stalled ✔ 200M to AODSIM
✗ 2 requests in “new” : means not defined in PREP ✗ JME QCD_Pt-15to30 digi-reco stalled
NB : “stalled” = Site issues, Queue overhead, probably done by now. http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPAx = Blockx in the url
6/26/12 7
✗ BPH : BdToKK, BdToPiPi, BdToPiMuNu, LambdaBToPK, digi-reco stalled ✗ BPH : LambdaBToPMuNu gen-sim taking forever due to a very low filter efficiency ✗ EWK : DYToTauTau_M-20_CT10 digi-reco stalled ✗ SUS : QCD_HT-500To1000 completing ✗ SUS : TTWWJets, WZZNoGstarJets, WWWJets, TTGJets, TTWJets, ✗ Top : 7 systematic samples (TT/T/W scale up/down) digi-reco stalled ✗ Top : 3 systematic samples gen-sim stalled ✔ 550M to AODSIM
✗ SUS : QCD_HT-100To250 digi-reco stalled ✗ Top : 4 systematic samples (TT/W matching up/down) digi-reco stalled ✗ Higgs : VBF_HToZZTo2L2Nu_M-525 digi-reco stalled ✔ 52M to AODSIM
http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPAx = Blockx in the url NB : “stalled” = Site issues, Queue overhead, probably done by now.
6/26/12 8
✔ No support from main developers gone to work in industry ✔ Solved by definition with experienced gained ✔ Lot's of experience gained both by PdmV and Comp-Ops ➔ Computing full post-mortem end July ➔ PREP2 project
✔ Ad-hoc monitoring pages will be turned into a consolidated third party
PREP/reqMng monitoring in medium time scale (pre-PREP2)
➔ GlobalMonitor is being upgraded
✔ Ad-hoc chaining from PREP evolved to ad-hoc operation summary ➔ Improvement of current PREP to speed up operation ➔ PREP2 / integration with request manager
✗ Daily assignment is a killer overhead ✗ Weekly assignment does not allow for quick turn-over ✗ Monthly assignment early April severely delayed some samples ➔ Accumulate experience into automated procedures ➔ More from the July post-mortem
6/26/12 9
✗ Damaged the output dataset ✔ We won't do that again anytime soon ➔ Development of the system to allow for this feature
✗ Many cases of “change the priority” the “next day it was acquirred” ✔ Ask for future careful pre-planning ✔ Tied to lack of approximate estimated time of delivery ➔ Development of the system to allow more flexibility
✔ Were dealt in priority ➔ Add a link to a PAS in PREP2 to tie requests to analysis ➔ More careful planning from the groups needs to be made, early on
✔ Improve on preparation/documentation of special requests ➔ Implementing a gen-validation step as part of the submission procedure
✗ The first 10% of the samples was not reachable fast enough ✔ Numerous requests were staged, but the rest steals resources ✔ A handful of requests were extended ➔ Planning for two-speed submission of samples (10% high, 90% bulk) with PREP2 ➔ Development on WM infrastructure to allow for safe extension of dataset
6/26/12 10
✔ Clarified half-way ✔ Tied to resource downtime ➔ Planned to be automated “by date” in PREP2
✗ Due to filter efficiency, corrupted LHE,... ➔ Incorporate this as part of a gen-valid request
✗ History monitoring missing ✔ Weekly report from PREP ✔ Scanning scripts developed by the operators ✔ Thanks to the eyes of some requester, making clear reports ➔ More from the July post-mortem
➔ Increase the usage of Fastsim
✔ Increase coordination between groups ✔ Follow up on important samples ✔ Propagation of operational information and news ➔ Monte-Carlo coordination meeting put in place
6/26/12 11