NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, - - PowerPoint PPT Presentation

np04 powercut recovery and cold starts
SMART_READER_LITE
LIVE PREVIEW

NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, - - PowerPoint PPT Presentation

NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, Geoff Savage DUNE DAQ meeting 29 July 2019 Power cut #1 July 23 Power cut with subsequent cooling failure brought down all servers except np04-srv-001 and np04-srv-004 (on


slide-1
SLIDE 1

NP04 Powercut Recovery and Cold Starts

Pengfei Dine, Bonnie King, Geoff Savage DUNE DAQ meeting 29 July 2019

slide-2
SLIDE 2

Power cut #1 July 23

  • Power cut with subsequent cooling failure brought down all

servers except np04-srv-001 and np04-srv-004 (on UPS)

2

slide-3
SLIDE 3

Power cut #1 July 23 recovery

  • Manually pressed power button on servers without IPMI

configured/cabled (and IPMI head nodes)

  • Issued power on commands via IPMI where possible with no

particular order

  • cronjob to delay reboot in place and mounts came back

correctly

  • had to restart supervisord where it came up before NFS mount
  • had to mount CIFS mount manually

3

slide-4
SLIDE 4

Power cut #2

  • Had to power down servers again due to water pressure drop
  • This time, non-critical servers were gracefully shut down and

IPMI head nodes kept up

  • Most servers came back except srv-0[03, 04, 10. 11, 12, 21, 22,

24] due to known issue (keystroke required to boot)

4

slide-5
SLIDE 5

Power cut #2 recovery

  • Some RAID volumes (not on UPS) got upset after the power cut
  • recovered mounts on np04-srv-003 and np04-srv-004 by

manually assembling volumes

  • raw data was written to some mount areas while mount was

missing, filling up /

  • moved data to correct area

Need to redirect boot loader and kernel init to serial console for IPMI access (plan to do this later today)

5

slide-6
SLIDE 6

Planned improvements

  • Get serial console redirection to Serial Over Lan working during

boot (can send keystrokes remotely)

  • Configure supervisord with correct startup dependency
  • Audit ansible playbooks; remove outdated playbooks, create

playbook to provision new node from scratch

  • Alerting in Prometheus
  • disk usage, missing mounts, etc etc
  • Ansible roles for work done in test period
  • np04-onl-XXX cabled for IPMI?
  • np04-srv-007 IPMI cable

6