The diffjculty of verifying small improvements in forecast quality - - PowerPoint PPT Presentation

the diffjculty of verifying small improvements in
SMART_READER_LITE
LIVE PREVIEW

The diffjculty of verifying small improvements in forecast quality - - PowerPoint PPT Presentation

The diffjculty of verifying small improvements in forecast quality Alan Geer Satellite microwave assimilatjon team, Research Department, ECMWF (Day job: all-sky assimilatjon) Thanks to: Mike Fisher, Michael Rennie, Martjn Janousek, Elias Holm,


slide-1
SLIDE 1

WMO 7th verifjcatjon workshop, May 8-11, 2017 1

The diffjculty of verifying small improvements in forecast quality Alan Geer

Satellite microwave assimilatjon team, Research Department, ECMWF (Day job: all-sky assimilatjon) Thanks to: Mike Fisher, Michael Rennie, Martjn Janousek, Elias Holm, Stephen English, Erland Kallen, Tomas Wilhelmsson and Deborah Salmond

Slide 1

slide-2
SLIDE 2

WMO 7th verifjcatjon workshop, May 8-11, 2017 2

The viewpoint from an NWP research department

  • Not:
  • What is the skill of a forecast?
  • Is one NWP centre’s forecast betuer than another?
  • But this:
  • Is one experiment betuer than another?
  • Is the new cycle (upgrade) betuer than current operatjons?
  • Philosophy:
  • Lots of small improvements add up to generate betuer forecasts.

Slide 2

slide-3
SLIDE 3

WMO 7th verifjcatjon workshop, May 8-11, 2017 3

Research to operatjons

Slide 3

Individual scientjst testjng

  • Expt. A
  • Expt. B
  • Expt. C
  • Expt. D

Control

Contributjons from

  • ther scientjsts on

the team Team merge

Team Z contributjon 2 1 4  Control

Cycle merge

Control Team X contributjon 3 Team X + Y contributjon Team X + Y + Z contributjon

Contributjons from other teams Upgrade to

  • peratjonal

system

slide-4
SLIDE 4

WMO 7th verifjcatjon workshop, May 8-11, 2017 4

“Iver”: an R&D-focused verifjcatjon tool

Slide 7 Normalised change in RMS error in 500hPa geopotential Control: current operatjonal system (“Cycle 43R1”) Experiments progressively adding difgerent components for the cycle upgrade 95% confjdence intervals, based on the paired- difgerence t-test Confjdence interval infmatjon to account for tjme-correlatjons of paired-difgences in forecast errors Correctjon for multjplicity:

  • 4 separate experiments
  • 2 hemispheres
  • roughly 2 independent

scores across days 3-10

slide-5
SLIDE 5

WMO 7th verifjcatjon workshop, May 8-11, 2017 5

Latjtude-pressure verifjcatjon

Normalised change in std. dev. of error in Z (experiment - control)

Slide 8 Blue = reductjon in error = experiment betuer than control Cross-hatching: signifjcant at 95% using t-test with Šidák correctjon assuming one panel contains 20 independent tests

A typical dilemma in NWP development:

  • Should we accept a degradatjon in stratospheric

scores to improve tropospheric midlatjtude scores?

  • Do we even believe the scores are meaningful?
slide-6
SLIDE 6

WMO 7th verifjcatjon workshop, May 8-11, 2017 6

Latjtude-longitude verifjcatjon

Because many improvements (and degradatjons) are local

Slide 9

+0.2 +0.1 0.0

  • 0.1
  • 0.2

Normalised change in RMS T error at 850hPa

Are these patuerns statjstjcally signifjcant?

  • requires multjplicity correctjon: work in progress

But are these patuerns useful despite the lack of signifjcance testjng?

  • Yes, this turned out to be a problem associated with a new aerosol

climatology that put too much optjcal depth over the Gulf of Guinea

  • Too much optjcal depth = too much IR radiatjve heatjng at low levels

= local temperatures too warm

slide-7
SLIDE 7

WMO 7th verifjcatjon workshop, May 8-11, 2017 7

Statjstjcal problems in NWP research & development

  • The issues:
  • Every cycle upgrade generates hundreds of experiments
  • NWP systems are already VERY good: experiments usually test only minor modifjcatjons, with

small expected benefjts to forecast scores

  • Much of what we do is (in the sofuware sense) regression testjng:
  • We are checking for unexpected changes or interactjons (bugs) anywhere in the

atmosphere, at any scale

  • Verifjcatjon tools will generate 10,000+ plots, and each of those plots themselves may

contain multjple statjstjcal tests

  • Accurate hypothesis testjng (signifjcance testjng) is critjcal:
  • Type I error = rejectjon of null hypothesis when it is true = false positjve. Can be more

frequent than expected due to:

  • Multjple testjng (multjplicity)
  • Temporal correlatjon of forecast error
  • Type II error = failure to reject null hypothesis when it is false
  • Changes in forecast error are small; many samples required to gain signifjcance
  • Are our chosen scores meaningful and useful?

Slide 4

1 4 3 2

slide-8
SLIDE 8

WMO 7th verifjcatjon workshop, May 8-11, 2017 8

  • 1. Multjple comparisons (multjplicity)
  • 95% confjdence = 0.95 probability of NOT making a type I error
  • What if we make 4 statjstjcal tests at 95% confjdence?
  • Probability of not making a type I error in any of the four tests is:

0.95 × 0.95 × 0.95 × 0.95 = 0.81

  • We have gone from 95% confjdence to 81% confjdence.
  • There is now a 1 in 5 chance of at least one test falsely rejectjng the null

hypothesis (i.e. falsely showing “signifjcant” results)

  • Šidák correctjon:
  • PTEST = (PFAMILY)(1/n)
  • If we want a family-wide p-value of 0.95, then each of the four tests should be

performed at 0.987

Slide 5

slide-9
SLIDE 9

Shouldn’t n be very large?

  • If we generate 10,000+ plots, why isn’t n>10,000?
  • Because many of the forecast scores we examine are NOT

independent

slide-10
SLIDE 10

WMO 7th verifjcatjon workshop, May 8-11, 2017 1

Testjng the statjstjcal signifjcance testjng

Geer (2016, Tellus): Signifjcance of changes in forecast scores

  • Three experiments with the full ECMWF NWP system, each run over 2.5

years:

  • Control
  • AMSU-A denial: Remove one AMSU-A (an important source of

temperature informatjon) from the observing system

  • Chaos: Change a technical aspect of the system (number of

processing elements) that causes initjally tjny numerical difgerence in the results, which quickly grow. ▪ A representatjon of the null hypothesis: no scientjfjc change

Slide 11

slide-11
SLIDE 11

WMO 7th verifjcatjon workshop, May 8-11, 2017 11

Correlatjon of paired difgerences in other scores with paired difgerences in day-5 Z RMSE scores

  • All the dynamical scores are fairly correlated over the troposphere, and with
  • ne another

→ Z500 RMSE is suffjcient to verify tropospheric synoptjc forecasts in the medium range But the stratospheric scores, and relatjve humidity, appear more independent

Slide 15

slide-12
SLIDE 12

WMO 7th verifjcatjon workshop, May 8-11, 2017 12

Correlatjon of paired difgerences in scores at other tjme ranges with paired difgerences in day-5 Z RMSE scores

  • Scores are correlated over a few days through the tjme range

→Day 5 RMSE Z is suffjcient to verify the quality of (roughly) the day 4 to day 6 forecasts

Slide 16

slide-13
SLIDE 13

What is a reasonable n?

  • For the regional scores, n is the product of:
  • Number of experiments
  • Medium-range and long-range
  • Two hemispheres
  • But why not also count the stratosphere, tropics, lat-lon verifjcatjon?
  • For the moment, n is computed independently for each style of

plot

slide-14
SLIDE 14

WMO 7th verifjcatjon workshop, May 8-11, 2017 14

  • 2. Type I error (false rejectjon of the null hypothesis) due to

tjme-correlatjon of forecast errors

Slide 12

Chaos – control, computed on 8 chunks

  • f 230 forecasts

95% t-test with k=1 (no infmatjon) 95% t-test with k=1.22 (infmatjon for tjme- correlatjon)

The chaos experiment should generate false positjves at the chosen p-value (e.g. 0.95). Instead, naive testjng generates false positjves far more frequently.

slide-15
SLIDE 15

WMO 7th verifjcatjon workshop, May 8-11, 2017 15

  • 3. Type II error: failure to reject the null hypothesis

Slide 13

AMSU-A denial – control, computed on 8 chunks of 230 forecasts

Based on 2.5 years testjng, we know the AMSU-A denial impact is this But on 230 forecasts (about 4 months) we might get this: Type II error

The AMSU-A denial experiment should degrade forecast scores. AMSU-A is a very important source of data, known to provide benefjt to forecasts

slide-16
SLIDE 16

WMO 7th verifjcatjon workshop, May 8-11, 2017 16

Fightjng type II error: How many forecasts are required to get signifjcance?

1 independent test (e.g. we have one experiment and all we care about is NH day 5 RMSE)

Slide 14

Once in a while (e.g. moving from 3D-Var to 4D-Var) A typical cycle upgrade? A typical individual change, e.g. one AMSU-A

slide-17
SLIDE 17

WMO 7th verifjcatjon workshop, May 8-11, 2017 17

  • 4. Are our scores meaningful? Changing the reference changes the results

Problem areas: Tropics, stratosphere, any short-range verifjcatjon, any verifjcatjon of humidity

Slide 19

Temperature Geopotentjal Vector wind Relatjve humidity Verifjed against own analysis Verifjed against operatjonal analyses SH Tropics NH

slide-18
SLIDE 18

WMO 7th verifjcatjon workshop, May 8-11, 2017 18

Observatjonal verifjcatjon “obstats”

Example: verifjcatjon against aircrafu temperature measurements (AIREP)

Slide 20 Change in std. dev. of error

  • f the T+12 forecast,

relatjve to control

slide-19
SLIDE 19

WMO 7th verifjcatjon workshop, May 8-11, 2017 19

Summary: four issues in operatjonal R&D verifjcatjon

  • 1. Type I error due to multjple comparisons:
  • Try to determine how many independent tests n are being made (e.g.

compute correlatjon between scores)

  • Paired difgerences in medium range dynamical tropospheric scores are all

quite correlated

  • Paired difgerences are correlated at difgerent forecast ranges
  • Once n is estjmated, use a Šidák correctjon
  • 2. Type I error due to tjme-correlated forecast error:
  • Chaos experiment used to validate an AR(2) model for correctjng tjme-

correlatjons

  • Note that at forecast day 10, this may not work: long-range tjme-

correlatjons?

Slide 21

slide-20
SLIDE 20

WMO 7th verifjcatjon workshop, May 8-11, 2017 20

Summary: four issues in operatjonal R&D verifjcatjon

  • 3. Type II error because typical experiments test only small changes in forecast error:
  • 300-400 forecasts are now a minimum requirement for research experiments at

ECMWF

  • 4. Are the forecast scores meaningful?
  • Own-analysis scores are accurate in the medium and long-range, for midlatjtude

dynamical scores

  • In other areas (e.g. tropics, stratosphere, early forecast range) these scores are
  • fuen measuring something very difgerent from forecast skill
  • Also check observatjonal-based verifjcatjon

For more detail on issues 1-3 see Geer (2016,Tellus) “Signifjcance of changes in forecast scores”

Slide 21