Inference, aggregation and graphics for top- k rank lists Michael G. - - PowerPoint PPT Presentation

inference aggregation and graphics for top k rank lists
SMART_READER_LITE
LIVE PREVIEW

Inference, aggregation and graphics for top- k rank lists Michael G. - - PowerPoint PPT Presentation

Inference, aggregation and graphics for top- k rank lists Michael G. Schimek 1 a 2 Shili Lin 3 Eva Budinsk a 4 Alena My si ckov 1 Medical University of Graz and Danube University Krems, Austria 2 Swiss Institute of Bioinformatics,


slide-1
SLIDE 1

university-logo

Inference, aggregation and graphics for top-k rank lists

Michael G. Schimek1 Eva Budinsk´ a2 Shili Lin3 Alena Myˇ siˇ ckov´ a4

1Medical University of Graz and Danube University Krems, Austria 2Swiss Institute of Bioinformatics, Lausanne, Switzerland 3Ohio State University, Columbus, USA 4Humboldt University, Berlin, Germany

useR! 2009, Rennes, France, July 8-10, 2009

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-2
SLIDE 2

university-logo

Motivation

In various fields of application we are confronted with lists

  • f distinct objects in rank order

The ordering might be due to a measure of strength of evidence or to an assessment based on expert knowledge

  • r a technical device

The ranking might also represent some measurement taken on the objects which might not be comparable across the lists, for instance, because of different assessment technologies or levels of measurement error Our aim is to consolidate such lists of common objects to provide computationally tractable solutions, hence appropriate algorithms and graphs to develop an R package named TopkLists

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-3
SLIDE 3

university-logo

General assumptions

Let us assume ℓ assessors or laboratories (j = 1, 2, . . . , ℓ) assigning rank positions to the same set of N distinct

  • bjects

Assessment of N distinct objects according to the extent to which a particular attribute is present All assessors, independently of each other, rank the same

  • bjects between 1 and N on the basis of relative

performance The ranking is from 1 to N, without ties Missing assessments are allowed The ℓ assessors produce ℓ ranked lists τj There are (ℓ2 − ℓ)/2 possible pairs of such lists τj

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-4
SLIDE 4

university-logo

The problem

Our overall goal is to identify a subset of objects that is characterized by high conformity across the lists It is implied that there is similarity between the rankings which can be evaluated by a distance measure d (a permutation metric) Such measures are

Kendall’s τ Spearman’s footrule

In practice we have truncated lists and incomplete rankings of objects in some or all of the lists caused by missing assignments Because of that penalized distance measures are required

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-5
SLIDE 5

university-logo

The problem continued

In most applications, especially for large or huge numbers N of objects, it is unlikely that consensus prevails As result only the top-ranked objects matter (the remainder

  • nes show random ordering)

Quite often we observe a general decrease, not necessarily monotone, of the probability for consensus rankings with increasing distance from the top rank position Typically there is reasonable conformity in the rankings for the first, say k, elements of the lists This motivates the notion of top-k rank lists as known from information retrieval literature Important application field: Integration and meta analysis

  • f gene expression data (microarray experiments)
  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-6
SLIDE 6

university-logo

Computational aspects and algorithms

List aggregation by means of brute force is limited to the situation where N is very small ℓ is very small the k’s are equal and a priori known Our purpose is to solve this computational problem for a realistic setting There are 3 subtasks respectively algorithms:

1

Selection of the ˆ k’s for all possible pairs of lists τj

2

Integration of partial information from the pairs of lists via a graphical tool

3

Calculation of a set of objects characterized by rankings of high conformity across the lists up to some global index ¯ k

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-7
SLIDE 7

university-logo

Selection of the ˆ k’s

Moderate deviation-based inference for random degeneration in paired rank lists (Hall and Schimek, 2009) For the estimation of the point of degeneration j0 into noise independent Bernoulli random variables are assumed A general decrease of the probability pj (need not be monotone) for concordance of rankings with increasing distance j from the top rank is assumed Several tuning parameters (δ, ν, . . . ) are required to account for the closeness of the assessors’ rankings and the degree of randomness in the assignments The algorithm represents a simplified mathematical model; It is embedded in an iterative scheme to account for irregular rankings

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-8
SLIDE 8

university-logo

Graphical integration of paired ranked lists

Define a partial reference list L0

1; anyone of the 2 lists

with maxj(ˆ kj) objects among all pairwise comparisons L0

1 gives the ordering of the objects Oi in the heatmap and

defines the vertical axis Take L0

1’s highest ranking {maxj(ˆ

kj) + δ} objects Oi The partial lists L2, L3, . . . , Lℓ are ordered from highest to lowest by their individual kj when compared to the reference list L0

1 (one column per list)

In each cell we represent: (1) top-k membership, ’yes’ is denoted by color ’grey’ and ’no’ by ’white’, (2) distance of a current object Oi ∈ L0

1 from its position in

the other list, color scale from ’red’ identical to ’yellow’ far distant (integer value denotes distance with negative sign if to the left, and positive sign if to the right)

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-9
SLIDE 9

university-logo

Calculation of a set of highly conforming objects

Cross-entropy Monte Carlo (CEMC) for consolidation of top-k

  • bjects (Lin and Ding, 2009)

Assume a random matrix X and a corresponding probability matrix p Given the probability mass function Pv(x), any realization x of X uniquely determines the corresponding top-k candidate list without reference to the probability matrix p Stochastic search to find an ordering x∗ that corresponds to an optimal τ ∗ satisfying the minimization criterion Iterative CEMC algorithm in two steps: (i) simulation step in which random samples from Pv(x) are drawn, (ii) update step for improved samples increasingly concentrating around an x∗ (correspond to optimal τ ∗)

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-10
SLIDE 10

university-logo

Graphics tool example: top-k integration of 5 gene expression lists (N = 120, ˆ kj ∈ [20, 38])

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 L3 19 22 10 7 12 6 13 3 20 11 2 16 14 9 17 23 15 4 5 8 34 45 65 31 87 44 59 37 72 51 68 66 102 99 113 95 74 119 L1

5 16 88 15 7 48 8 66 11 −3 3 −10 −8 −6 −2 −5 NA −9 −3 −10 79 8 NA 7 15 32 7 29 53 NA 39 NA NA −1 42 61 6 42

L5

18 4 1 5 13 4 7 9 NA 25 4 −9 −2 24 1 −11 −9 72 −12 −8 NA 41 NA 7 −2 2 6 20 58 9 NA 25 NA 18 35 NA

L2

2 16 4 2 7 8 29 11 2 −5 4 31 64 3 −2 −12 −15 −8 −11 10 4 45 38 72 35 NA 46 NA 33 NA 33 45 24 27

L4

6 17 1 −2 9 91 4 71 71 −2 −2 8 −7 76 31 2 −12 −3 −3 −3 23 60 51 8 NA

L1 25 16 1 21 14 19 11 9 4 8 23 12 17 2 13 5 24 22 7 20 91 30 103 111 110 53 40 106 55 45 31 75 99 59 L5

19 1 44 −2 6 13 28 30 81 2 −6 6 3 1 −1 −9 −16 −12 −10 NA 70 50 77 NA 73 34 37 45 66 33 NA 19 −1

L2

41 6 12 72 −3 −2 9 1 20 −7 1 21 −8 3 −13 −9 28 NA NA 51 8 16 12 NA 60 37 65 NA 25 NA

L4

2 18 7 8 1 1 1 82 6 7 7 2 33 −5 −4 −4 1 −17 60 NA

L4 18 7 25 10 15 14 19 11 2 1 13 21 24 12 4 5 8 23 22 16 62 103 28 72 71 84 58 30 119 64 47 31 33 55 L2

96 4 −2 3 −3 71 −4 −3 6 −1 25 4 7 −2 −5 −8 13 −14 −1 23 6 NA NA NA 31 47 NA NA NA 24 62 64 62 55

L5

12 7 17 3 5 12 27 6 37 3 −10 −12 4 75 −9 −5 −13 −13 −17 24 78 48 63 57 62 10 44 47 14 3 −1

L2 25 15 19 23 11 7 10 5 1 4 20 12 17 6 2 21 9 22 3 24 38 114 L5

19 6 16 1 30 3 −3 −1 38 80 NA 6 3 −4 −14 21 −12 −2 −19 NA 46

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-11
SLIDE 11

university-logo

Simulation with 2 lists and k=100, δ=40, N = 1000: Estimation of k for different ν

  • 20

40 60 80 100 4 10 20 30 40 50 60 70 80 90 100

nu k_hat

  • 20

40 60 80 100 120 100 200 300 400 500 600 700 800 900 1000

cumsum j

4 10 20 30 40 50 60 70 80 90 100

nu

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-12
SLIDE 12

university-logo

Simulation with 5 lists and k=10, Spearman’s footrule (blue) vs. Kendall’s τ (red), N = 100: Top selected genes (objects)

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=15

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=20

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=25

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=15

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=22

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 50 100

delta=25

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists

slide-13
SLIDE 13

university-logo

Summary

The TopkLists Package is implemented in R, applying the grid package is a computationally tractable and efficient approach to the top-k rank list problem implements the Hall & Schimek-algorithm implements the Lin & Ding-algorithm implements graphical procedures for information integration allows the user to interact with the data and to select an

  • verall top-k set of objects

allows to monitor the aggregation process allows to evaluate tuning parameter choice

  • M. G. Schimek et al.

Inference, aggregation and graphics for top-k rank lists