Youtube Revisited: On the Importance of Correct Measurement - - PowerPoint PPT Presentation

youtube revisited on the importance of correct
SMART_READER_LITE
LIVE PREVIEW

Youtube Revisited: On the Importance of Correct Measurement - - PowerPoint PPT Presentation

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi Kangasharju University of Helsinki www.helsinki.fi/yliopisto www.helsinki.fi/yliopisto 1 Introduction Measuring large systems is challenging


slide-1
SLIDE 1

www.helsinki.fi/yliopisto www.helsinki.fi/yliopisto

Youtube Revisited: On the Importance of Correct Measurement Methodology

Ossi Karkulahti, Jussi Kangasharju University of Helsinki

1

slide-2
SLIDE 2

www.helsinki.fi/yliopisto

  • Measuring large systems is challenging
  • Full system analysis is expensive -> sampling
  • The way sampling is conducted affects the results
  • Ideally a random and representative sample
  • Technological limitation may skew the sampling process
  • Biased sample may yield incorrect conclusions
  • Could also affect any derivative work
  • We will show the effects of three different sampling methods
  • n YouTube

2

Introduction

slide-3
SLIDE 3

www.helsinki.fi/yliopisto

  • Previously YouTube video metadata collection:
  • selecting videos belonging to certain categories
  • crawling related videos
  • using most recent videos
  • We argue that all these methods lead to a biased

sample

  • The result are not representative in all aspects
  • Other work base their assumptions on these results

3

Motivation

slide-4
SLIDE 4

www.helsinki.fi/yliopisto

  • We have collected three datasets with three methods
  • We compare the methods for collecting YouTube video

metadata

  • We demonstrate the differences in various metrics

between the different datasets

4

Our Contributions

slide-5
SLIDE 5

www.helsinki.fi/yliopisto

  • We have collected metadata by three different methods:
  • 1. Most recent videos (MR)
  • 2. Related videos (BFS)
  • 3. Random string (RS)
  • Fourth method is to use videos from a certain category,

which is obviously biased

  • M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube,

everybody tubes: Analyzing the world’s largest user generated content video system. IMC, 2007.

5

Data Collection

slide-6
SLIDE 6

www.helsinki.fi/yliopisto

  • Collect periodically metadata of the most recent videos
  • Included information: video ID, view count, length,

category, publish date etc.

  • Obviously limited to new videos
  • Previously used by, e.g.:
  • X. Cheng, J. Liu, and C. Dale. Understanding the characteristics of

internet short video sharing: A youtube-based measurement study. Multimedia, IEEE Transactions on, 2013.

  • G. Szabo and B. A. Huberman. Predicting the popularity of online
  • content. Communications of the ACM, 2010.

6

  • 1. Most Recent Videos (MR)
slide-7
SLIDE 7

www.helsinki.fi/yliopisto

  • Select a video ID and then ask its related videos and then the

related videos for all those videos and so on

  • We limited related videos to 50 per one video
  • In theory, one seed yields to ~125,000 videos (50x50x50)
  • N unique videos is lower, the related videos overlap
  • Can be seen as similar to breadth-first search (BFS)
  • Fast, most of the time one query returns metadata of tens of videos
  • X. Cheng, J. Liu, and C. Dale. Understanding the characteristics of

internet short video sharing: A youtube-based measurement study. Multimedia, IEEE Transactions on, 2013.

7

  • 2. Related Videos (BFS)
slide-8
SLIDE 8

www.helsinki.fi/yliopisto

  • Zhou et al. have used similar method to estimate

YouTube’s size (“Counting YouTube Videos via Random Prefix

Sampling”, IMC 2011)

  • Generate a random character string and ask the API to

return videos which IDs include the string

  • ‘a-Z’, ‘0-9’, ‘-’, ‘_’, four-letter strings work the best
  • On average a random string matched to 6.9 video IDs
  • For an unknown reason IDs include ‘-’

8

  • 3. Random Strings (RS)
slide-9
SLIDE 9

www.helsinki.fi/yliopisto

A random string w57j would match and return metadata for the following videos:

W57J-21gSSo XcY-W57J-Uo w57j-VVNAg0 W57J-msuors

9

  • 3. Random Strings (RS)
slide-10
SLIDE 10

www.helsinki.fi/yliopisto 10

Datasets

Dataset Method Time period N

MR-09 Most recent videos Summer 2009 9,405 MR-11 Most recent videos Summer 2011 8,766 MR-14 Most recent videos Late 2013-early 2014 10,000 RS Random ID Early 2014 ~ 5 million BFS Related videos Early 2014 ~ 5 million

slide-11
SLIDE 11

www.helsinki.fi/yliopisto

  • Popularity
  • Views
  • Age
  • Categories
  • Length

11

Results

slide-12
SLIDE 12

www.helsinki.fi/yliopisto

  • RS and BFS: Very different view count

distributions

  • BFS has two-part distribution, with a quick-

dropping tail

  • RS follows more closely Zipf, with a truncated

tail

  • BFS data seems to over-estimate view counts
  • RS:Top 10 -> 5% of all views, top 1000 -> 43

%, top 10,000 -> 74 %

12

Popularity

slide-13
SLIDE 13

www.helsinki.fi/yliopisto

  • MR and BFS seem to ever-estimate video popularity
  • However MR-09 resembles RS

13

Popularity after 30 days

slide-14
SLIDE 14

www.helsinki.fi/yliopisto

  • The 5th percentile of BFS is higher than the median of

RS and MR

  • BFS view counts are at least one order of magnitude

higher than the RS ones

14

Views

slide-15
SLIDE 15

www.helsinki.fi/yliopisto

  • The median, 5th and 95th percentiles for BFS and RS
  • ver eight years
  • BFS’s median is most of the time two orders of

magnitude higher than RS’s

15

Views

slide-16
SLIDE 16

www.helsinki.fi/yliopisto

  • BFS has less videos newer than two years, but a lot of very

recent videos

  • The drop in RS is an artifact of the method
  • RS: 29 % of videos are newer than a year, majority is newer

than two years

16

Age Distribution

slide-17
SLIDE 17

www.helsinki.fi/yliopisto

  • Most videos of:
  • RS: People & Blogs

(Default category for an upload)

  • BFS: Music
  • MR: News & Politics

17

Categories (share of videos)

slide-18
SLIDE 18

www.helsinki.fi/yliopisto

  • Distribution of number of views is more similar
  • Music videos get most views

18

Categories (share of views)

slide-19
SLIDE 19

www.helsinki.fi/yliopisto 19

Popularity based on Category

slide-20
SLIDE 20

www.helsinki.fi/yliopisto

  • RS and MR: Most common length is 60 s or less
  • BFS: Most common 3-5 min, music videos?
  • All: Videos of 3-5 mins length get most views

20

Video Length

slide-21
SLIDE 21

www.helsinki.fi/yliopisto 21

Summary of the Methods

BFS MR RS

Tends to over- estimate some metrics Over-estimates views Most ‘reliable’ Fast, up to 100 per query Slow Not that fast, ~7 per query Mostly popular music videos? Limited to new videos Mostly news clips? Mysterious ‘-’ curiosity

slide-22
SLIDE 22

www.helsinki.fi/yliopisto

  • We have used YouTube as an example, using three

data collection methods

  • The datasets differ in many key metrics that have used

in past research (MR, BFS)

  • RS not previously used in this manner
  • Differences between RS and the others raise questions

about the general applicability of the previous results

  • We believe the RS produces a representative sample

22

Conclusion 1/2

slide-23
SLIDE 23

www.helsinki.fi/yliopisto

  • As BFS dataset demonstrates even large datasets are

not immune to bias introduced by the method

  • Data collection method can have a significant impact
  • n the results
  • Whatever is the selected sampling method, be aware
  • f its properties and weaknesses
  • Be careful when adopting results from earlier work
  • Time to accept more reappraisal work?

23

Conclusion 2/2

slide-24
SLIDE 24

www.helsinki.fi/yliopisto 24

Questions?