COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - - PowerPoint PPT Presentation

comparison of categorical properties offered by multiple
SMART_READER_LITE
LIVE PREVIEW

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. Franois Bry Betreuer: Prof. Dr. Franois Bry,


slide-1
SLIDE 1

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS

Using a Web Crawler in Python with Scrapy

Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. François Bry Betreuer: Prof. Dr. François Bry, Yingding Wang

1

12.04.18

slide-2
SLIDE 2
  • 1. Introduction / Goal
  • 2. Defining MOOC model
  • 3. Web Scraper / Results
  • 4. Gold standard selection
  • 5. Text categorisation approach
  • 6. Gold standard evaluation
  • 7. Evaluation of all platforms
  • 8. Conclusion & Future work

2

AGENDA

slide-3
SLIDE 3

3

  • 1. Introduction / Goal
slide-4
SLIDE 4

4

  • Irom - Intelligent Recommender Of MOOCs
  • MOOC - Massive Open Online Course
  • To improve the learning and studying at the university.
  • To develop an intelligent MOOCs search engine

The goal of Irom

  • Define unified set of categories across all MOOC

platforms. Goal of thesis

Motivation

slide-5
SLIDE 5

5

Motivation

slide-6
SLIDE 6

6

Modified Goal

slide-7
SLIDE 7

7

Tasks

  • 1. Define a MOOC model
  • 2. Build a Web Scraper and extract data
  • 3. Select a platform as the “Gold standard”
  • 4. Text categorisation approach (TF-IDF & cos sim.)
  • 5. Evaluate tf-idf and cosine similarity approach
  • 6. Categorise courses from other platforms
  • 7. Evaluate the results
slide-8
SLIDE 8

8

  • 2. Defining MOOC model
slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

Motivation - MOOC platforms

slide-12
SLIDE 12

12

Unified MOOC Model (Table)

Coursera Udacity Edx FutureLearn Open2Study Udemy Url

✓ ✓ ✓ ✓ ✓ ✓

Title

✓ ✓ ✓ ✓ ✓ ✓

Summary

✓ ✓ ✓ ✓ ✓ ✓

Description

✓ ✓ ✓ ✓ ✓ ✓

Subcategory

✓ ✘ ✘ ✘ ✓ ✓

Category

✓ ✓ ✓ ✓ ✓ ✓

WhyTakeThis Course

✓ ✓ ✓ ✓ ✓ ✓

Provider

✓ ✓ ✓ ✓ ✓ ✓

Level

✓ ✓ ✓ ✘ ✘ ✓

ImageUrl

✓ ✓ ✓ ✓ ✓ ✓

Price

✓ ✓ ✓ ✓ ✓ ✓

Duration

✓ ✓ ✓ ✓ ✓ ✓

RatingValue

✓ ✘ ✓ ✓ ✓ ✓

RatingAmount

✓ ✘ ✓ ✓ ✓ ✓

StartDate

✓ ✘ ✓ ✓ ✓ ✘

EndDate

✘ ✘ ✘ ✓ ✓ ✘

slide-13
SLIDE 13

13

  • 3. Web Scraper / Results
slide-14
SLIDE 14

14

Scraped Data (Table)

Coursera Udacity Edx FutureLearn Open2Study Udemy All Total number of courses

3.032 232 1.098 193 49 40.003 44.607

slide-15
SLIDE 15

15

  • 4. Gold standard selection
slide-16
SLIDE 16

16

Gold standard criteria

  • 1. Number of categories
  • 2. Number of courses
  • 3. Diversity
  • 4. Represent University Subjects
slide-17
SLIDE 17

17

Gold standard elimination process

Coursera Udacity Edx FutureLearn Open2Study Udemy

  • No. of

categories

✓ ✓ ✘ ✓ ✘ ✓

  • No. of

courses

✓ ✘ ✓ ✘ ✘ ✓

Diversity

✓ ✘ ✓ ✓ ✓ ✓

University rep.

✓ ✘ ✓ ✓ ✓ ✘

slide-18
SLIDE 18

18

Gold standard structure

slide-19
SLIDE 19

19

Gold standard structure

slide-20
SLIDE 20

20

Gold standard structure

slide-21
SLIDE 21

21

  • 5. Text categorisation approach
slide-22
SLIDE 22

22

Text categorisation approach (Step 1)

Query Database : ‘Platform’ = ‘Coursera’ AND GROUB BY ‘Subcategory’

MongoDB

slide-23
SLIDE 23

23

Text categorisation approach (Step 2)

subcategories

Finance Marketing Algorithms

… …

Subcategory m

course 1 course 2

course n course 1 course 2

course n course 1 course 2

course n course 1 course 2

course n

Array of courses (JSON object)

slide-24
SLIDE 24

24

Text categorisation approach (Step 3)

subcategories

Finance Marketing

… …

Subcategory m course 1 course 2

course n course 1 course 2

course n course 1 course 2

course n

Array of courses (String)

Iterate through courses and extract and combine the ‘title’, ‘Summary’, ‘Description’

slide-25
SLIDE 25

25

Text categorisation approach (Step 4)

subcategories

Finance Marketing

… …

Subcategory m course 1 course 2

course n

Combined array of courses (String)

Join all arrays/list of strings into one string

“Intro into Finance. This course …” “Marketing 101. Learn fundamentals …”

slide-26
SLIDE 26

26

Text categorisation approach (Step 5)

subcategories

Finance Marketing

… …

Subcategory m course 1 course 2

course n

Preprocessed combined array of courses

Preprocess Data: Remove all stop words and punctuations, All words to lowercase, All words are stemmed

“intro financ cours …” “market lear fundamental …”

slide-27
SLIDE 27

27

Text categorisation approach (Step 6)

Course (Query) from another platform, that needs to be categorised

{ “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … }

Extract and combine the ‘title’, ‘Summary’, ‘Description’ Preprocess Data: Remove all stop words and punctuations, All words to lowercase, All words are stemmed

slide-28
SLIDE 28

28

Text categorisation approach (Step 6)

Course(s)

{ “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … }

Subcategories (String)

Finance Marketing

… …

Subcategory m

TF-IDF and cosine similarity

Calculate TF-IDF and Cosine similarity for all

  • subcategories. Course is categorised to the subcategory

with the highest value.

slide-29
SLIDE 29

29

TF-IDF & Cosine Similarity

TF-IDF - Term frequency inverse document frequency Term Frequency - How frequent a term appears in a given document Inverse document frequency - diminishes the weight of terms that appear very frequently in the corpus and increases the weight of terms that appear rarely. Cosine similarity - a measure of similarity between two vectors, that measures the cosine of the angle between them.

slide-30
SLIDE 30

30

  • 6. Approach evaluation
slide-31
SLIDE 31

31

Approach evaluation

{ “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … }

TF-IDF and cosine similarity approach New Category Coursera courses

slide-32
SLIDE 32

32

Approach evaluation Gold standard accuracy = 2625 3032 ≈

0.87

Accuracy - Accuracy is a ratio of total correctly categorised

courses to the total number of courses

slide-33
SLIDE 33

33

  • 7. Evaluation of all platforms
slide-34
SLIDE 34

34

Evaluation of all platforms (Udacity)

TF-IDF and cosine similarity approach Computer Science Category

Good or bad outcome?

  • Intro. to Android

Category: Android

slide-35
SLIDE 35

35

Evaluation of all platforms (Udacity)

{ {

Gold standard (Coursera) categories Udacity categories

slide-36
SLIDE 36

36

Evaluation of all platforms (Udacity)

*The heat-map shows the percentages of courses categorised to that particular category, with darker colours indicating greater percentage.

slide-37
SLIDE 37

37

Grading schema

slide-38
SLIDE 38

38

Evaluation of all platforms (Udacity)

Udacity evaluation table

slide-39
SLIDE 39

39

Evaluation of all platforms (Udacity)

Udacity courses distribution (Pie Chart)

slide-40
SLIDE 40

40

Evaluation of all platforms (Udacity)

Udacity courses distribution (Table)

slide-41
SLIDE 41

41

Evaluation of all platforms (Edx)

slide-42
SLIDE 42

42

  • 8. Conclusion & Future work
slide-43
SLIDE 43

43

Conclusion

  • 1. ca 45.000 courses scraped and indexed for IROM.
  • 2. Coursera’s categories as the gold standard was a great
  • utcome.
  • 3. Tf-idf and cosine similarity measure was also a positive
  • utcome.
slide-44
SLIDE 44

44

Future work

  • 1. Measure the quality of data scraped
  • 2. Better approach - machine learning (neural networks, etc)
  • 3. Evaluating text categorisation
slide-45
SLIDE 45

45

Thank you.