LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline - PowerPoint PPT Presentation
LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2) Why bother? 3) Where to find large datasets 4) Challenges, pitfalls and opportunities Big data? The Australian Square Kilometre Array
LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK
Outline • 1) What is big data? • 2) Why bother? • 3) Where to find large datasets • 4) Challenges, pitfalls and opportunities
Big data? • ‘The Australian Square Kilometre Array Pathfinder (ASKAP) project currently acquires 7.5 terabytes/second of sample image data, a rate projected to increase 100-fold to 750 terabytes/second (~25 zettabytes per year) by 2025
Outline • 1) What is big data? • 2) Why bother? • 3) Where to find them • 4) Challenges, pitfalls and opportunities
‘Well if I found an effect in a small sample, then there must be something there right?’
Why bother BF 10 =5.26*10 8315 • 1) It’s (almost) free • 2) More statistical power is always better • 3) Reproducible • 4) Generalizability/ replicability • 5) Inspires new questions • 6) Look beyond your current domain • 7) Develop/apply/test new methods on real data
Procedure • 1) Find suitable data • 2) Apply • 3) Wait • 4) (Wait some more) • 5) Data!
Open data types: Databases (cognitive neuro) Sample size cost age data Biobank 500.000 2000 £ 43-73 everything LARGE cognitive, neural, mental ABCD 10000 free 9-11 DATASETS health cognitive, neural, mental HCP 1000 <£1000 21-35 health cognitive, behavioural, some IMAGEN 1500 free 0-3 neural PNC 800 free 11–17 cognitive, behavioural, neural Reach out online/ cognitive, behavioural, some NKI Rockland 800 free 6-18 neural Google data OASIS, ADNI, HABS, ENIGMA, and many more
Integration with other open science practices? • Data sharing: By definition • Preregistration: Possible • Reproducibility
Large public datasets in practice
• https://openpsychometrics.org/ _rawdata/ • Freely downloadable • e.g.: Stress, anxiety, depression • N=48.000 in 5 seconds • (demanding) Model fit excellently • Personality and demographic covariates explained >50% (!) of the variance in depression/ anxiety/stress Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2018). Variable selection in structural equation models with regularized MIMIC Models. In press, AMPPS
Big data by leveraging technical tools • ‘Math Garden’ • Incentivisation • Free ‘participation’ • Accessible through signed form 'improving an online practice environment for math, currently containing over a billion responses’ Brinkhuis, M., Savi, A., Hofman, A., Coomans, F., van der Maas, H., & Maris, G. (2018). Learning As It Happens: A Decade of Analyzing and Shaping a Large-Scale Online Learning System.
• Cognitive health • Immediate recall word list of 10 words • 0-10, 4 waves • Proportion remembered • Survey of Health, Ageing and • Mental health Retirement in Europe • EURO-D scale • Depressive • Freely and easily available symptoms • N=111.000 (!) in 60 minutes • Inverted so that higher scores -> • 6 waves better mental health • 27 European countries and Israel Decline in mental health • Fit a series of complex growth models r=.94 Decline in memory
• Case study: Biobank Me preregistering • Age-related decline (3 waves) in fluid intelligence • Data access, acquisition: All excellent • But cognitive data • Not 3 waves • Not fluid reasoning • Self-paced • Ceiling effects • Floor effects Me now • Easy to remember • No slope variance in N=160.000 • At the mercy of the data available Kievit, R. A., Fuhrmann, D., Borgeest, G. S., Simpson-Kent, I. L., & Henson, R. N. (2018). The neural determinants of age-related changes in fluid intelligence: a pre-registered, longitudinal analysis in UK Biobank. Wellcome open research, 3.
• 1) Time Beyond CBU: • 2) Effort 36 emails… • 3) Requirements 10 phone calls… Within CBU: 3 months…. -Anyone who shares an office with you has to sign an NDA to get a single signature. -The computer cannot be on if anybody who has NOT signed the NDA is in the same room -The computer with the data cannot connected to the internet or the CBU network -You have to enter a password every ^me you load the data
Interim summary • Many benefits for researchers • Power, replication, precision • Widening access • Enrich existing paradigms • Learning/teaching data analysis • But… Where does this data come from?
18 citations
Cam-CAN data portal • 400 downloads • Managed access
Summary • There is an ocean of data out there • It can be your primary focus, or complement other (e.g. experimental) work • Benefits • Your data can, and where possible • Power should, contribute to the ecosystem • Generalisability • Adapt your ethics forms to allow • Extensions/scope sharing • Challenges • Don’t decide which data is valuable • Cost (modest) enough • Time/effort (negligible, relatively) • Suitability (low to high) • Slanted towards individual differences/ epidemiological (some experimental exists!)
Questions? rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.