[PPT] - Some Ideas for Best Practice in Scientific Computing Dr Owain PowerPoint Presentation

SLIDE 1

Some Ideas for “Best Practice” in Scientific Computing

Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader

SLIDE 2

“Scientific Computing?”

“Doing science with computers”

– Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI?

Not just science

– Arts/humanities

→ “Research Computing”

SLIDE 3

About Me

Been at UCL since 2005

(Computational Chemistry PhD)

Spent the last 8 or so years

working in Research Computing in ISD

– Team Lead of Research Computing

Applications and Support

– Look after users and applications

n UCL ISD managed resources +

design those services

SLIDE 4

UCL Research Computing Resources

UCL only services:

– Grace → High Performance

Computing (HPC)

– Myriad, Legion → High Throughput

Computing (HTC)

– Aristotle → Interactive teaching

Linux service

National services:

– Thomas (Tier 2 MMM hub) – Michael (Faraday Battery Institute)

Parallel

– Single job spans multiple nodes – Tightly coupled parallelisation usually in

MPI

– Sensitive to network performance – Currently primarily chemistry, physics,

engineering

High throughput

– Lots (tens of thousands) of independent

jobs on different data

– High I/O – Currently, primarily biosciences and

physics

– In the future, digital humanities

SLIDE 6

HPC

Input Data Output Data Many processes on many processors work simultaneously + communicate between each other

SLIDE 7

HTC

Many processes, operate independently of each other and in any order Input Data Output Data

SLIDE 8

The what + why of version control

Version control systems are tools that let you keep track of who

changed a file or set of files, when and what they changed.

– If you are collaborating they let you all work on a project and share

changes in a structured way.

– If you are working on a long term project (e.g. your PhD thesis!) help

you keep a record of what you did and when (and get old versions back).

Many available, many types – from very basic (e.g. “track

changes”) to very advanced decentralised systems.

SLIDE 9

Git and Github

Git is an Open Source (GPL) command line

tool originally written by Linus Torvalds.

– But there are lots of graphical tools available

that “talk git”

– “Decentralised” - i.e. every person working on a

repository has their own copy

Github is a centralised service for hosting,

sharing and contributing to git repositories of

pen source code

– A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft!

“Octocat”, Github’s cute mascot

SLIDE 10

Github is an interesting place to explore

It’s the default for RITS (including RSD) at UCL – e.g.

– https://github.com/UCL/i_newspaper_rods- software to run queries

ver the British Museum’s Times Digital archive.

– https://github.com/UCL-RITS/rcps-buildscripts/- all the installation

management for UCL RC services (and where you can request new software).

Code for all sorts and scales of projects, inc. big companies like

Microsoft, Valve...

SLIDE 11

Setting up git/Github

Depends on whether you are using Linux, Mac or Windows!

–

Linux – often already installed, or install from your package manager

–

Mac – install from the Xcode developer tools

–

Windows – a lot more complicated: pick an option from:

Command-line tools: https://git-scm.com/downloads
GUI choices: https://git-scm.com/downloads/guis
Set up name and email in the client
IF you want to use Github, register a Github account

–

More detail on linking this to git on your local machine here: https://help.github.com/articles/set-up-git/

SLIDE 12

But overall...

You don’t have to like git or even use it:

– Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated

way of tracking what you did when and getting back “that” version of the code.

– Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket.

Anything that’s a text file(*) can go into version control – this includes

LaTeX source if you use that for your thesis/papers.

(*) Binary files can go in but you can’t see the difference between versions as easily

SLIDE 13

Aside: Code: Application vs Method

Applications

Packaged as “ready for other

people”

Works on machines other than the

developer’s:

–

No hard coding of paths

–

“sensible” install process

–

works on arbitrary dataset

Used directly by other people for

work

Method

“What I did”
Really a part of the write-up

– Probably hard-coded to work

with one dataset, in the few environments available to the user.

– Jupyter notebooks etc.

Inspire other people’s work

SLIDE 14

Publishing Code + Data

First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.

SLIDE 15

Publishing Code + Data: Motivations

Citations:

IF you license appropriately, get citations for free!

Collaborations:

IF you are willing to, potential collaborators will come to you!

“Reproducibility” + finding errors

IF you are not evil, this is good.

SLIDE 16

Reproducibility

Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers.

Overwhelming on focus on bit-perfect reproduction of results:

–

Containers/VMs to exactly reproduce environment.

→ Doesn’t work anyway because of hardware.

–

Only actually of use forensically (of course useful for moving your software about which is a separate issue)

SLIDE 17

Reproducibility

Relatively little focus on whether the general method is stable:

– If your method only works with a particular compiler/MKL/whatever

version then it may be a bug, not a valid result.

– (Related) If your code stops working because a language feature is

deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version

f your code.

SLIDE 18

Publishing Code: “Do”

Things to think about:

What you are publishing – is it an “Application” or is it a “Method”?

–

Set expectations for users in the documentation

License – in order for people to actually use your software!
Versions:

–

Keep old versions online and distinguish them i.e. “myprg-1.2.3.tar.gz” not “myprg- current.tar.gz”

–

Tag releases (if on Github etc.).

–

When publishing results say which version/tag you used!

DO THE SAME FOR DATA SETS!

SLIDE 19

Pitfalls

Ritual
The “things I did” explosion
Obsessing over

performance/not caring enough

Designing experiments

based on the contents of slide decks

SLIDE 20

Ritual

Third contentious statement:

– IF you are publishing research you should know how your results are

generated.

– i.e. it’s not just enough to plug some data into a black box whose

workings you do not understand and then publish the results.

THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!

SLIDE 21

Avoiding Ritual

Read up on the software you are using.
Think about its limitations:

– Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on?

Think about how it might be applied to your problem:

– Am I actually using this software appropriately? – What data requirements do I need to think about?

Think about your results:

– Are they reasonable?

SLIDE 22

It’s just a model...

This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models:

Animal models
Building an actual model
Theoretical models

JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!

SLIDE 23

It’s just a model...

Real life... Computer model…. Physical scale model...

SLIDE 24

The “things I did” explosion

This is not unique to scientific computing but is encouraged by

the way we use computers.

It can be tempting to try a lot of unplanned things on our input

data and see what “works”: “I’ll just run it through X and see...”

– This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data...

Always record what you did even if you didn’t plan it.

– Version control helps (particularly if you are modifying code)

SLIDE 25

Performance

Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone.

There’s no point in learning C to make one job that takes 48

hours run in 4 hours.

– But maybe if you have to run 10000 of them?

Obsessive optimisation is madness.
It’s completely worth slightly modifying your code to make it run

10x as fast.

SLIDE 26

Experiment design by slide deck

Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”

SLIDE 27

Why is this bad?

Think about what you present when YOU present your work:

Gloss over “insignificant” problems
Small amount of self-hyping
Don’t mention the detail of how our code is

tailored to our environment/problem “You can just go here to download the code!” For a good example of this I could talk about how wonderful our package management on our HPC clusters is but it suffers from many of these issues!

SLIDE 28

Using other projects’ code

Here are some things to think about:

Is their code actually an “application” or is it really documenting their method?
Can you even legally re-use it?
Is it easy to install?

– What language is it in? (can you understand the code?) – What does it depend on to work? – Is it full of hard coded paths? – Can you work out which version you should be using?

If the answer to the above is “Oh no” where can you get help?

Is it even suitable to your problem at all?
Telling your supervisor “I took a look and looking at their code the issues with using it are X, Y and Z”

is a reasonable response!

SLIDE 29

And finally...

Fourth contentious statement: “Best practice” is at best an inaccurate and worst a meaningless phrase. What we should aim for is “good practice”:

Use resources efficiently to make your job easier.
Make stuff you create open in a way that’s as frictionless for you and the end user as

makes sense. At UCL there are a lot of teams to help you do this!

Some Ideas for “Best Practice” in Scientific Computing

“Scientific Computing?”

→ “Research Computing”

About Me

(Computational Chemistry PhD)

working in Research Computing in ISD

Contents

4 contentious statements

UCL Research Computing Resources

HPC

HTC

The what + why of version control

changed a file or set of files, when and what they changed.

changes in a structured way.

you keep a record of what you did and when (and get old versions back).

changes”) to very advanced decentralised systems.

Git and Github

tool originally written by Linus Torvalds.

sharing and contributing to git repositories of

Github is an interesting place to explore

management for UCL RC services (and where you can request new software).

Microsoft, Valve...

Setting up git/Github

But overall...

LaTeX source if you use that for your thesis/papers.

Aside: Code: Application vs Method

Method

Publishing Code + Data

First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.

Publishing Code + Data: Motivations

IF you license appropriately, get citations for free!

IF you are willing to, potential collaborators will come to you!

IF you are not evil, this is good.

Reproducibility

Reproducibility

version then it may be a bug, not a valid result.

deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version

Publishing Code: “Do”

Pitfalls

performance/not caring enough

based on the contents of slide decks

Ritual

Third contentious statement:

generated.

workings you do not understand and then publish the results.

THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!

Avoiding Ritual

It’s just a model...

This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models:

JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!

It’s just a model...

The “things I did” explosion

the way we use computers.

data and see what “works”: “I’ll just run it through X and see...”

Performance

Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone.

hours run in 4 hours.

10x as fast.

Experiment design by slide deck

Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”

Why is this bad?

Using other projects’ code

And finally...