Some Ideas for Best Practice in Scientific Computing Dr Owain - - PowerPoint PPT Presentation

some ideas for best practice in scientific computing
SMART_READER_LITE
LIVE PREVIEW

Some Ideas for Best Practice in Scientific Computing Dr Owain - - PowerPoint PPT Presentation

Some Ideas for Best Practice in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader Scientific Computing? Doing science with computers Generating data Simulation Analysing data


slide-1
SLIDE 1

Some Ideas for “Best Practice” in Scientific Computing

Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader

slide-2
SLIDE 2

“Scientific Computing?”

  • “Doing science with computers”

– Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI?

  • Not just science

– Arts/humanities

→ “Research Computing”

slide-3
SLIDE 3

About Me

  • Been at UCL since 2005

(Computational Chemistry PhD)

  • Spent the last 8 or so years

working in Research Computing in ISD

– Team Lead of Research Computing

Applications and Support

– Look after users and applications

  • n UCL ISD managed resources +

design those services

slide-4
SLIDE 4

Contents

  • Overview of HPC/HTC services at UCL
  • Version Control
  • Publishing Code (+ Data)
  • Pitfalls

4 contentious statements

slide-5
SLIDE 5

UCL Research Computing Resources

  • UCL only services:

– Grace → High Performance

Computing (HPC)

– Myriad, Legion → High Throughput

Computing (HTC)

– Aristotle → Interactive teaching

Linux service

  • National services:

– Thomas (Tier 2 MMM hub) – Michael (Faraday Battery Institute)

  • Parallel

– Single job spans multiple nodes – Tightly coupled parallelisation usually in

MPI

– Sensitive to network performance – Currently primarily chemistry, physics,

engineering

  • High throughput

– Lots (tens of thousands) of independent

jobs on different data

– High I/O – Currently, primarily biosciences and

physics

– In the future, digital humanities

slide-6
SLIDE 6

HPC

Input Data Output Data Many processes on many processors work simultaneously + communicate between each other

slide-7
SLIDE 7

HTC

Many processes, operate independently of each other and in any order Input Data Output Data

slide-8
SLIDE 8

The what + why of version control

  • Version control systems are tools that let you keep track of who

changed a file or set of files, when and what they changed.

– If you are collaborating they let you all work on a project and share

changes in a structured way.

– If you are working on a long term project (e.g. your PhD thesis!) help

you keep a record of what you did and when (and get old versions back).

  • Many available, many types – from very basic (e.g. “track

changes”) to very advanced decentralised systems.

slide-9
SLIDE 9

Git and Github

  • Git is an Open Source (GPL) command line

tool originally written by Linus Torvalds.

– But there are lots of graphical tools available

that “talk git”

– “Decentralised” - i.e. every person working on a

repository has their own copy

  • Github is a centralised service for hosting,

sharing and contributing to git repositories of

  • pen source code

– A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft!

“Octocat”, Github’s cute mascot

slide-10
SLIDE 10

Github is an interesting place to explore

  • It’s the default for RITS (including RSD) at UCL – e.g.

– https://github.com/UCL/i_newspaper_rods- software to run queries

  • ver the British Museum’s Times Digital archive.

– https://github.com/UCL-RITS/rcps-buildscripts/- all the installation

management for UCL RC services (and where you can request new software).

  • Code for all sorts and scales of projects, inc. big companies like

Microsoft, Valve...

slide-11
SLIDE 11

Setting up git/Github

  • Depends on whether you are using Linux, Mac or Windows!

Linux – often already installed, or install from your package manager

Mac – install from the Xcode developer tools

Windows – a lot more complicated: pick an option from:

  • Command-line tools: https://git-scm.com/downloads
  • GUI choices: https://git-scm.com/downloads/guis
  • Set up name and email in the client
  • IF you want to use Github, register a Github account

More detail on linking this to git on your local machine here: https://help.github.com/articles/set-up-git/

slide-12
SLIDE 12

But overall...

  • You don’t have to like git or even use it:

– Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated

way of tracking what you did when and getting back “that” version of the code.

– Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket.

  • Anything that’s a text file(*) can go into version control – this includes

LaTeX source if you use that for your thesis/papers.

(*) Binary files can go in but you can’t see the difference between versions as easily

slide-13
SLIDE 13

Aside: Code: Application vs Method

Applications

  • Packaged as “ready for other

people”

  • Works on machines other than the

developer’s:

No hard coding of paths

“sensible” install process

works on arbitrary dataset

  • Used directly by other people for

work

Method

  • “What I did”
  • Really a part of the write-up

– Probably hard-coded to work

with one dataset, in the few environments available to the user.

– Jupyter notebooks etc.

  • Inspire other people’s work
slide-14
SLIDE 14

Publishing Code + Data

First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.

slide-15
SLIDE 15

Publishing Code + Data: Motivations

  • Citations:

IF you license appropriately, get citations for free!

  • Collaborations:

IF you are willing to, potential collaborators will come to you!

  • “Reproducibility” + finding errors

IF you are not evil, this is good.

slide-16
SLIDE 16

Reproducibility

Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers.

  • Overwhelming on focus on bit-perfect reproduction of results:

Containers/VMs to exactly reproduce environment.

→ Doesn’t work anyway because of hardware.

Only actually of use forensically (of course useful for moving your software about which is a separate issue)

slide-17
SLIDE 17

Reproducibility

  • Relatively little focus on whether the general method is stable:

– If your method only works with a particular compiler/MKL/whatever

version then it may be a bug, not a valid result.

– (Related) If your code stops working because a language feature is

deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version

  • f your code.
slide-18
SLIDE 18

Publishing Code: “Do”

Things to think about:

  • What you are publishing – is it an “Application” or is it a “Method”?

Set expectations for users in the documentation

  • License – in order for people to actually use your software!
  • Versions:

Keep old versions online and distinguish them i.e. “myprg-1.2.3.tar.gz” not “myprg- current.tar.gz”

Tag releases (if on Github etc.).

When publishing results say which version/tag you used!

  • DO THE SAME FOR DATA SETS!
slide-19
SLIDE 19

Pitfalls

  • Ritual
  • The “things I did” explosion
  • Obsessing over

performance/not caring enough

  • Designing experiments

based on the contents of slide decks

slide-20
SLIDE 20

Ritual

Third contentious statement:

– IF you are publishing research you should know how your results are

generated.

– i.e. it’s not just enough to plug some data into a black box whose

workings you do not understand and then publish the results.

THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!

slide-21
SLIDE 21

Avoiding Ritual

  • Read up on the software you are using.
  • Think about its limitations:

– Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on?

  • Think about how it might be applied to your problem:

– Am I actually using this software appropriately? – What data requirements do I need to think about?

  • Think about your results:

– Are they reasonable?

slide-22
SLIDE 22

It’s just a model...

This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models:

  • Animal models
  • Building an actual model
  • Theoretical models

JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!

slide-23
SLIDE 23

It’s just a model...

Real life... Computer model…. Physical scale model...

slide-24
SLIDE 24

The “things I did” explosion

  • This is not unique to scientific computing but is encouraged by

the way we use computers.

  • It can be tempting to try a lot of unplanned things on our input

data and see what “works”: “I’ll just run it through X and see...”

– This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data...

  • Always record what you did even if you didn’t plan it.

– Version control helps (particularly if you are modifying code)

slide-25
SLIDE 25

Performance

Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone.

  • There’s no point in learning C to make one job that takes 48

hours run in 4 hours.

– But maybe if you have to run 10000 of them?

  • Obsessive optimisation is madness.
  • It’s completely worth slightly modifying your code to make it run

10x as fast.

slide-26
SLIDE 26

Experiment design by slide deck

Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”

slide-27
SLIDE 27

Why is this bad?

Think about what you present when YOU present your work:

  • Gloss over “insignificant” problems
  • Small amount of self-hyping
  • Don’t mention the detail of how our code is

tailored to our environment/problem “You can just go here to download the code!” For a good example of this I could talk about how wonderful our package management on our HPC clusters is but it suffers from many of these issues!

slide-28
SLIDE 28

Using other projects’ code

Here are some things to think about:

  • Is their code actually an “application” or is it really documenting their method?
  • Can you even legally re-use it?
  • Is it easy to install?

– What language is it in? (can you understand the code?) – What does it depend on to work? – Is it full of hard coded paths? – Can you work out which version you should be using?

If the answer to the above is “Oh no” where can you get help?

  • Is it even suitable to your problem at all?
  • Telling your supervisor “I took a look and looking at their code the issues with using it are X, Y and Z”

is a reasonable response!

slide-29
SLIDE 29

And finally...

Fourth contentious statement: “Best practice” is at best an inaccurate and worst a meaningless phrase. What we should aim for is “good practice”:

  • Use resources efficiently to make your job easier.
  • Make stuff you create open in a way that’s as frictionless for you and the end user as

makes sense. At UCL there are a lot of teams to help you do this!