Some Ideas for “Best Practice” in Scientific Computing
Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader
Some Ideas for Best Practice in Scientific Computing Dr Owain - - PowerPoint PPT Presentation
Some Ideas for Best Practice in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader Scientific Computing? Doing science with computers Generating data Simulation Analysing data
Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader
– Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI?
– Arts/humanities
– Team Lead of Research Computing
Applications and Support
– Look after users and applications
design those services
– Grace → High Performance
Computing (HPC)
– Myriad, Legion → High Throughput
Computing (HTC)
– Aristotle → Interactive teaching
Linux service
– Thomas (Tier 2 MMM hub) – Michael (Faraday Battery Institute)
– Single job spans multiple nodes – Tightly coupled parallelisation usually in
MPI
– Sensitive to network performance – Currently primarily chemistry, physics,
engineering
– Lots (tens of thousands) of independent
jobs on different data
– High I/O – Currently, primarily biosciences and
physics
– In the future, digital humanities
Input Data Output Data Many processes on many processors work simultaneously + communicate between each other
Many processes, operate independently of each other and in any order Input Data Output Data
– If you are collaborating they let you all work on a project and share
– If you are working on a long term project (e.g. your PhD thesis!) help
– But there are lots of graphical tools available
that “talk git”
– “Decentralised” - i.e. every person working on a
repository has their own copy
– A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft!
“Octocat”, Github’s cute mascot
– https://github.com/UCL/i_newspaper_rods- software to run queries
– https://github.com/UCL-RITS/rcps-buildscripts/- all the installation
–
Linux – often already installed, or install from your package manager
–
Mac – install from the Xcode developer tools
–
Windows – a lot more complicated: pick an option from:
–
More detail on linking this to git on your local machine here: https://help.github.com/articles/set-up-git/
– Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated
way of tracking what you did when and getting back “that” version of the code.
– Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket.
(*) Binary files can go in but you can’t see the difference between versions as easily
Applications
people”
developer’s:
–
No hard coding of paths
–
“sensible” install process
–
works on arbitrary dataset
work
– Probably hard-coded to work
with one dataset, in the few environments available to the user.
– Jupyter notebooks etc.
Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers.
–
Containers/VMs to exactly reproduce environment.
→ Doesn’t work anyway because of hardware.
–
Only actually of use forensically (of course useful for moving your software about which is a separate issue)
– If your method only works with a particular compiler/MKL/whatever
– (Related) If your code stops working because a language feature is
Things to think about:
–
Set expectations for users in the documentation
–
Keep old versions online and distinguish them i.e. “myprg-1.2.3.tar.gz” not “myprg- current.tar.gz”
–
Tag releases (if on Github etc.).
–
When publishing results say which version/tag you used!
– IF you are publishing research you should know how your results are
– i.e. it’s not just enough to plug some data into a black box whose
– Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on?
– Am I actually using this software appropriately? – What data requirements do I need to think about?
– Are they reasonable?
Real life... Computer model…. Physical scale model...
– This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data...
– Version control helps (particularly if you are modifying code)
– But maybe if you have to run 10000 of them?
Think about what you present when YOU present your work:
tailored to our environment/problem “You can just go here to download the code!” For a good example of this I could talk about how wonderful our package management on our HPC clusters is but it suffers from many of these issues!
Here are some things to think about:
– What language is it in? (can you understand the code?) – What does it depend on to work? – Is it full of hard coded paths? – Can you work out which version you should be using?
If the answer to the above is “Oh no” where can you get help?
is a reasonable response!
Fourth contentious statement: “Best practice” is at best an inaccurate and worst a meaningless phrase. What we should aim for is “good practice”:
makes sense. At UCL there are a lot of teams to help you do this!