Scientific Computing: 5 Key Considerations for Research Projects
Scientific computing has exponentially grown in importance over the last 2 decades, it is now a becoming a central part of many multi-disciplinary projects from materials engineering, biotechnology to artificial intelligence. It’s a strong and growing part of the scientific mantra and will continue to be long be into the future. So it’s more important than ever that all research projects start off with solid fundamentals.
A question I found, through my experience working in scientific computing support is what should be the key considerations of a resesarch project underpinned by scientific computing. Most researchers when starting a project need to ask five main technical questions:
- What programming language/libraries should be used? Python we know is super popular, with Julia also gaining pace. C and C++ and FORTRAN still have a strong foothold. Matlab endlessly continues to expand its functionality and still commands strong support in research and education as well as many more domain-specific languages.
- How should I maintain this code? There is only one answer here and that is git, whether it's GitHub, Gitlab or Bitbucket any of these is better than nothing.
- What acceleration do I need? This very much depends on the problem at hand, but can be condensed down (for simplicity) to CPU parallel like OpenMP and MPI, GPUs NVLINK/MPI and Quantum Computing is now making an emergence.
- What platform should be used to run my programs? For scientific computing, platforms can be a narrow selection or infinite since every research project has its bespoke needs. Here, I condense this to three main cluster types are High-Performance Computing (HPC), High Throughput Computing (HTC) and Big Data. All of these allow you to run your code at a massive scale, scientific computing demands more power as the scope of your research gets wider.
- How will my output data be managed? Always consider how much data will be produced and where will it be stored. Consideration of the durability and scale is essential, are we talking GBs, TBs or even PBs. We should use our knowledge of the science to minimise this and hence the cost incurred. If for example, a simulation is deterministic, we only need the inputs and code in storage rather than TBs of output once published. There is nothing worst than having 10s of external HDDs, find the right scale solution like Amazon S3 or Azure blob storage.
Develop code with the expectation it will be scaled up to a larger platform, your desktop computer can only take you so far.
More often than it should, the above choices are made incorrectly and never considered before a research project begins. Often, the decision in a research group is left to a PhD Student, or a researcher starting out in their first programming project. As a result, we find groundbreaking research is taking place on poorly maintained codebases with their scope limited by lack of foresight in the platform as a well bad choice of programming language. A sad truth in research communities is bad code is passed on by generations of researchers getting worse in every iteration and less portable since the research output/papers outstrip good code.
A well-maintained code base will result in reproducibility, the foundation upon which scientific research stands.
My advice and belief for any scientific computing project are to seek advice from any technologist you can, most universities and academic institutions have some sort of Research Computing department. These are generally up made of System Administrators and Research Software Engineers who will be specialists in this domain to empower you to develop scientific code that will enrich your research. They will have extensive experience and exposure to a plethora of previous projects to give you the best advice and support your research endeavours to ensure the success and longevity of a successful computing project. Just ask them the five questions and you won’t go wrong!