====== High Performance Computing and Best Practices ====== ===== Background ===== ==== HPC Vs. Single Processors ==== Serial Computing - Instructions run one at a time: {{:images:serialproblem.gif?400|}} Parallel Computing - Problem is split into multiple problems, each problem can be run concurrently (at the same time) on multiple processors {{:images:parallelproblem.gif?500|}} ==== Need for HPC ==== //Time// - Parallel computing can solve problems faster //Cost// - Parallel computing can be accomplished with cheap hardware and the time savings can lead to cost savings //Larger Problems// - Complex problems like climate change, traffic, website transactions, and plasma physics can be infeasible with serial computing and are better suited to parallel computing ==== Concepts and Terminology ==== //HPC// - High Performance Computing, solving big problems with big computing power //Cluster// - A group of computers working together //Node// - A single computer, many nodes join together to form a cluster //Core/CPU// - A processing unit //Job// - A problem or program to be run //Parallel Overhead// - Extra time needed to setup and coordinate a parallel job: synchronizing, data exchange, start-up/termination, etc //Scalability// - Ability of a system to handle more work or its ability to be enlarged to accommodate more work ==== Limits and Cost ==== $Speedup = \frac{1}{(P/N)+S}$ P = parallel fraction, N = number of processors, S = serial fraction {{:images:amdahl2.gif|}} Development cost of parallel programming is higher. ==== Memory Architectures ==== ===== Batch Processing ===== ==== Batch Processing and Shared Resources ==== The need for shared resources //Our Cluster// * 16 cores on two nodes //NERSC// * National Energy Research Scientific Computing Center * Primary scientific computing facility for the Office of Science in the U.S. Department of Energy * Multiple machines with over 100,000 cores {{:images:nersc.jpg?300|}} //Sunway TaihuLight// * in China, currently the top HPC center with 10,649,600 cores {{::sunway-supercomputer-6.jpg?200|}} Check [[https://www.top500.org/|www.top500.org]] for an updated list of the top500 supercomputers ==== Batch Processing Management ==== With multiple users wishing to access computing resources, we need a way to manage who gets to use what resources and when they can do so. Some Resource Managers: * [[http://www.adaptivecomputing.com/products/open-source/torque/|Torque]] * [[https://computing.llnl.gov/linux/slurm/slurm.html|Slurm]] * [[http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition/|Moab]] ===== Using HPC Resources at WSI ===== ==== Accessing Resources via SSH ==== In the office: $ ssh user@control ==== Launching and Managing Your Jobs ==== [[torque_tutorial|Using Torque (NERSC)]] [[slurm_tutorial|Using Slurm]] ==== Version Control ==== Often multiple programmers will be working on a codebase simultaneously. Alice may want to work on the code but Bob does too. If they both make changes to a file without telling each other, changes might get lost. Using a version control system ensures that Alice and Bob can work on the code together without undoing each-others work or breaking the code. === Commonly used Version Control Systems === Github SVN ===== HPC Best Practices ===== * Redundant Data * Seperate Storage from Compute Nodes * Central Source for User Documentation