User Tools

Site Tools


high_performance_computing_and_best_practices

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
high_performance_computing_and_best_practices [2014/09/08 16:32] – [Accessing Resources via SSH] wsihigh_performance_computing_and_best_practices [2022/07/21 06:59] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +====== High Performance Computing and Best Practices ======
 +===== Background =====
 +==== HPC Vs. Single Processors ====
 +Serial Computing - Instructions run one at a time:
 +{{:images:serialproblem.gif?400|}}
 +
 +Parallel Computing - Problem is split into multiple problems, each problem can be run concurrently (at the same time) on multiple processors
 +{{:images:parallelproblem.gif?500|}}
 +
 +
 +==== Need for HPC ====
 +//Time// - Parallel computing can solve problems faster
 +
 +//Cost// - Parallel computing can be accomplished with cheap hardware and the time savings can lead to cost savings
 +
 +//Larger Problems// - Complex problems like climate change, traffic, website transactions, and plasma physics can be infeasible with serial computing and are better suited to parallel computing
 +
 +==== Concepts and Terminology ====
 +
 +//HPC// - High Performance Computing, solving big problems with big computing power
 +
 +//Cluster// - A group of computers working together
 +
 +//Node// - A single computer, many nodes join together to form a cluster
 +
 +//Core/CPU// - A processing unit
 +
 +//Job// - A problem or program to be run
 +
 +//Parallel Overhead// - Extra time needed to setup and coordinate a parallel job: synchronizing, data exchange, start-up/termination, etc
 +
 +//Scalability// - Ability of a system to handle more work or its ability to be enlarged to accommodate more work
 +
 +
 +
 +
 +==== Limits and Cost ====
 +
 +$Speedup = \frac{1}{(P/N)+S}$
 +
 +P = parallel fraction, N = number of processors, S = serial fraction
 +
 +{{:images:amdahl2.gif|}}
 +
 +
 +Development cost of parallel programming is higher.
 +
 +
 +==== Memory Architectures ====
 +
 +
 +
 +
 +===== Batch Processing =====
 +==== Batch Processing and Shared Resources ====
 +The need for shared resources
 +
 +//Our Cluster//
 +  * 16 cores on two nodes
 +
 +//NERSC//
 +  * National Energy Research Scientific Computing Center
 +  * Primary scientific computing facility for the Office of Science in the U.S. Department of Energy
 +  * Multiple machines with over 100,000 cores
 +{{:images:nersc.jpg?300|}}
 +
 +//Sunway TaihuLight//
 +  * in China, currently the top HPC center with 10,649,600 cores
 +{{::sunway-supercomputer-6.jpg?200|}}
 +
 +Check [[https://www.top500.org/|www.top500.org]] for an updated list of the top500 supercomputers
 +==== Batch Processing Management ====
 +With multiple users wishing to access computing resources, we need a way to manage who gets to use what resources and when they can do so.
 +
 +Some Resource Managers:
 +  * [[http://www.adaptivecomputing.com/products/open-source/torque/|Torque]]
 +  * [[https://computing.llnl.gov/linux/slurm/slurm.html|Slurm]]
 +  * [[http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition/|Moab]]
 +===== Using HPC Resources at WSI =====
 +==== Accessing Resources via SSH ====
 +
 +In the office:
 +  $ ssh user@control
 +  
 +==== Launching and Managing Your Jobs ====
 +[[torque_tutorial|Using Torque (NERSC)]]
 +
 +[[slurm_tutorial|Using Slurm]]
 +==== Version Control ====
 +Often multiple programmers will be working on a codebase simultaneously. Alice may want to work on the code but Bob does too. If they both make changes to a file without telling each other, changes might get lost. Using a version control system ensures that Alice and Bob can work on the code together without undoing each-others work or breaking the code. 
 +
 +=== Commonly used Version Control Systems ===
 +Github
 +SVN
 +===== HPC Best Practices =====
 +
 +  * Redundant Data
 +  * Seperate Storage from Compute Nodes
 +  * Central Source for User Documentation