high_performance_computing_and_best_practices
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
high_performance_computing_and_best_practices [2015/09/11 17:34] – [Version Control] jestuber_gmail.com | high_performance_computing_and_best_practices [2022/07/21 06:59] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== High Performance Computing and Best Practices ====== | ||
+ | ===== Background ===== | ||
+ | ==== HPC Vs. Single Processors ==== | ||
+ | Serial Computing - Instructions run one at a time: | ||
+ | {{: | ||
+ | |||
+ | Parallel Computing - Problem is split into multiple problems, each problem can be run concurrently (at the same time) on multiple processors | ||
+ | {{: | ||
+ | |||
+ | |||
+ | ==== Need for HPC ==== | ||
+ | //Time// - Parallel computing can solve problems faster | ||
+ | |||
+ | //Cost// - Parallel computing can be accomplished with cheap hardware and the time savings can lead to cost savings | ||
+ | |||
+ | //Larger Problems// - Complex problems like climate change, traffic, website transactions, | ||
+ | |||
+ | ==== Concepts and Terminology ==== | ||
+ | |||
+ | //HPC// - High Performance Computing, solving big problems with big computing power | ||
+ | |||
+ | //Cluster// - A group of computers working together | ||
+ | |||
+ | //Node// - A single computer, many nodes join together to form a cluster | ||
+ | |||
+ | // | ||
+ | |||
+ | //Job// - A problem or program to be run | ||
+ | |||
+ | //Parallel Overhead// - Extra time needed to setup and coordinate a parallel job: synchronizing, | ||
+ | |||
+ | // | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ==== Limits and Cost ==== | ||
+ | |||
+ | $Speedup = \frac{1}{(P/ | ||
+ | |||
+ | P = parallel fraction, N = number of processors, S = serial fraction | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | Development cost of parallel programming is higher. | ||
+ | |||
+ | |||
+ | ==== Memory Architectures ==== | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Batch Processing ===== | ||
+ | ==== Batch Processing and Shared Resources ==== | ||
+ | The need for shared resources | ||
+ | |||
+ | //Our Cluster// | ||
+ | * 16 cores on two nodes | ||
+ | |||
+ | //NERSC// | ||
+ | * National Energy Research Scientific Computing Center | ||
+ | * Primary scientific computing facility for the Office of Science in the U.S. Department of Energy | ||
+ | * Multiple machines with over 100,000 cores | ||
+ | {{: | ||
+ | |||
+ | //Sunway TaihuLight// | ||
+ | * in China, currently the top HPC center with 10,649,600 cores | ||
+ | {{:: | ||
+ | |||
+ | Check [[https:// | ||
+ | ==== Batch Processing Management ==== | ||
+ | With multiple users wishing to access computing resources, we need a way to manage who gets to use what resources and when they can do so. | ||
+ | |||
+ | Some Resource Managers: | ||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ | * [[http:// | ||
+ | ===== Using HPC Resources at WSI ===== | ||
+ | ==== Accessing Resources via SSH ==== | ||
+ | |||
+ | In the office: | ||
+ | $ ssh user@control | ||
+ | | ||
+ | ==== Launching and Managing Your Jobs ==== | ||
+ | [[torque_tutorial|Using Torque (NERSC)]] | ||
+ | |||
+ | [[slurm_tutorial|Using Slurm]] | ||
+ | ==== Version Control ==== | ||
+ | Often multiple programmers will be working on a codebase simultaneously. Alice may want to work on the code but Bob does too. If they both make changes to a file without telling each other, changes might get lost. Using a version control system ensures that Alice and Bob can work on the code together without undoing each-others work or breaking the code. | ||
+ | |||
+ | === Commonly used Version Control Systems === | ||
+ | Github | ||
+ | SVN | ||
+ | ===== HPC Best Practices ===== | ||
+ | |||
+ | * Redundant Data | ||
+ | * Seperate Storage from Compute Nodes | ||
+ | * Central Source for User Documentation | ||