The field of high performance scientific computing lies at the

crossroads of a number of disciplines and skill sets, and

correspondingly, for someone to be successful at using high

performance computing in science requires at least elementary

knowledge of and skills in all these areas. This book brings

together the strands of numerical modeling, numerical linear

algebra, computer architecture, parallel computing, performance

optimization in a unified manner.

The contents of this book are a combination of theoretical material

and self-guided tutorials on various practical skills. Together,

this teaches a graduate student or advanced undergraduate the

necessary skills to be a successful computational scientist.

Victor Eijkhout is a research scientist at the

Texas Advanced Computing Center of

The University of Texas at Austin.

This book is released under a CC-BY license, thanks to a gift from the Saylor Foundation. Print copies and course materials are available from the author's web page.

- I Theory
- Single-processor Computing
- The Von Neumann architecture
- Modern processors
- The processing cores
- 8-bit, 16-bit, 32-bit, 64-bit
- Caches: on-chip memory
- Graphics, controllers, special purpose hardware
- Superscalar processing and instruction-level parallelism

- Memory Hierarchies
- Busses
- Latency and Bandwidth
- Registers
- Caches
- Prefetch streams
- Concurrency and memory transfer
- Memory banks
- TLB and virtual memory

- Multicore architectures
- Cache coherence
- Computations on multicore chips

- Locality and data reuse
- Data reuse and arithmetic intensity
- Locality

- Programming strategies for high performance
- Peak performance
- Pipelining
- Cache size
- Cache lines
- TLB
- Cache associativity
- Loop tiling
- Optimization strategies
- Cache aware and cache oblivious programming
- Case study: Matrix-vector product
- Case study: Goto matrix-matrix product

- Power consumption
- Derivation of scaling properties
- Multicore
- Total computer power

- Review questions

- Parallel Computing
- Introduction
- Quantifying parallelism
- Definitions
- Asymptotics
- Amdahl's law
- Scalability
- Simulation scaling

- Parallel Computers Architectures
- SIMD
- MIMD / SPMD computers
- The commoditization of supercomputers

- Different types of memory access
- Symmetric Multi-Processors: Uniform Memory Access
- Non-Uniform Memory Access
- Logically and physically distributed memory

- Granularity of parallelism
- Data parallelism
- Instruction-level parallelism
- Task-level parallelism
- Conveniently parallel computing
- Medium-grain data parallelism
- Task granularity

- Parallel programming
- Thread parallelism
- OpenMP
- Distributed memory programming through message passing
- Hybrid shared/distributed memory computing
- Parallel languages
- OS-based approaches
- Active messages
- Bulk synchronous parallelism
- Program design for parallelism

- Topologies
- Some graph theory
- Busses
- Linear arrays and rings
- 2D and 3D arrays
- Hypercubes
- Switched networks
- Cluster networks
- Bandwidth and latency
- Locality in parallel computing

- Multi-threaded architectures
- Co-processors
- A little history
- Bottlenecks
- GPU computing
- Intel Xeon Phi

- Remaining topics
- Load balancing
- Distributed computing, grid computing, cloud computing
- Usage scenarios
- Characterization

- Capability versus capacity computing
- FPGA computing
- MapReduce
- The top500 list
- The top500 list as a recent history of supercomputing

- Heterogeneous computing

- Computer Arithmetic
- Integers
- Real numbers
- They're not really real numbers
- Representation of real numbers
- Limitations
- Normalized and unnormalized numbers
- Representation error
- Machine precision
- The IEEE 754 standard for floating point numbers

- Round-off error analysis
- Correct rounding
- Addition
- Multiplication
- Subtraction
- Examples
- Roundoff error in parallel computations

- Compilers and round-off
- More about floating point arithmetic
- Programming languages
- Other computer arithmetic systems
- Extended precision
- Fixed-point arithmetic
- Complex numbers

- Conclusions

- Numerical treatment of differential equations
- Initial value problems
- Error and stability
- Finite difference approximation: Euler explicit and implicit methods

- Boundary value problems
- General PDE theory
- The Poisson equation in one space dimension
- The Poisson equation in two space dimensions
- Difference stencils
- Other discretization techniques

- Initial boundary value problem
- Discretization
- Stability analysis

- Initial value problems
- Numerical linear algebra
- Elimination of unknowns
- Linear algebra in computer arithmetic
- Roundoff control during elimination
- Influence of roundoff on eigenvalue computations

- LU factorization
- The algorithm
- The Cholesky factorization
- Uniqueness
- Pivoting
- Solving the system
- Complexity
- Block algorithms

- Sparse matrices
- Storage of sparse matrices
- Sparse matrices and graph theory
- LU factorizations of sparse matrices

- Iterative methods
- Abstract presentation
- Convergence and error analysis
- Computational form
- Convergence of the method
- Choice of K
- Stopping tests
- Theory of general iterative methods
- Iterating by orthogonalization
- Coupled recurrences form of iterative methods
- The method of Conjugate Gradients
- Derivation from minimization
- GMRES
- Complexity

- Further Reading

- High performance linear algebra
- Collective operations
- Broadcast
- Reduction
- Allreduce
- Allgather
- Reduce-scatter

- The sparse matrix-vector product
- Parallel dense matrix-vector product
- Implementing the block-row case
- Scalability of the dense matrix-vector product

- LU factorization in parallel
- Solving a triangular system
- Factorization

- Parallel sparse matrix-vector product
- Parallel efficiency of the sparse matrix-vector product
- Memory behaviour of the sparse matrix-vector product
- The transpose product
- Setup of the sparse matrix-vector product

- Parallelism in solving linear systems from PDEs
- Computational aspects of iterative methods
- Vector operations
- Finite element matrix construction

- Parallel preconditioners
- Jacobi preconditioning
- The trouble with parallel ILU
- Block Jacobi methods
- Parallel ILU

- Ordering strategies and parallelism
- Nested dissection
- Variable reordering and colouring: independent sets

- Operator splitting
- Parallelism and implicit operations
- Wavefronts
- Recursive doubling
- Approximating implicit by explicit operations, series expansion

- Grid updates
- Analysis
- Communication and work minimizing strategy

- Block algorithms on multicore architectures

- Collective operations

- Single-processor Computing
- II Applications
- Molecular dynamics
- Force Computation
- Force Fields
- Computing Short-Range Nonbonded Forces
- Computing Long-Range Forces

- Parallel Decompositions
- Atom Decompositions
- Force Decompositions
- Spatial Decompositions
- Neutral Territory Methods

- Parallel Fast Fourier Transform
- Parallel 1-D FFT
- Parallel 3-D FFT

- Integration for Molecular Dynamics

- Force Computation
- Sorting
- Brief introduction to sorting
- Quicksort
- Quicksort in shared memory
- Quicksort on a hypercube
- Quicksort on a general parallel processor

- Bitonic sort

- Graph analytics
- Traditional graph algorithms
- Shortest path algorithms
- Parallelization
- Floyd-Warshall all-pairs shortest path
- Spanning trees

- `Real world' graphs
- Properties of random graphs
- Concepts

- Hypertext algorithms
- HITS
- PageRank

- Large-scale computational graph theory

- Traditional graph algorithms
- N-body problems
- The Barnes-Hut algorithm
- The Fast Multipole Method
- Full computation
- Implementation
- Vectorization
- Shared memory implementation
- Distributed memory implementation

- Monte Carlo Methods
- Parallel Random Number Generation
- Examples
- Integration by statistics
- Monte Carlo simulation of the Ising model

- Computational biology
- Dynamic programming approaches
- Suffix tree

- Big data
- Recommender systems
- Stochastic gradient descent
- Alternating least squares

- Recommender systems
- Other physics applications
- Lattice Boltzmann methods
- Hartree-Fock / Density Functional Theory

- Molecular dynamics
- III Appendices
- Linear algebra
- Norms
- Gram-Schmidt orthogonalization
- The power method
- Nonnegative matrices; Perron vectors
- The Gershgorin theorem
- Householder reflectors

- Complexity
- Partial Differential Equations
- Partial derivatives
- Poisson or Laplace Equation
- Heat Equation
- Steady state

- Taylor series
- Graph theory
- Definitions
- Common types of graphs
- Graph colouring and independent sets
- Graphs and matrices
- Spectral graph theory

- Fourier Transforms
- Automata theory
- Finite State Automatons (FSAs)
- General discussion

- Parallel Prefix

- Linear algebra
- IV Tutorials
- Unix intro
- Files and such
- Text searching and regular expressions
- Command execution
- Scripting
- Expansion
- Shell interaction
- The system and other users
- The sed and awk tools
- Review questions

- Compilers and libraries
- An introduction to binary files
- Simple compilation
- Libraries

- Managing projects with Make
- A simple example
- Makefile power tools
- Miscellania
- Shell scripting in a Makefile
- A Makefile for LaTeX

- Source code control
- Workflow in source code control systems
- Subversion or SVN
- Mercurial or hg

- Scientific Data Storage
- Introduction to HDF5
- Creating a file
- Datasets
- Writing the data
- Reading

- Scientific Libraries
- The Portable Extendable Toolkit for Scientific Computing
- Libraries for dense linear algebra: Lapack and Scalapack

- Plotting with GNUplot
- Usage modes
- Plotting
- Workflow

- Good coding practices
- Defensive programming
- Guarding against memory errors
- Testing

- Debugging
- Invoking gdb
- Finding errors
- Memory debugging with Valgrind
- Stepping through a program
- Inspecting values
- Breakpoints
- Further reading

- Performance measurement
- Timers
- Accurate counters
- Profiling tools
- Tracing

- C/Fortran interoperability
- Linker conventions
- Arrays
- Strings
- Subprogram arguments
- Input/output
- Fortran/C interoperability in Fortran2003

- LaTeX for scientific documentation
- The idea behind LaTeX, some history
- A gentle introduction to LaTeX
- A worked out example
- Where to take it from here
- Review questions

- Unix intro
- V Projects, codes, index
- Class projects
- Cache simulation and analysis
- Evaluation of Bulk Synchronous Programming
- Heat equation
- The memory wall

- Codes
- Hardware event counting
- Test setup
- Cache size
- Cachelines
- Cache associativity
- TLB
- Unrepresentible numbers

- Index and list of acronyms

- Class projects

#### Free Machine Learning Books

11 Books

- Pattern Recognition and Machine Learning (Information Science and Statistics)
- by Christopher M. Bishop
- Data mining
- by I. H. Witten
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- by Various

#### Free Chemistry Textbooks

9 Books

- CK-12 Chemistry
- by Various
- Concept Development Studies in Chemistry
- by John Hutchinson
- An Introduction to Chemistry - Atoms First
- by Mark Bishop

#### Free Mathematics Textbooks

21 Books

- Microsoft Word - How to Use Advanced Algebra II.doc
- by Jonathan Emmons
- Advanced Algebra II: Activities and Homework
- by Kenny Felder
- de2de
- by

#### Free Children Books

38 Books

- The Sun Who Lost His Way
- by
- Tania is a Detective
- by Kanika G
- Firenze_s-Light
- by

#### Free Java Books

10 Books

- Java 3D Programming
- by Daniel Selman
- The Java EE 6 Tutorial
- by Oracle Corporation
- JavaKid811
- by