Parallel Computing

Papers
(The median citation count of Parallel Computing is 2. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2020-04-01 to 2024-04-01.)
ArticleCitations
NekRS, a GPU-accelerated spectral element Navier–Stokes solver40
Porting WarpX to GPU-accelerated platforms25
Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms17
Parallel and scalable Dunn Index for the validation of big data clusters17
OpenMP application experiences: Porting to accelerated nodes16
A thread-adaptive sparse approximate inverse preconditioning algorithm on multi-GPUs16
SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent16
Toward performance-portable PETSc for GPU-based exascale systems15
Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking14
A novel hybrid heuristic-based list scheduling algorithm in heterogeneous cloud computing environment for makespan optimization13
GPU algorithms for Efficient Exascale Discretizations12
GPU-based parallel multi-objective particle swarm optimization for large swarms and high dimensional problems11
LU-Cholesky QR algorithms for thin QR decomposition11
Enabling GPU accelerated computing in the SUNDIALS time integration library11
Implementation and evaluation of MPI 4.0 partitioned communication libraries10
Multiscale modeling and cinematic visualization of photosynthetic energy conversion processes from electronic to cell scales10
Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance9
HBPFP-DC: A parallel frequent itemset mining using Spark9
Porting hypre to heterogeneous computer architectures: Strategies and experiences9
Dynamic power management for value-oriented schedulers in power-constrained HPC system8
On revisiting energy and performance in microservices applications: A cloud elasticity-driven approach8
Linear solvers for power grid optimization problems: A review of GPU-accelerated linear solvers8
AIR: Iterative refinement acceleration using arbitrary dynamic precision8
Measurement and analysis of GPU-accelerated applications with HPCToolkit8
AMG based on compatible weighted matching for GPUs7
A new scalable distributed k-means algorithm based on Cloud micro-services for High-performance computing7
Graph optimization algorithm for low-latency interconnection networks7
High performance sparse multifrontal solvers on modern GPUs7
Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers7
Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx7
Towards performance portability in the Spark astrophysical magnetohydrodynamics solver in the Flash-X simulation framework6
ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment6
Callback-based completion notification using MPI Continuations6
Exploring GPU acceleration of Deep Neural Networks using Block Circulant Matrices6
A domain partitioning method using a multi-phase-field model for block-based AMR applications6
Scalable communication for high-order stencil computations using CUDA-aware MPI6
A novel method of grouping target paths for parallel programs5
Optimizing small channel 3D convolution on GPU with tensor core5
Optimal task scheduling for partially heterogeneous systems5
Ginkgo—A math library designed for platform portability5
A computational-graph partitioning method for training memory-constrained DNNs5
GPU acceleration of Levenshtein distance computation between long strings5
Asynchronous parallel stochastic Quasi-Newton methods5
Collectives in hybrid MPI+MPI code: Design, practice and performance5
Using long vector extensions for MPI reductions5
An international survey on MPI users4
Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers4
Minimizing development costs for efficient many-core visualization using MCD34
Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems4
Accelerated molecular dynamics simulation of Silicon Crystals on TaihuLight using OpenACC4
A parallel strategy for density functional theory computations on accelerated nodes4
MPI detach — Towards automatic asynchronous local completion4
Performance portability through machine learning guided kernel selection in SYCL libraries4
High performance solution of skew-symmetric eigenvalue problems with applications in solving the Bethe-Salpeter eigenvalue problem4
CCF: An efficient SpMV storage format for AVX512 platforms4
Parallel branch and bound algorithm for solving integer linear programming models derived from behavioral synthesis4
Speedup vs. quality: Asynchronous and cluster-based distributed adaptive genetic algorithms for ordered problems4
An optimisation of allreduce communication in message-passing systems4
GPU accelerated parallel reliability-guided digital volume correlation with automatic seed selection based on 3D SIFT4
OpenCL-like offloading with metaprogramming for SX-Aurora TSUBASA4
Delaunay triangulation of large-scale datasets using two-level parallelism3
Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments3
Analysis of energy efficiency of a parallel AES algorithm for CPU-GPU heterogeneous platforms3
Parallelization of network motif discovery using star contraction3
QMPI: A next generation MPI profiling interface for modern HPC platforms3
On the scalability of CFD tool for supersonic jet flow configurations3
A multi-improvement local search using dataflow and GPU to solve the minimum latency problem3
Improving the I/O of large geophysical models using PnetCDF and BeeGFS3
An on-node scalable sparse incomplete LU factorization for a many-core iterative solver with Javelin3
AIOC2: A deep3
Parallel graph coloring algorithms for distributed GPU environments3
Tree cutting approach for domain partitioning on forest-of-octrees-based block-structured static adaptive mesh refinement with lattice Boltzmann method3
NVIDIA IndeX accelerated computing for visualizing Cholla's galactic winds3
Improved probabilistic I/O scheduling for limited-size Burst-Buffers deployed HPC2
Evaluating adaptive and predictive power management strategies for optimizing visualization performance on supercomputers2
Characterizing the performance of node-aware strategies for irregular point-to-point communication on heterogeneous architectures2
Visualizing the world’s largest turbulence simulation2
Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval2
Metall: A persistent memory allocator for data-centric analytics2
Efficient parallel branch-and-bound approaches for exact graph edit distance problem2
Towards leveraging collective performance with the support of MPI 4.0 features in MPC2
Optimal ATAPE task scheduling on reconfigurable and partitionable hierarchical hypercube networks2
Octopus-DF: Unified DataFrame-based cross-platform data analytic system2
Robust parallel eigenvector computation for the non-symmetric eigenvalue problem2
Asynchronous runtime with distributed manager for task-based programming models2
Optimizing convolutional neural networks on multi-core vector accelerator2
Tight Lower bound on power consumption for scheduling real-time periodic tasks in core-level DVFS systems2
Immortal rays: Rethinking random ray neutron transport on GPU architectures2
An evaluation of fast segmented sorting implementations on GPUs2
Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices2
Context switch cost aware joint task merging and scheduling for deep learning applications2
Data stream processing in HPC systems: New frameworks and architectures for high-frequency streaming2
GPU-accelerated Lagrangian heuristic for multidimensional assignment problems with decomposable costs2
Multi-level parallel multi-layer block reproducible summation algorithm2
A case study on parallel HDF5 dataset concatenation for high energy physics data analysis2
NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers2
MPI collective communication through a single set of interfaces: A case for orthogonality2
Towards scaling community detection on distributed-memory heterogeneous systems2
Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing2
HySet: A hybrid framework for exact set similarity join using a GPU2
OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight2
An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning2
0.036533117294312