\hfuzz\maxdimen % (or 999.0pt) \newdimen\hfuzz

Prototype for FC5-Aware Reorganization Operations in MPI-Next

2017 Sep 10

Robert P. Munafo, Boston University ( rmunafo@bu.edu )

1. Introduction

Recently, HPC systems (including "clusters", purpose-built dedicated supercomputers, and nodes in "clouds") have begun to incorporate one or more Field-Programmable Gate Array (FPGA) accelerators in or near each node to enable much greater computational performance for common and important HPC applications. In many cases, the FPGAs are equipped to communicate directly with each other, allowing data exchange between nodes together with on-chip computation. This allows many applications to achieve great savings in latency, as compared to performing the computation on a CPU or GPU and using the primary networking system (e.g. InfiniBand) to transfer data between nodes.

The Message Passing Interface (MPI) is a broadly-adopted standard for inter-node communication in HPC applications[1]. There is an ongoing research effort, headed by co-Principal Investigators Herbordt and Skjellum [4], to support acceleration of MPI functions through the use of the FPGAs and/or their backside networks.

One area of particular importance is data reorganization (shuffling or permutation): the exchange of large amounts of data between nodes, according to a pre-determined pattern. When the physical topology of the backside network allows, many patterns of data reorganization (for example, the modulo-2^n-1 exchange performed at step n of a Fast Fourier Transform) can be optimized through lower latency static routing.

The groups performing the aforementioned research have access to FPGA-equipped HPC systems, including Novo-G# [3] and Catapult [2] systems. It is a primary aim of their research to develop MPI-Next (a major version update to MPI, to succeed the current 3.x specification [6]) to support these specific systems, as well as other novel designs and non-FPGA commodity designs.

The aim of this research is to develop and certify several specific implementations of data reorganization on at least one of the FPGA-equipped systems, and on a conventional HPC cluster using InfiniBand.

2. Background --- Distributed Computing with MPI

The Message Passing Interface [6] is a standard specification for an API that enables large computational tasks to be distributed among autonomous processes running on many nodes connected by a network. Typically all processes working on a given task have identical binaries, and non-identical data. MPI functions (part of an MPI runtime library) allow each process to discover its "rank" (unique sequence number) and the "size of the world" (total number of processes working on this task).

Each process will typically begin by allocating memory to hold its share of the task's data set, then load/initialize its data. Computation is performed locally by each process, then data is communicated through calls to an MPI runtime library. This communication is usually with neighbors according to some logical network graph that corresponds to a natural and efficient division of the task's data set.

For example, processes might be arranged in a ring, and pass data only in the clockwise direction. Each process might call MPI_Isend to transmit some data from its own memory to the next-higher-rank process, and call MPI_Irecv to receive data from the next-lower-rank process. These API calls are asynchronous; the process could perform computation while waiting for the data communication to take place. A call to MPI_Waitall blocks until all sends and receives are actually finished. The process could then make use of the received data, then loop (returning to the MPI_Isend step).

The API calls mentioned in the previous paragraph are enough to do anything, albeit inefficiently, but many specialized functions exist. For example, MPI_Bcast broadcasts data from a single process to all other processes; MPI_Allreduce performs a reduction, such as adding the values of all data items, and sends the answer to all processes; MPI_Allgather sends a single datum from each process to all other processes.

Communication patterns are typically more complicated than a ring; HPC tasks are typically partitioned according to a 2- or 3-dimensional spatial arrangement, with data exchange taking place between each process and from 4 to 26 neighbors. Each process would invoke four API calls to each neighbor. To simplify this, MPI provides a distributed graph model, wherein each process informs MPI which "ranks" are its neighbors. Then transfers similar to broadcast and scatter can be performed, addressing just the neighbors and effectively performing from 4 to 26 MPI_Isend calls at once.

Many tasks involve repetitive communication that follows very simple patterns, and with FPGA-equipped systems the potential exists to accelerate these tasks by making the MPI library "aware" of the capabilities of the hardware.

3. Research Plan

The ISO/OSI 7-layer model will be used to describe the planned work. Development is necessarily bottom-up, as each layer depends upon functionality of the layers below it. So this description will process from the bottom layer up.

3.1 Simulation and Communications Modeling

The Novo-G# system is shared by the two PIs, a colleague at the University of Florida, and several members of their respective groups, but it does not support time-sharing or virtualization. I can expect that any access I might have will be very limited; and it will often be necessary to go for days or weeks without any access at all. The same concerns apply even more strongly for Catapult: the four-node test system at BU would be too small for this work.

Control studies will be performed on commodity networks (e.g. on the Shared Computing Cluster (SCC)) but there the routing will be performed transparently by an existing MPI implementation over InfiniBand, providing no way to verify the correctness of any optimized routing patterns, nor to measure their performance.

Therefore, for debugging and diagnosis it will be essential to have a simulator that accurately models the behavior of the actual system. Prof. Herbordt's group has used simulators in the past [5], [7]; I will need to create s simulator that incorporates design ideas from some of this prior work but that is largely new. This simulator will properly emulate the FPGA network with any parameter adjustments I have made (e.g. "Physical and Data Link Layers" below) performing data exchange according to the patterns used for specific tasks; and produce latency and throughput measurements.

3.2 Physical and Data Link Layers --- Runtime Libraries for FPGA (re)-Configuration

Others have already implemented a functional physical layer, but by design some aspects can be reconfigured by reprogramming the FGPA. I may need to change packet and flit sizes, buffer sizes, clock speeds, and other parameters of the physical and data link layers.

3.3 Network Layer --- Runtime Libraries to Provide Routing Configuration to Backside Network

Prior work has configured Novo-G# to support simple static routing, various routing patterns using table-lookup, and general adaptive routing. I will need to implement algorithms to generate routing tables specific to each type of data reorganization pattern that I wish to accelerate through the FPGA network.

3.4 Transport Layer --- Interface to MPI-Next

Applications will use APIs such as MPI_Graph_create, MPI_Dist_graph_create_adjacent, and MPI_Dist_graph_create to inform the system of their logical (task-space) topology, and such APIs as MPI_Neighbor_alltoallv to perform actual data reorganizations. To support the proposed optimizations, an FC5-aware MPI-Next implementation will need to recognize logical topologies and data exchange patterns that match one of the available (optimizable) patterns. This must be done in a distributed manner with local answers combined into a global answer via reduction; and results of each recognition attempt must be memoized to avoid repeating this time-intensive operation. A new API or an additional parameter to existing APIs is needed so that the application can tell the middleware when a given data reorganization pattern is identical to one that has been used before.

4. Evaluation Plan

Benchmarks representing kernels of important HPC applications (such as FFT, Matrix Transpose, and Molecular Dynamics) will be used to compare the performance of data reorganizations through InfiniBand vs. the same via FC5 using pre-existing dynamic routing, and FC5 accelerated by custom deterministic routing implemented in this work.

5. Proposed Course and Credit

EC 954, Fall 2017, 2 credits

EC 954, Spring 2018, 2 credits

Oral Defense: April 2018

Graduation Date: May 20 2018

Bibliography

[1] William Gropp, Ewing Lusk, Anthony Skjellum Using MPI: Portable Parallel Programming with the Message-passing Interface, Volume 1 (1999)

[2] A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proc. Int. Symp. on Computer Architecture. 13–24.

[3] A.D. George, M.C. Herbordt, H. Lam, A.G. Lawande, J. Sheng, C. Yang. "Novo-G#: A Community Resource for Exploring Large-Scale Reconfigurable Computing with Direct and Programmable Interconnects," Proceedings of the IEEE High Performance Extreme Computing Conference, 2016.

[4] SHF: Small: Collaborative Research: Coupling Computation and Communication in FPGA-Enhanced Clouds and Clusters, NSF awards #1618303 and #1617690, 2016 Jun 3

[5] Jie Meng, Eduard Llamosi, Fulya Kaplan, Chulian Zhang, Jaiyi Sheng, Martin Herbordt, Gunar Schirner, Ayse K. Coskun. Communication and cooling aware job allocation in data centers for communication-intensive workloads. J. Parallel Distrib. Comput. 96(2016) 181-193.

[6] The Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 3.1, mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf , 2015 June 4.

[7] Jaiyi Sheng, High Performance Communication on Multi-FPGA Systems (Ph.D. dissertation), Boston University, 2017.

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2022 Jun 20.

s.27