What is massively parallel processing?
Massively Parallel Processing (MPP) is a processing paradigm where hundreds or thousands of processing nodes work on parts of a computational task in parallel. Each of these nodes run individual instances of an operating system. They have their own input and output devices, and do not share memory. They achieve a common computational task by communicating with each other over a high-speed interconnect.
Organizations that deal with huge volumes of ever-expanding data use massively parallel processing for their data processing. For example, imagine a popular insurance company with millions of customers. As the number of customers increases, so does the customer data. Even if the firm uses parallel processing, they may experience a delay in processing customer data. Assume a data analyst is running a query against 100 million rows of a database. If the organization uses a massively parallel processing system with 1000 nodes, each node has to bear only 1/1000 computational load.
What are the major hardware components of massively parallel processing?
It’s essential to understand the hardware components of a massively parallel processing system to understand various architectures.
Processing nodes
Processing nodes are the basic building blocks of massively parallel processing. These nodes are simple, homogeneous processing cores with one or more central processing units. The nodes can be visualized as simple desktop PCs.
High-speed interconnect
The nodes in a massively parallel processing system parallelly work on parts of a single computation problem. Even though their processing is independent of each other, they need to regularly communicate with each other as they are trying to solve a common problem. A low latency, high bandwidth connection is required between the nodes. This is called a high-speed interconnect or a bus. It might be an ethernet connection, fiber distributed data interface, or any proprietary connection method.
Distributed lock manager (DLM)
In those massively parallel processing architectures where the external memory or disk space is shared among the nodes, a distributed lock manager (DLM) coordinates this resource sharing. The distributed lock manager takes a request for resources from various nodes and connects the nodes when the resources are available. In some systems, the distributed lock manager ensures data consistency and recovery of any failed node.
Massively parallel processing architecture(s)
The massively parallel processing architectures belong to two major groups depending upon how the nodes share their resources.
Shared disk systems
Each processing node in the shared disk system will have one or more central processing units (CPUs) and an independent random-access memory (RAM). These nodes, however, share an external disk space for the storage of files. These processing nodes are connected with a high-speed bus. The scalability of the shared disk systems depends upon the bandwidth of the high-speed interconnect and the hardware constraints on distributed lock manager.
What are the advantages of shared disk systems?
As all the nodes share a single external database, the massively parallel processing system becomes highly available. No data is permanently lost even if one node is damaged. The shared disk systems are more straightforward as they do not have to use a distributed database. It is easy to add new nodes in the shared disk systems.
What are the disadvantages of shared disk systems?
Since the processing nodes share a common disk, the coordination of the data access is complex. The system needs to rely on a distributed lock manager. These communications between nodes take up some bandwidth of the high-speed interconnect. The shared disk needs an operating system to control it. This adds additional overhead.
Shared nothing systems
A more popular architecture of massively parallel processing systems is the “shared nothing” architecture. The processing nodes have independent random-access memory and disk that stores the necessary files and databases. The data that needs to be processed is shared among the nodes using various techniques.
Replicated Database: In this method, each processing node owns a complete copy of the data. In this model, the risk of data loss is low, even if a few nodes fail. This model has the overhead of additional storage space.
Distributed Database: In this model, the database is partitioned into multiple slices. Each processing node owns a particular slice of the database and works on it. This method saves a lot of disk storage as there is no redundancy. However, this method is more complex than a replicated database. Here, a lot of data moves between the nodes to complete the processing. This will increase the traffic on the interconnect bus. This model might also lead to data loss, as there is no redundancy.
What are the advantages of “shared nothing” systems?
Shared nothing systems can horizontally scale to include a massive number of nodes. As the processing nodes are relatively independent of each other, adding a new processing node is easier. If the database is read-only, the “shared nothing” systems work great. The failure of a node does not affect the other nodes as they are almost independent. The chances of database corruption are minimal in the case of “shared nothing” systems.
What are the disadvantages of the “shared nothing” systems?
The “shared nothing” systems with distributed databases need a lot of coordination to complete a common task. Each node owns slices of the database. Managing this database could be very difficult. Shared nothing systems with the replicated database are not suitable for applications with tremendous data requirements. If the computation needs a lot of data modification operations like data insertion and join, then the “shared nothing” architecture may not be viable.
Massively parallel processing (MPP) vs symmetric multi-processing (SMP)
Massively parallel processing is a loosely-coupled system where nodes don’t share a memory or disk space in some cases. Massively parallel processing can be thought of as an array of independent processing nodes communicating over a high-speed interconnection bus.
Symmetric multi-processing is a single system with multiple, tightly coupled processors. These processors share the operating system, I/O devices, and memory. Symmetric multi-processing systems are often cheaper than massively parallel processing. But, symmetric multi-processing is limited on how much it can scale. All the processing nodes in symmetric multi-processing share a single memory bus, so as the number of processors increases, memory bottlenecks may occur, and it will slow down the system eventually. Massively parallel processing systems are costly and complex, but they can grow infinitely. Massively parallel processing works best when the processing tasks can be perfectly partitioned, and minimal communication between the nodes is required.