Parallel matrix algorithms
Nonparallel matrix multiplication algorithm
MPI9Matr1°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process. Find and output a M × Q matrix C that is the product of the matrices A and B. The formula for calculating the elements of the matrix C under the assumption that the rows and columns of all matrices are numbered from 0 is as follows: C_{I,J} = A_{I,0}·B_{0,J} + A_{I,1}·B_{1,J} + … + A_{I,P1}·B_{P1,J}, where I = 0, …, M − 1, J = 0, …, Q − 1. To store the matrices A, B, C, use onedimensional arrays of size M·P, P·Q, and M·Q placing elements of matrices in a rowmajor order (that is, the matrix element with indices I and J will be stored in the element of the corresponding array with the index I·N + J, where N is the number of columns of the matrix). The slave processes are not used in this task.
Band algorithm 1 (horizontal bands)
MPI9Matr2°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process. In the first variant of the band algorithm of matrix multiplication, each matrix multiplier is divided into K horizontal bands, where K is the number of processes (hereinafter bands are distributed by processes and used to calculate a part of the total matrix product in each process). The band of the matrix A contains N_{A} rows, the band of the matrix B contains N_{B} rows. The numbers N_{A} and N_{B} are calculated as follows: N_{A} = ceil(M/K), N_{B} = ceil(P/K), where the operation "/" means the division of real numbers and the function ceil performs rounding up. If the matrix contains insufficient number of rows to fill the last band then the zerovalued rows should be added to this band. Add, if necessary, the zerovalued rows to the initial matrices, save them in onedimensional arrays in the master process, and then send the matrix bands from these arrays to all processes as follows: a band with the index R is sent to the process of rank R (R = 0, 1, …, K − 1), all the bands A_{R} are of the size N_{A} × P, all the bands B_{R} are of the size N_{B} × Q. In addition, create a band C_{R} in each process to store the part of the matrix product C = AB which will be calculated in this process. Each band C_{R} is of the size N_{A} × Q and is filled with zerovalued elements. The bands, like the initial matrices, should be stored in onedimensional arrays in a rowmajor order. To send the matrix sizes, use the MPI_Bcast collective function, to send the bands of the matrices A and B, use the MPI_Scatter collective function. Include all the above mentioned actions in a Matr1ScatterData function (without parameters). As a result of the call of this function, each process will receive the values N_{A}, P, N_{B}, Q, as well as onedimensional arrays filled with the corresponding bands of the matrices A, B, C. Output all obtained data (that is, the numbers N_{A}, P, N_{B}, Q and the bands of the matrices A, B, C) in each process after calling the Matr1ScatterData function. Perform the input of initial data in the Matr1ScatterData function, perform the output of the results in the Solve function. Note. To reduce the number of the MPI_Bcast function calls, all matrix sizes may be sent as a single array.
MPI9Matr3°. Integers N_{A}, P, N_{B}, Q and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr2 task). Implement the first step of the band algorithm of matrix multiplication as follows: multiply elements in the bands A_{R} and B_{R} of each process and perform the cyclic sending each band B_{R} to the process of the previous rank (that is, from the process 1 to the process 0, from the process 2 to the process 1, …, from the process 0 to the process K − 1, where K is the number of processes). Use the MPI_Sendrecv_replace function to send the bands. To determine the ranks of the sending and receiving processes, use the expression containing the % operator that gives the remainder of a division. Include all the above mentioned actions in a Matr1Calc function (without parameters). Output the new contents of the bands C_{R} and B_{R} in each process; perform data input and output in the Solve function. Note. As a result of multiplying the bands A_{R} and B_{R}, each element of the band C_{R} will contain a part of the terms included in the elements of the product AB; all elements of the band B_{R} and some of the elements of the band A_{R} will be used (in particular, the first N_{B} elements of the band A_{0} will be used in the process 0 in the first step and the last N_{B} elements of the band A_{K1} will be used in the process K − 1 in the first step).
MPI9Matr4°. Integers N_{A}, P, N_{B}, Q and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr2 task). Modify the Matr1Calc function, which was implemented in the previous task; the modified function should provide execution of any step of the band algorithm of matrix multiplication. To do this, add the parameter named step to the function (this parameter specifies the step number and may be in the range 0 to K − 1, where K is the number of processes) and use the value of this parameter in the part of the algorithm that deals with the recalculation of the elements of the band C_{R} (the cyclic sending of the bands B_{R} does not depend on the value of the parameter step). Using two calls of the modified Matr1Calc function with the parameters 0 and 1, execute two initial steps of the band algorithm and output the new contents of the bands C_{R} and B_{R} in each process. Perform data input and output in the Solve function. Note. The parameter step determines which part of the band A_{R} will be used for the next recalculation of the elements of the band C_{R} (note that these parts should be selected cyclically).
MPI9Matr5°. Integers N_{A}, P, N_{B}, Q and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr2 task). In addition, a number L with the same value is given in each process. The value of L is in the range 3 to K, where K is the number of processes, and determines the number of steps of the band algorithm. Using the function Matr1Calc(I) (see the previous task) in a loop with the parameter I (I = 0, …, L − 1), execute the initial L steps of the band algorithm and output the new contents of the bands C_{R} and B_{R} in each process. Perform data input and output in the Solve function. Remark. If the value of L is equal to K then the bands C_{R} will contain the corresponding parts of the final matrix product AB.
MPI9Matr6°. An integer M (the number of rows of the matrix product) is given in the master process. In addition, integers N_{A}, Q and onedimensional arrays filled with the N_{A} × Q bands of matrix C are given in each process (the given bands of C are obtained as a result of K steps of the band algorithm — see the MPI9Matr5 task). Send all the bands C_{R} to the master process and output the received matrix C of the size M × Q in this process. To store the resulting matrix C in the master process, use a onedimensional array sufficient to store the matrix of the size (N_{A}·K) × Q. To send data to this array, use the MPI_Gather collective function. Include all the above mentioned actions in a Matr1GatherData function (without parameters). Perform the input of initial data in the Solve function, perform the output of the resulting matrix in the Matr1GatherData function.
MPI9Matr7°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process (thus, the given data coincide with the given data in the MPI9Matr2 task). Using successively the Matr1ScatterData, Matr1Calc (in a loop), and Matr1GatherData functions, that are developed in the MPI9Matr2−MPI9Matr6 tasks, find a matrix C, which is equal to the product of the initial matrices A and B, and output this matrix in the master process. In addition, output the current contents of the band C_{R} in each process after each call of the Matr1Calc function. Modify the Matr1Calc function (see the MPI9Matr4 task), before using in this task, as follows: the bands B_{R} should not be sent when the parameter step is equal to K − 1.
MPI9Matr8°. Integers M, P, Q and two file names are given in the master process. The given files contain elements of a matrix A of the size M × P and a matrix B of the size P × Q. Modify the initial stage of the band algorithm of matrix multiplication (see the MPI9Matr2 task) as follows: each process should read the corresponding bands of the matrices A and B directly from the given files using the MPI_File_seek and MPI_File_read_all collective functions (a new file view is not required). To send the sizes of matrices and file names, use the MPI_Bcast collective function. Include all these actions in a Matr1ScatterFile function (without parameters). As a result of the call of this function, each process will receive the values N_{A}, P, N_{B}, Q, as well as onedimensional arrays filled with the corresponding bands of the matrices A, B, C. Output all obtained data (that is, the numbers N_{A}, P, N_{B}, Q and the bands of the matrices A, B, C) in each process after calling the Matr1ScatterFile function. Perform the input of initial data in the Matr1ScatterFile function, perform the output of the results in the Solve function. Remark. For some bands, some of their elements (namely, the last rows) or even the entire bands should not be read from the source files and will remain zerovalued ones. However, this situation does not require special processing, since the MPI_File_read_all function automatically stops reading the data (without generating any error message) when the end of the file is reached.
MPI9Matr9°. Integers N_{A}, Q and onedimensional arrays filled with the N_{A} × Q bands C_{R} are given in each process (the given bands C_{R} are obtained as a result of K steps of the band algorithm of matrix multiplication — see the MPI9Matr5 task). In addition, an integer M (the number of rows of the matrix product) and the name of file (to store this product) are given in the master process. Send the number M and the file name to all processes using the MPI_Bcast function. Write all the parts of the matrix product contained in the bands C_{R} to the resulting file, which will eventually contain a matrix C of the size M × Q. To write the bands to the file, use the MPI_File_seek and MPI_File_write_all collective functions. Include all these actions (namely, the input of file name, sending number M and the file name, and writing all bands to the file) in a Matr1GatherFile function. Perform the input of all initial data, except the file name, in the Solve function. Note. When writing data to the resulting file, it is necessary to take into account that some of the bands C_{R} may contain trailing zerovalued rows that are not related to the resulting matrix product (the number M should be sent to all processes in order to control this situation).
MPI9Matr10°. Integers M, P, Q and three file names are given in the master process. The first two file names are related to the existing files containing the elements of matrices A and B of the size M × P and P × Q, respectively, the third file should be created to store the resulting matrix product C = AB. Using successively the Matr1ScatterFile, Matr1Calc (in a loop), and Matr1GatherFile functions (see the MPI9Matr8, MPI9Matr5, MPI9Matr9 tasks), find a matrix C and write its elements to the resulting file. In addition, output the current value of the c[step] in each process after each call of the Matr1Calc function, where c is a onedimensional array containing the band C_{R}, and step is the algorithm step number (0, 1, …, K − 1). Thus, the element c[0] should be output on the first step of the algorithm, the element c[1] should be output on the second step of the algorithm, and so on.
Band algorithm 2 (horizontal and vertical bands)
MPI9Matr11°. Integers P and Q are given in each process; in addition, a matrix B of the size P × Q is given in the master process. The number Q is a multiple of the number of processes K. Input the matrix B into a onedimensional array of the size P·Q in the master process and define a new datatype named MPI_BAND_B that contains a vertical band of the matrix B. The width of the vertical band should be equal to N_{B} = Q/K columns. When defining the MPI_BAND_B datatype, use the MPI_Type_vector and MPI_Type_commit functions. Include this definition in a Matr2CreateTypeBand(p, n, q, t) function with the input integer parameters p, n, q and the output parameter t of the MPI_Datatype type; the parameters p and n determine the size of the vertical band (the number of its rows and columns), and the parameter q determines the number of columns of the matrix from which this band is extracted. Using the MPI_BAND_B datatype, send to each process (including the master process) the corresponding band of the matrix B in the ascending order of ranks of receiving processes. Sending should be performed using the MPI_Send and MPI_Recv functions; the bands should be stored in onedimensional arrays of the size P·N_{B}. Output the received band in each process. Remark. In the MPICH2 version 1.3, the MPI_Send function call is erroneous if the sending and receiving processes are the same. You may use the MPI_Sendrecv function to send a band to the master process. You may also fill a band in the master process without using tools of the MPI library.
MPI9Matr12°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process. In the second variant of the band algorithm of matrix multiplication, the first multiplier (the matrix A) is divided into K horizontal bands and the second multiplier (the matrix B) is divided into K vertical bands, where K is the number of processes (hereinafter bands are distributed by processes and used to calculate a part of the total matrix product in each process). The band of the matrix A contains N_{A} rows, the band of the matrix B contains N_{B} columns. The numbers N_{A} and N_{B} are calculated as follows: N_{A} = ceil(M/K), N_{B} = ceil(Q/K), where the operation "/" means the division of real numbers and the function ceil performs rounding up. If the matrix contains insufficient number of rows (or columns) to fill the last band then the zerovalued rows (or columns) should be added to this band. Add, if necessary, the zerovalued rows or columns to the initial matrices, save them in onedimensional arrays in the master process, and then send the matrix bands from these arrays to all processes as follows: a band with the index R is sent to the process of rank R (R = 0, 1, …, K − 1), all the bands A_{R} are of the size N_{A} × P, all the bands B_{R} are of the size P × N_{B}. In addition, create a band C_{R} in each process to store the part of the matrix product C = AB which will be calculated in this process. Each band C_{R} is of the size (N_{A}·K) × N_{B} and is filled with zerovalued elements. The bands, like the initial matrices, should be stored in onedimensional arrays in a rowmajor order. To send the matrix sizes, use the MPI_Bcast collective function, to send the bands of the matrix A, use the MPI_Scatter collective function, to send the bands of the matrix B, use the MPI_Send and MPI_Recv functions and also the MPI_BAND_B datatype created by the Matr2CreateTypeBand function (see the previous task and a note to it). Include all the above mentioned actions in a Matr2ScatterData function (without parameters). As a result of the call of this function, each process will receive the values N_{A}, P, N_{B}, as well as onedimensional arrays filled with the corresponding bands of the matrices A, B, C. Output all obtained data (that is, the numbers N_{A}, P, N_{B} and the bands of the matrices A, B, C) in each process after calling the Matr2ScatterData function. Perform the input of initial data in the Matr2ScatterData function, perform the output of the results in the Solve function. Notes. (1) When input the matrix B into an array in the master process, it should be taken into account that this array may contain elements corresponding to additional zerovalued columns. (2) To reduce the number of the MPI_Bcast function calls, all matrix sizes may be sent as a single array.
MPI9Matr13°. Integers N_{A}, P, N_{B} and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr12 task). Implement the first step of the band algorithm of matrix multiplication as follows: multiply elements in the bands A_{R} and B_{R} of each process and perform the cyclic sending each band A_{R} to the process of the previous rank (that is, from the process 1 to the process 0, from the process 2 to the process 1, …, from the process 0 to the process K − 1, where K is the number of processes). Use the MPI_Sendrecv_replace function to send the bands. To determine the ranks of the sending and receiving processes, use the expression containing the % operator that gives the remainder of a division. Include all the above mentioned actions in a Matr2Calc function (without parameters). Output the new contents of the bands C_{R} and A_{R} in each process; perform data input and output in the Solve function. Note. In this variant of the band algorithm, the bands A_{R} contain the full rows of the matrix A and the bands B_{R} contain the full columns of the matrix B, so, as a result of their multiplication, the band C_{R} will contain part of the elements of the final matrix product already at the first step of the algorithm (the other elements of the band C_{R} will remain zerovalued). The location of the found elements in the band C_{R} depends on the rank of the process (in particular, the first N_{A} rows of the band C_{0} in the process 0 will be filled in the first step and the last N_{A} rows of the band C_{K1} in the process K − 1 will be filled in the first step).
MPI9Matr14°. Integers N_{A}, P, N_{B} and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr12 task). Modify the Matr2Calc function, which was implemented in the previous task; the modified function should provide execution of any step of the band algorithm of matrix multiplication. To do this, add the parameter named step to the function (this parameter specifies the step number and may be in the range 0 to K − 1, where K is the number of processes) and use the value of this parameter in the part of the algorithm that deals with the recalculation of the elements of the band C_{R} (the cyclic sending of the bands A_{R} does not depend on the value of the parameter step). Using two calls of the modified Matr2Calc function with the parameters 0 and 1, execute two initial steps of the band algorithm and output the new contents of the bands C_{R} and A_{R} in each process. Perform data input and output in the Solve function. Note. The parameter step determines which rows of the band C_{R} will be calculated in this step of the algorithm (note that these rows are selected cyclically).
MPI9Matr15°. Integers N_{A}, P, N_{B} and onedimensional arrays filled with the corresponding bands of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr12 task). In addition, a number L with the same value is given in each process. The value of L is in the range 3 to K, where K is the number of processes, and determines the number of steps of the band algorithm. Using the function Matr2Calc(I) (see the previous task) in a loop with the parameter I (I = 0, …, L − 1), execute the initial L steps of the band algorithm and output the new contents of the bands C_{R} and A_{R} in each process. Perform data input and output in the Solve function. Remark. If the value of L is equal to K then the bands C_{R} will contain the corresponding parts of the final matrix product AB.
MPI9Matr16°. Integers M and Q (the numbers of rows and columns of the matrix product) are given in the master process. In addition, integers N_{A}, N_{B} and onedimensional arrays filled with the (N_{A}·K) × N_{B} bands of the matrix C are given in each process (the given bands of C are obtained as a result of K steps of the band algorithm — see the MPI9Matr15 task). Send all the bands C_{R} to the master process and output the received matrix C of the size M × Q in this process. To store the resulting matrix C in the master process, use a onedimensional array sufficient to store the matrix of the size (N_{A}·K) × (N_{B}·K). To send data to this array, use the MPI_Send and MPI_Recv functions and the MPI_BAND_C datatype created by the Matr2CreateTypeBand function (see the MPI9Matr11 task and a note to it). Include all the above mentioned actions in a Matr2GatherData function (without parameters). Perform the input of initial data in the Solve function, perform the output of the resulting matrix in the Matr2GatherData function. Note. When output the matrix C in the master process, it should be taken into account that an array, which is intended for matrix storage, may contain elements corresponding to additional zerovalued columns.
MPI9Matr17°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process (thus, the given data coincide with the given data in the MPI9Matr12 task). Using successively the Matr2ScatterData, Matr2Calc (in a loop), and Matr2GatherData functions, that are developed in the MPI9Matr12−MPI9Matr16 tasks, find a matrix C, which is equal to the product of the initial matrices A and B, and output this matrix in the master process. In addition, output the current contents of the band C_{R} in each process after each call of the Matr2Calc function. Modify the Matr2Calc function (see the MPI9Matr14 task), before using in this task, as follows: the bands A_{R} should not be sent when the parameter step is equal to K − 1.
MPI9Matr18°. Integers M, P, Q and two file names are given in the master process. The given files contain elements of a matrix A of the size M × P and a matrix B of the size P × Q. The number Q is a multiple of the number of processes K. Modify the initial stage of the band algorithm of matrix multiplication (see the MPI9Matr12 task) as follows: each process should read the corresponding bands of the matrices A and B directly from the given files. To send the sizes of matrices and file names, use the MPI_Bcast collective function. Use the MPI_File_seek and MPI_File_read_all collective functions to read the horizontal bands of the matrix A. To read the vertical bands of the matrix B, set the appropriate file view using the MPI_File_set_view function and the MPI_BAND_B file type defined with the Matr2CreateTypeBand function (see the MPI9Matr11 task), and then use the MPI_File_read_all function. Include all these actions in a Matr2ScatterFile function (without parameters). As a result of the call of this function, each process will receive the values N_{A}, P, N_{B}, as well as onedimensional arrays filled with the corresponding bands of the matrices A, B, C. Output all obtained data (that is, the numbers N_{A}, P, N_{B} and the bands of the matrices A, B, C) in each process after calling the Matr2ScatterFile function. Perform the input of initial data in the Matr2ScatterFile function, perform the output of the results in the Solve function. Note. A condition that the number Q is a multiple of K allows us to perform reading of the bands B_{R} using the same file type in all processes. If this condition is not fulfilled then it would be necessary to use special types that ensure the correct reading from the file and write to the array of "truncated" bands of the matrix B in the last processes (in addition, in this case it would be necessary to send to each process the value of Q which is necessary for the correct type definition for "truncated" bands).
MPI9Matr19°. Integers N_{A}, N_{B} and onedimensional arrays filled with the (N_{A}·K) × N_{B} bands C_{R} are given in each process (the given bands C_{R} are obtained as a result of K steps of the band algorithm of matrix multiplication — see the MPI9Matr15 task). In addition, an integer M (the number of rows of the matrix product) and the name of file (to store this product) are given in the master process. The number of columns Q of the matrix product is a multiple of the number of processes K (and, therefore, is equal to N_{B}·K). Send the number M and the file name to all processes using the MPI_Bcast function. Write all the parts of the matrix product contained in the bands C_{R} to the resulting file, which will eventually contain a matrix C of the size M × Q. To write the bands to the file, set the appropriate file view using the MPI_File_set_view function and the MPI_BAND_C file type defined with the Matr2CreateTypeBand function (see the MPI9Matr11 task), and then use the MPI_File_write_all function. Include all these actions (namely, the input of file name, sending number M and the file name, and writing all bands to the file) in a Matr2GatherFile function. Perform the input of all initial data, except the file name, in the Solve function. Note. When writing data to the resulting file, it is necessary to take into account that the bands C_{R} may contain trailing zerovalued rows that are not related to the resulting matrix product (the number M should be sent to all processes in order to control this situation).
MPI9Matr20°. Integers M, P, Q and three file names are given in the master process. The first two file names are related to the existing files containing the elements of matrices A and B of the size M × P and P × Q, respectively, the third file should be created to store the resulting matrix product C = AB. The number Q is a multiple of the number of processes K. Using successively the Matr2ScatterFile, Matr2Calc (in a loop), and Matr2GatherFile functions (see the MPI9Matr18, MPI9Matr15, MPI9Matr19 tasks), find a matrix C and write its elements to the resulting file. In addition, output the current value of the c[step] in each process after each call of the Matr2Calc function, where c is a onedimensional array containing the band C_{R}, and step is the algorithm step number (0, 1, …, K − 1). Thus, the element c[0] should be output on the first step of the algorithm, the element c[1] should be output on the second step of the algorithm, and so on.
Cannon's block algorithm
MPI9Matr21°. Integers M and P are given in each process; in addition, a matrix A of the size M × P is given in the master process. The number of processes K is a perfect square: K = K_{0}·K_{0}, the numbers M and P are multiples of K_{0}. Input the matrix A into a onedimensional array of the size M·P in the master process and define a new datatype named MPI_BLOCK_A that contains a M_{0} × P_{0} block of the matrix A, where M_{0} = M/K_{0}, P_{0} = P/K_{0}. When defining the MPI_BLOCK_A type, use the MPI_Type_vector and MPI_Type_commit functions. Include this definition in a Matr3CreateTypeBlock(m0, p0, p, t) function with the input integer parameters m0, p0, p and the output parameter t of the MPI_Datatype type; the parameters m0 and p0 determine the size of the block, and the parameter p determines the number of columns of the matrix from which this block is extracted. Using the MPI_BLOCK_A datatype, send to each process (in ascending order of ranks of processes, including the master process) the corresponding block of the matrix A in a rowmajor order of blocks (that is, the first block should be sent to the process 0, the next block in the same row of blocks should be sent to the process 1, and so on). Sending should be performed using the MPI_Send and MPI_Recv functions; the blocks should be stored in onedimensional arrays of the size M_{0}·P_{0}. Output the received block in each process. Remark. In the MPICH2 version 1.3, the MPI_Send function call is erroneous if the sending and receiving processes are the same. You may use the MPI_Sendrecv function to send a block to the master process. You may also fill a block in the master process without using tools of the MPI library.
MPI9Matr22°. Integers M_{0}, P_{0} and a matrix A of the size M_{0} × P_{0} are given in each process. The number of processes K is a perfect square: K = K_{0}·K_{0}. Input the matrix A into a onedimensional array of the size M_{0}·P_{0} in each process and create a new communicator named MPI_COMM_GRID using the MPI_Cart_create function. The MPI_COMM_GRID communicator defines a Cartesian topology for all processes as a twodimensional periodic K_{0} × K_{0} grid (ranks of processes should not be reordered). Include the creation of the MPI_COMM_GRID communicator in a Matr3CreateCommGrid(comm) function with the output parameter comm of the MPI_Comm type. Using the MPI_Cart_coords function for this communicator, output the process coordinates (I_{0}, J_{0}) in each process. Perform a cyclic shift of the matrices A given in all processes of each grid row I_{0} by I_{0} positions left (that is, in descending order of ranks of processes) using the MPI_Cart_shift and MPI_Sendrecv_replace functions. Output the received matrix in each process.
MPI9Matr23°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process. The number of processes K is a perfect square: K = K_{0}·K_{0}. In the block algorithms of matrix multiplication, the initial matrices are divided into K blocks and are interpreted as square block matrices of the order K_{0} (hereinafter blocks are distributed by processes and used to calculate a part of the total matrix product in each process). The block of the matrix A is of the size M_{0} × P_{0}, the block of the matrix B is of the size P_{0} × Q_{0}, the numbers M_{0}, P_{0}, Q_{0} are calculated as follows: M_{0} = ceil(M/K_{0}), P_{0} = ceil(P/K_{0}), Q_{0} = ceil(Q/K_{0}), where the operation "/" means the division of real numbers and the function ceil performs rounding up. If the matrix contains insufficient number of rows (or columns) to fill the last blocks then the zerovalued rows (or columns) should be added to these blocks. Add, if necessary, the zerovalued rows or columns to the initial matrices (as a result, the matrices A and B will have the size (M_{0}·K_{0}) × (P_{0}·K_{0}) and (P_{0}·K_{0}) × (Q_{0}·K_{0}) respectively), save them in onedimensional arrays in the master process, and then send the matrix blocks (in a rowmajor order) from these arrays to all processes (in ascending order of its ranks): the process R will receive the blocks A_{R} and B_{R}, R = 0, …, K − 1. In addition, create a block C_{R} in each process to store the part of the matrix product C = AB which will be calculated in this process. Each block C_{R} is of the size M_{0} × Q_{0} and is filled with zerovalued elements. The blocks, like the initial matrices, should be stored in onedimensional arrays in a rowmajor order. To send the matrix sizes, use the MPI_Bcast collective function, to send the blocks of the matrices A and B, use the MPI_Send and MPI_Recv functions and also the MPI_BLOCK_A and MPI_BLOCK_B datatypes created by the Matr3CreateTypeBlock function (see the MPI9Matr21 task and a note to it). Include all the above mentioned actions in a Matr3ScatterData function (without parameters). As a result of the call of this function, each process will receive the values M_{0}, P_{0}, Q_{0}, as well as onedimensional arrays filled with the corresponding blocks of the matrices A, B, C. Output all obtained data (that is, the numbers M_{0}, P_{0}, Q_{0} and the blocks of the matrices A, B, C) in each process after calling the Matr3ScatterData function. Perform the input of initial data in the Matr3ScatterData function, perform the output of the results in the Solve function. Notes. (1) When input the matrices A and B into arrays in the master process, it should be taken into account that these arrays may contain elements corresponding to additional zerovalued columns. (2) To reduce the number of the MPI_Bcast function calls, all matrix sizes may be sent as a single array.
MPI9Matr24°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the corresponding blocks of matrices A, B, C are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr23 task). Implement the initial block redistribution used in the Cannon's algorithm for block matrix multiplication. To do this, define a Cartesian topology for all processes as a twodimensional periodic K_{0} × K_{0} grid, where K_{0}·K_{0} is equal to the number of processes (ranks of processes should not be reordered), and perform a cyclic shift of the blocks A_{R} given in all processes of each grid row I_{0} by I_{0} positions left (that is, in descending order of ranks of processes), I_{0} = 0, …, K_{0} − 1, and perform a cyclic shift of the blocks B_{R} given in all processes of each grid column J_{0} by J_{0} positions up (that is, in descending order of ranks of processes), J_{0} = 0, …, K_{0} − 1. To create the MPI_COMM_GRID communicator associated with the Cartesian topology, use the Matr3CreateCommGrid function implemented in the MPI9Matr22 task. Use the MPI_Cart_coords, MPI_Cart_shift, MPI_Sendrecv_replace functions to perform the cyclic shifts (compare with MPI9Matr22). Include all the above mentioned actions in a Matr3Init function (without parameters). Output the received blocks A_{R} and B_{R} in each process; perform data input and output in the Solve function.
MPI9Matr25°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the corresponding blocks of matrices A, B, C are given in each process. The blocks C_{R} are zerovalued, the initial redistribution for the blocks A_{R} and B_{R} has already been performed in accordance with the Cannon's algorithm (see the previous task). Implement one step of the Cannon's algorithm of matrix multiplication as follows: multiply elements in the blocks A_{R} and B_{R} of each process and perform a cyclic shift of the blocks A_{0} given in all processes of each row of the Cartesian periodic grid by 1 position left (that is, in descending order of ranks of processes) and perform a cyclic shift of the blocks B_{0} given in all processes of each grid column by 1 position up (that is, in descending order of ranks of processes). To create the MPI_COMM_GRID communicator associated with the Cartesian topology, use the Matr3CreateCommGrid function implemented in the MPI9Matr22 task. Use the MPI_Cart_shift and MPI_Sendrecv_replace functions to perform the cyclic shifts (compare with MPI9Matr22). Include all the above mentioned actions in a Matr3Calc function (without parameters). Output the new contents of the blocks C_{R}, A_{R}, and B_{R} in each process; perform data input and output in the Solve function. Remark. A special feature of the Cannon's algorithm is that the actions at each step are not depend on the step number.
MPI9Matr26°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the corresponding blocks of matrices A, B, C are given in each process. The blocks C_{R} are zerovalued, the initial redistribution for the blocks A_{R} and B_{R} has already been performed in accordance with the Cannon's algorithm (see the MPI9Matr24 task). In addition, a number L with the same value is given in each process. The value of L is in the range 2 to K_{0}, where K_{0}·K_{0} is the number of processes, and determines the number of steps of the Cannon's algorithm. Using the function Matr3Calc (see the previous task) in a loop, execute the initial L steps of the Cannon's algorithm and output the new contents of the blocks C_{R}, A_{R}, and B_{R} in each process. Perform data input and output in the Solve function. Note. If the value of L is equal to K_{0} then the blocks C_{R} will contain the corresponding parts of the final matrix product AB.
MPI9Matr27°. Integers M and Q (the numbers of rows and columns of the matrix product) are given in the master process. In addition, integers M_{0}, Q_{0} and onedimensional arrays filled with the M_{0} × Q_{0} blocks of the matrix C are given in each process (the given blocks of C are obtained as a result of K_{0} steps of the Cannon's algorithm — see the MPI9Matr26 task). Send all the blocks C_{R} to the master process and output the received matrix C of the size M × Q in this process. To store the resulting matrix C in the master process, use a onedimensional array sufficient to store the matrix of the size (M_{0}·K_{0}) × (Q_{0}·K_{0}). To send the blocks C_{R} to this array, use the MPI_Send and MPI_Recv functions and the MPI_BLOCK_C datatype created by the Matr3CreateTypeBlock function (see the MPI9Matr21 task and a note to it). Include all the above mentioned actions in a Matr3GatherData function (without parameters). Perform the input of initial data in the Solve function, perform the output of the resulting matrix in the Matr3GatherData function. Note. When output the matrix C in the master process, it should be taken into account that an array, which is intended for matrix storage, may contain elements corresponding to additional zerovalued columns.
MPI9Matr28°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process (thus, the given data coincide with the given data in the MPI9Matr23 task). Using successively the Matr3ScatterData, Matr3Calc (in a loop), and Matr3GatherData functions, that are developed in the MPI9Matr23−MPI9Matr27 tasks, find a matrix C, which is equal to the product of the initial matrices A and B, and output this matrix in the master process. In addition, output the current contents of the block C_{R} in each process after each call of the Matr3Calc function. The MPI_COMM_GRID communicator, which is used in the Matr3Init and Matr3Calc functions, should not be created several times. To do this, modify the Matr3CreateCommGrid function; the modified function should not perform any actions when it is called with the parameter comm that is not equal to MPI_COMM_NULL. In addition, modify the Matr3Calc function (see the MPI9Matr25 task), before using in this task, as follows: add the parameter named step to this function (step = 0, …, K_{0} − 1); the blocks A_{R} and B_{R} should not be sent when the parameter step is equal to K_{0} − 1.
MPI9Matr29°. Integers M, P, Q and two file names are given in the master process. The given files contain elements of a matrix A of the size M × P and a matrix B of the size P × Q. Modify the stage of receiving blocks for the Cannon's algorithm of matrix multiplication (see the MPI9Matr23 task) as follows: each process should read the corresponding blocks of the matrices A and B directly from the given files. In this case, all processes should receive not only the sizes M_{0}, P_{0}, Q_{0} of the blocks, but also the sizes M, P, Q of the initial matrices, which are needed to determine correctly the positions of blocks in the source files. To send the sizes of matrices and file names, use the MPI_Bcast collective function. Use the MPI_File_read_at local function to read each row of the block (a new file view is not required). Include all these actions in a Matr3ScatterFile function (without parameters). As a result of the call of this function, each process will receive the values M, P, Q, M_{0}, P_{0}, Q_{0}, as well as onedimensional arrays filled with the corresponding blocks of the matrices A, B, C. Output all obtained data (that is, the numbers M, P, Q, M_{0}, P_{0}, Q_{0} and the blocks of the matrices A, B, C) in each process after calling the Matr3ScatterFile function. Perform the input of initial data in the Matr3ScatterFile function, perform the output of the results in the Solve function. Note. For some blocks, some of their elements (namely, the last rows and/or columns) should not be read from the source files and will remain zerovalued ones. To determine the actual size of the block being read (the number of rows and columns), it is required to use the sizes of the initial matrices and the coordinates (I_{0}, J_{0}) of the block in a square Cartesian grid of order K_{0} (note that I_{0} = R/K_{0}, J_{0} = R%K_{0}, where R is the process rank). Remark. Whereas the values of P and Q are necessary to ensure the correct reading of the file blocks, the value of M is not required for this purpose, since the attempt to read data beyond the end of file is ignored (without generating any error message). However, the value of M is required at the final stage of the algorithm (see the next task), so it must also be sent to all processes.
MPI9Matr30°. Integers M, Q, M_{0}, Q_{0} and onedimensional arrays filled with the M_{0} × Q_{0} blocks C_{R} are given in each process (the given blocks C_{R} are obtained as a result of K_{0} steps of the Cannon's block algorithm of matrix multiplication — see the MPI9Matr25 task). In addition, the name of file to store the matrix product is given in the master process. Send the file name to all processes using the MPI_Bcast function. Write all the parts of the matrix product contained in the blocks C_{R} to the resulting file, which will eventually contain a matrix C of the size M × Q. Use the MPI_File_write_at local function to write each row of the block to the file (a new file view is not required). Include all these actions (namely, the input of file name, sending the file name, and writing all blocks to the file) in a Matr3GatherFile function. Perform the input of all initial data, except the file name, in the Solve function. Note. When writing data to the resulting file, it is necessary to take into account that some of the blocks C_{R} may contain trailing zerovalued rows and/or columns that are not related to the resulting matrix product (see also the note and the remark for the previous task).
MPI9Matr31°. Integers M, P, Q and three file names are given in the master process. The first two file names are related to the existing files containing the elements of matrices A and B of the size M × P and P × Q, respectively, the third file should be created to store the resulting matrix product C = AB. Using successively the Matr3ScatterFile, Matr3Init, Matr3Calc (in a loop), and Matr3GatherFile functions (see the MPI9Matr29, MPI9Matr24, MPI9Matr25, and MPI9Matr30 tasks), find a matrix C and write its elements to the resulting file. In addition, output the current value of the c[step] in each process after each call of the Matr3Calc function, where c is a onedimensional array containing the block C_{R}, and step is the algorithm step number (0, 1, …, K_{0} − 1). Thus, the element c[0] should be output on the first step of the algorithm, the element c[1] should be output on the second step of the algorithm, and so on.
Fox's block algorithm
MPI9Matr32°. Integers M and P are given in each process; in addition, a matrix A of the size M × P is given in the master process. The number of processes K is a perfect square: K = K_{0}·K_{0}, the numbers M and P are multiples of K_{0}. Input the matrix A into a onedimensional array of the size M·P in the master process and define a new datatype named MPI_BLOCK_A that contains a M_{0} × P_{0} block of the matrix A, where M_{0} = M/K_{0}, P_{0} = P/K_{0}. When defining the MPI_BLOCK_A type, use the MPI_Type_vector and MPI_Type_commit functions. Include this definition in a Matr4CreateTypeBlock(m0, p0, p, t) function with the input integer parameters m0, p0, p and the output parameter t of the MPI_Datatype type; the parameters m0 and p0 determine the size of the block, and the parameter p determines the number of columns of the matrix from which this block is extracted. Using the MPI_BLOCK_A datatype, send to each process (in ascending order of ranks of processes, including the master process) the corresponding block of the matrix A in a rowmajor order of blocks (that is, the first block should be sent to the process 0, the next block in the same row of blocks should be sent to the process 1, and so on). Sending should be performed using the MPI_Alltoallw function; the blocks should be stored in onedimensional arrays of the size M_{0}·P_{0}. Output the received block in each process. Notes. (1) Use the MPI_Send and MPI_Recv functions instead of the MPI_Alltoallw function when solving this task using the MPI1 library. (2) The MPI_Alltoallw function introduced in MPI2 is the only collective function that allows you to specify the displacements for the sent data in bytes (not in elements). This gives opportunity to use it in conjunction with complex data types to implement any variants of collective communications (in our case, we need to implement a communication of the scatter type). It should be noted that all array parameters of the MPI_Alltoallw function associated with the sent data must be differently defined in the master and slave processes. In particular, the array scounts (which determines the number of sent elements) must contain the values 0 in all the slave processes and the value 1 in the master process (the sent elements are of the MPI_BLOCK_A datatype). At the same time, arrays associated with the received data will be defined in the same way in all processes; in particular, the zeroindexed element of the array rcounts (which determines the number of received elements) must be equal to M_{0}·P_{0}, and all other elements of this array must be equal to 0 (the received elements are of the MPI_INT datatype). It is necessary to pay special attention to the correct definition of elements in the array sdispls of displacements for the sent data in the master process (in the slave processes, it is enough to use the zerovalued array sdispls).
MPI9Matr33°. Integers M_{0}, P_{0} and a matrix A of the size M_{0} × P_{0} are given in each process. The number of processes K is a perfect square: K = K_{0}·K_{0}. Input the matrix A into a onedimensional array of the size M_{0}·P_{0} in each process and create a new communicator named MPI_COMM_GRID using the MPI_Cart_create function. The MPI_COMM_GRID communicator defines a Cartesian topology for all processes as a twodimensional periodic K_{0} × K_{0} grid (ranks of processes should not be reordered). Include the creation of the MPI_COMM_GRID communicator in a Matr4CreateCommGrid(comm) function with the output parameter comm of the MPI_Comm type. Using the MPI_Cart_coords function for this communicator, output the process coordinates (I_{0}, J_{0}) in each process. On the base of the MPI_COMM_GRID communicator, create a set of communicators named MPI_COMM_ROW, which are associated with the rows of the initial twodimensional grid. Use the MPI_Cart_sub function to create the MPI_COMM_ROW communicators. Include the creation of the MPI_COMM_ROW communicators in a Matr4CreateCommRow(grid, row) function with the input parameter grid (the communicator associated with the initial twodimensional grid) and the output parameter row (both parameters are of the MPI_Comm type). Output the process rank R_{0} for the MPI_COMM_ROW communicator in each process (this rank must be equal to J_{0}). In addition, send the matrix A from the grid element (I_{0}, I_{0}) to all processes of the same grid row I_{0} (I_{0} = 0, …, K_{0} − 1) using the MPI_Bcast collective function for the MPI_COMM_ROW communicator. Save the received matrix in the auxiliary matrix T of the same size as the matrix A (it is necessary to copy the matrix A to the matrix T in the sending process before the call of the MPI_Bcast function). Output the received matrix T in each process.
MPI9Matr34°. Integers P_{0}, Q_{0} and a matrix B of the size P_{0} × Q_{0} are given in each process. The number of processes K is a perfect square: K = K_{0}·K_{0}. Input the matrix B into a onedimensional array of the size P_{0}·Q_{0} in each process and create a new communicator named MPI_COMM_GRID, which defines a Cartesian topology for all processes as a twodimensional periodic K_{0} × K_{0} grid. Use the Matr4CreateCommGrid function (see the MPI9Matr33 task) to create the MPI_COMM_GRID communicator. Using the MPI_Cart_coords function for this communicator, output the process coordinates (I_{0}, J_{0}) in each process. On the base of the MPI_COMM_GRID communicator, create a set of communicators named MPI_COMM_COL, which are associated with the columns of the initial twodimensional grid. Use the MPI_Cart_sub function to create the MPI_COMM_COL communicators. Include the creation of the MPI_COMM_COL communicators in a Matr4CreateCommCol(grid, col) function with the input parameter grid (the communicator associated with the initial twodimensional grid) and the output parameter col (both parameters are of the MPI_Comm type). Output the process rank R_{0} for the MPI_COMM_COL communicator in each process (this rank must be equal to I_{0}). In addition, perform a cyclic shift of the matrices B given in all processes of each grid column J_{0} by 1 position up (that is, in descending order of ranks of processes) using the MPI_Sendrecv_replace function for the MPI_COMM_COL communicator (to determine the ranks of the sending and receiving processes, use the expression containing the % operator that gives the remainder of a division). Output the received matrix in each process.
MPI9Matr35°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process. The number of processes K is a perfect square: K = K_{0}·K_{0}. In the block algorithms of matrix multiplication, the initial matrices are divided into K blocks and are interpreted as square block matrices of the order K_{0} (hereinafter blocks are distributed by processes and used to calculate a part of the total matrix product in each process). The block of the matrix A is of the size M_{0} × P_{0}, the block of the matrix B is of the size P_{0} × Q_{0}, the numbers M_{0}, P_{0}, Q_{0} are calculated as follows: M_{0} = ceil(M/K_{0}), P_{0} = ceil(P/K_{0}), Q_{0} = ceil(Q/K_{0}), where the operation "/" means the division of real numbers and the function ceil performs rounding up. If the matrix contains insufficient number of rows (or columns) to fill the last blocks then the zerovalued rows (or columns) should be added to these blocks. Add, if necessary, the zerovalued rows or columns to the initial matrices (as a result, the matrices A and B will have the size (M_{0}·K_{0}) × (P_{0}·K_{0}) and (P_{0}·K_{0}) × (Q_{0}·K_{0}) respectively), save them in onedimensional arrays in the master process, and then send the matrix blocks (in a rowmajor order) from these arrays to all processes (in ascending order of its ranks): the process R will receive the blocks A_{R} and B_{R}, R = 0, …, K − 1. In addition, create two blocks C_{R} and T_{R} filled with zerovalued elements in each process: the block C_{R} (of the size M_{0} × Q_{0}) is intended to store the part of the matrix product C = AB, which will be calculated in this process, the block T_{R} (of the size M_{0} × P_{0}) is an auxiliary one. The blocks, like the initial matrices, should be stored in onedimensional arrays in a rowmajor order. To send the matrix sizes, use the MPI_Bcast collective function, to send the blocks of the matrices A and B, use the MPI_Alltoallw collective function and also the MPI_BLOCK_A and MPI_BLOCK_B datatypes created by the Matr4CreateTypeBlock function (see the MPI9Matr32 task and notes to it). Include all the above mentioned actions in a Matr4ScatterData function (without parameters). As a result of the call of this function, each process will receive the values M_{0}, P_{0}, Q_{0}, as well as onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R}. Output all obtained data (that is, the numbers M_{0}, P_{0}, Q_{0} and the blocks A_{R}, B_{R}, C_{R}, T_{R}) in each process after calling the Matr4ScatterData function. Perform the input of initial data in the Matr4ScatterData function, perform the output of the results in the Solve function. Notes. (1) When input the matrices A and B into arrays in the master process, it should be taken into account that these arrays may contain elements corresponding to additional zerovalued columns. (2) To reduce the number of the MPI_Bcast function calls, all matrix sizes may be sent as a single array.
MPI9Matr36°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R} are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr35 task). A virtual Cartesian topology in the form of a square grid of order K_{0} is used for all processes, the value of K_{0}·K_{0} is equal to the number of the processes. Each step of the Fox's block algorithm of matrix multiplication consists of two stages. In the first stage of the first step, the block A_{R} is sent from the process with the grid coordinates (I_{0}, I_{0}) to all processes of the same grid row I_{0} (I_{0} = 0, …, K_{0} − 1). The received block is saved in the block T_{R} in the receiving processes. Then the block T_{R} is multiplied by the block B_{R} from the same process and the result is added to the block C_{R}. Implement the first stage of the first step of the Fox's algorithm. To do this, create the MPI_COMM_GRID and MPI_COMM_ROW communicators using the Matr4CreateCommGrid and Matr4CreateCommRow functions implemented in the MPI9Matr33 task. Use the MPI_Bcast function for the MPI_COMM_ROW communicator to send the blocks A_{R} (compare with MPI9Matr33). Include all the above mentioned actions in a Matr4Calc1 function (without parameters). Output the new contents of the blocks T_{R} and C_{R} in each process; perform data input and output in the Solve function.
MPI9Matr37°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R} are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr35 task). Implement the second stage of the first step of the Fox's algorithm of matrix multiplication as follows: perform a cyclic shift of the blocks B_{R} given in all processes of each column of the Cartesian periodic grid by 1 position up (that is, in descending order of ranks of processes). To do this, create the MPI_COMM_GRID and MPI_COMM_COL communicators using the Matr4CreateCommGrid and Matr4CreateCommCol functions implemented in the MPI9Matr34 task, then use the MPI_Sendrecv_replace function for the MPI_COMM_COL communicator to perform the cyclic shift of the blocks B_{R} (compare with MPI9Matr34). Include all the above mentioned actions in a Matr4Calc2 function (without parameters). Output the received blocks B_{R} in each process; perform data input and output in the Solve function.
MPI9Matr38°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R} are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr35 task). Modify the Matr4Calc1 function, which was implemented in the MPI9Matr36 task; the modified function should provide execution of the first stage of any step of the Fox's algorithm. To do this, add the parameter named step to the function (this parameter specifies the step number and may be in the range 0 to K_{0} − 1, where K_{0} is the order of the Cartesian grid of processes) and use the value of this parameter in the part of the algorithm that deals with the sending the blocks A_{R}: the block A_{R} should be sent from the process with the grid coordinates (I_{0}, (I_{0} + step)%K_{0}) to all processes of the same grid row I_{0}, I_{0} = 0, …, K_{0} − 1 (the recalculation of the elements of the block C_{R} does not depend on the value of the parameter step). Using successively the calls of Matr4Calc1(0), Matr4Calc2(), Matr4Calc1(1) (the Matr4Calc2 function provides the second stage of each step of the algorithm — see the MPI9Matr37 task), execute two initial steps of the Fox's algorithm and output the new contents of the blocks T_{R}, B_{R}, and C_{R} in each process. Perform data input and output in the Solve function.
MPI9Matr39°. Integers M_{0}, P_{0}, Q_{0} and onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R} are given in each process (thus, the given data coincide with the results obtained in the MPI9Matr35 task). In addition, a number L with the same value is given in each process. The value of L is in the range 3 to K_{0} and determines the number of steps of the Fox's algorithm. Using successively the calls of Matr4Calc1(0), Matr4Calc2(), Matr4Calc1(1), Matr4Calc2(), …, Matr4Calc1(L − 1) (see the MPI9Matr38 and MPI9Matr37 tasks), execute the initial L steps of the Fox's algorithm and output the new contents of the blocks T_{R}, B_{R}, and C_{R} in each process. Perform data input and output in the Solve function. Remark. If the value of L is equal to K_{0} then the blocks C_{R} will contain the corresponding parts of the final matrix product AB. Note that the second stage (associated with the call of the Matr4Calc2 function) is not necessary at the last step of the algorithm.
MPI9Matr40°. Integers M and Q (the numbers of rows and columns of the matrix product) are given in the master process. In addition, integers M_{0}, Q_{0} and onedimensional arrays filled with the M_{0} × Q_{0} blocks of the matrix C are given in each process (the given blocks of C are obtained as a result of K_{0} steps of the Fox's algorithm — see the MPI9Matr39 task). Send all the blocks C_{R} to the master process and output the received matrix C of the size M × Q in this process. To store the resulting matrix C in the master process, use a onedimensional array sufficient to store the matrix of the size (M_{0}·K_{0}) × (Q_{0}·K_{0}). To send the blocks C_{R} to this array, use the MPI_Alltoallw collective function and the MPI_BLOCK_C datatype created by the Matr4CreateTypeBlock function (see the MPI9Matr32 task and notes to it). Include all the above mentioned actions in a Matr4GatherData function (without parameters). Perform the input of initial data in the Solve function, perform the output of the resulting matrix in the Matr4GatherData function. Note. When output the matrix C in the master process, it should be taken into account that an array, which is intended for matrix storage, may contain elements corresponding to additional zerovalued columns.
MPI9Matr41°. Integers M, P, Q, a matrix A of the size M × P, and a matrix B of the size P × Q are given in the master process (thus, the given data coincide with the given data in the MPI9Matr35 task). Using successively the Matr4ScatterData, Matr4Calc1, Matr4Calc2, and Matr4GatherData functions, that are developed in the MPI9Matr35−MPI9Matr40 tasks, find a matrix C, which is equal to the product of the initial matrices A and B, and output this matrix in the master process. The Matr4Calc1 and Matr4Calc2 functions should be called in a loop, the number of Matr4Calc2 function calls must be one less than the number of Matr4Calc1 function calls. In addition, output the current contents of the block C_{R} in each process after each call of the Matr4Calc1 function. The MPI_COMM_GRID, MPI_COMM_ROW, and MPI_COMM_COL communicators that are used in the Matr4Calc1 and Matr4Calc2 functions, should not be created several times. To do this, modify the Matr4CreateCommGrid, Matr4CreateCommRow, and Matr4CreateCommCol functions (see the MPI9Matr33 and MPI9Matr34 tasks); the modified functions should not perform any actions when it is called with the parameter comm that is not equal to MPI_COMM_NULL.
MPI9Matr42°. Integers M, P, Q and two file names are given in the master process. The given files contain elements of a matrix A of the size M × P and a matrix B of the size P × Q. The numbers M, P, and Q are multiples of the order K_{0} of square grid of processes. Modify the stage of receiving blocks for the Fox's algorithm of matrix multiplication (see the MPI9Matr35 task) as follows: each process should read the corresponding blocks of the matrices A and B directly from the given files. To send the sizes of matrices and file names, use the MPI_Bcast collective function. To read the blocks from the files, set the appropriate file views using the MPI_File_set_view function and the MPI_BLOCK_A and MPI_BLOCK_B file types defined with the Matr4CreateTypeBlock function (see the MPI9Matr32 task), and then use the MPI_File_read_all function. Include all these actions in a Matr4ScatterFile function (without parameters). As a result of the call of this function, each process will receive the values M_{0}, P_{0}, Q_{0}, as well as onedimensional arrays filled with the blocks A_{R}, B_{R}, C_{R}, T_{R} (the blocks C_{R} and T_{R} should contain zerovalued elements). Output all obtained data (that is, the numbers M_{0}, P_{0}, Q_{0} and the blocks A_{R}, B_{R}, C_{R}, T_{R}) in each process after calling the Matr4ScatterFile function. Perform the input of initial data in the Matr4ScatterFile function, perform the output of the results in the Solve function. Remark. A condition that the numbers M, P, Q are multiples of K_{0} means that you do not need to add zerovalued rows and/or zerovalued columns to the blocks obtained from the matrices A and B, and therefore you may perform reading of the blocks A_{R} and B_{R} using the same file types (namely, MPI_BLOCK_A and MPI_BLOCK_B) in all processes. If this condition is not fulfilled then it would be necessary to use special types that ensure the correct reading from the file and write to the array of "truncated" blocks of the matrices A and B in some processes (in addition, in this case it would be necessary to send to each process the values of P and Q that are necessary for the correct type definition for "truncated" blocks).
MPI9Matr43°. Integers M_{0}, Q_{0} and onedimensional arrays filled with the M_{0} × Q_{0} blocks C_{R} are given in each process (the given blocks C_{R} are obtained as a result of K_{0} steps of the Fox's block algorithm of matrix multiplication — see the MPI9Matr39 task). In addition, the name of file to store the matrix product is given in the master process. The numbers M and Q (the numbers of rows and columns of the matrix product) are multiples of the order K_{0} of square grid of processes (thus, M = M_{0}·K_{0}, Q = Q_{0}·K_{0}). Send the file name to all processes using the MPI_Bcast function. Write all the parts of the matrix product contained in the blocks C_{R} to the resulting file, which will eventually contain a matrix C of the size M × Q. To write the blocks to the files, set the appropriate file view using the MPI_File_set_view function and the MPI_BLOCK_C file type defined with the Matr4CreateTypeBlock function (see the MPI9Matr32 task), and then use the MPI_File_write_all collective function. Include all these actions (namely, the input of file name, sending the file name, and writing all blocks to the file) in a Matr4GatherFile function. Perform the input of all initial data, except the file name, in the Solve function. Remark. A condition that the numbers M and Q are multiples of K_{0} means that the blocks C_{R} do not contain "extra" zerovalued rows and/or columns, and therefore you may perform writing of the blocks C_{R} to the file using the same file type (namely, MPI_BLOCK_C) in all processes.
MPI9Matr44°. Integers M, P, Q and three file names are given in the master process. The first two file names are related to the existing files containing the elements of matrices A and B of the size M × P and P × Q, respectively, the third file should be created to store the resulting matrix product C = AB. The numbers M, P, and Q are multiples of the order K_{0} of square grid of processes. Using successively the Matr4ScatterFile, Matr4Calc1, Matr3Calc2, and Matr4GatherFile functions (see the MPI9Matr42, MPI9Matr38, MPI9Matr37, and MPI9Matr43 tasks), find a matrix C and write its elements to the resulting file. The Matr4Calc1 and Matr4Calc2 functions should be called in a loop, the number of Matr4Calc2 function calls must be one less than the number of Matr4Calc1 function calls. In addition, output the current value of the c[step] in each process after each call of the Matr4Calc1 function, where c is a onedimensional array containing the block C_{R}, and step is the algorithm step number (0, 1, …, K_{0} − 1). Thus, the element c[0] should be output on the first step of the algorithm, the element c[1] should be output on the second step of the algorithm, and so on.
