SGEMM GPU kernel performance Data Set
Download: Data Folder, Data Set Description
Abstract: Running times for multiplying two 2048 x 2048 matrices using a GPU OpenCL SGEMM kernel with varying parameters (using the library 'CLTune').


Data Set Characteristics: 
Multivariate 
Number of Instances: 
241600 
Area: 
Computer 
Attribute Characteristics: 
Integer 
Number of Attributes: 
18 
Date Donated 
20180227 
Associated Tasks: 
Regression 
Missing Values? 
N/A 
Number of Web Hits: 
14028 
Source:
Enrique G. Paredes (egparedes '@' ifi.uzh.ch). Visualization and MultiMedia Lab, Department of Informatics, University of Zurich. Zurich, 8050. Switzerland
Rafael BallesterRipoll (rballester '@' ifi.uzh.ch). Visualization and MultiMedia Lab, Department of Informatics, University of Zurich. Zurich, 8050. Switzerland
Data Set Information:
This data set measures the running time of a matrixmatrix product A*B = C, where all matrices have size 2048 x 2048, using a parameterizable SGEMM GPU kernel with 241600 possible parameter combinations. For each tested combination, 4 runs were performed and their results are reported as the 4 last columns. All times are measured in milliseconds*.
There are 14 parameter, the first 10 are ordinal and can only take up to 4 different powers of two values, and the 4 last variables are binary. Out of 1327104 total parameter combinations, only 241600 are feasible (due to various kernel constraints). This data set contains the results for all these feasible combinations.
The experiment was run on a desktop workstation running Ubuntu 16.04 Linux with an Intel Core i5 (3.5GHz), 16GB RAM, and a NVidia Geforce GTX 680 4GB GF580 GTX1.5GB GPU. We use the 'gemm_fast' kernel from the automatic OpenCL kernel tuning library 'CLTune' ([Web Link]).
* Note: for this kind of data sets it is usually better to work with the logarithm of the running times (see e.g. Falch and Elster, 'Machine learningbased autotuning for enhanced performance portability of OpenCL applications', 2015).
Attribute Information:
 Independent variables:
12. MWG, NWG: permatrix 2D tiling at workgroup level: {16, 32, 64, 128} (integer)
3. KWG: inner dimension of 2D tiling at workgroup level: {16, 32} (integer)
45. MDIMC, NDIMC: local workgroup size: {8, 16, 32} (integer)
67. MDIMA, NDIMB: local memory shape: {8, 16, 32} (integer)
8. KWI: kernel loop unrolling factor: {2, 8} (integer)
910. VWM, VWN: permatrix vector widths for loading and storing: {1, 2, 4, 8} (integer)
1112. STRM, STRN: enable stride for accessing offchip memory within a single thread: {0, 1} (categorical)
1314. SA, SB: permatrix manual caching of the 2D workgroup tile: {0, 1} (categorical)
 Output:
1518. Run1, Run2, Run3, Run4: performance times in milliseconds for 4 independent runs using the same parameters. They range between 13.25 and 3397.08.
Relevant Papers:
A fraction of this data set was used in the following paper to compute a tensor train based predictive model and estimate the Sobol sensitivity indices of all the parameters:
Rafael BallesterRipoll, Enrique G. Paredes, Renato Pajarola.
Sobol Tensor Trains for Global Sensitivity Analysis.
In arXiv Computer Science / Numerical Analysis eprints, 2017
([Web Link]).
Citation Request:
If you use this data set, please cite one or both of these:
 Rafael BallesterRipoll, Enrique G. Paredes, Renato Pajarola.
Sobol Tensor Trains for Global Sensitivity Analysis.
In arXiv Computer Science / Numerical Analysis eprints, 2017
([Web Link]).
 Cedric Nugteren and Valeriu Codreanu.
CLTune: A Generic AutoTuner for OpenCL Kernels.
In: MCSoC: 9th International Symposium on Embedded Multicore/Manycore SystemsonChip. IEEE, 2015
([Web Link])
