# ROC Profiler Library Specification

## 1. High level overview
```
The goal of the implementation is to provide a HW specific low-level performance analysis
interface for profiling of GPU compute applications. The profiling includes HW performance
counters with complex performance metrics and HW traces. The implementation distinguishes
two profiling features, metrics and traces. HW performance counters are treated as the basic
metrics and the formulas can be defined for derived complex metrics.
The library can be loaded by HSA runtime as a tool plugin and it can be loaded by higher
level HW independent performance analysis API like PAPI.
The library has C API and is based on AQLprofile AMD specific HSA extension.

  1. The library provides methods to query the list of supported HW features.
  2. The library provides profiling APIs to start, stop, read metrics results and tracing
  data.
  3. The library provides a intercepting API for collecting per-kernel profiling data for
  the kernels
  dispatched to HSA AQL queues.
  4. The library provides mechanism to load profiling tool library plugin by env variable
  ROCP_TOOL_LIB.
  5. The library is responsible for allocation of the buffers for profiling and notifying
  about output data buffer overflow for traces.
  6. The library is implemented based on AMD specific AQLprofile HSA extension.
  7. The library implementation is abstracted from the specific GFXIP.
  8. The library implementation is extensible:
    - Easy adding of counters and metrics
    - Counters enumeration 
    - Counters and metrics can be dynamically configured using XML configuration files with
    counters and metrics tables:
      o Counters table entry, basic metric: counter name, block name, event id
      o Complex metrics table entry: metric name, an expression for calculation the metric
      from the counters

Metrics XML file example:
<gfx8>
	<metric name=L1_CYCLES_COUNTER block=L1 event=0 >
	<metric name=L1_MISS_COUNTER block=L1 event=33 >
	. . .
</gfx8>

<gfx9>
	. . .
</gfx9>

<global>
	<metric name=L1_MISS_RATIO expr=L1_CYCLES_COUNT/ L1_MISS_COUNTER ></metric>
</global>
```
## 2. Environment
```
* HSA_TOOLS_LIB - required to be set to the name of rocprofiler library to be loaded by
HSA runtime
* ROCP_METRICS - path to the metrics XML file
* ROCP_TOOL_LIB - path to profiling tool library loaded by ROC Profiler
* ROCP_HSA_INTERCEPT - if set then HSA dispatches intercepting is enabled
```
## 3. General API
### 3.1. Description
```
The library supports method for getting the error number and error string of the last
failed library API call.
To check the conformance of used library APi header and the library binary the version
macros and API methods can be used.

Returning the error and error string methods:
- rocprofiler_errno - method for returning the error number
- rocprofiler_error_string - method for returning the error string

Library version:
- ROCPROFILER_VERSION_MAJOR - API major version macro
- ROCPROFILER_VERSION_MINOR - API minor version macro
- rocprofiler_version_major - library major version
- rocprofiler_version_minor - library minor version
```
### 3.2. Returning the error and error string methods
```
rocprofiler_errno_t rocprofiler_errno();
const char* rocprofiler_error_string();
```
### 3.3. Library version
```
The library provides back compatibility if the library major version is less or equal
then the API major version macro.

API version macros defined in the library API header 'rocprofiler.h':

ROCPROFILER_VERSION_MAJOR
ROCPROFILER_VERSION_MINOR

Methods to check library major and minor venison:

uint32_t rocprofiler_major_version();
uint32_t rocprofiler_minor_version();
```
## 4. Backend API
### 4.1. Description
```
The library provides the methods to open/close profiling context, to start, stop and read
HW performance counters and traces, to intercept kernel dispatches to collect per-kernel
profiling data. Also the library provides methods to calculate complex performance metrics
and to query the list of available metrics. The library distinguishes two profiling features,
metrics and traces, where HW performance counters are treated as the basic metrics. To check
if there was an error the library methods return HSA standard status code.
For a given context the profiling can be started/stopped and counters sampled in standalone
mode or profiling can be initiated by intercepting the kernel dispatches with registering
a dispatch callback.
For counters sampling, which is the usage model of higher level APIs like PAPI,
the start/stop/read APIs should be used.
For collecting per-kernel data for the submitted to HSA queues kernels the dispatch callback
API should be used.
The library provides back compatibility if the library major version is less or equal.

Returned API status:
- hsa_status_t - HSA status codes are used from hsa.h header

Info API:
- rocprofiler_info_kind_t - profiling info kind
- rocprofiler_info_query_t - profiling info query
- rocprofiler_info_data_t - profiling info data
- rocprofiler_get_info - return the info for a given info kind
- rocprofiler_iterate_info - iterate over the info for a given info kind 
- rocprofiler_query_info - iterate over the info for a given info query

Context API:
- rocprofiler_t - profiling context handle
- rocprofiler_feature_kind_t - profiling feature kind
- rocprofiler_feature_parameter_t - profiling feature parameter
- rocprofiler_data_kind_t - profiling data kind
- rocprofiler_data_t - profiling data
- rocprofiler_feature_t - profiling feature
- rocprofiler_mode_t - profiling modes
- rocprofiler_properties_t - profiler properties
- rocprofiler_open - open new profiling context
- rocprofiler_close - close profiling context and release all allocated resources
- rocprofiler_group_count - return profiling groups count
- rocprofiler_get_group - return profiling group for a given index
- rocprofiler_get_metrics - method for calculating the metrics data
- rocprofiler_iterate_trace_data - method for iterating output trace data instances

Sampling API:
- rocprofiler_start - start profiling
- rocprofiler_stop - stop profiling
- rocprofiler_read - read profiling data to the profiling features objects
- rocprofiler_get_data - wait for profiling data
  Group versions of start/stop/read/get_data methods:
  o rocprofiler_group_start
  o rocprofiler_group_stop
  o rocprofiler_group_read
  o rocprofiler_group_get_data

Intercepting API:
- rocprofiler_callback_t - profiling callback type
- rocprofiler_callback_data_t - profiling callback data type
- rocprofiler_set_queue_callbacks - set queue kernel dispatch and queue destroy callbacks
- rocprofiler_remove_queue_callbacks - remove queue callbacks
```
### 4.2. Info API
```
The profiling metrics are defined by name and the traces are defined by name and parameters.
All supported features can be iterated using 'iterate_info/query_info' methods. The counter
names are defined in counters table configuration file, each counter has a unique name and
defined by block name and event id. The traces and trace parameters names are same as in
the hardware documentation and the parameters codes are rocprofiler_feature_parameter_t values,
see below in the "Context API" section.
Profiling info kind:

typedef enum {
	ROCPROFILER_INFO_KIND_METRIC = 0,		// metric info
	ROCPROFILER_INFO_KIND_METRIC_COUNT = 1,		// metrics count
	ROCPROFILER_INFO_KIND_TRACE = 2,		// trace info
	ROCPROFILER_INFO_KIND_TRACE_COUNT = 3,		// traces count
} rocprofiler_info_kind_t;

Profiling info data:

typedef struct {
	rocprofiler_info_kind_t kind;			// info data kind
	union {
		struct {
			const char* name;		// metric name
			const char* description;	// metric description
		} metric;
		struct {
			const char* name;		// trace name
			const char* description;	// trace description
			uint32_t parameter_count;	// supported by the trace number
			                                // parameters
		} trace;
	};
} rocprofiler_info_data_t;

Return info for a given info kind:

has_status_t rocprofiler_get_info(
	const hsa_agent_t* agent,			// [in] GPU handle, NULL for all
	                                                // GPU agents
	rocprofiler info_kind_t kind,			// kind of iterated info
	void *data);					// data passed to callback

Iterate over the info for a given info kind, and invoke an application-defined callback on
every iteration:

has_status_t rocprofiler_iterate_info(
	const hsa_agent_t* agent,			// [in] GPU handle, NULL for all
	                                                // GPU agents
	rocprofiler info_kind_t kind,			// kind of iterated info
	hsa_status_t (*callback)(const rocprofiler_info_data_t info, void *data), // callback
	void *data);	

Iterate over the info for a given info query, and invoke an application-defined callback on
every iteration. The query
fields set to NULL define the query wildcard:

has_status_t rocprofiler_query_info(
	const hsa_agent_t* agent,			// [in] GPU handle, NULL for all
	                                                // GPU agents
	rocprofiler info_kind_t kind,			// kind of iterated info
	rocprofiler_info_data_t query,			// info query
	hsa_status_t (*callback)(const rocprofiler_info_data_t info, void *data), // callback
	void *data);					// data passed to callback
```
### 4.3. Context API
```
Profiling context is accumulating all profiling information including profiling features
which carry profiling data, required buffers for profiling command packets and output data.
The context can be created and deleted by the library open/close methods. By deleting
the context all accumulated by the library resources associated with this context will be
released. If it is required more than one run to collect all requested counters data then
data for all profiling groups should be collected and then the metrics can be calculated by
loading the saved groups' data to the profiling context. Saving and loading of the groups
data is responsibility of the tool. The groups are automatically identified on the profiling
context open and there is API to access them, see the "Profiling groups" section below.

Profiling context handle:

typename rocprofiler_t;

Profiling feature kind:

typedef enum {
	ROCPROFILER_FEATURE_KIND_METRIC = 0,	// metric
	ROCPROFILER_FEATURE_KIND_TRACE = 1	// trace
} rocprofiler_feature_kind_t;

Profiling feature parameter:

typedef hsa_ven_amd_aqlprofile_parameter_t rocprofiler_feature_parameter_t;

Profiling data kind:

typedef enum {
	ROCPROFILER_DATA_KIND_UNINIT = 0,	// data uninitialized
	ROCPROFILER_DATA_KIND_INT32 = 1,	// 32bit integer
	ROCPROFILER_DATA_KIND_INT64 = 2,	// 64bit integer
	ROCPROFILER_DATA_KIND_FLOAT = 3,	// float single-precision result
	ROCPROFILER_DATA_KIND_DOUBLE = 4,	// float double-precision result
	ROCPROFILER_DATA_KIND_BYTES = 5		// trace output as a bytes array
} rocprofiler_data_kind_t;


Profiling data:

typedef struct {
	rocprofiler_data_kind_t kind;		// result kind
	union {
		uint32_t result_int32;		// 32bit integer result
		uint64_t result_int64;		// 64bit integer result
		float result_float;		// float single-precision result
		double result_double;		// float double-precision result
		typedef struct {
 			void* ptr;		// pointer
			uint32_t size;		// byte size
			uint32_t instances;	// number of trace instances
		} result_bytes;			// data by ptr and byte size
	};
} rocprofiler_data_t;

Profiling feature:

typedef struct {
	rocprofiler_feature_kind_t type;			// feature type
	const char* name;					// feature name
	const rocprofiler_feature_parameter_t* parameters;	// feature parameters
	uint32_t parameter_count;				// feature parameter count
	rocprofiler_data_t* data;				// profiling data
} rocprofiler_feature_t;

Profiling mode masks:
There are several modes which can be specified for the profiling context.
STANDALONE mode can be used for the counters sampling in another then application context
to support statistical system wide profiling. In this mode the profiling context supports
its own queue which can be created on the context open if the CREATEQUEUE mode also specified.
See also "Profiler properties" section below for the standalone mode queue properties.
The profiler supports several profiling groups for collecting profiling data in several
runs and 'SINGLEGROUP' mode allows only one group and the context open will fail if more
groups are needed.

typedef enum {
	ROCPROFILER_MODE_STANDALONE = 1,	// standalone mode when ROC profiler
	                                        // supports own AQL queue
	ROCPROFILER_MODE_CREATEQUEUE = 2,	// profiler creates queue in STANDALONE mode
	ROCPROFILER_MODE_SINGLEGROUP = 4	// profiler allows one group only and fails
	                                        // if more groups are needed
} rocprofiler_mode_t;

Context data readiness callback:

typedef void (*rocprofiler_context_callback_t)(
	rocprofiler_group_t* group,		// profiling group
	void* arg);				// callback arg

Profiler properties:
There are several properties which can be specified for the context. A callback can be
registered which will be called when the context data is ready. In standalone profiling mode
'ROCPROFILER_MODE_STANDALONE' the context supports its own queue and the queue can be set by
the property 'queue' or a queue will be created with the specified depth 'queue_depth' if mode
'ROCPROFILER_MODE_CREATEQUEUE' also specified.

typedef struct {
	rocprofiler_context_callback_t callback; // callback on the context data readiness
	void* callback_arg;			 // callback arg
	has_queue_t* queue;			 // HSA queue for standalone mode
	uint32_t queue_depth;			 // created queue depth,for create-queue mode
} rocprofiler_properties_t;

Open/close profiling context:

hsa_status_t rocprofiler_open(
	hsa_agent_t agent,			// GPU handle
	rocprofiler_feature_t* features,	// [in/out] profiling feature array
	uint32_t feature_count,			// profiling feature count
	rocprofiler_t** context,		// [out] profiling context handle
	uint32_t mode,				// profiling mode mask
	rocprofiler_properties_t* properties);	// profiler properties

hsa_status_t rocprofiler_close(
	rocprofiler_t* context);		// [in] profiling context

Profiling groups:
The profiler on the context open automatically identifies a required number of the application
runs to collect all data needed for all specified metrics and creates a metric group per each
run. Data for all profiling groups should be collected and then the metrics can be calculated
by loading the saved groups' data to the profiling context. Saving and loading of he groups
data is responsibility of the tool.

typedef struct {
	uint32_t index;				// profiling group index
	rocprofiler_feature_t** features;	// profiling features array
	uint32_t feature_count;			// profiling feature count
	rocprofiler_t* context;			// profiling context handle
} rocprofiler_group_t;

Return profiling groups count:

hsa_status_t rocprofiler_group_count(
	rocprofiler_t* context);		// [in/out] profiling context
	uint32* count);				// [out] profiling groups count

Return the profiling group for a given index:

hsa_status_t rocprofiler_get_group(
	rocprofiler_t* context,			// [in/out] profiling context,
	                                        // will be returned as
						// a part of the group structure
	uint32_t index,				// [in] group index
	rocprofiler_group_t* group);		// [out] profiling group

Calculate metrics data. The data will be stored to the registered profiling features data fields:
After all profiling context data is ready the registered metrics can be calculated. The context
data readiness can be checked by 'get_data' API or using the context callback.

hsa_status_t rocprofiler_get_metrics(
	rocprofiler_t* context);		// [in/out] profiling context

Method for iterating trace data instances:
Trace data can have several instance, for example, one instance per Shader Engine.

hsa_status_t rocprofiler_iterate_trace_data(
	const rocprofiler_t* contex,			// [in] context object
	hsa_ven_amd_aqlprofile_data_callback_t callback, // [in] callback to iterate
	                                                // the output data
	void* callback_data);				// [in/out] passed to callback data
```
### 4.4. Sampling API
```
The API supports the counters sampling usage model with start/read/stop methods and also lets
to wait for the profiling data in the intercepting usage model with get_data method.

Start/stop/read methods:

hsa_status_t rocprofiler_start(
	rocprofiler_t* context,			// [in/out] profiling context
	uint32_t group_index = 0);		// group index

hsa_status_t rocprofiler_stop(
	rocprofiler_t* context,			// [in/out] profiling context
	uint32_t group_index = 0);		// group index
	
hsa_status_t rocprofiler_read(
	rocprofiler_t* context,			// [in/out] profiling context
	uint32_t group_index = 0);		// group index

Wait for profiling data:

hsa_status_t rocprofiler_get_data(
	rocprofiler_t* context,			// [in/out] profiling context
	uint32_t group_index = 0);		// group index

Group versions of the above start/stop/read/get_data methods:

hsa_status_t rocprofiler_group_start(
	rocprofiler_group_t* group);		// [in/out] profiling group

hsa_status_t rocprofiler_group_stop(
	rocprofiler_group_t* group);		// [in/out] profiling group

	
hsa_status_t rocprofiler_group_read(
	rocprofiler_group_t* group);		// [in/out] profiling group


hsa_status_t rocprofiler_group_get_data(
	rocprofiler_group_t* group);		// [in/out] profiling group
```
### 4.5.  Intercepting API 
```
The library provides a callback API for enabling profiling for the kernels dispatched to
HSA AQL queues. The API enables per-kernel profiling data collection.
Currently implemented the option with serializing the kernels execution.

ROC profiler callback type:

hsa_status_t (*rocprofiler_callback_t)(
	const rocprofiler_callback_data_t* callback_data, // callback data passed by HSA runtime
	void* user_data,				  // [in/out] user data passed
	                                                  // to the callback
	rocprofiler_group** group);			  // [out] returned profiling group

Profiling callback data:

typedef struct {
	uint64_t dispatch;                                   // dispatch timestamp
	uint64_t begin;                                      // begin timestamp
	uint64_t end;                                        // end timestamp
	uint64_t complete;                                   // completion signal timestamp
} rocprofiler_dispatch_record_t;

typedef struct {
	hsa_agent_t agent;                                   // GPU agent handle
	uint32_t agent_index;                                // GPU index
	const hsa_queue_t* queue;                            // HSA queue
	uint64_t queue_index;                                // Index in the queue
	const hsa_kernel_dispatch_packet_t* packet;          // HSA dispatch packet
	const char* kernel_name;                             // Kernel name
	const rocprofiler_dispatch_record_t* record;         // Dispatch record
} rocprofiler_callback_data_t;

Queue callbacks:

typedef struct {
    rocprofiler_callback_t dispatch;                             // kernel dispatch callback
    hsa_status_t (*destroy)(hsa_queue_t* queue, void* data);     // queue destroy callback
} rocprofiler_queue_callbacks_t;

Adding/removing kernel dispatch and queue destroy callbacks

hsa_status_t rocprofiler_set_intercepting(
    rocprofiler_intercepting_t callbacks,                        // intercepting callbacks
    void* data);                                                 // [in/out] passed callbacks data

hsa_status_t rocprofiler_remove_intercepting();
```
## 5. Application code examples
### 5.1. Querying available metrics
```
Info data callback:

	hsa_status_t info_data_callback(const rocprofiler_info_data_t info, void *data) {
		switch (info.kind) {
			case ROCPROFILER_INFO_KIND_METRIC: {
				printf("metric %s, description %s\n",
					info.metric.name,
					info.metric.description);
				break;
			}
			default:
				printf("wrong info kind %u\n", kind);
				return HSA_STATUS_ERROR;
		}
		return HSA_STATUS_SUCCESS;
	}

Printing all available metrics:

	hsa_status_t status = rocprofiler_iterate_info(
		agent,
		ROCPROFILER_INFO_KIND_METRIC,
		info_data_callback,
		 NULL);
	<check status>
```
### 5.2. Profiling code example
```
Profiling of L1 miss ratio, average memory bandwidth.
In the example below rocprofiler_group_get_data group APIs are used for the purpose of a usage
example but in SINGLEGROUP mode when only one group is allowed the context handle itself can be
saved and then direct context method rocprofiler_get_data with default group index equal to 0
can be used.

hsa_status_t_dispatch_callback(
	const rocprofiler_callback_data_t* callback_data,
	void* user_data,
	rocprofiler_group_t* group)
{
	hsa_status_t status = HSA_STATUS_SUCCESS;
	// Profiling context
	rocprofiler_t* context;
	// Profiling info objects
	rocprofiler_feature_t features* = new rocprofiler_feature_t[2];
	// Tracing parameters
	rocprofiler_feature_parameter_t* parameters = new rocprofiler_feature_parameter_t[2];

	// Setting profiling features
	features[0].type = ROCPROFILER_METRIC;
	features[0].name = "L1_MISS_RATIO";
	features[1].type = ROCPROFILER_METRIC;
	features[1].name = "DRAM_BANDWIDTH";

	// Creating profiling context
	status = rocprofiler_open(callback_data->dispatch.agent, features, 2, &context,
		ROCPROFILER_MODE_SINGLEGROUP, NULL);
	<check status>
	
	// Get the profiling group
	// For general case with many groups there is rocprofiler_group_count() API
	const uint32_t group_index = 0
	status = rocprofiler_get_group(context, group_index, group);
	<check_status>

	// In SINGLEGROUP mode the context handle itself can be saved, because there is just one group
	<saving the callback data/profiling group/profiling features>

	return status;
}

void profiling_libary_constructor() {
	// Defining callback data, no data in this simple example
	void* callback_data = NULL;

	// Adding observers
	hsa_sttaus_t status = rocprofiler_add_dispatch_callback(dispatch_callback, callback_data);
	<check status>

	// Dispatching profiled kernel
	<dispatching profiled kernels>
}

void profiling_libary_destructor() {
	<for entry : <saved callbacks data>> {
		// In SINGLEGROUP mode the rocprofiler_get_group() method with default zero group
		// index can be used, if context handle would be saved
		status = rocprofiler_group_get_data(entry->group);
		<check status>
		status = rocprofiler_get_metrics(entry->group->context);
		<check status>
		status = rocprofiler_close(entry->group->context);
		<check status>

		<tool_dump_data_method(entry->dispatch_data, entry->features, entry->features_count)>;
	}
}
```
### 5.3. Option to use completion callback
```
Creating profiling context with completion callback:
	. . .
	rocprofiler_properties_t properties = {};
	properties.callback = completion_callback;
	properties.callback_arg = NULL; 	// no args defined
	status = rocprofiler_open(agent, features, 3, &context,
		ROCPROFILER_MODE_SINGLEGROUP, properties);
	<check status>
	. . .

Definition of completion callback:

void completion_callback(profiler_group_t group, void* arg) {
	<tool_dump_data_method(group)>
	hsa_status_t status = rocprofiler_close(group.context);
	<check status>
}
```
