The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task ...
To use and configure, check out the setup documentation. For advanced usage, check out the configuration reference and custom instrumentation API. Confused about the terminology of APM? Take a look at ...