A Large-scale Dataset for Audio-Language Representation Learning


Luoyi Sun1,3
Xuenan Xu2
Mengyue Wu2
Weidi Xie1, 3

1CMIC & 2X-LANCE Shanghai Jiao Tong University
3Shanghai AI Lab

Accepted by ACM MM 2024



Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Dataset [HuggingFace]


We present an innovative and automatic audio caption generation pipeline(*), construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. As shown in the left figure, The text descriptions in Auto-ACD contain long texts (18 words) and diverse vocabularies (23K), and provide information about the surrounding auditory environment(data point with shadow) in which sounds take place.



Preview

A singing bowl resonates with a gentle gong sound, accompanied by soft music playing in a church.

A melodic accordion tune fills the air as the musician plays in a music studio, creating a pleasant ambiance.

A train horn blares as a train approaches, creating a loud and powerful sound in a railway environment.

Rain falls hard on a surface as people talk in the distance, creating a soothing ambiance of a rainy day.

Sheep bleat in the distance as people talk faintly, creating a pastoral atmosphere in a wheat field.

An emergency vehicle siren blares loudly as a fire engine speeds through the city streets, signaling an urgent situation.

The motorcycle engine revs up and down while driving through a residential neighborhood, accompanied by some speech and light engine sounds.

Bird wings flap as rustling and birds chirping in the background create a serene ambiance in a garden.

A roaring crowd erupts in cheers and battle cries, creating an electrifying atmosphere during a lively event.




Abstract

The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks.




Dataset Pipeline

This figure illustrates the pipeline we employ for the data collection process, we employ a range of publicly available tools or APIs across the general AI community, e.g., vision, language and audio models, to generate comprehensive language descriptions for the audio tracks of the given video datasets, e.g., AudioSet, VGGSound.




Architecture

The overview of our architectures. The left figure shows the audio-language retrieval model, we employ the pre-trained HTSAT as the audio encoder, and the pre-trained RoBERTa as the language encoder, both encoders were initialised from the pre-trained CLAP model. The right figure shows the automatic audio captioning model, we adopt a lightweight audio captioning model, where both the audio backbone and language model (GPT-2) are fixed, and only a mapping network is train.




Protocol-I: Audio-language Retrieval

To validate the efficacy of our proposed dataset, we train an audio-language model with standard contrastive learning. The results showcase, (i) training on Auto-ACDVS dataset leads to a significant improvement in Recall@k. (ii) training on Auto-ACD leads to a remarkable performance gain. (iii) on the Auto-ACD benchmark, which contains more diverse lexicon and abundant language description, training on Auto-ACD datasets significantly out performs the model trained on Laion-Audio-630K.




Protocol-II: Automatic Audio Captioning

Comparison with CLAP on Clotho and Auto-ACD.

We use audio captioning to demonstrate the effectiveness of our pretrained audio backbone. The results show improved performance across all evaluation metrics than baseline. And the performance of baseline approach oversees a sharp decrease on Auto-ACD.




Protocol-III:Environment Classification

Comparison with CLAP on DCASE 2020 Mobile and AudioSet Env.

We conduct environment classification, the results indicate that our audio-language model demonstrates a stronger recognition ability of environments over CLAP.



Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.