Anonymous Submission: Audio-Journey

This repository contains the anonymous submission of our work titled "Audio-Journey: Open Domain Latent Diffusion Based Text-to-Audio Generation".

Abstract

Despite recent progress, machine learning for open domain audio generation is lagging behind models for image, text, speech, and music. In this paper, we leverage state-of-the-art (SOTA) Large Language Models (LLMs) to augment the existing weak labels of the audio dataset to enrich captions; we adopt SOTA video-captioning model to automatically generate video caption, and we again use LLMs to merge the audio-visual captions to form a rich dataset of large-scale. In our experiment, we first verified that our Audio+Visual Caption is of high quality against baselines and ground truth (12.5\% gain in semantic score against baselines). Using this dataset we constructed a Latent Diffusion Model to generate in the encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism functioning together to provide competitive open domain audio generation.

Code

The supporting code for our work is available in the repository.

Paper

Visit the Audio Journey project on GitHub:

Audio Journey on GitHub

Read our preprint:

Audio Journey Preprint

Read our Appendix:

Audio Journey Appendix

Examples

Sound of a hammer striking wood

Sound of a dog howling

Progressive rock followed by a band playing music

Sound of shuffling cards

A train whistle heard in the distance

Dependencies

Please check the README file for instructions on how to set up the necessary dependencies to run the code.

Contact

For any issues related to the code or the project, please raise an issue on this GitHub repository.

Note

The authors of this work are keeping their identity hidden as part of the anonymous submission process.