This repository contains the anonymous submission of our work titled "Audio-Journey: Open Domain Latent Diffusion Based Text-to-Audio Generation".
Despite recent progress, machine learning for open domain audio generation is lagging behind models for image, text, speech, and music. In this paper, we leverage state-of-the-art (SOTA) Large Language Models (LLMs) to augment the existing weak labels of the audio dataset to enrich captions; we adopt SOTA video-captioning model to automatically generate video caption, and we again use LLMs to merge the audio-visual captions to form a rich dataset of large-scale. In our experiment, we first verified that our Audio+Visual Caption is of high quality against baselines and ground truth (12.5\% gain in semantic score against baselines). Using this dataset we constructed a Latent Diffusion Model to generate in the encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism functioning together to provide competitive open domain audio generation.
The supporting code for our work is available in the repository.
Visit the Audio Journey project on GitHub:
Audio Journey on GitHubRead our preprint:
Audio Journey PreprintRead our Appendix:
Audio Journey AppendixSound of a hammer striking wood
Sound of a dog howling
Progressive rock followed by a band playing music
Sound of shuffling cards
A train whistle heard in the distance
Please check the README file for instructions on how to set up the necessary dependencies to run the code.
For any issues related to the code or the project, please raise an issue on this GitHub repository.
The authors of this work are keeping their identity hidden as part of the anonymous submission process.