This repository aims to reproduce the results of the paper "ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling". This work introduces a new way of compressing audio into discrete tokens that are both low-bitrate and semantically rich, making them more suitable for audio language modeling tasks such as text-to-speech, audio captioning, or music generation.
Original: |
Reconstructed (w=3): |
Reconstructed (w=6): |
Reconstructed (w=10): |
In the following examples, we showcase the results of our interpolation smoothness tests. Each sequence begins with a sound from one instrument and gradually transforms into another as we interpolate between their latent representations. Ideally, these transitions should feel continuous, with the timbre evolving naturally rather than changing abruptly. While the evaluation is ultimately subjective, these samples provide a direct way to hear how well the latent space of ALMTokenizer captures meaningful and coherent trajectories compared to EnCodec.
From flute C5 to flute E5 to flute G5:
EnCodec: |
ALMTokenizer: |
From clarinet G5 to trumpet G5:
EnCodec: |
ALMTokenizer: |
From cello A3 to flute A6:
EnCodec: |
ALMTokenizer: |
In the following examples, we test whether we can change the timbre of a sound (its instrument-like quality) while keeping the pitch and rhythm intact.
The idea is simple:
These examples show that while reconstruction quality is still limited, ALMTokenizer's latent space captures semantic structure more clearly. This makes the timbre transfer feel more intentional than with EnCodec, even if the results are far from perfect.
EnCodec before transfer: |
EnCodec after transfer: |
ALMTokenizer before transfer: |
ALMTokenizer after transfer: |
EnCodec before transfer: |
EnCodec after transfer: |
ALMTokenizer before transfer: |
ALMTokenizer after transfer: |