almtokenizer

This repository aims to reproduce the results of the paper "ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling". This work introduces a new way of compressing audio into discrete tokens that are both low-bitrate and semantically rich, making them more suitable for audio language modeling tasks such as text-to-speech, audio captioning, or music generation.

Sound Reconstruction Examples

Here we present some examples of sound reconstruction using the ALMTokenizer. The model allows bitrate on demand by changing the window size (w), which controls the rate at which [CLS] tokens are interleaved. A higher w results in a more compressed audio, which can lead to a loss of detail but also reduces the overall bitrate. The examples show the reconstruction of a sound with windows of size 3, 6 and 10. All three reconstruction have been produced with the same model, no need to retrain for each bitrate.

Original:

Reconstructed (w=3):

Reconstructed (w=6):

Reconstructed (w=10):

Sound Space Traversals

In the following examples, we showcase the results of our interpolation smoothness tests. Each sequence begins with a sound from one instrument and gradually transforms into another as we interpolate between their latent representations. Ideally, these transitions should feel continuous, with the timbre evolving naturally rather than changing abruptly. While the evaluation is ultimately subjective, these samples provide a direct way to hear how well the latent space of ALMTokenizer captures meaningful and coherent trajectories compared to EnCodec.

From flute C5 to flute E5 to flute G5:

EnCodec:

ALMTokenizer:

From clarinet G5 to trumpet G5:

EnCodec:

ALMTokenizer:

From cello A3 to flute A6:

EnCodec:

ALMTokenizer:

Zero-shot Timbre Transfer

In the following examples, we test whether we can change the timbre of a sound (its instrument-like quality) while keeping the pitch and rhythm intact.

The idea is simple:

  1. We encode an input sound into latent representations using both EnCodec and ALMTokenizer.
  2. For each instrument in the Good-sounds dataset, we compute an “average point” in latent space (a centroid).
  3. To transform a sound, we take its latent representation and shift it toward the centroid of a target instrument.
  4. Finally, we decode the shifted representation back into audio.

These examples show that while reconstruction quality is still limited, ALMTokenizer's latent space captures semantic structure more clearly. This makes the timbre transfer feel more intentional than with EnCodec, even if the results are far from perfect.

Male speech to trumpet (note A, any octave)

EnCodec before transfer:

EnCodec after transfer:

ALMTokenizer before transfer:

ALMTokenizer after transfer:

Female speech to trumpet (note A, any octave)

EnCodec before transfer:

EnCodec after transfer:

ALMTokenizer before transfer:

ALMTokenizer after transfer: