Bark: Real-time Open-Source Text-to-Audio Rivaling ElevenLabs

May 14, 2023

Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio – including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints, which are ready for inference and available for commercial use.
suno-ai/bark

This test is part of a comprehensive series of assessments aimed at identifying a suitable alternative to ElevenLabs for on-premise infrastructure. While ElevenLabs offers high-quality real-time TTS, its cost through API access is prohibitive for end-user applications. Our use case, the development of an AI companion system, necessitates a more economical solution.

For this test, we had the following requirements:

End-to-end real-time or faster inference (excluding network transfer)
High-quality voice synthesis that a person would find acceptable and “inviting.” Perfect human-like quality is not essential for our use case.

Testing

Software

	System 1 (H100)	System 2 (A100)
Driver (Nvidia)	525.85.12	525.105.17
CUDA	12.0	12.0
Pytorch	2.1.0.dev20230513	2.0.0+cu117
Python	3.10.9	3.10.11
Ubuntu	20.04.5 LTS	22.04.2 LTS

Hardware

	System 1 (H100)	System 2 (A100)
CPU	26x vCPUs	30x vCPUs
GPU	1x Nvidia H100 PCIe 80 GB	1x Nvidia A100 SXM4 40 GB
RAM	200 GiB	200 GiB
Storage	512 GiB SSD	512 GiB SSD

Test 1 (Lawrence of Arabia)

4 sentences
150 words
300 characters

“All men dream: but not equally. Those who dream by night in the dusty recesses of their minds wake in the day to find that it was vanity: but the dreamers of the day are dangerous men, for they may act their dreams with open eyes, to make it possible. This I did. — T.E Lawrence "Lawrence of Arabia”

Test 2 (Lord of the Rings)

6 sentences
150 words
867 characters

"In ancient times the Rings of Power were crafted by the Elven-smiths, and Sauron, the Dark Lord, forged the One Ring, filling it with his own power so that he could rule all others. But the One Ring was taken from him, and though he sought it throughout Middle-earth, it remained lost to him. After many ages it fell by chance into the hands of the hobbit Bilbo Baggins.

From Sauron's fastness in the Dark Tower of Mordor, his power spread far and wide. Sauron gathered all the Great Rings to him, but always he searched for the One Ring that would complete his dominion.

When Bilbo reached his eleventy-first birthday he disappeared, bequeathing to his young cousin Frodo the Ruling Ring and a perilous quest: to journey across Middle-earth, deep into the shadow of the Dark Lord, and destroy the Ring by casting it into the Cracks of Doom."

Test 3 (Animal Farm)

1 sentence
29 words
149 characters

“The creatures outside looked from pig to man, and from man to pig, and from pig to man again; but already it was impossible to say which was which.”

Test 4 (Extended Sentence)

1 sentence
33 words
183 characters

"When she looked out the window, she saw unimaginable beauty; the likes of which she had never seen before; and that is when it struck her: she had been transported to another world."

Results

		System 1 (H100)	System 2 (A100)
*Test 1 (Lawrence of Arabia)*	Characters per Second	0.17	0.08
	Words per Second	2.55	1.12
	Sentences per Second	12.82	5.38
*Test 2 (Lord of the Rings)*	Characters per Second	0.11	0.05
	Words per Second	2.63	1.24
	Sentences per Second	15.22	7.17
*Test 3 (Animal Farm)*	Characters per Second	0.07	0.04
	Words per Second	2.10	1.12
	Sentences per Second	10.86	5.72
Test 4 (Extended Sentence)	Characters per Second	0.07	0.04
	Words per Second	2.30	1.31
	Sentences per Second	12.76	7.29

H100 evaluations are an average of 6 tests; A100 evaluations are an average of 3 tests.

Conclusions

Note: We conducted these evaluations using the most recent version of Bark (as of May 14, 2023). We used a serve script created with FastAPI for these initial tests, which was primarily designed for rapid prototyping. While the test code hasn’t been fully optimized, we believe more efficient results could be achieved with meticulous benchmarking design. Nonetheless, the numbers provided should give a realistic expectation of performance in a production environment, even considering the inevitable quirky setups and potentially haphazard programming. In this round of testing, we focused solely on its capacity for real-time voice synthesis. We plan to explore the capability of arbitrary audio generation in future assessments.

In terms of quality, the synthesized voices are generally clear and have character. To minimize artifacts, careful voice creation and curation are essential. Artifacts can vary from delays in speaking (which can be programmatically detected and removed) to occasional music interference. For extremely long sentences, the system might sometimes cut off before completion, or even generate words that fit logically but were not in the original prompt. However, we believe that further tuning and updates will address these issues.

The average English speaker communicates at a rate of 150 words per minute (WPM), or roughly 2.5 words per second (WPS). In our limited testing, we observed that multi-sentence synthesis achieved an average of about 2.5 words per second. However, single-sentence generations were slightly slower. Factors contributing to this discrepancy may include the programming of the script, variable completion times for different sentences, and the warm-up time for single short sentences affecting the overall WPM.

At present, when processing multiple sentences together in batches, we can expect the entire output to be ready in real-time.

Bark is a relatively new model that has seen significant improvements since its launch on April 20, 2023. If this trend continues, the newly available H100 GPUs should enable real-time inference for various types of audio synthesis, ranging from short text segments to entire paragraphs and beyond (including music and sound effects). Therefore, we currently anticipate that Bark will likely replace ElevenLabs as our TTS provider in the coming weeks.