Search code examples
google-colaboratorylarge-language-model

What is the size limit on Mistral 7B training samples?


While following this tutorial using this colab I tried to replace the default dataset with one that I generated from the text of the book "The Terraformers" in an attempt to teach the model about the content of the book. The original runs just fine (modulo changing a \\n to \n) on a gpu as small as a T4.

I generated some of my own training data. With even a single row containing a large sample (around 400 tokens), the memory usage explodes. It consumed 40Gb on an A100:

{"text":"<s>[INST]Please quote part of \"The Terraformers\" that would answer the following question: What is Destry's reaction when she sees someone tending a fire at the edge of the boreal forest?[/INST]\nDestry could smell the smoke long before she saw its improbable source. There was some kind of person---possibly Homo sapiens---tending a fire at the edge of the boreal forest. She squinted, trying to make out details from half a klick away. The person's skin was so pale she guessed it had hardly met real sunlight, which meant they were definitely not a stray worker from one of the construction camps. When the intruder crouched next to the flames, she caught a glimpse of red beard merging into a tangle of hair. In their hands, a hare was speared and cooking on an expensive alloy spit. The sight was horrifying, and Destry flinched back reflexively.\n\n\"Let's stop,\" she whispered to her mount, a thick-barreled moose with red-brown fur and a crown of antlers spreading from his forehead like a pair of massive, cupped hands. He flicked an ear in acknowledgement as she slid off his back and into his long shadow. Sinking down on one knee, Destry pressed her bare fingers into the soil, spreading them wide, establishing a high-bandwidth connection with the local ecosystem.\n\nThousands of sensors welcomed her into the planet's network, their collective perceptions knitting together from shards of cached memory, fragments of recorded sensation and perception. In this state, she too was a sensor, processing data through her eyes, nose, tongue, skin, and ears. What she perceived she shared with the ecosystem. She could feel the sensors collaboratively reviewing the scene from her perspective, learning that she wanted to know more about the mammal at the edge of the forest. It was like her body had become the land. Her awareness stretched forward, racing through root systems and over insects, tasting acid levels in the soil. The person's feet on the ground registered as pressure on her back, and she smelled redox reactions in the fire. Each sensor's evaluation joined the swelling chorus in her ears as the tiny machines voted on what their data points might mean: polymer, hair, carnivore, unprocessed excrement, dead trees, carbon cycle perturbation, predator, metal, fur, synthetic microbiome. As Destry's data surged across the field and into the forest, the sensors could see what she did, and their analysis coalesced into a strong probability: Homo sapiens in the region for eight days, causally linked to tree loss, small mammal loss, excrement buildup, complex toxins.</s>"

What is the token size limit for a training sample? Where is that configured?


Solution

  • The error is not related to the dataset.

    optim = "paged_adamw_32bit"
    

    replace that with

    optim = "paged_adamw_8bit"
    

    it should solve the OOM issue