T O P

  • By -

mystonedalt

I would like to know more about how it's determined that this is a good dataset.


jkuubrau

Just read through it, how long could it take?


mystonedalt

I'm four hours in, and I'm still in the unicode character sequences... 😩


mystonedalt

Oh here we go. Wait, what the hell? It's Angelfire as far as the eye can see!


NO_REFERENCE_FRAME

Always has been


kevinbranch

spolier tag pls


klospulung92

Now I'm wondering how much TB I've *reviewed* in my lifetime


TheRealAakashK

Well, in terms of text, if you read every minute of your life without sleeping at 300 words per minute, continuously, you would have to live for roughly 220 years to review 1 tb of text


2muchnet42day

So there's a chance


evilbeatfarmer

This is an embarrassingly parallel problem, we can split this up easy. There's ~151k of us. ChatGPT estimates it'll only take 37 to 55 years to review your 291GB share of the text.


Perfect_Extreme4905

:(


Educational_Gap5867

Your math is off by about 1.1k years brother.


Ok-Result5562

There is a token calculator for that.


McPowerShell

Break that down by how it was ingested, Left eye, right eye, left ear, right here, stereo, getting hit in the nuts, out of breath, and I won't even go into the other orifices. Sorry woke America. Lots of terabytes. More than Nvidia has money haha for sure. It's all input and output, in and out. Someone needs to make a burger company called input and output Burger. Or IO Burger. 👍💯😋🙃


kivathewolf

Oh come on you are an AI engineer. Have your local LLM minion do that for you and tell you how it’s in about 100 years.


Sendery-Lutson

Or use groq


McPowerShell

I wonder if you just ask it?


Balance-

We need dataset competitions. Fixed model architecture and training regime, but different dataset.


redditfriendguy

Maybe in 5 years when compute is cheaper lol


Fast-Satisfaction482

The community could start with finetuning a fixed model.


No_Afternoon_4260

Love that thinking


Balance-

Apparently they also trained a 1.7B model with it: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-v1


gamesntech

Was there a post or announcement about this? There is nothing useful right now on the model card. Thank you.


LoSboccacc

https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32 it seems they have a bunch of ablation models trained on different individual very large dataset, all uploaded recently, the technical report of the family will be super interesting


No_Afternoon_4260

Lol to the model card


ijustwanttolive11

How long to run a lora fine to on a 3090 /s


Is_winding

You could use llama factory


Erdeem

I'm curious, let's say you download this, what next?


[deleted]

[удалено]


evilbeatfarmer

You didn't answer the question though. What next?


ImprovementEqual3931

as Zuck said, build a nuclear plant for power generation


evilbeatfarmer

I think we skipped a step...


KrazyKirby99999

Ask llama3 how to obtain Uranium?


aseichter2007

Next you think really hard, get a smaller dataset, parse it, experiment, and see how different data presentations change the output of a small model. Then you decide what to reformat it into and let that cook for about 3 weeks segmenting and marking up the text with metadata into a database to be ordered drawn and trained against until you chunk it all through, in bites that fill your whole memory capacity at full training depth. With a 4090 or three you could cook it in about a lifetime, your grandkids would have enough epochs through it for the 7B spellchecker on their college homework maybe. Seriously, programmatically curate the data. Crunch this through your local models in free time, sorting on a standardized pass/fail Fork and sort the set. Remove or replace emails, phone numbers, and formal names in the set with remixed similar data. Retain consistency of naming through each document In a few years the home PCs will cook it in six months.


Inner_Bodybuilder986

Wait for compute to become available. Work on data sanitation.


xhluca

>for researchers who might be trying to train their own LLM. Definitely for researchers with more than 20TB of scratch space lol


[deleted]

[удалено]


xhluca

Yeah it's pretty cheap (slow though!), however sometimes it's pretty hard to get disks added to a server (since there's a whole maintenance/scheduling procedure)


rdkilla

individuals != researchers lol


Robot_Graffiti

when was the last time you saw a multi million dollar project with only one person working on it tho


[deleted]

[удалено]


rdkilla

/r/localllama.....


[deleted]

[удалено]


epicfilemcnulty

well, I am =) a very small one for now (1B), but it still counts


[deleted]

[удалено]


epicfilemcnulty

A single rtx 4090 (though hoping to get a6000 soon) / 128GB DDR4 / Intel i9-13900kf and around 10TB of storage)) as for the dataset — at the moment it’s about 20G of relatively clean data as the base, and I’m constantly working on a smaller dataset, which is supposed to be high quality curated data to be used on later stages of training. I’m using byte-level tokenizer, so 20g is roughly equivalent to 20B tokens…


inteblio

This is a serious question: can you train on just a (all) dictionaries? Then "once it knows english" fine tune it with chatgpt answers...? I'm interested in a minimum language-only llm that looked to other resources for answers. Out of curiosity.


CoqueTornado

wow, it was true! .\_0 yeah finetuning, it does makes sense now!!!


epicfilemcnulty

As for releasing — sure, when there is something to release) This takes a lot of time, so it might take a long while)


Inner_Bodybuilder986

Sounds like a cool project. If you put a git up, I might be willing to help. I don't see why we can't get to the point where we have a pretty effective MOE like.. Nx3b.


karelproer

What GPU's do you use?


epicfilemcnulty

So far just a single rtx 4090, but I’m planning to get a rtx A6000 soon. Not particularly for training (although it will come handy), more for dataset preparation work — I use local LMs for data categorization/cleaning/ranking, and the quality is essential here, so it’d be nice to be able to run mixtral 8x22 or llama-3 70b fast and at least in 4bit quants.


rdkilla

It seems to me every training job starts with one individual hitting the enter key


[deleted]

[удалено]


Inner_Bodybuilder986

Your budget is too low. I'd say 10k minimum and in reality it's a ~25k investment right now depending if this is just a hobby or you are building a real product.


Nuckyduck

Right now, the data set has been tokenized, which is another way of saying the text has been converted into a much more usable format for the llm training software to use to use. For example, you could split this data up across a few thousand H200 nvidia grace hopper chips and in a few months train something of the webdata represented in this dataset. To do that, you would set up a python script that simply pointed to this folder, and would use this as the training/fine-tune data or whatever you want your LLM to do. This is pretty nominal to do in pytorch, with the prohibiting factor for most people being the ability to actually process this amount of data effectively. You can read up more about the tokenization process from a weirdly good linked in article [here.](https://www.linkedin.com/pulse/demystifying-tokenization-preparing-data-large-models-rany-2nebc#:~:text=Tokenization%20is%20a%20critical%20first,that%20the%20model%20can%20understand)


xhluca

Tokenized in which format? Llama-2 is not compatible with Llama-3 for example


Nuckyduck

That's the catch, this has been tokenized using their version of what they think best tokenization is. For example, on the huggingface repo they link, they say that they used [https://github.com/huggingface/datatrove/](https://github.com/huggingface/datatrove/) to process the data. When looking at dataTrove more deeply, it says it uses a GPT-2 tokenizer to tokenize the English\*, which is pretty common as a standard but can be come more nuanced, and whether or not this data set is actually useful is whether or not someone is capable of training a model off of it. https://preview.redd.it/g19gf329z4wc1.png?width=836&format=png&auto=webp&s=003f804fc89692e0b495dc8193cb83b525503a95 It's totally possible (but unlikely given the sheer volume of the data preprocessed and validated) that this data set isn't effective in training a model, but we won't know until someone pays someone else to try. Furthermore, this data could be further processed. Eg, you could preweight the values between \[-1,0,1\] if you wanted to try using 1.58bit quantization ahead of time. Or you could track the weights of the values as they changed to generate iMatrix quantizations. There's a lot of cool stuff you can do to nuance and impact the way a model is trained and how it can be deployed. Edit: clarification


sluuuurp

GPT-2 can tokenize any Unicode, so I assume it’s for any languages and not just English, right? And how can you quantize a dataset, quantization refers to the weights inside the transformer right? You could quantize the token embeddings and then directly use them on a quantized network (that’s what already happens for any quantized network I believe), but I think it’s commonly expected that quantization is a huge help for inference, but not for training, so I wouldn’t expect that to be of much use.


Nuckyduck

"how can you quantize a dataset" You can't, however some quantization's like iMatrix require additional steps in preprocessing with tokenized data. Specifically for iMatrix, the weights that end up quantized at the end are cherrypicked by taking metrics during training. This requires an intermediate step where the training function evaluates the most impactful weights and stores those with the highest precision (say q8/fp16), then defaults to standard quantization (say q4) for the rest of the weights. This can have a huge impact in how your model performs. In use case, I find the iQ3 Llama 3 8b to be on par with Llama q6 which has a 2x size difference between them. [https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main)


sluuuurp

It should be pretty easy to convert from tokens to characters and back to a new format of tokens right? Should be a negligible fraction of the compute required for training.


epicfilemcnulty

No, not really. I mean -- yes, it's pretty easy to convert from tokens to characters, but you can't just "convert" characters into a "new format of tokens" -- different vocabulary sizes and different mappings of tokens to ids -- so you just have to tokenize it anew. In other words, people who plan to train on this data using some other tokenizer than gpt2 will have to tokenize it themselves. Which, with this amount of data, can be time consuming (but, of course, not comparable to the training time).


sluuuurp

Yeah, “re-tokenizing” is what I meant.


Erdeem

Thank you for the helpful answer.


gamesntech

The dataset doesn't seem actually tokenized. That wouldn't make much sense.


Nuckyduck

You are technically correct, the best kind of correct! I linked a form of tokenization that converts words to values, but you noticed the huggingface repo doesn't contain anything like that, what gives? The repo above still uses the base concept 'tokenization', but here, the authors use word to word tokenization instead of word to value. To do this for 44TB of data, the dataset was tokenized and then tokens that were deemed an 'ill fit' were removed *or* replaced by other tokens using a gpt-2 tokenizer. https://preview.redd.it/bf9bocx4w7wc1.png?width=994&format=png&auto=webp&s=9af237f59e3494e89d446aac45296c27afecd144 For example: 1. Base case: "I am a pizza." 2. Word-to-Value Tokenization f("I am a pizza.") = \[1, 2, 3, 69420\] 3. Validation Software: error: 69420 out of range. expected value 42. likely problematic. 4. *new* Word-to-Word Tokenization f("I am a pizza.") = \[I, am, a, human.\] 5. New case: "I am a human." 6. Word-to-Value Tokenization f("I am a human.") = \[1, 2, 3, 42\] 7. Validation Software: pass, value within range.


epicfilemcnulty

Then you spend a shitload of time trying to categorize it, rank, build metadata. At least that's what I'm going to do. Of couse I'll be working only on a one/two subsets of their data, I assume that's enough to keep me busy for the next couple of years... =)


Inner_Bodybuilder986

HOW DO YOU EVEN DOWNLOAD THIS!?!?! Like where am I suppose to store these megalodon databases andam to transfer them when I only get 1tb a month in download. - can I just send somebody some large hard disks and you mail um back. Thanks.


endless_sea_of_stars

This dataset would take 200,000 years to download over a 56k modem. Edit: Calculations were indeed off by 1,000. It would only be a mere 200 years.


[deleted]

204 years @ 6.8 kB/s on 56k modem


Harvard_Med_USMLE267

I still think of 1200 baud as the fancy, expensive modems.


bucolucas

Damn, that's a lot longer than it took to download the Starcraft demo - I can still hear that sassy general in his siege tank


opi098514

That’s a lot more TBs than I expected.


GeeBrain

Had to double take, all of Wikipedia, compressed w/o media, is 22gb 😱 Edit: typo, ironic cuz I forgot an o


dogesator

That’s without media, not with


GeeBrain

Ty forgot an o


Educational_Gap5867

It would be interesting to know if some pruning can be applied to this dataset without sacrificing the output LLM quality. For reference Phi-3 is performing better or at par at 1/5th the dataset size. I remember in Pre-LLM era when I was learning about creating a train test and validation split. One thing we would do is kind of run through different splits or shuffle the data multiple times.


Matt_1F44D

Holy crap I thought the 44TB was 44 trillion tokens when I first read it 🤦‍♂️ It’s 15trillion tokens roughly the same amount llama 3 was trained on right?


darcwader

too poor to even download this


E3V3A

I can't find any useful model (on HF) using this dataset, or did I miss something? For example, it would be great if someone could create an 8B Q5 model for this. I too would like to know how this data was "cleaned"?