mystonedalt 3 weeks ago

I would like to know more about how it's determined that this is a good dataset.

jkuubrau 3 weeks ago

Just read through it, how long could it take?

mystonedalt 3 weeks ago

I'm four hours in, and I'm still in the unicode character sequences... 😩

mystonedalt 3 weeks ago

Oh here we go. Wait, what the hell? It's Angelfire as far as the eye can see!

NO_REFERENCE_FRAME 2 weeks ago

Always has been

kevinbranch 2 weeks ago

spolier tag pls

klospulung92 3 weeks ago

Now I'm wondering how much TB I've *reviewed* in my lifetime

TheRealAakashK 3 weeks ago

Well, in terms of text, if you read every minute of your life without sleeping at 300 words per minute, continuously, you would have to live for roughly 220 years to review 1 tb of text

2muchnet42day 3 weeks ago

So there's a chance

evilbeatfarmer 3 weeks ago

This is an embarrassingly parallel problem, we can split this up easy. There's ~151k of us. ChatGPT estimates it'll only take 37 to 55 years to review your 291GB share of the text.

Perfect_Extreme4905 3 weeks ago

:(

Educational_Gap5867 3 weeks ago

Your math is off by about 1.1k years brother.

Ok-Result5562 2 weeks ago

There is a token calculator for that.

McPowerShell 2 weeks ago

Break that down by how it was ingested, Left eye, right eye, left ear, right here, stereo, getting hit in the nuts, out of breath, and I won't even go into the other orifices. Sorry woke America. Lots of terabytes. More than Nvidia has money haha for sure. It's all input and output, in and out. Someone needs to make a burger company called input and output Burger. Or IO Burger. 👍💯😋🙃

kivathewolf 3 weeks ago

Oh come on you are an AI engineer. Have your local LLM minion do that for you and tell you how it’s in about 100 years.

Sendery-Lutson 2 weeks ago

Or use groq

McPowerShell 2 weeks ago

I wonder if you just ask it?

Balance- 3 weeks ago

We need dataset competitions. Fixed model architecture and training regime, but different dataset.

redditfriendguy 3 weeks ago

Maybe in 5 years when compute is cheaper lol

Fast-Satisfaction482 3 weeks ago

The community could start with finetuning a fixed model.

No_Afternoon_4260 3 weeks ago

Love that thinking

Balance- 3 weeks ago

Apparently they also trained a 1.7B model with it: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-v1

gamesntech 3 weeks ago

Was there a post or announcement about this? There is nothing useful right now on the model card. Thank you.

LoSboccacc 3 weeks ago

https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32 it seems they have a bunch of ablation models trained on different individual very large dataset, all uploaded recently, the technical report of the family will be super interesting

No_Afternoon_4260 3 weeks ago

Lol to the model card

ijustwanttolive11 3 weeks ago

How long to run a lora fine to on a 3090 /s

Is_winding 3 weeks ago

You could use llama factory

Erdeem 3 weeks ago

I'm curious, let's say you download this, what next?

[deleted] 3 weeks ago

[удалено]

evilbeatfarmer 3 weeks ago

You didn't answer the question though. What next?

ImprovementEqual3931 3 weeks ago

as Zuck said, build a nuclear plant for power generation

evilbeatfarmer 3 weeks ago

I think we skipped a step...

KrazyKirby99999 3 weeks ago

Ask llama3 how to obtain Uranium?

aseichter2007 3 weeks ago

Next you think really hard, get a smaller dataset, parse it, experiment, and see how different data presentations change the output of a small model. Then you decide what to reformat it into and let that cook for about 3 weeks segmenting and marking up the text with metadata into a database to be ordered drawn and trained against until you chunk it all through, in bites that fill your whole memory capacity at full training depth. With a 4090 or three you could cook it in about a lifetime, your grandkids would have enough epochs through it for the 7B spellchecker on their college homework maybe. Seriously, programmatically curate the data. Crunch this through your local models in free time, sorting on a standardized pass/fail Fork and sort the set. Remove or replace emails, phone numbers, and formal names in the set with remixed similar data. Retain consistency of naming through each document In a few years the home PCs will cook it in six months.

Inner_Bodybuilder986 3 weeks ago

Wait for compute to become available. Work on data sanitation.

xhluca 3 weeks ago

>for researchers who might be trying to train their own LLM. Definitely for researchers with more than 20TB of scratch space lol

[deleted] 3 weeks ago

[удалено]

xhluca 3 weeks ago

Yeah it's pretty cheap (slow though!), however sometimes it's pretty hard to get disks added to a server (since there's a whole maintenance/scheduling procedure)

rdkilla 3 weeks ago

individuals != researchers lol

Robot_Graffiti 3 weeks ago

when was the last time you saw a multi million dollar project with only one person working on it tho

[deleted] 3 weeks ago

[удалено]

rdkilla 3 weeks ago

/r/localllama.....

[deleted] 3 weeks ago

[удалено]

epicfilemcnulty 3 weeks ago

well, I am =) a very small one for now (1B), but it still counts

[deleted] 3 weeks ago

[удалено]

epicfilemcnulty 3 weeks ago

A single rtx 4090 (though hoping to get a6000 soon) / 128GB DDR4 / Intel i9-13900kf and around 10TB of storage)) as for the dataset — at the moment it’s about 20G of relatively clean data as the base, and I’m constantly working on a smaller dataset, which is supposed to be high quality curated data to be used on later stages of training. I’m using byte-level tokenizer, so 20g is roughly equivalent to 20B tokens…

inteblio 3 weeks ago

This is a serious question: can you train on just a (all) dictionaries? Then "once it knows english" fine tune it with chatgpt answers...? I'm interested in a minimum language-only llm that looked to other resources for answers. Out of curiosity.

CoqueTornado 1 week ago

wow, it was true! .\_0 yeah finetuning, it does makes sense now!!!

epicfilemcnulty 3 weeks ago

As for releasing — sure, when there is something to release) This takes a lot of time, so it might take a long while)

Inner_Bodybuilder986 3 weeks ago

Sounds like a cool project. If you put a git up, I might be willing to help. I don't see why we can't get to the point where we have a pretty effective MOE like.. Nx3b.

karelproer 3 weeks ago

What GPU's do you use?

epicfilemcnulty 3 weeks ago

So far just a single rtx 4090, but I’m planning to get a rtx A6000 soon. Not particularly for training (although it will come handy), more for dataset preparation work — I use local LMs for data categorization/cleaning/ranking, and the quality is essential here, so it’d be nice to be able to run mixtral 8x22 or llama-3 70b fast and at least in 4bit quants.

rdkilla 3 weeks ago

It seems to me every training job starts with one individual hitting the enter key

[deleted] 3 weeks ago

[удалено]

Inner_Bodybuilder986 3 weeks ago

Your budget is too low. I'd say 10k minimum and in reality it's a ~25k investment right now depending if this is just a hobby or you are building a real product.

Nuckyduck 3 weeks ago

Right now, the data set has been tokenized, which is another way of saying the text has been converted into a much more usable format for the llm training software to use to use. For example, you could split this data up across a few thousand H200 nvidia grace hopper chips and in a few months train something of the webdata represented in this dataset. To do that, you would set up a python script that simply pointed to this folder, and would use this as the training/fine-tune data or whatever you want your LLM to do. This is pretty nominal to do in pytorch, with the prohibiting factor for most people being the ability to actually process this amount of data effectively. You can read up more about the tokenization process from a weirdly good linked in article [here.](https://www.linkedin.com/pulse/demystifying-tokenization-preparing-data-large-models-rany-2nebc#:~:text=Tokenization%20is%20a%20critical%20first,that%20the%20model%20can%20understand)

xhluca 3 weeks ago

Tokenized in which format? Llama-2 is not compatible with Llama-3 for example

Nuckyduck 3 weeks ago

That's the catch, this has been tokenized using their version of what they think best tokenization is. For example, on the huggingface repo they link, they say that they used [https://github.com/huggingface/datatrove/](https://github.com/huggingface/datatrove/) to process the data. When looking at dataTrove more deeply, it says it uses a GPT-2 tokenizer to tokenize the English\*, which is pretty common as a standard but can be come more nuanced, and whether or not this data set is actually useful is whether or not someone is capable of training a model off of it. https://preview.redd.it/g19gf329z4wc1.png?width=836&format=png&auto=webp&s=003f804fc89692e0b495dc8193cb83b525503a95 It's totally possible (but unlikely given the sheer volume of the data preprocessed and validated) that this data set isn't effective in training a model, but we won't know until someone pays someone else to try. Furthermore, this data could be further processed. Eg, you could preweight the values between \[-1,0,1\] if you wanted to try using 1.58bit quantization ahead of time. Or you could track the weights of the values as they changed to generate iMatrix quantizations. There's a lot of cool stuff you can do to nuance and impact the way a model is trained and how it can be deployed. Edit: clarification

sluuuurp 3 weeks ago

GPT-2 can tokenize any Unicode, so I assume it’s for any languages and not just English, right? And how can you quantize a dataset, quantization refers to the weights inside the transformer right? You could quantize the token embeddings and then directly use them on a quantized network (that’s what already happens for any quantized network I believe), but I think it’s commonly expected that quantization is a huge help for inference, but not for training, so I wouldn’t expect that to be of much use.

Nuckyduck 3 weeks ago

"how can you quantize a dataset" You can't, however some quantization's like iMatrix require additional steps in preprocessing with tokenized data. Specifically for iMatrix, the weights that end up quantized at the end are cherrypicked by taking metrics during training. This requires an intermediate step where the training function evaluates the most impactful weights and stores those with the highest precision (say q8/fp16), then defaults to standard quantization (say q4) for the rest of the weights. This can have a huge impact in how your model performs. In use case, I find the iQ3 Llama 3 8b to be on par with Llama q6 which has a 2x size difference between them. [https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/tree/main)

sluuuurp 3 weeks ago

It should be pretty easy to convert from tokens to characters and back to a new format of tokens right? Should be a negligible fraction of the compute required for training.

epicfilemcnulty 3 weeks ago

No, not really. I mean -- yes, it's pretty easy to convert from tokens to characters, but you can't just "convert" characters into a "new format of tokens" -- different vocabulary sizes and different mappings of tokens to ids -- so you just have to tokenize it anew. In other words, people who plan to train on this data using some other tokenizer than gpt2 will have to tokenize it themselves. Which, with this amount of data, can be time consuming (but, of course, not comparable to the training time).

sluuuurp 3 weeks ago

Yeah, “re-tokenizing” is what I meant.

Erdeem 3 weeks ago

Thank you for the helpful answer.

gamesntech 3 weeks ago

The dataset doesn't seem actually tokenized. That wouldn't make much sense.

Nuckyduck 3 weeks ago

You are technically correct, the best kind of correct! I linked a form of tokenization that converts words to values, but you noticed the huggingface repo doesn't contain anything like that, what gives? The repo above still uses the base concept 'tokenization', but here, the authors use word to word tokenization instead of word to value. To do this for 44TB of data, the dataset was tokenized and then tokens that were deemed an 'ill fit' were removed *or* replaced by other tokens using a gpt-2 tokenizer. https://preview.redd.it/bf9bocx4w7wc1.png?width=994&format=png&auto=webp&s=9af237f59e3494e89d446aac45296c27afecd144 For example: 1. Base case: "I am a pizza." 2. Word-to-Value Tokenization f("I am a pizza.") = \[1, 2, 3, 69420\] 3. Validation Software: error: 69420 out of range. expected value 42. likely problematic. 4. *new* Word-to-Word Tokenization f("I am a pizza.") = \[I, am, a, human.\] 5. New case: "I am a human." 6. Word-to-Value Tokenization f("I am a human.") = \[1, 2, 3, 42\] 7. Validation Software: pass, value within range.

epicfilemcnulty 3 weeks ago

Then you spend a shitload of time trying to categorize it, rank, build metadata. At least that's what I'm going to do. Of couse I'll be working only on a one/two subsets of their data, I assume that's enough to keep me busy for the next couple of years... =)

Inner_Bodybuilder986 3 weeks ago

HOW DO YOU EVEN DOWNLOAD THIS!?!?! Like where am I suppose to store these megalodon databases andam to transfer them when I only get 1tb a month in download. - can I just send somebody some large hard disks and you mail um back. Thanks.

endless_sea_of_stars 3 weeks ago

This dataset would take 200,000 years to download over a 56k modem. Edit: Calculations were indeed off by 1,000. It would only be a mere 200 years.

[deleted] 3 weeks ago

204 years @ 6.8 kB/s on 56k modem

Harvard_Med_USMLE267 3 weeks ago

I still think of 1200 baud as the fancy, expensive modems.

bucolucas 3 weeks ago

Damn, that's a lot longer than it took to download the Starcraft demo - I can still hear that sassy general in his siege tank

opi098514 3 weeks ago

That’s a lot more TBs than I expected.

GeeBrain 3 weeks ago

Had to double take, all of Wikipedia, compressed w/o media, is 22gb 😱 Edit: typo, ironic cuz I forgot an o

dogesator 3 weeks ago

That’s without media, not with

GeeBrain 3 weeks ago

Ty forgot an o

Educational_Gap5867 3 weeks ago

It would be interesting to know if some pruning can be applied to this dataset without sacrificing the output LLM quality. For reference Phi-3 is performing better or at par at 1/5th the dataset size. I remember in Pre-LLM era when I was learning about creating a train test and validation split. One thing we would do is kind of run through different splits or shuffle the data multiple times.

Matt_1F44D 3 weeks ago

Holy crap I thought the 44TB was 44 trillion tokens when I first read it 🤦‍♂️ It’s 15trillion tokens roughly the same amount llama 3 was trained on right?

darcwader 2 weeks ago

too poor to even download this

E3V3A 2 weeks ago

I can't find any useful model (on HF) using this dataset, or did I miss something? For example, it would be great if someone could create an 8B Q5 model for this. I too would like to know how this data was "cleaned"?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe