Llama 3.1 in C

danielhanchen · on July 24, 2024

Oh this is super cool! I think maybe the new RoPE scaling method Llama 3.1 uses isn't yet added in? It's some weird one time scaling mechanism found by a grid search to enable 128K context. Essentially the model was trained on 15.6T tokens on 8K context, then iteratively extended to 128K context with 800B tokens.

Can open a PR if people want :) [Edit: Just opened a PR! Apologies my C is very rusty! https://github.com/trholding/llama2.c/pull/14]

https://github.com/trholding/llama2.c/blob/master/runq.c#L65... needs to be scaled with some weird formula like in https://github.com/unslothai/unsloth/blob/main/unsloth/model...

AMICABoard · on July 24, 2024

Oh thanks bro, nope it uses the simple llama 2 rope with tetha changed to 50k to match llama 3's. I'll check your python PR, have a deeper look at the meta llama 3 & 3.1 implementation and hack together something soonish. Awesome!

danielhanchen · on July 24, 2024

Np! Oh ye my PR is just a random conversion from Python to C - haven't done C in ages though so probs won't even compile!

AMICABoard · on July 24, 2024

To be honest your pr and these notes are super helpful cos, otherwise I'd have been too lazy to read up the original implementation, but I can't merge it. Will make the fix soonish and credit it to you. I'll tell you a secret, nobody is good at C except for a few wizards like woz, Linus, jart, Knuth, Bellard and the like. I am also a total looser at C. I just love C. And also the base of this is mostly Karpathy's work and other awesome folks who are way better at everything than me.

danielhanchen · on July 24, 2024

Oh thanks! :) Oh ye Karpathy's repo is pretty sick! Well still great work on it! If you need any help on it, feel free to ask!

AMICABoard · on July 24, 2024

Will do! Thanks :)

danielhanchen · on July 25, 2024

AMICABoard · on July 24, 2024

Okay but hold your horses. Still a bit buggy.

Sample output:

Meta's Llama 3.1 models can output multilingual text which is awesome. Here are some examples output of 8 bit quantized 8b model with 100 token output (-n 100)... Quantization creates some brain damage.

English

My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts

German

Besitzen Sie einen Amiga 500? Wenn nicht, werden Sie wissen, dass dies ein Computer war, der im späten 1980er und frühen 1990er Jahren für Spiele verfügbar war, die für Personen mit bestimmten Körperverletzungen gedacht waren. Manchmal müssen wir uns an frühere Zeiten erinnern, die, wie wir jetzt wissen, schwierig waren. Hier ist ein Link, der meine Geschichte bespre

French

Le vin français est, à bien des égards, un vin des origines, car il a joué un rôle important dans l'histoire de la France". La réputation des vins de France repose principalement sur leurs qualités gustatives et la gestion des vignobles contrôlée, ce qui rend le vin français un "produit d'exception". La France est donc leader mondial de la production de vin, avec 25 % des exportations mon

Thai

แมวของฉันตลกชอบเล่นบนม้วนกระดาษ และฉันก็ไม่แน่ใจว่าควรจะยินยอมที่จะให้เล่นหรือไม่

เมื่อเวลาผ่านไป ฉันเห็นว่าแมวของฉันเล่นม้วนกระดาษเป็นระยะ ๆ ฉันจึงตัดสินใจที่จะลองปรับเปลี่ยนเกมให้สนุกขึ้น

Hindi

मेरी बिल्ली बहुत मज़ाया है और वह हमेशा अपनी शारीरिक गतिविधियों से मुझे मजाक करती है। वास्तव में, जब वह अपनी खिलौनों की चपपेट में आती है तो वह विशेष रूप से क्लासिक बन जाती है। इसके अलावा, वह एक छोटी सी च

aninteger · on July 24, 2024

> My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts

This is kind of like 3rd grade English. What would be required to go beyond that?

trhway · on July 25, 2024

>This is kind of like 3rd grade English

This is also how you'd communicate with a cat or dog. We [still] can't use complex long sentences in communications with them. And the LLM seems to be nicely reproducing that style and mindset of complete yet small sentence-bites you're in when communicating with your Good Boy or your Funny Cat.

AMICABoard · on July 25, 2024

Hmmm interesting take.

AMICABoard · on July 24, 2024

Actually "My cat is funny" was the prompt it continued that. I got to fix some stuff to reflect meta's implementation and also fix the chat mode, then it would be usable. Will take a few days to do that.

actionfromafar · on July 24, 2024

Einen Amiga 500?

Shut up and take my money!

AMICABoard · on July 24, 2024

Ja Amiga 500! My first computer. Still in love with her...:)

alex_smart · on July 25, 2024

The Hindi output is garbage.

AMICABoard · on July 25, 2024

Likely, cos other langs too. The fault may be my incomplete implementation.

FrostKiwi · on July 24, 2024

> Quantization creates some brain damage.

Love the wording.

impossiblefork · on July 24, 2024

There's actually an old paper titled Optimal Brain Damage, where they don't try to find optimal quantizations, but optimal sparse versions of a models-- i.e. where some weights are set to zero.

nickpsecurity · on July 24, 2024

I found…

Optimal Brain Damage https://www.researchgate.net/publication/221618539_Optimal_B...

Optimal Brain Compression https://openreview.net/pdf?id=ksVGCOlOEba#:~:text=The%20resu....

TinyVolt’s implementation of it: https://github.com/TinyVolt/optimal-brain-compression

AMICABoard · on July 24, 2024

Thanks, I'll read it up. Interesting.

nickpsecurity · on July 24, 2024

That was really for them. You’re out there building neat stuff. Your talent might warrant looking into AdderNets and Bitnets which might get the cost down. There’s also some brain-inspired designs.

I don’t think many people have implemented such things. You might discover something new experimenting with them.

AMICABoard · on July 24, 2024

AMICABoard · on July 24, 2024

Interesting, I heard something like that, but now I must read about it.

AMICABoard · on July 24, 2024

Maybe we should make "Brain Damage Factor" a official term to denote how much types of quantizations degrade output compared to unquantized.:)

Y_Y · on July 24, 2024

I think that's known as degradation, but I think the brain damage metric could be usefully applies to the poor sods who try to get the quantization to work in the first place.

AMICABoard · on July 24, 2024

Lol :)

robertkoss · on July 24, 2024

As someone who has literally no idea of ModelOps / GenAI Deployment, what am I seeing there? Code that just loads in the weights and provides an inference API? Or what does this code actually do?

AMICABoard · on July 24, 2024

My bad, I directly linked to the C file instead of the project here:

It is a program that given a model file, tokenizer file and a prompt, it continues to generate text.

To get it to work, you need to clone and build this: https://github.com/trholding/llama2.c

So the steps are like this:

First you'll need to obtain approval from Meta to download llama3 models on hugging face.

Go to https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, fill the form and then go to https://huggingface.co/settings/gated-repos see acceptance status. Once accepted, do the following to download model, export and run.

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct

git clone https://github.com/trholding/llama2.c.git

cd llama2.c/

# Export Quantized 8bit

python3 export.py ../llama3.1_8b_instruct_q8.bin --version 2 --meta-llama ../Meta-Llama-3.1-8B-Instruct/original/

# Fastest Quantized Inference build

make runq_cc_openmp

# Test Llama 3.1 inference, it should generate sensible text

./run ../llama3.1_8b_instruct_q8.bin -z tokenizer_l3.bin -l 3 -i " My cat"

v3ss0n · on July 24, 2024

How this compares to llamacpp?

AMICABoard · on July 24, 2024

Llama.cpp is the king, this is just a lowly wanna be peasant. But some day it will reach there.

hislaziness · on July 24, 2024

Cool. I will try it out. I tried the same with ollama, the non english part needs a lot more polish. Do you see the outcome being any different?

AMICABoard · on July 24, 2024

I think the non English part is mostly hit and miss in this primitive version, probably cos the implementation is not correct. I got to read up a lot and fix it.

BaculumMeumEst · on July 24, 2024

I think generalizing llama2.c like this project is doing kind of defeats the purpose, no?

AMICABoard · on July 24, 2024

Hmm yeah, it started as fork of karpathy's llama2.c and some experiments. So it is an abomination I agree.

zarkenfrood · on July 25, 2024

It's really interesting and a great piece of work that you have put in what looks like a huge amount of effort. You should definitely be proud of this.

AMICABoard · on July 25, 2024

Thank you :) The open source community is great, the core is a confluence of the awesome work of others. I'm just a orchestrator. I wish I had more time to make it into something like llama.cpp and an OS around it.