Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Llama 3.1 in C (github.com/trholding)
212 points by AMICABoard on July 24, 2024 | hide | past | favorite | 36 comments


Oh this is super cool! I think maybe the new RoPE scaling method Llama 3.1 uses isn't yet added in? It's some weird one time scaling mechanism found by a grid search to enable 128K context. Essentially the model was trained on 15.6T tokens on 8K context, then iteratively extended to 128K context with 800B tokens.

Can open a PR if people want :) [Edit: Just opened a PR! Apologies my C is very rusty! https://github.com/trholding/llama2.c/pull/14]

https://github.com/trholding/llama2.c/blob/master/runq.c#L65... needs to be scaled with some weird formula like in https://github.com/unslothai/unsloth/blob/main/unsloth/model...


Oh thanks bro, nope it uses the simple llama 2 rope with tetha changed to 50k to match llama 3's. I'll check your python PR, have a deeper look at the meta llama 3 & 3.1 implementation and hack together something soonish. Awesome!


Np! Oh ye my PR is just a random conversion from Python to C - haven't done C in ages though so probs won't even compile!


To be honest your pr and these notes are super helpful cos, otherwise I'd have been too lazy to read up the original implementation, but I can't merge it. Will make the fix soonish and credit it to you. I'll tell you a secret, nobody is good at C except for a few wizards like woz, Linus, jart, Knuth, Bellard and the like. I am also a total looser at C. I just love C. And also the base of this is mostly Karpathy's work and other awesome folks who are way better at everything than me.


Oh thanks! :) Oh ye Karpathy's repo is pretty sick! Well still great work on it! If you need any help on it, feel free to ask!


Will do! Thanks :)


:)


Okay but hold your horses. Still a bit buggy.

Sample output:

Meta's Llama 3.1 models can output multilingual text which is awesome. Here are some examples output of 8 bit quantized 8b model with 100 token output (-n 100)... Quantization creates some brain damage.

English

My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts

German

Besitzen Sie einen Amiga 500? Wenn nicht, werden Sie wissen, dass dies ein Computer war, der im späten 1980er und frühen 1990er Jahren für Spiele verfügbar war, die für Personen mit bestimmten Körperverletzungen gedacht waren. Manchmal müssen wir uns an frühere Zeiten erinnern, die, wie wir jetzt wissen, schwierig waren. Hier ist ein Link, der meine Geschichte bespre

French

Le vin français est, à bien des égards, un vin des origines, car il a joué un rôle important dans l'histoire de la France". La réputation des vins de France repose principalement sur leurs qualités gustatives et la gestion des vignobles contrôlée, ce qui rend le vin français un "produit d'exception". La France est donc leader mondial de la production de vin, avec 25 % des exportations mon

Thai

แมวของฉันตลกชอบเล่นบนม้วนกระดาษ และฉันก็ไม่แน่ใจว่าควรจะยินยอมที่จะให้เล่นหรือไม่

เมื่อเวลาผ่านไป ฉันเห็นว่าแมวของฉันเล่นม้วนกระดาษเป็นระยะ ๆ ฉันจึงตัดสินใจที่จะลองปรับเปลี่ยนเกมให้สนุกขึ้น

Hindi

मेरी बिल्ली बहुत मज़ाया है और वह हमेशा अपनी शारीरिक गतिविधियों से मुझे मजाक करती है। वास्तव में, जब वह अपनी खिलौनों की चपपेट में आती है तो वह विशेष रूप से क्लासिक बन जाती है। इसके अलावा, वह एक छोटी सी च


> My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts

This is kind of like 3rd grade English. What would be required to go beyond that?


>This is kind of like 3rd grade English

This is also how you'd communicate with a cat or dog. We [still] can't use complex long sentences in communications with them. And the LLM seems to be nicely reproducing that style and mindset of complete yet small sentence-bites you're in when communicating with your Good Boy or your Funny Cat.


Hmmm interesting take.


Actually "My cat is funny" was the prompt it continued that. I got to fix some stuff to reflect meta's implementation and also fix the chat mode, then it would be usable. Will take a few days to do that.


Einen Amiga 500?

Shut up and take my money!


Ja Amiga 500! My first computer. Still in love with her...:)


The Hindi output is garbage.


Likely, cos other langs too. The fault may be my incomplete implementation.


> Quantization creates some brain damage.

Love the wording.


There's actually an old paper titled Optimal Brain Damage, where they don't try to find optimal quantizations, but optimal sparse versions of a models-- i.e. where some weights are set to zero.



Thanks, I'll read it up. Interesting.


That was really for them. You’re out there building neat stuff. Your talent might warrant looking into AdderNets and Bitnets which might get the cost down. There’s also some brain-inspired designs.

I don’t think many people have implemented such things. You might discover something new experimenting with them.


:)


Interesting, I heard something like that, but now I must read about it.


Maybe we should make "Brain Damage Factor" a official term to denote how much types of quantizations degrade output compared to unquantized.:)


I think that's known as degradation, but I think the brain damage metric could be usefully applies to the poor sods who try to get the quantization to work in the first place.


Lol :)


As someone who has literally no idea of ModelOps / GenAI Deployment, what am I seeing there? Code that just loads in the weights and provides an inference API? Or what does this code actually do?


My bad, I directly linked to the C file instead of the project here:

It is a program that given a model file, tokenizer file and a prompt, it continues to generate text.

To get it to work, you need to clone and build this: https://github.com/trholding/llama2.c

So the steps are like this:

First you'll need to obtain approval from Meta to download llama3 models on hugging face.

Go to https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, fill the form and then go to https://huggingface.co/settings/gated-repos see acceptance status. Once accepted, do the following to download model, export and run.

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct

git clone https://github.com/trholding/llama2.c.git

cd llama2.c/

# Export Quantized 8bit

python3 export.py ../llama3.1_8b_instruct_q8.bin --version 2 --meta-llama ../Meta-Llama-3.1-8B-Instruct/original/

# Fastest Quantized Inference build

make runq_cc_openmp

# Test Llama 3.1 inference, it should generate sensible text

./run ../llama3.1_8b_instruct_q8.bin -z tokenizer_l3.bin -l 3 -i " My cat"


How this compares to llamacpp?


Llama.cpp is the king, this is just a lowly wanna be peasant. But some day it will reach there.


Cool. I will try it out. I tried the same with ollama, the non english part needs a lot more polish. Do you see the outcome being any different?


I think the non English part is mostly hit and miss in this primitive version, probably cos the implementation is not correct. I got to read up a lot and fix it.


I think generalizing llama2.c like this project is doing kind of defeats the purpose, no?


Hmm yeah, it started as fork of karpathy's llama2.c and some experiments. So it is an abomination I agree.


It's really interesting and a great piece of work that you have put in what looks like a huge amount of effort. You should definitely be proud of this.


Thank you :) The open source community is great, the core is a confluence of the awesome work of others. I'm just a orchestrator. I wish I had more time to make it into something like llama.cpp and an OS around it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: