Oh this is super cool! I think maybe the new RoPE scaling method Llama 3.1 uses isn't yet added in? It's some weird one time scaling mechanism found by a grid search to enable 128K context. Essentially the model was trained on 15.6T tokens on 8K context, then iteratively extended to 128K context with 800B tokens.
Oh thanks bro, nope it uses the simple llama 2 rope with tetha changed to 50k to match llama 3's. I'll check your python PR, have a deeper look at the meta llama 3 & 3.1 implementation and hack together something soonish. Awesome!
To be honest your pr and these notes are super helpful cos, otherwise I'd have been too lazy to read up the original implementation, but I can't merge it. Will make the fix soonish and credit it to you. I'll tell you a secret, nobody is good at C except for a few wizards like woz, Linus, jart, Knuth, Bellard and the like. I am also a total looser at C. I just love C. And also the base of this is mostly Karpathy's work and other awesome folks who are way better at everything than me.
Meta's Llama 3.1 models can output multilingual text which is awesome. Here are some examples output of 8 bit quantized 8b model with 100 token output (-n 100)... Quantization creates some brain damage.
English
My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts
German
Besitzen Sie einen Amiga 500? Wenn nicht, werden Sie wissen, dass dies ein Computer war, der im späten 1980er und frühen 1990er Jahren für Spiele verfügbar war, die für Personen mit bestimmten Körperverletzungen gedacht waren. Manchmal müssen wir uns an frühere Zeiten erinnern, die, wie wir jetzt wissen, schwierig waren. Hier ist ein Link, der meine Geschichte bespre
French
Le vin français est, à bien des égards, un vin des origines, car il a joué un rôle important dans l'histoire de la France". La réputation des vins de France repose principalement sur leurs qualités gustatives et la gestion des vignobles contrôlée, ce qui rend le vin français un "produit d'exception". La France est donc leader mondial de la production de vin, avec 25 % des exportations mon
मेरी बिल्ली बहुत मज़ाया है और वह हमेशा अपनी शारीरिक गतिविधियों से मुझे मजाक करती है। वास्तव में, जब वह अपनी खिलौनों की चपपेट में आती है तो वह विशेष रूप से क्लासिक बन जाती है। इसके अलावा, वह एक छोटी सी च
> My cat is funny. "Funny cat," I say, walking up to it. "What are you up to?" It sits up straight and looks at me with a tilted head, as if to say, "What's wrong with you?" Sometimes I just have to laugh at how funny a cat can be. So I say, "Okay, you're funny. I'll give you some treats." It stretches out a little and I give it some treats. It eats them up quickly and starts
This is kind of like 3rd grade English. What would be required to go beyond that?
This is also how you'd communicate with a cat or dog. We [still] can't use complex long sentences in communications with them. And the LLM seems to be nicely reproducing that style and mindset of complete yet small sentence-bites you're in when communicating with your Good Boy or your Funny Cat.
Actually "My cat is funny" was the prompt it continued that. I got to fix some stuff to reflect meta's implementation and also fix the chat mode, then it would be usable. Will take a few days to do that.
There's actually an old paper titled Optimal Brain Damage, where they don't try to find optimal quantizations, but optimal sparse versions of a models-- i.e. where some weights are set to zero.
That was really for them. You’re out there building neat stuff. Your talent might warrant looking into AdderNets and Bitnets which might get the cost down. There’s also some brain-inspired designs.
I don’t think many people have implemented such things. You might discover something new experimenting with them.
I think that's known as degradation, but I think the brain damage metric could be usefully applies to the poor sods who try to get the quantization to work in the first place.
As someone who has literally no idea of ModelOps / GenAI Deployment, what am I seeing there? Code that just loads in the weights and provides an inference API? Or what does this code actually do?
I think the non English part is mostly hit and miss in this primitive version, probably cos the implementation is not correct. I got to read up a lot and fix it.
It's really interesting and a great piece of work that you have put in what looks like a huge amount of effort.
You should definitely be proud of this.
Thank you :) The open source community is great, the core is a confluence of the awesome work of others. I'm just a orchestrator. I wish I had more time to make it into something like llama.cpp and an OS around it.
Can open a PR if people want :) [Edit: Just opened a PR! Apologies my C is very rusty! https://github.com/trholding/llama2.c/pull/14]
https://github.com/trholding/llama2.c/blob/master/runq.c#L65... needs to be scaled with some weird formula like in https://github.com/unslothai/unsloth/blob/main/unsloth/model...