Most data around is junk and the internet produces junk data faster then useful ...

vidarh · on July 1, 2024

There are multiple companies hiring people on contracts to curate and generate data for this. I do confidential contract work for two different ones at the moment, and while my NDAs limit how much I can say, it involves both identifying issues with captured prompt/response pairs that have been filtered, and writing synthetic ones from scratch aided by models (e.g. come up with a coding problem, and rewrite the response to be "perfect").

The first category has obviously been pre-filtered to put cheaper resources in simpler problems, as sometimes these projects pays reasonable tech contract rates for 1-2 hours of work to improve only 2-3 conversation turns of a single conversation, and it's clear they usually involve more than one person reviewing the same data.

A lot of money is pouring into that space, and the moats in the form of proprietary training data heavily curated by experts is going to be growing rapidly given how much cash the big players have.

Propelloni · on July 1, 2024

Thanks for your insights! Would you say, this is an approach suited for "general" GPTs (not in the sense of AGI) or more for expert systems like Copilot?

vidarh · on July 1, 2024

I can't really say I know whether the outcomes are good as I won't be told to what extent the output makes it into production models, and I don't even always know which company it's for. But I know at least some of it is being used for "general" models. I do more code-related work than general purpose as it's the work I find most interesting, but the highest paid contract I've had in this space so far is for a general-purpose model that to my knowledge isn't available yet, for a model from a company you'd know (but I'm under strict NDA not to mention the company name or more details about the work).

threeseed · on June 30, 2024

> My take to improve AI output is to heavily curate the data you feed your AI

This is what OpenAI is doing with their relationships with companies like Reddit, News Corp etc:

https://openai.com/index/news-corp-and-openai-sign-landmark-...

Problem is that we have a finite amount of this type of information.

grugagag · on June 30, 2024

Massive surveilance, take out data and use it on training. Hope this will not come to frution.

talldayo · on July 1, 2024

Thankfully, we have stalwart and well-known defenders of our security like Apple and Microsoft to protect us. There's nothing of the sort to worry about.