Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most data around is junk and the internet produces junk data faster then useful data and current GPT AIs basically regurgitate what someone already did somewhere on the internet. So I guess the more data we feed into GPTs the worse the results will get.

My take to improve AI output is to heavily curate the data you feed your AI, much the like expert systems of old (which were lauded as "AI" also.) Maybe we can break the vicious circle of "I trained my GPT on billions of Twitter posts and let it write Twitter posts to great sucess", "Hey, me too!"



There are multiple companies hiring people on contracts to curate and generate data for this. I do confidential contract work for two different ones at the moment, and while my NDAs limit how much I can say, it involves both identifying issues with captured prompt/response pairs that have been filtered, and writing synthetic ones from scratch aided by models (e.g. come up with a coding problem, and rewrite the response to be "perfect").

The first category has obviously been pre-filtered to put cheaper resources in simpler problems, as sometimes these projects pays reasonable tech contract rates for 1-2 hours of work to improve only 2-3 conversation turns of a single conversation, and it's clear they usually involve more than one person reviewing the same data.

A lot of money is pouring into that space, and the moats in the form of proprietary training data heavily curated by experts is going to be growing rapidly given how much cash the big players have.


Thanks for your insights! Would you say, this is an approach suited for "general" GPTs (not in the sense of AGI) or more for expert systems like Copilot?


I can't really say I know whether the outcomes are good as I won't be told to what extent the output makes it into production models, and I don't even always know which company it's for. But I know at least some of it is being used for "general" models. I do more code-related work than general purpose as it's the work I find most interesting, but the highest paid contract I've had in this space so far is for a general-purpose model that to my knowledge isn't available yet, for a model from a company you'd know (but I'm under strict NDA not to mention the company name or more details about the work).


> My take to improve AI output is to heavily curate the data you feed your AI

This is what OpenAI is doing with their relationships with companies like Reddit, News Corp etc:

https://openai.com/index/news-corp-and-openai-sign-landmark-...

Problem is that we have a finite amount of this type of information.


Massive surveilance, take out data and use it on training. Hope this will not come to frution.


Thankfully, we have stalwart and well-known defenders of our security like Apple and Microsoft to protect us. There's nothing of the sort to worry about.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: