More

pvankessel · 2025-11-14T21:29:31 1763155771

This view of the world puts everything on the individual. It might be worth reading up on structuralism to balance that perspective out a bit. I'm somewhere in the middle of the two extremes myself, but surely one must acknowledge that there are larger systems at play that can constrain an individual's ability to "optimize".

pvankessel · 2025-11-03T01:30:25 1762133425

The automation one is so true! When I first deployed a huge job to MTurk, with so much money on the line I wanted to be careful, and I wrote some heuristics to auto-ban Turkers who worked their way through the HITs suspiciously quickly (2 standard deviations above the norm, iirc) - and damn did I wake up to a BUNCH of angry (but kind) emails. Turns out, there was a popular hotkey programming tool that Turk Masters made use of to work through the more prized HITs more efficiently, and on one of their forums someone shared a script for ours. I checked their work and it was quality, they were just hyper-optimizing. It was reassuring to see how much they cared about doing a good job.

pvankessel · 2025-11-02T23:07:49 1762124869

I used MTurk heavily in its hey-day for data annotation - it was an invaluable tool for collecting training data for large-scale research projects, I honestly have to credit it with enabling most of my early career triumphs. We labeled and classified hundreds of thousands of tweets, Facebook posts, news articles, YouTube videos - you name it. Sure, there were bad actors who gave us fake data, but with the right qualifications and timing checks, and if you assigned multiple Turkers (3-5) to each task, you could get very reliable results with high inter-rater reliability that matched that of experts. Wisdom of the crowd, or the law of averages, I suppose. Paying a living wage also helped - the community always got extremely excited when our HITs dropped and was very engaged, I loved getting thank yous and insightful clarifying questions in our inbox. For most of this kind of work, I now use AI and get comparable results, but back in the day, MTurk was pure magic if you knew how to use it to its full potential. Truthfully I really miss it - hitting a button to launch 50k HITs and seeing the results slowly pour in overnight (and frantically spot-checking it to make sure you weren't setting $20k on fire) was about as much of a rush as you can get in the social science research world.

pvankessel · 2025-10-25T19:56:06 1761422166

I got this one a few months ago and have been running it in my basement directly under my living room, separated only by the floor and a bit of insulation. Can't hear it at all. It's been working well and it's a fun low-investment hobby. I live on a glacial moraine so there are lots of unique rocks in my backyard, and my son enjoys digging for them. https://a.co/d/4HSnVVX

pvankessel · 2025-10-05T18:16:58 1759688218

Except many STEM graduates are having a harder time finding jobs right now than liberal arts and humanities majors: https://www.newyorkfed.org/research/college-labor-market#--:....

For what it's worth, I have enjoyed a very successful career in data science and software engineering after taking some AP STEM courses in high school, followed by three liberal arts degrees. Many of the best engineers I've known have had similar backgrounds. A good liberal arts education teaches one how to think and learn independently. It's not a substitute for a highly-specialized education in, say, molecular biology, but it provides a really solid foundation to easily pick up more logic-derived technical skills like software development. It's also essential for an informed citizenry and functional democracy.

LudwigNagasena · 2025-10-05T18:35:09 1759689309

It’s sad that many people need to spend years on liberal arts education to learn to learn independently. Where has our society failed that 11 years of schooling and upbringing can’t provide that?

pvankessel · 2025-10-05T18:49:26 1759690166

Oh I agree with you on that wholeheartedly. I think our society would be substantially healthier if we required civics, philosophy, economics, etc in high school. But if we're already struggling to have evolution taught in schools and we have state boards of education removing references to the slave trade and founding fathers from history curriculum (https://www.theguardian.com/world/2010/may/16/texas-schools-...), expanding liberal arts in public education is a non-starter. Hell, half the country would love to see it wiped from post-secondary education. Best I figure we can do at this point is defend the idea itself to the extent we can - for instance, in Hacker News threads where the liberal arts are being dismissed as an unnecessary lesser-than academic pursuit.

Jensson · 2025-10-06T00:39:43 1759711183

You do realize that an engineering degree is better to learn how to learn than a typical liberal arts degree? Read academically adrift, this is well studies.

In general the more difficult your degree the better it teaches you how to learn, because you are forced to learn more difficult stuff.

pvankessel · 2025-10-06T02:09:26 1759716566

Hard disagree. I gave up a top-10 engineering scholarship and switched to liberal arts largely because my entire curriculum was predetermined in the former. Five courses in calculus and two slots for electives in your entire undergraduate schedule - that doesn't teach you how to think. Political philosophy, symbolic logic, comparative history, econometrics - having the freedom to explore and dabble and push yourself into new ideas instead of being fast-tracked into a pipeline, that's how you learn how to learn. And the "difficulty" is entirely what you make of it. Sure, if you show up to college and want to major in anthropology and put no effort in, you get nothing out. But I saw very quickly that with absolute unfailing effort applied to my engineering degree, I was still going to get exactly one and only one thing out of it. The liberal arts gave me a cornucopia of possibility. I've gone on a human trafficking sting op with the FBI, I've presented my research at the White House, I've been cited by the Pope - that's all wild shit that an engineering degree never would have enabled. Breadth of learning and soft skills matter. I'd be a shell of a person today if not for my liberal arts education. I owe everything to it, and the constant condescension towards non-STEM education in tech would frustrate me more if I didn't run laps around my peers.

ThrowMeAway1618 · 2025-10-06T15:36:36 1759764996

>In general the more difficult your degree the better it teaches you how to learn, because you are forced to learn more difficult stuff.

How right you are! From now on I'm only hiring folks who created abiogenesis in a cereal bowl while fellating a hungry lion. Anyone else had it much too easy, amirite?

It's either that or just folks who discovered a new elementary particle while defending Afghani women from the Taliban.

Anything else would be way too easy.

Roscius · 2025-10-05T20:42:14 1759696934

I entirely agree - I have a 30 year career in STEM and am now a senior software architect at a $5b company. I also read, write and speak classical Latin at an advanced (almost fluent) level.

My favourite pastime is quoting Cicero in planning meetings.

I also hire SEs - if I see a resume come in with a CS and liberal arts background, they are definitely going to the top of the pile and getting an interview. If they can explain to me how Plato relates to their work as a SE then the job is theirs...

umeshunni · 2025-10-07T21:51:11 1759873871

Lol. Way to filter our people without a western education.

It's ok - most companies that matter are led by people who have spent more time reading the Mahabharata rather than Plato. Enjoy your scraps.

DaSHacka · 2025-10-05T20:47:01 1759697221

> Except many STEM graduates are having a harder time finding jobs right now than liberal arts and humanities majors: https://www.newyorkfed.org/research/college-labor-market#--:....

Is that in both respective fields of study, though?

It aplears liberal arts/humanities majors are much more willing to work non-related jobs where their STEM collegues more strictly pursue relevant titles.

https://www.forbes.com/sites/christopherrim/2023/01/11/the-p...

pvankessel · 2025-10-05T21:33:47 1759700027

Well that's kind of my point - liberal arts and humanities set you up with a very versatile baseline. With a proper education in those disciplines you learn how to think, and that's applicable to a wide range of fields. The woman I dated in grad school at UChicago studied war history and wound up being an analyst for a prominent wine auctioneering firm as a key researcher. My master's thesis was on the meaning of life, and now I'm running data science at a non-profit. So many of my fellow liberal arts grads have gone on to do incredible things entirely unrelated to their chosen subject of study.

pvankessel · 2025-09-02T02:17:05 1756779425

Curious about this, is there actually a canonical explanation in the trilogy somewhere?

cyptus · 2025-09-02T14:39:44 1756823984

no, i was just kidding :D

pvankessel · on April 4, 2024

Heh, that's actually pretty compelling - it sounds like a darker twist on the plot of a Stargate episode I recently rewatched: https://en.m.wikipedia.org/wiki/Revisions_(Stargate_SG-1)

pvankessel · on April 1, 2024

Are there any models out there for cleaning up an image, not just upscaling? I have a bunch of old photos taken on early low-res point-and-shoots that have JPEG artifacts etc and this seems like something a modern model could easily be fine-tuned to resolve, but every few months I look around and have yet to find anything

iampims · on April 1, 2024

Check out these models: https://replicate.com/collections/image-restoration

Most of them can be run locally, but I’d recommend testing them with replicate before investing in understanding cog/docker/hf…

pvankessel · on April 1, 2024

Oh this sounds like exactly like what I've been looking for, can't wait to give these a try - many thanks

pvankessel · on Dec 23, 2023

Here's a study I did a few years ago that broke down popular YT videos into more granular categories, might be of interest: https://www.pewresearch.org/internet/2019/07/25/childrens-co...

pvankessel · on Dec 23, 2023

This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...

m463 · on Dec 23, 2023

> Google started locking down the API almost immediately after we published our study

Isn't this ironic, given how google bots scour the web relentlessly and hammer sites almost to death?

LeonM · on Dec 23, 2023

> google bots scour the web relentlessly and hammer sites almost to death

I have been hosting sites and online services for a long time now and never had this problem, or heard of this issue ever before.

If your site can't even handle a crawler, you need to seriously question your hosting provider, or your architecture.

sbarre · on Dec 23, 2023

Perhaps stop and reconsider such a dismissive opinion given that "you've never had this issue before" then? Or go read up a bit more on how crawlers work in 2023.

If your site is very popular and the content changes frequently, you can find yourself getting crawled a higher frequency than you might want, particularly since Google can crawl your site at a high rate of concurrency, hitting many pages at once, which might not be great for your backend services if you're not used to that level of simultaneous traffic.

"Hammered to death" is probably hyperbole but I have worked with several clients who had to use Google's Search Console tooling[0] to rate-limit how often Googlebot crawled their site because it was indeed too much.

0: https://developers.google.com/search/docs/crawling-indexing/...

holoduke · on Dec 23, 2023

I have a website thats get crawled at least 50 times per second. Is that a real deal? No not really. The site is probably doing 10.000 requests per second. I mean a popular site is indexed a lot. Your webserver should be designed for it. What tech are you using if I may ask?

sbarre · on Dec 25, 2023

My specific case doesn't really matter (and my examples are from some years ago and of smaller clients, not my own setup).

My point was that people provision capacity ideally based on observed or expected traffic, and that crawlers can, and do, show up and exceed that capacity sometimes, having a negative effect on your customers' experience.

But you are correct that it's absolutely manageable. And telling crawlers to slow the F down is one of the tools you can use to manage it. :-)

pas · on Dec 23, 2023

if your site is popular and you have a problem with crawlers use robots.txt (in particular the Crawl-delay stanza)

also for less friendly crawlers a rate limiter is needed anyway :(

(of course the existence of such tools doesn't give carte blanche to any crawler to overload sites ... but let's say they implement some sensing, based on response times, that means a significant load is probably needed to increase response times, which definitely can raise some eyebrows, and with autoscaling can cost a lot of money to site operators)

piva00 · on Dec 23, 2023

I worked at a company back in 2005-2010 where we had a massive problem with Googlebot crawlers hammering our servers, stuff like 10-100x the organic traffic.

That's pre-cloud ubiquity so scaling up meant buying servers, installing them on a data center, and paying rent for the racks. It was a fucking nightmare to deal with.

LocalH · on Dec 23, 2023

"Rules for thee, but not for me"

dotandgtfo · on Dec 23, 2023

This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.

vasco · on Dec 23, 2023

For aggregated data and stats like this I think it could be fully publicly available.

hotstickyballs · on Dec 23, 2023

Vetted always means people with the time, resources and desire to navigate through the vetting process, which makes them biased.

weeblewobble · on Dec 23, 2023

You might say the same thing about doing research in general

lillecarl · on Dec 23, 2023

I would argue it's better than nothing, and what are they going to be biased towards?

fsckboy · on Jan 4, 2024

Are you talking about Europe? they're certainly going to be biased against Google and any US tech giant.

I'm biased against Google, but I'm honest about it. I don't ask "what could I possibly be biased about?"

MBCook · on Dec 23, 2023

This would find things like unlisted videos which don’t have links to them from recommendations.

trogdor · on Dec 23, 2023

That’s a really good point. I wonder if they have an estimate of the percentage of YouTube videos that are unlisted.

0x1ceb00da · on Dec 23, 2023

This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)

pants2 · on Dec 23, 2023

That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.

cpeterso · on Dec 23, 2023

A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)

https://en.m.wikipedia.org/wiki/Bebugging

mewpmewp2 · on Dec 23, 2023

But this implies that all bugs are of equal likelihood of being found which I would highly doubt, no?

pants2 · on Dec 23, 2023

Yes, it's obviously not a perfect estimate, but can be directionally helpful.

You could bucket bugs into categories by severity or type and that might improve the estimate, as well.

rightbyte · on Dec 23, 2023

Oh this is a really interesting concept.

I guess it underestimates the number hard to find bugs though since it assumes same likelyhood to be found.

justinpombrio · on Dec 23, 2023

That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)

In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64

Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.

ad404b8a372f2b9 · on Dec 23, 2023

Did you understand where the 2^64 came from in their explanation btw? I would have thought it would be (64^10)*16 according to their description of the string.

Edit: Oh because 64^10 * 16 = (2^6)^10 * (2^4)

dajonker · on Dec 23, 2023

The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.

zellyn · on Dec 23, 2023

Do you get the same 100 dumb fish?

p1mrx · on Dec 23, 2023

Why are they dumb? Free tag.

labster · on Dec 23, 2023

Imagine being the only fish without a tag. Everyone at school will know how lame you are.

DeathArrow · on Dec 23, 2023

It would be illegal not to have a tag. If the fish has nothing to hide, it shouldn't worry about being tagged.

And, also, the fish gets tagged for its own good.

DeathArrow · on Dec 23, 2023

>Everyone at school will know how lame you are.

They'll even call you tinfoil fish.

eek2121 · on Dec 23, 2023

This comment. Please see here.

egeozcan · on Dec 23, 2023

Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.

nkurz · on Dec 23, 2023

You make a very weak argument, and are simply assuming the conclusion.

What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?

To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?

If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.

egeozcan · on Dec 23, 2023

Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.

lanstin · on Dec 24, 2023

In my experience conservation biologists are really good at finding animals in the wild. Much better than a typical SWE or typical business person.

panarky · on Dec 23, 2023

Wouldn't a previously caught fish be less likely to fall for the same trick a second time?

soonwitdafishis · on Dec 23, 2023

only if you're within a 100 mile radius of me the ultimate dumb fish

dclowd9901 · on Dec 23, 2023

I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.

midasuni · on Dec 23, 2023

It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851

krackers · on Dec 23, 2023

Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problem http://www.stat.yale.edu/~yw562/reprints/species-si.pdf

neurostimulant · on Dec 24, 2023

> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.

fergbrain · on Dec 23, 2023

Isn’t this just a variation of the Monte Carlo method?

layer8 · on Dec 23, 2023

That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.

gaucheries · on Dec 23, 2023

I think YouTube locked down their APIs after the Cambridge Analytica scandal.

herval · on Dec 23, 2023

in the end, that scandal was the open web's official death sentence :(

m1sta_ · on Dec 23, 2023

The issue wasn't the analytics either. The issue was the engagement algorithms and lack of accountability. Those problems still exist today.

TeMPOraL · on Dec 23, 2023

So as usual, the exploitative agents get to destroy the commons and come out on top.

We need to figure out how to target the malicious individuals and groups instead of getting creeped out by them to the point of destroying most of the so praised democratizing of computing. Between this and locking down the local desktop and mobile software and hardware, we've never got to having the promised "bicycle for the mind".

bossyTeacher · on Dec 23, 2023

no one promised you anything

fallingknife · on Dec 23, 2023

And what kind of accountability is that? An engagement algorithm is a simple thing that gives people more of what they want. It just turns out that what we want is a lot more negative than most people are willing to admit to themselves.

Applejinx · on Dec 23, 2023

I would rephrase that to 'what we predictably respond to'.

You can legitimately claim that people respond in a very striking and predictable way to being set on fire, and even find ways to exploit this behavior for your benefit somehow, and it still doesn't make setting people on fire a net benefit or a service to them in any way.

Just because you can condition an intelligent organism in a certain way doesn't make that become a desirable outcome. Maybe you're identifying a doomsday switch, an exploit in the code that resists patching and bricks the machine. If you successfully do that, it's very much on you whether you make the logical leap to 'therefore we must apply this as hard as possible!'

clippyplz · on Dec 23, 2023

Engagement can be quite unrelated to what people like. A well crafted troll comment will draw tons of engagement, not because people like it.

fallingknife · on Dec 23, 2023

If people didn't like engaging with troll comments, they wouldn't do it. It's not required, and they aren't getting paid.

zaphar · on Dec 23, 2023

This comment has a remarkable lack of nuance in it. That isn't even remotely close to how how human motivation works. We do all kinds of things motivated by emotions that have nothing to do with "liking" it.

herval · on Dec 23, 2023

I don't think people "like" it as much as hate elicits a response from your brain, like it or not.

If people had perfect self-control, they wouldn't do it. IMO it's somewhat irresponsible for the algorithm makers to profit from that - it's basically selling an unregulated, heavily optimized drug. They downrank scammy content for instance, which limits its reach - why not also downrank trolling? (obviously bc the former directly impacts profits, but not the latter, but still)

mdhb · on Dec 23, 2023

This is really a child like understanding of the world.

nextaccountic · on Dec 23, 2023

In which ways were the Cambridge Analytica thing and the openness of Youtube APIs (or other web APIs) related? I just don't see the connection

gloryjulio · on Dec 23, 2023

The original open API from the Facebook was open for the benefit of the good actors to use their data. You can disagree with how it's used, but u can't disagree with the intention.

With the CA scandal, now all the big companies would lock down their app data and sell ads strictly through their limited API only, so the ads buyer would have much less control before.

It's basically saying: u cant behave with the open data. Then we will do business only

newaccount74 · on Dec 23, 2023

CA was about 3rd parties scraping private user data.

Companies are locking down access to public posts. This has nothing to do with CA, just with companies moving away from the open web towards vertical integration.

Companies requiring users to login to view public posts (Twitter, Instagram, Facebook, Reddit) has nothing to do with protecting user data. It's just that tech companies now want to be in control of who can view their public posts.

gloryjulio · on Dec 23, 2023

I'm a bit hazy on the details of the event but the spirit still applies: there were more access to the data that were not 100% profit driven. Now the it's locked down as the companies want to cover their asses and do not want another CA

passion__desire · on Dec 23, 2023

Wasn't the "open" data policy used to create Clearview AI to create a profile and provide it to US govt departments?

pvankessel · on Dec 23, 2023

They actually held out for a couple of years after Facebook and didn't start forcing audits and cutting quotas until 2019/2020

hipadev23 · on Dec 23, 2023

[flagged]

blackle · on Dec 23, 2023

It is a little more sophisticated. They say they use an exploit that was found where a URL with five characters with a dash will get autocompleted by YouTube (I wonder why that is.) That improves sampling by 32,000 times apparently