

It’s easy and cheap to transcribe audio today (e.g. look at whisper). You can automatically create subtitles, which means that you conveniently get timecodes for the audio, too.
If you now build a database, you can link audio and video to a searchable full text database that also gives you corresponding timestamps to the search term you entered. Now cross-reference it with other data like who is talking/present in this video and you get even filters for words/phrases said by a person. This is way less work than you imagine.
One thing to keep in mind is, that your view may also be biased. You only get to see when they successfully pull it off, not their failures. You don’t see the constraints like lack of material they had to fight with. If you regularly watch several of those shows, you might see it more often but made from different people. If you watch them (or single snippets/performances) on e.g. YouTube, you may get pushed those and similar clips more often but those may be from an arbitrary time frame (and an arbitrary number of people) and not necessarily new.
As for the industry: Not every event/speech is necessarily covered by each TV station/studio. There are also teams that record them and then sell them. I guess, that those are interested in providing easy access to these recordings and maybe even provide metadata like captions, text search, maybe even marked gesturing and so on.
I recon it’s hard if you were to compile those clips from random social media posts.
As for content creators on YouTube: I remember one doing this. They watched, archived and used transcripts of every official video and stream of the devs of a specific game to pick from this material and compile those montages letting the devs announce whatever he wants as a meme. In this case, it probably also helped to keep the scope reasonable by only focusing on one game and only the “official” material.