So I just read a piece from Medium entitled "I Built a Fake News Detector Using Natural Language Processing and Classification Models: Analyzing open source data from Subreddits r/TheOnion & r/nottheonion".
It pretty much does what it says on the tin: the data scientist who wrote the article used standard machine learning to look at example text that was "fake news" (in this case, text from the classic satirical news site The Onion, or more specifically, text from the part of Reddit that reposts Onion stories) and text that was "real," but absurd news (in this case, text from the part of Reddit called nottheonion).
I've put quotes around "real" because a) I'm unsure about the reliability of the sub-Reddit "nottheonion," and b) sometimes I think there is nothing truer than satire like The Onion. For instance, one of my favourite quotes about the hype around DeepMind's AlphaGo program beating a real, human master of the game Go is the following from The Onion's frequent "man on the street" fake vox-pop American Voices:
“I’m sorry, but this AI stuff scares me to death. It’s only a matter of time until we wake up to find the world overrun with computers playing all sorts of board games.”
DENNIS KALEN • PART-TIME LABORER
In my opinion, truer words were never spoken, Dennis Kalen.
Anyway, the interesting thing in the Medium story is that the titular fake-news-detection algorithm was able to tell The Onion from r/noththeonion 90 per cent of the time!
An impressive number, but as with most algorithmic results, it pays to look at the specifics. In this case, the author included a sorted list of the words that the algorithm found most useful in distinguishing what was satirical from what was merely absurd in the news. Here's the graphic:
And that's right: the words most indicative of a story being from The Onion were "Kavanaugh," "Incredible," and "FTW" (the Internet acronym that means "For The Win"). I suspect that there may be some significant in-sample bias here (that is the say, I think the data may have come from the period when Brett Kavanaugh's confirmation hearings had gone from sad to disturbingly ridiculous).
But much a more amusing algorithmic outcome are the words that are most indicative of "true," but absurd news.
They are "Florida," "Cops," and "Arrested."
Ah, Florida Man, is there nothing you can't fuck up? Even A.I., apparently.
God speed, Florida Man, God speed.