Steal this article, bots. I dare you


Graphic by Sean Mullins.

I recently discovered the shocking truth that bots steal items from the Webster Journal. Naturally, I opted for the only logical answer: write an article about it which will inevitably be stolen itself.

On Monday, October 10, I was browsing article comments on the Journal website’s dashboard — because that’s what you do when you have no life — and discovered several pingbacks. When other websites include links to our articles in a text like this, WordPress warns us. However, these articles were not shared. They were copied and pasted by fraudulent websites – or, to use colloquial language, “borrowed indefinitely”.

This phenomenon is the result of scrapers, bots that invade websites and search engines with digital content to repost. We only noticed these stolen articles because, in their infinite wisdom, the bots accidentally copied embedded links to the originals. Both veteran and new writers in each section fell victim to scrapers; apparently the bots really liked it my review of dating apps, because he received pingbacks from four unique imposters.

Seeing these counterfeit items was like the real world “can i copy your homework” memes. They differentiated by mistakes (“Dine and Discuss” became “Dine and Discus”) or awkwardly replaced synonyms (“Students take hands-on learning to a new level” became “Students take study to a whole new stage”). Notably, my review on “Shovel Knight Dig” has been mistranslated by Google into German, resulting in this S-level Tweet with a like.

Photo by Sean Mullins. The Journal’s WordPress website received pingbacks from news-scraper robots plagiarizing student articles, with article titles slightly altered.

We were lucky that the scrapers were dumb enough to embed links, but who knows how long they’ve been targeting us without our knowledge? How many of our articles have been hijacked by malware-infested sites? As far as I know, there is a bootleg version of my “Deltarune: Chapter 2” review that downloads Spamton G. Spamton to your hard drive when clicked. I had to investigate further.

According to a MediaShift report by Rami Essaid, 2015 was the first year in which more web traffic came from bots than humans, with around 59% of website visits automated. Essaid singled out 36% of web traffic that year as “good bots,” including search engines and social media aggregators that benefit websites, while “bad bots” like news scrapers accounted for 23% while looting on the Internet.

“[Bad bots] can drain visitors from a site, hurt their SEO rankings, and cost advertising revenue,” Essid said. “And because of all the bandwidth bad bots use to steal content, they can slow down page loads, annoy human visitors, and further hurt search engine rankings.”

Photo by Morgan Smith. A sign inside room 116 of Sverdrup Hall, where Journal staff write, edit, present and publish articles.

From stealing clicks to collecting data, it only takes scrapers minutes to steal weeks of honest work. This is not limited to textual content like Journal articles; Image and video content can also be plagiarized. Essaid noted that bot prevention software can fix problems on an individual level, but preventing this systemic problem with digital publishing is implausible when laws like the Digital Millennium Copyright Act aren’t effectively enforced.

Even large publications are not immune to scrapers. After sleazy website Newsbuzzr published pirated versions of her articles, HuffPost senior reporter Jesselyn Cook shared her journey down the rabbit hole of bot plagiarism. Reading his article felt like a beat-by-beat description of everything I saw on our pingbacks: randomly placed synonyms that break otherwise consistent sentences, clearly dangerous scam websites, occasional links to original items, etc.

Cook’s research showed him multiple websites that stole content from every major publisher imaginable — from The New York Times to Wired — and monetized it with revenue programs like Google AdSense. Although Google later informed Cook that violations of Newsbuzzr’s quality guidelines had blocked the site from AdSense, these sites abused a faulty system that requires manual response instead of proactively fighting scrapers. Therein lies the motive for item theft: profit.

“On the surface level, this rip-reword-repost operation is a creative little scam (and yes, it produced some really great ‘Florida Male’ content),” Cook said. “But it’s alarming that it’s obviously profitable to scavenge content for ad traffic in this way, and the diagram illustrates how skewed the economic incentives of our click-driven media industry are.”

To learn more about how scrapers work and what solutions exist, I reached out to webmaster, writer, and plagiarism/copyright consultant Jonathan Bailey. The scraping operations emerged during the rapid growth of the internet in the early 2000s, when Bailey saw his journalistic writings frequently plagiarized. This prompted him to share “techniques for detecting and stopping the misuse of online content” when he launched his website, Plagiarism Today, in 2005.

Google dealt a serious blow to scrapers from February 2011 with multiple algorithm updates including Google Panda, which promotes high-quality websites with original and well-researched content. Although scrapers temporarily became less common after these changes, they never completely disappeared. Bailey explained that scrapers have made a recent resurgence as they have become cheaper and more convenient. They’re designed to be as hands-free and effortless as possible, which makes them cost-effective.

“Even if it doesn’t work 99.9% of the time, it only takes creating 1,000 websites hosting spun content to see some success. Spam, all kinds of spam, is always a numbers game and the numbers have just started to favor these sites more,” Bailey said.

Due to their customization, scrapers use a wide range of tactics, from typing in keywords to tracking RSS feeds. Some scrapers copy individual pages and insert them into synonym generators, while others combine phrases from different pages into what Bailey described as “Frankenstein articles that read like inconsistent goop.” Monetization methods also vary from AdSense earnings to signaling other suspicious websites with links, or even advertisements from other scammers.

It was sticky to hear that scammers could profit from the work of student journalists. They’re not even good enough to be my imposters. However, Bailey assured me that they are probably scraping the bottom of the barrel. The few companies that advertise on scam websites pay very little, and any meager income likely goes to a higher authority.

“I’m convinced that [scrapers] make an ill-gotten living, but they’re usually middlemen, either for clients wanting unscrupulous SEO or for advertisers peddling questionable products,” Bailey said. “These groups are probably doing a lot better.”

So what can digital publishers do to fight back? Anyone can file copyright notices with Google or web hosts, and Google regularly updates its algorithms to demote low-quality websites that irritate search engine users. That said, Google itself is fallible on rare occasions, sometimes falsely reporting scrapers as the original source and punishing their victims.

One of the reasons creators might not seek out the scrapers is that fighting them requires investing time or money, which is much more limited for smaller publications. Bailey advises devoting these resources to a targeted approach; while finding every scraper is unreasonable and inefficient, finding copycat items that rank higher in Google will deliver the most valuable targets with less effort.

Photo by Morgan Smith. A laptop belonging to Sean Mullins, editor of the Journal, on which this article was written.

Having your hard work plagiarized can be demoralizing, although thieves probably can’t afford a Kraft single on stolen website traffic. I imagine this won’t be the last time the scrapers take on the Journal’s next generation of employees, and while I highly doubt I’ll pursue journalism after graduation, copyright theft author is widespread in all media fields. Luckily, Bailey left me words of encouragement from writer to writer.

“As corny as it sounds, the best advice I can give is don’t let it get to you. Understand that it exists and there may be times when you have to deal with it, but don’t overdo it. Remember, it’s not personal, it’s done by bots, and it’s a problem with the internet itself, not you or your job,” Bailey said. “Just keep a pragmatic view, and you should be fine.”

Share this post

FacebookTwitterredditpinterestLinkedInmail

Sean Mullins (she/they) is the Journal’s opinion editor and webmaster. She majored in media studies and minored in professional writing at Webster University, but has been involved in student journalism since high school, having served as a gaming columnist, blogger, and cartoonist for Webster Groves Echo at Webster Groves High School. Her passion is writing and editing stories about video games and other entertainment mediums. Outside of writing, Sean is also the treasurer of the Webster Literature Club. She enjoys playing games, hanging out with friends, LGBTQ+ and disability advocacy, streaming, doing terrible puns, and listening to music.


About Nereida Nystrom

Check Also

Scarcity marketing and fear of missing out [Podcast]

Scarcity marketing and fear of missing out [Podcast]

Countdowns, limited seating, more than X items in stock… is scarcity marketing working for you? …