News publishers are blocking the Internet Archive’s Wayback Machine to stop AI companies from using it

News Publishers Blocking the Internet Archive’s Wayback Machine

News publishers are blocking the Internet Archive’s Wayback Machine to stop AI companies from using it for training language models without permission or payment.

Background

The Internet Archive, established in 1996, has preserved over a trillion web pages, serving as a vital resource for journalists, historians, and researchers. Courts often cite its archives, and it's considered one of the most significant public information infrastructure projects of the digital age.

The Issue

News organizations like The New York Times, CNN, USA Today, and The Guardian, among 241 others across nine countries, are restricting access to the Archive’s crawlers. This blocking is a result of AI companies utilizing archived news content without authorization for model training.

Key Points:

  • Blockage Extent: According to Originality AI, 23 major news publications block the main web crawler used by the Internet Archive (ia_archiverbot). In total, 241 sites across nine countries disallow at least one of the Archive’s crawling bots.

  • Major Publishers Involved: USA Today Co. contributes significantly to this blockage, affecting numerous local publications. The New York Times implemented a 'hard block' in late 2025.

  • News Organizations' Argument: They argue that AI companies using archived content for training violates copyright law and undermines their business models.

Historical Perspective

The Internet Archive's Wayback Machine has been an invaluable resource for:

  • Journalists verifying article edits.
  • Historians studying historical context.

However, its role in AI model training raises copyright concerns for news publishers.

Quotes:

“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us... The work should not be used without our permission.” - Mark Graham, Wayback Machine Director

"A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive’s API would have been an obvious place to plug their own machines into." - Robert Hahn, Head of Business Affairs at The Guardian