Did You Enjoy Your Ephemeral Web Experience?
Consequences and considerations from the Internet Archive outage.
The Internet Archive has been down since October 9th. At the time of writing, a provisional read-only version of the Wayback Machine is online.
Initially when archive.org went down, there was a prompt displaying a message:
“Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!”.
“HIBP” referring to the breach notification service have i been pwned?, which is worth checking with any and all emails you happen to use. At the very least it’s worth making sure you’re not re-using any passwords from previous breaches.
The outage itself meant that for many, we are already experiencing an ephemeral web where the current state is the only state. The old adage “the Internet never forgets” relies on a variety of assumptions and compromises. The truth is, a great deal of information disappears from the web on a very regular basis. Projects terminate, companies rebrand and/or change, other sites just disappear without a trace. In the hope of preserving crucial information, various online archives have been used. It takes an immense amount of resources, allegedly 160 petabytes, of data to even attempt to store a record of the web’s state. This is something that very few can even consider taking on. This difficulty means that the web will always have ephemeral qualities to a degree, and we have to consider how best to navigate that environment.
Navigating The Ephemeral Web
Have you ever been in a chat room with disappearing messages? Maybe you should. The web as it exists is very similar to that. Every site and service changes over time. In some ways this may not necessarily be bad, not every byte of data on the web is inherently valuable or desirable. Some people may be relieved that their digital footprint fades over time. But there are serious and troubling concerns as well. As an information management system, the World Wide Web is arguably one of the wonders of the world. Any decision to deliberately, destroy, conceal or manipulate critical information is itself a massive problem. Beyond mere entropy, there will always be concerted efforts to ensure that particular information is kept from being used to counter various interests.
The deliberate removal of information online can be as simple as various platforms removing content from their site, it could be institutions breaking links to previously available reports and documents, or even destruction of entire sites. It takes a great deal of effort to preserve information across time. That effort isn’t to be taken for granted. Online information destruction certainly isn’t new, but there is a troubling, and accelerating trend. One of the more significant warnings is episode 384 of the Corbett Report’s podcast The Library of Alexandria is on Fire where James Corbett gives some excellent advice regarding overcoming this problem:
Refuse cloud-confined devices
The very beginning of preserving information is having the very storage devices under your own control.
Move away from controlled platforms
Relying on online services to protect information is doomed in the long run.
Save EVERYTHING!
As far as your own information management is concerned: if it’s not on your own machine, it doesn’t exist.
It’s worth reiterating that information management is its own (very worthwhile!) skill. There’s a variety of things you’ll want to do on top of preserving information, but saving it is always the first step. Never feel bad merely saving information, but here’s some things worth considering after you have:
Indexing: Is your information stored in an easily searchable way?
The easiest way to accomplish this is to use very descriptive file names, so that they can be searched with your operating system, regardless of media type.
Conversion: Sometimes you want information in a different format, this is useful to save space and stay current.
Adding Metadata: There’s a lot of useful information about saved content that’s worth keeping, such as transcriptions, summaries, and citations.
Sharing: Why keep it all to yourself? Sharing is caring and can help better protect information by increasing the amount of copies in existence. Remember hoarding is not preserving, but it’s a good start!
But why do we need information on our machines if the Internet Archive exists? If we can’t hope to compete with the 160PB behemoth ourselves, why bother duplicating data ourselves?
Concerns
While it’s been confirmed that the entire Internet Archive should be back online in “days not weeks” there are some serious questions to ask in the meantime. At the very least, we have up to 31 million emails breached from a population that is more likely to be journalists and researchers.(independent and mainstream alike!) This is an important reminder that when one is doing sensitive work, that you should make sure to isolate online identities from each other by using completely different emails. It should go without saying that any information that can be used, will be.
The outage itself is a serious concern. Given that not only is this an election year, but also many high-stakes events are currently taking place that benefit from having a public information trail. The (hopefully) temporary outage of the Internet Archive not only temporarily memory-holes the existing record, but potentially permanently prevents current events from being properly preserved. This may have profound implications on the future.
Far more troubling than the outage itself, is the nature of the hack of the Internet Archive itself. With the data from the breach, we know that those who attacked the Internet Archive had access since September 28th. There is a great deal of damage that attackers could do in even a single day without being detected. Even worse, because the Internet Archive is a record of events a mere rollback to a previous state is insufficient to reverse serious inside attacks. Independent tech journalist Bryan Lunduke raises the serious concern that these attackers may not be the only ones attacking the Internet Archive, merely the ones that we know about.
It is difficult to overstate the implications of the Internet Archive potentially being manipulated in real-time. Attackers with high-level access could absolutely manipulate records in difficult to address ways. We have assurances from staff that the data is safe, but it is an open question how confidently they can make those assurances.
What Makes The Internet Archive Great?
Despite all this, should the Internet Archive return, there is no reason to abandon the cause entirely. Imperfections are themselves noteworthy, but never a reason to completely disregard any particular tool. For what it is, the Internet Archive is a phenomenal achievement and a incredibly useful service for a wide variety of people and purposes.
Trust
As a long-standing archive, the Internet Archive has garnered a great deal of trust. For many, the site is a go-to resource, and it has partnered with many organizations like libraries and other institutions. This trust is a big part of the reason why citing the Wayback Machine is very often considered as good as citing the original link itself. Other alternatives, may not have the same level of trust, and it may be difficult to rely on them.
Scale, Depth & Diversity
The size alone of the Internet Archive is formidable. Beyond merely archiving versions of the web, it also includes a massive amount of media from all over the world. The cultural and historical relevance of that information is its own treasure not to be taken lightly. No radical new approach will even begin to reach the same size anytime soon.
Age
Founded in 1996, the Internet archive has a massive head-start over other projects. Despite present circumstances, almost 30 years of being operational is nothing to scoff at. Catching up to 30 years of information at such a scale is an enormous task.
Limitations of The Internet Archive
Despite these things, there are reasons why one should consider alternative archives for particular purposes. For some situations, these limitations may be deal-breakers, for others they may be fairly trivial. In any case it is still best that one still maintains their own archives of important information, for its own sake and potentially because of the following.
Centralized
The Internet Archive is a massive target. From now, until it stops existing it will always have to contend with a wide variety of threats. From cyber threats like this incident, to even financial and legal troubles. The institution itself is vulnerable to being pressured by a variety of forces that all have their own agendas. This is not to say that a decentralized internet archive is some kind of panacea. Such an undertaking would absolutely have its own challenges and trade-offs.
Removals
There are primarily three ways that information is deleted the Internet Archive:
Copyright Infringement:
Since March 2001, their copyright policy states:
The Internet Archive respects the intellectual property rights and other proprietary rights of others. The Internet Archive may, in appropriate circumstances and at its discretion, remove certain content or disable access to content that appears to infringe the copyright or other intellectual property rights of others.
Deletion Requests:
Internet Archive honors deletion requests as well, these are reviewed by their team. There are many reasons why a request would be honored, such as removing personal information, but can quite plausibly include a wide variety of other requests as well.
/robots.txt
Every website can forbid the Wayback Machine from recording information through their
robots.txt
file. This file is generally used to suggest to search engines and other bots what information is supposed to be indexed. Not only does the Wayback Machine honor this mechanism, but they also apply it retroactively. Meaning that it is relatively simple for any website operator to “memory-hole” their own specific content, even on a per-item basis.
A Call For Innovation
The greatest challenge with independent archives is credibility. Anyone can create their own modified mirror of a web site and claim it’s an honest and true archive. If the operators of the site deny its authenticity how can others be sure? It’s a reasonable question if any archive can be trusted on a long enough time frame. One can even argue that generative AI tools can make any and all information falsifiable. Reputation can’t be trusted, because it can be sold as a commodity like anything else.
Credibility alone, is not the only reason to preserve information, however. The truth is that there are many reasons to take on the task even in such conditions. In many scenarios, the information itself would be lost to time forever. Preservation can very often be its own end for a wide variety of reasons. When aiming to hold powerful institutions accountable, very often the information can be corroborated by other facts.
Trying to independently archive the entire Web is effectively an impossible task. The goal should never be to preserve the totality of information online, but merely what is important. Defining what is important is its own massive challenge, but ultimately it boils down to a personal choice on how to use resources available. It is clear that more tools for independent information management are needed. Either building those tools yourself, or teaming up with others to cooperate on them would be a massive step in the right direction. Indexing and automation are both viable avenues to provide powerful tools for independent researchers.
We can hope, that for as long as the digital landscape remains somewhat free, that there will be those willing to support investing in building specialized tools that can begin to level the playing field. If we are not ardent in our support for a better technological future, we may end up with the ones imposed on us via governments and/or corporations.
Tools to Consider
Other Online Archives:
These are definitely worth considering, even in parallel with the Wayback Machine.
Archiving Tools
Courtesy of the Awesome-Selfhosted list:
Webrecorder which is a suite of useful tools for maintaining one’s own web archives. Webrecorder aims for accuracy that may not be matched by other tools.
SingleFile Browser add-on to save pages “as a single .html file”, which can be a handy alternative to printing a page as a PDF.
yt-dlp for downloading videos & audio
For those who want to go pro, archive team is a great place to get started.
Don’t forget your backups!
Any data that isn’t backed up is considered to be in a superposition of accessible and gone forever. It’s there until it just isn’t. Since the goal is to preserve information, it’s crucial to learn how to protect information now that you’ve started collecting it.
PSA’s like this are included in Cyber Fix my bi-monthly tech news roundup to support this project. Paid Subscribers get access to full episodes. Please consider becoming a supporter to help make this project viable.
Excellent info and suggestions that the general public should know about... Thank you!