The Internet Archive, Open Knowledge, and the History of Everything

Posted by Marbenz Antonio on March 31, 2022

Digital storage is both the most sensitive and the most strong media ever devised. On a hard drive, a change in the magnetization of a few tiny bits can wipe out content indefinitely. Furthermore, anyone who causes trouble on their website or through social media may easily delete the humiliating proof with a few keystrokes. However, the ability to produce digital copies at a minimal cost allows the material to be duplicated and kept in secure locations. The Internet Archive uses this second characteristic of digital material to preserve the history of the web—and more.

When the Internet Archive was founded in 1996, most individuals had only had access to the internet for a few years. Already, computer expert Brewster Kahle could see that historical material was being destroyed, so he founded the Internet Archive. The archive’s engines presently crawl roughly 750 million pages every day, with each site holding possibly hundreds or thousands of distinct web pages. The archive’s content is believed to be 552 billion web pages at the time of writing. It contains a lot more than just websites. This article looks at the Internet Archive’s accomplishments and what it has to offer both scholars and regular computer users.

Another facet of free information is online sites that provide unique content, which writers frequently use while researching pieces like this one. Wikipedia, which celebrated its 20th anniversary on January 15 of this year, is the superhero of these free sites. Although Wikipedia material is unique, it makes extensive use of references and cautions readers against using it as the main source. Furthermore, Wikipedia’s content and pictures are licensed under the Creative Commons Attribution License, the GNU Free Documentation License, or both. As a result, the information frequently appears on other websites.

Lost in the Mists of Time

The internet’s key characteristic is its ease of use. The United States of America The Supreme Court has not learned this lesson, as the justices and their staff often reference websites in their decisions. Nearly half of these links are broken, resulting in the normal 404 error response, according to researchers. That means we won’t be able to learn about the evidence used by judges to make such important decisions.

News sites, academic research, and anybody else who makes use of the web’s core feature: the simplicity with which they may link to other sites face the same risk of losing responsibility. The issue isn’t limited to sites that have gone 404. (disappeared). It also applies to sites that update their material after you’ve made a point based on the previous content. As a result, when clever commentators use other people’s site material or social media posts to make a point, they upload screenshots of the current content.

Amber, a project of Harvard’s Berkman Klein Center for Internet & Society, offers a more organized approach to archiving the past. Amber allows saving a copy of a web page while you’re viewing it simply. Amber, on the other hand, has a basic requirement: a web server on which to keep the material. The majority of us utilize third-party online services and do not have the necessary permissions to save a page. Harvard offers Perma.cc, a form of “Amber as a Service,” where anybody may store a website in its present state and create a URL that others can refer to later. Drupal.org also allows you to save pages using Amber, which is a plus. The Internet Archive maintains a copy of Perma.cc. To see how common the problem of broken links is, then searched through one of these articles, picking one that was very substantial and published four years before the study for this Internet Archive article. Only four years after having written it, my essay was published with 43 links, seven of which were broken.

The Internet Archive is a great place to start. Because they don’t throw anything away, you may access a website at any time. Let’s look at ways to get old pages back. This may be done using the Wayback Machine, an Internet Archive search interface.

Assume that one of the links on this page has become 404. The following link will take you to the content.

  1. To discover the original URL you wish to visit, look at the source code of this web page.
  2. Use the Wayback Machine to go back in time.
  3. In the search box, type the URL.
  4. The dates on which the Wayback Machine archived this page are displayed on the page retrieved by the Wayback Machine. You may access the page as it appeared on any of those dates by clicking on it. Please be patient while the site loads slowly. An archive has the luxury of waiting.

You may alternatively forgo the visual interface and manually search for the page, but this is a more complex issue that users won’t go into here. You may use the save-page-now functionality to ensure that a web page is saved in its present state. There’s also a file upload option.

More than 250 of my articles and blog posts have vanished from various websites, according to our estimates. Some articles might be recreated from preserved drafts, while others were discovered through searches in unusual locations like mailing list archives. However, the Internet Archive is certain to have them all. You recover it and post it on your website whenever you determine it’s worth keeping.

You probably don’t agree with everything on the internet, therefore you won’t agree with everything on the Internet Archive. Remember that everything people publish on the internet, no matter how offensive, might be useful to historians and scholars. To comply with material take-down legislation, the Internet Archive has a copyright policy comparable to those of social networking sites.

When evaluating this article, Brewster Kahle, the Internet Archive’s Founder and Digital Librarian, said:

The pandemic and disinformation operations have demonstrated how reliant we are on reliable and high-quality information available online. These are the functions of a library, and we are pleased to assist in any way possible.

In Praise of Brute Force Computer Algorithms

How can the Internet Archive maintain the current condition of a medium that is several orders of magnitude larger than anything that has come before it regularly?

The solution is straightforward: they apply the same brute-force strategies as search engines. The Internet Archive goes page by page through the web, trying to find everything it can. To save everything it discovers, the archive has leased huge storage space.

Programmers are always looking for new ways to avoid brute force approaches, which have an optimization level of O(n) and can only be scaled up by spending a similar amount of computing power. However, there are situations when using raw force is the best option.

Graphical processing, for example, involves reading a large amount of data about the graphic and applying algorithms to each pixel. This is why, before affordable hardware was designed to suit the specific demands of these applications, few programs could conduct graphical processing: the now-ubiquitous graphics processing unit or GPU.

Modern machine learning is another area where raw force prevails. The underlying concept dates back to 1949 when digital computing was still in its infancy. For decades, artificial intelligence experts were enthralled by the neural network, but after much research and sweat, it was branded a failure. Then processors (including GPUs) became fast enough to perform the algorithms in a reasonable length of time, and virtual computing and the cloud made compute power almost infinite. Machine learning is now being used to solve classification and categorization difficulties all around the world.

A word about limitations: web crawling misses a lot of what we view on the internet every day. The Internet Archive will not go behind paywalls, which hide a lot of journalistic and scholarly information. Because the crawler is unable to submit forms, it is unable to detect what users view on dynamically produced web pages such as those seen on retail websites.

Beyond the Web

The history of lost culture is woven into the fabric of history. The following are some of the disasters that we still mourn:

  • After Spain defeated the Mayans in Central America in the 1500s, a single Spanish bishop ordered the destruction of all Mayan cultural and religious documents. The few codices that have survived reflect a complex philosophical investigation that we will never be able to fully understand.
  • Invading Mongols burned Baghdad’s library in 1258, an act of gratuitous hedonism that accompanied their conquest of the city. This shattered a fruitful legacy on which medieval Europe’s intellectual revival was built.
  • The destruction of Alexandria’s old library appears to have occurred over several centuries. The Internet Archive was founded as a result of Kahle’s inspiration from this resource.

Add to these tragic events the destruction of ancient architecture (often dismantled by local residents looking for cheap building materials), the extinction of entire languages (each losing not only a culture but also a unique worldview), and the disappearance of poems and plays by Sappho, Sophocles, and others that shaped modern literature.

Many megabytes of data were entrenched in corporate data centers long before the internet. Their owners must have recognized that when organizations transitioned to new computers, databases, and formats, data may be lost. Customers are caught with the material in opaque and proprietary forms when software suppliers go out of business. People today have priceless memories stored on tangible media for which there are few technologies available. As a result, our data is slipping from our grasp.

Although the Internet Archive’s terms of service emphasize their importance to scholars, they provide fantastic tools that anybody may access. They have a book lending service that looks to be similar to what is available now at other libraries. They provide a section for youngsters with instructional materials, as well as unique repositories for music, photos, videos, video games, and historic radio broadcasts.


Here at CourseMonster, we know how hard it may be to find the right time and funds for training. We provide effective training programs that enable you to select the training option that best meets the demands of your company.

For more information, please get in touch with one of our course advisers today or contact us at training@coursemonster.com