This past week the German Bundestag (parliament) published a law (it was passed in 2006, but doesn’t go into effect until it is published in its final form) that mandated all German websites needed to deliver a copy of all digital content (text, photos, sound, and any other multimedia content) to the National Library in Leipzig, which is the German equivalent to the U.S. Library of Congress. German companies protested the law throughout the legislative process, arguing that it set an undue burden on companies to comply and would result in enormous financial costs. Not only is the law itself interesting – mandating the state to archive the internet – but also how it is going to preserve the internet content. The library has asked that all of the website content be submitted in one of two formats – either as a PDF file or if the content stretches over multiple pages, such as a multi-page HTML website, then the content should be submitted using the ZIP compression format containing all of the related files. This last bit alone raises so many questions – such as which files need to be included and how often do companies need to resubmit their content – every time the website is updated?
One exception to the law is content that is generated by private citizens for private use. But, this raises a whole other set of questions, such as what is “private” on the Internet, a space that is by design “public.” Pundits have been quick to point to the gray area of Weblogs – are they private or are they public. If companies are supposed to archive all of the content, then what happens to “private” blogs that are hosted by for-profit companies like Blogger or even Facebook? Theoretically, companies that don’t comply will be served a letter of warning followed by a fine of up to 10,000 Euros for each act of non-compliance. At the moment, the National Library has issued a statement that it is not going to enforce the statute until it has been able to fully assess its ability to store all of the data that will be flowing its way.
In a related story out of Europe this week – the European Union has decided to take on the Google Book Project by digitizing the contents of Europe’s largest libraries, museums, archives, and film studios – placing this content online. The first incarnation of “Europeana” should be launching November 20th. The European Commission, which is coordinating, but not actually carrying out the project, hopes that the new website will become a clearing house for access to European civilization. Those are lofty goals indeed and to compete with Google might be an even loftier one, but it will be a wonderful (free) resource for those of us living and teaching outside of Europe. At the same time, as per the EU’s goals, the digitization of the cultural objects will also serve to preserve them in a digital age. For more on this topic, see this English Language article from Der Spiegel.
Both of these two recent examples highlight several of the themes that were addressed in this week’s readings. As all three of the readings allude, finding a digital medium that can hold its own over time is probably going to be the greatest challenge for digital archivists. I see a great hurdle being created here by the German National Library – submitting one’s site via PDF or a ZIP file is probably only a temporary fix and does not actually ensure any sort of preservation. With the PDF format, the library is basically asking the owners of the sites to “print” out their site and submit this copy to the library in what is at the moment the ubiquitous e-book or e-paper format. However, will it remain so in the near and distant future? Already there are competing formats, most of which actually rely on less sophisticated coding – using ASCI text and a style sheet instead of embedded formatting. The ZIP file format seems even more controversial – who is going to guarantee that what is submitted can actually be accessed? So much of the rich multimedia content on the web is dependent on specific server-side technologies that would make a stand-alone version relatively useless. Instead, maybe the National Library should consult with the people at the Internet Archive about what might be better ways to archive the content that the library desires…
The Europeana project sounds more feasible, as it aims not to digitize everything out there, but gather together the various digitization projects in Europe and place them under one roof for easy access and cross-referencing. Europeana is in effect attempting to build a multimedia encyclopedia out of the content that has been or is being created in Europe. One of the criticisms raised in the Rosenzweig article was that archivists complain that they cannot archive everything and that someone (i.e. historians) need to help in the process of determining what should be preserved and what can be discarded. In some ways, Europeana is performing exactly this function (at least partially) – it is selecting those aspects of European culture that have been deemed the most important for inclusion, which in turn will guide other archivists and curators to gather more content in order to further enhance the collection as it grows over time.
The immense job of a digital archivist is far from enviable, especially when important digital historical documents have been willfully or even purposefully deleted. One of my favorite Bloggers is Dan Froomkin of the Washington Post. He writes a Blog called White House Watch, which analyzes not just the White House but also the White House Press Corps. Starting in April 2007, Froomkin wrote a series of posts concerned with the deletion of White House emails and how they could impact future historical accounts of the Bush White House. The first article in the series is here and is worth a read.