When you visit web archives to go back in time and look at a web page, you naturally expect it to display the content exactly as it appeared on the live web at that particular datetime. That is, of course, with the assumption in mind that all of the resources on the page were captured at or near the time of the datetime displayed in the banner for the root HTML page. However, we noticed that it is not always the case and problems with archiving Twitter's new UI can result in replaying Twitter profile pages that never existed on the live web. In our previous blog post, we talked about how difficult it is to archive Twitter's new UI, and in this blog post, we uncover how the new Twitter UI mementos in the Internet Archive are vulnerable to temporal violations.
If a Pulitzer-nominated 34-part series of investigative journalism can vanish from the web, anything can.
[The Web] is a constantly changing patchwork of perpetual nowness. ❧
Highlighted on January 07, 2020 at 11:58AM
I’ve only come across one or two which archive.org didn’t crawl or didn’t have. Many of the broken links I’m able to link directly to archive copies on the same day I made them and my archive snapshots were the only ones ever made.
It was the biggest disaster in the history of the music business — and almost nobody knew. This is the story of the 2008 Universal fire.
I’m surprised that the author doesn’t whip out any references to the burning of the Library at Alexandria, which may have been roughly on par in terms of cultural loss to society. It’s painfully sad that UMG covered up the devastating loss.
The artwork for the piece is really brilliant. Some great art direction here.
A tool to view web pages using old browsers on legacy platforms
This talk will present innovative uses of Docker containers, emulators and web archives to allow anyone to experience old web sites using old web browsers, as demonstrated by the Webrecorder and oldweb.today projects. Combining containerization with emulation can provide new techniques in preserving both scholarly and artistic interactive works, and enable obsolete technologies like Flash and Java applets to be accessible today and in the future. The talk will briefly cover the technology and how it can be deployed both locally and in the cloud. Latest research in this area, such as automated preservation of education publishing platforms like Scalar will also be presented. The presentation will include live demos and users will also be invited to try the latest version of oldweb.today and interact with old browsers directly in their browser. The Q&A will help serve to foster a discussion on the potential opportunities and challenges of containerization technology in ‘future-proofing’ interactive web content and software.
I’ve got a new piece over at The Atlantic on Barack Obama’s prospective presidential library, which will be digital rather than physical. This has caused some consternation. We need to realize, however, that the Obama library is already largely digital: The vast majority of the record his presid...
The means and methods of digital preservation also become an interesting test case for this particular presidency because so much of it was born digitally. I’m curious what the overlaps are for those working in the archival research space? In fact, I know that groups like the Reynolds Journalism Institute have been hosting conferences like Dodging the Memory Hole which are working at preserving born digital news and I suspect there’s a huge overlap with what digital libraries like this one are doing. I have to think Dan would make an interesting keynote speaker if there were another Dodging the Memory Hole conference in the near future.
Given my technological background, I’m less reticent than some detractors of digital libraries, but this article reminds me of some of the structural differences in this particular library from an executive and curatorial perspective. Some of these were well laid out in an episode of On the Media which I listened to recently. I’d be curious to hear what Dan thinks of this aspect of the curatorial design, particularly given the differences a primarily digital archive might have. For example, who builds the search interface? Who builds the API for such an archive and how might it be designed to potentially limit access of some portions of the data? Design choices may potentially make it easier for researchers, but given the current and some past administrations, what could happen if curators were less than ideal? What happens with changes in technology? What about digital rot or even link rot? Who chooses formats? Will they be standardized somehow? What prevents pieces from being digitally tampered with? When those who win get to write the history, what prevents those in the future from digitally rewriting the narrative? There’s lots to consider here.
What do you do with 11,000 blogs on a platform that is over a decade old? That is the question that the Division of Teaching and Learning Technologies (DTLT) and the UMW Libraries are trying to answer. This essay outlines the challenges of maintaining a large WordPress multisite installation and offers potential solutions for preserving institutional digital history. Using a combination of data mining, personal outreach, and available web archiving tools, we show the importance of a systematic, collaborative approach to the challenges we didn’t expect to face in 2007 when UMW Blogs launched. Complicating matters is the increased awareness of digital privacy and the importance of maintaining ownership and control over one’s data online; the collaborative nature of a multisite and the life cycle of a student or even faculty member within an institution blurs the lines of who owns or controls the data found on one of these sites. The answers may seem obvious, but as each test case emerges, the situation becomes more and more complex. As an increasing number of institutions are dealing with legacy digital platforms that are housing intellectual property and scholarship, we believe that this essay will outline one potential path forward for the long-term sustainability and preservation.
When it comes to their stuff, people often have a hard time letting go. When the object of their obsession are rooms full of old clothes or newspapers, it can be unhealthy—even dangerous. But what about a stash that fits on 10 5-inch hard drives?
Ben Welsh of the LA Times data desk has built Savemy.News which leverages Twitter in combination with archive.is, webcitation.org, and archive.org to allow journalists to quickly create multiple archives of their work by simply inputting the URLs of their related pages. It’s also got a useful download functionality too.
Richard MacManus, founder of RWW, wrote a worthwhile article on how and why he archived a lot of his past work.
Those with heavier digital journalism backgrounds and portfolios may find some useful information and research coming out of Reynolds Journalism Institute’s Dodging the Memory Hole series of conferences. I can direct those interested to a variety of archivists, librarians, researchers, and technologists should they need heavier lifting that simpler solutions than archive.org, et al.
Additional ideas for archiving and saving online work can be found on the IndieWeb wiki page archival copy. There are some additional useful ideas and articles on the IndieWeb for Journalism page as well. I’d welcome anyone with additional ideas or input to feel free to add to any of these pages for others’ benefit as well. If you’re unfamiliar with wiki notation or editing, feel free to reply to this post; I’m happy to make additions on your behalf or help you log in and navigate the system directly.
If you don’t have a website where you keep your personal archive and/or portfolio online already, now might be a good time to put one together. The IndieWeb page mentioned above has some useful ideas, real world examples, and even links to tutorials.
As an added bonus for those who clicked through, if you’re temporarily unemployed and don’t have your own website/portfolio already, I’m happy to help build an IndieWeb-friendly website (gratis) to make it easier to store and display your past and future articles.
I’ve recently outlined how ideas like a Domain of One’s Own and IndieWeb philosophies could be used to allow researchers and academics to practice academic samizdat on the open web to own and maintain their own open academic research and writing. A part of this process is the need to have useful and worthwhile back up and archiving ability as one thing we have come to know in the history of the web is that link rot is not our friend.
Toward that end, for those in the space I’ll point out some useful resources including the IndieWeb wiki pages for archival copies. Please contribute to it if you can. Another brilliant resource is the annual Dodging the Memory Hole conference which is run by the Reynolds Journalism Institute.
While Dodging the Memory Hole is geared toward saving online news in particular, many of the conversations are nearly identical to those in the broader archival space and also involve larger institutional resources and constituencies like the Internet Archive, the Library of Congress, and university libraries as well. The conference is typically in the fall of each year and is usually announced in August/September sometime, so keep an eye out for its announcement. In the erstwhile, they’ve recorded past sessions and have archive copies of much of their prior work in addition to creating a network of academics, technologists, and journalists around these ideas and related work. I’ve got a Twitter list of prior DtMH participants and stake-holders for those interested.
I’ll also note briefly, that as I self-publish on my own self-hosted domain, I use a simple plugin so that both my content and the content to which I link are being sent to the Internet Archive to create copies there. In addition to semi-regular back ups I make locally, this hopefully helps to mitigate potential future loss and link rot.
As a side note, major bonus points to Robin DeRosa (@actualham) for the use of the IndieWeb hashtag in her post!!
Dave Winer has a great post today on the closing of blogs.harvard.edu. These are sites run by Berkman, some dating back to 2003, which are being shut down. My galaxy brain goes towards the idea of …
I got an email in the middle of the night asking if I had seen an announcement from Berkman Center at Harvard that they will stop hosting blogs.harvard.edu. It's not clear what will happen to the archives. Let's have a discussion about this. That was the first academic blog hosting system anywhere. It was where we planned and reported on our Berkman Thursday meetups, and BloggerCon. It's where the first podcasts were hosted. When we tried to figure out what makes a weblog a weblog, that's where the result was posted. There's a lot of history there. I can understand turning off the creation of new posts, making the old blogs read-only, but as a university it seems to me that Harvard should have a strong interest in maintaining the archive, in case anyone in the future wants to study the role we played in starting up these (as it turns out) important human activities.
Running time: 0h 12m 59s | Download (13.9 MB) | Subscribe by RSS | Huffduff
Researcher posts research work to their own website (as bookmarks, reads, likes, favorites, annotations, etc.), they can post their data for others to review, they can post their ultimate publication to their own website.
The researcher’s post can webmention an aggregating website similar to the way they would pre-print their research on a server like arXiv.org. The aggregating website can then parse the original and display the title, author(s), publication date, revision date(s), abstract, and even the full paper itself. This aggregator can act as a subscription hub (with WebSub technology) to which other researchers can use to find, discover, and read the original research.
Readers of the original research can then write about, highlight, annotate, and even reply to it on their own websites to effectuate peer-review which then gets sent to the original by way of Webmention technology as well. The work of the peer-reviewers stands in the public as potential work which could be used for possible evaluation for promotion and tenure.
Readers of original research can post metadata relating to it on their own website including bookmarks, reads, likes, replies, annotations, etc. and send webmentions not only to the original but to the aggregation sites which could aggregate these responses which could also be given point values based on interaction/engagement levels (i.e. bookmarking something as “want to read” is 1 point where as indicating one has read something is 2 points, or that one has replied to something is 4 points and other publications which officially cite it provide 5 points. Such a scoring system could be used to provide a better citation measure of the overall value of of a research article in a networked world. In general, Webmention could be used to provide a two way audit-able trail for citations in general and the citation trail can be used in combination with something like the Vouch protocol to prevent gaming the system with spam.
Government institutions (like Library of Congress), universities, academic institutions, libraries, and non-profits (like the Internet Archive) can also create and maintain an archival copy of digital and/or printed copies of research for future generations. This would be necessary to guard against the death of researchers and their sites disappearing from the internet so as to provide better longevity.
Resources mentioned in the microcast
IndieWeb for Education
IndieWeb for Journalism
arXiv.org (an example pre-print server)
A Domain of One’s Own
Article on A List Apart: Webmentions: Enabling Better Communication on the Internet