library science | Chris Aldrich

Notes from Day 2 of Dodging the Memory Hole: Saving Online News | Friday, October 14, 2016

If you missed the notes from Day 1, see this post.

It may take me a week or so to finish putting some general thoughts and additional resources together based on the two day conference so that I might give a more thorough accounting of my opinions as well as next steps. Until then, I hope that the details and mini-archive of content below may help others who attended, or provide a resource for those who couldn’t make the conference.

Overall, it was an incredibly well programmed and run conference, so kudos to all those involved who kept things moving along. I’m now certainly much more aware at the gaping memory hole the internet is facing despite the heroic efforts of a small handful of people and institutions attempting to improve the situation. I’ll try to go into more detail later about a handful of specific topics and next steps as well as a listing of resources I came across which may provide to be useful tools for both those in the archiving/preserving and IndieWeb communities.

Archive of materials for Day 2

Audio Files

Below are the recorded audio files embedded in .m4a format (using a Livescribe Pulse Pen) for several sessions held throughout the day. To my knowledge, none of the breakout sessions were recorded except for the one which appears below.

Lightning Rounds: Six Presenters

Lightning rounds (in two parts)
Six + one presenters: Jefferson Bailey, Terry Britt, Katherine Boss (and team), Cynthia Joyce, Mark Graham, Jennifer Younger and Kalev Leetaru
1: Jefferson Bailey, Internet Archive, “Supporting Data-Driven Research using News-Related Web Archives” 2: Terry Britt, University of Missouri, “News archives as cornerstones of collective memory” 3: Katherine Boss, Meredith Broussard and Eva Revear, New York University: “Challenges facing preservation of born-digital news applications” 4: Cynthia Joyce, University of Mississippi, “Keyword ‘Katrina’: Re-collecting the unsearchable past” 5: Mark Graham, Internet Archive/The Wayback Machine, “Archiving news at the Internet Archive” 6: Jennifer Younger, Catholic Research Resources Alliance: “Digital Preservation, Aggregated, Collaborative, Catholic” 7. Kalev Leetaru, senior fellow, The George Washington University and founder of the GDELT Project: A Look Inside The World’s Largest Initiative To Understand And Archive The World’s News

Technology and Community

Presentation: Technology and community: Why we need partners, collaborators, and friends by Kate Zwaard, Library of Congress

Breakout: Working with CMS

Working with CMS, led by Eric Weig, University of Kentucky

Alignment and reciprocity

Alignment & reciprocity by Katherine Skinner, Ph.D., executive director, the Educopia Institute

Closing remarks

Closing remarks by Edward McCain, RJI and MU Libraries and Todd Grappone, associate university librarian, UCLA

Live Tweet Archive

Reminder: In many cases my tweets don’t reflect direct quotes of the attributed speaker, but are often slightly modified for clarity and length for posting to Twitter. I have made a reasonable attempt in all cases to capture the overall sentiment of individual statements while using as many original words of the participant as possible. Typically, for speed, there wasn’t much editing of these notes. Below I’ve changed the attribution of one or two tweets to reflect the proper person(s). Fore convenience, I’ve also added a few hyperlinks to useful resources after the fact that didn’t have time to make the original tweets. I’ve attached .m4a audio files of most of the audio for the day (apologies for shaky quality as it’s unedited) which can be used for more direct attribution if desired. The Reynolds Journalism Institute videotaped the entire day and livestreamed it. Presumably they will release the video on their website for a more immersive experience.

Peter Arnett:

Condoms were required issue in Vietnam–we used them to waterproof film containers in the field.

Do not stay close to the head of a column, medics, or radiomen. #warreportingadvice

I told the AP I would undertake the task of destroying all the reporters’ files from the war.

Instead the AP files moved around with me.

Eventually the 10 trunks of material went back to the AP when they hired a brilliant archivist.

“The negatives can outweigh the positives when you’re in trouble.”

Edward McCain:

Our first panel:Kiss your app goodbye: the fragility of data jornalism

Meredith Broussard:

I teach data journalism at NYU

A news app is not what you’d install on your phone

Dollars for Docs is a good example of a news app

A news app is something that allows the user to put themself into the story.

Often there are three CMSs: web, print, and video.

News apps don’t live in any of the CMSs. They’re bespoke and live on a separate data server.

This has implications for crawlers which can’t handle them well.

Then how do we save news apps? We’re looking at examples and then generalizing.

Everyblock.com was a good example based on chicagocrime and later bought by NBC and shut down.

What?! The internet isn’t forever? Databases need to be save differently than web pages.

Reprozip was developed by NYU Center for Data and we’re using it to save the code, data, and environment.

Ben Welsh:

My slides will be at http://bit.ly/frameworkfix. I work on the data desk @LATimes

We make apps that serve our audience.

We also make internal tools that empower the newsroom.

We also use our nerdy skills to do cool things.

Most of us aren’t good programmers, we “cheat” by using frameworks.

Frameworks do a lot of basic things for you, so you don’t have to know how to do it yourself.

Archiving tools often aren’t built into these frameworks.

Instagram, Pinterest, Mozilla, and the LA Times use django as our framework.

Memento for WordPress is a great way to archive pages.

We must do more. We need archiving baked into the systems from the start.

Slides at http://bit.ly/frameworkfix

Regina Roberts:

Got data? I’m a librarian at Stanford University.

I’ll mention Christine Borgman’s book Big Data, Little Data, No data.

Journalists are great data liberators: FOIA requests, cleaning data, visualizing, getting stories out of data.

But what happens to the data once the story is published?

BLDR: Big Local Digital Repository, an open repository for sharing open data.

Solutions that exist: Hydra at http://projecthydra.org or Open ICPSR www.openicpsr.org

For metadata: www.ddialliance.org, RDF, International Image Interoperability Framework (iiif) and MODS

Martin Klein:

We’ll open up for questions.

Audience Question:

What’s more important: obey copyright laws or preserving the content?

Regina Roberts:

The new creative commons licenses are very helpful, but we have to be attentive to many issues.

Perhaps archiving it and embargoing for later?

Ben Welsh:

Saving the published work is more important to me, and the rest of the byproduct is gravy.

Evan Sandhaus:

I work for the New York Times, you may have heard of it…

Doing a quick demo of Times Machine from @NYTimes

Sophia van Valkenburg:

Talking about modernizing the born-digital legacy content.

Our problem was how to make an article from 2004 look like it had been published today.

There were 100’s of thousands of articles missing.

There was no one definitive list of missing articles.

Outlining the workflow for reconciling the archive XML and the definitive list of URLs for conversion.

It’s important to use more than one source for building an archive.

Jane Cotler:

I’m going to talk about all of “the little things” that came up along the way..

Article Matching: Fusion – How to convert print XML with web HTML that was scraped.

Primarily, we looked at common phrases between the corpus of the two different data sets.

We prioritized the print data over the digital data.

We maintain a system called switchboard that redirects from old URLs to the new ones to prevent link rot.

The case of the missing sections: some sections of the content were blank and not transcribed.

We made the decision of taking out data we had in lieu of making a better user experience for missing sections.

In the future, we’d also like to put photos back into the articles.

Evan Sandhaus:

Modernizing and archiving the @NYTimes archives is an ongoing challenge.

Edward McCain:

Can you discuss the decision to go with a more modern interface rather than a traditional archive of how it looked?

Evan Sandhaus:

Some of the decision was to get the data into an accessible format for modern users.

We do need to continue work on preserving the original experience.

Edward McCain:

Is there a way to distinguish between the print version and the online versions in the archive?

Audience Question:

Could a researcher do work on the entire corpora? Is it available for subscription?

Edward McCain:

We do have a sub-section of data availalbe, but don’t have it prior to 1960.

Audience Question:

Have you documented the process you’ve used on this preservation project?

Sophia van Valkenburg:

We did save all of the code for the project within GitHub.

Jane Cotler:

We do have meeting notes which provide some documentation, though they’re not thorough.

ChrisAldrich:

Oh dear. Of roughly 1,155 tweets I counted about #DtMH2016 in the last week, roughly 25% came from me. #noisy

Opensource tool I had mentioned to several: @wallabagapp A self-hostable application for saving web pages https://www.wallabag.org

Notes from Day 1 of Dodging the Memory Hole: Saving Online News | Thursday, October 13, 2016

Today I spent most of the majority of the day attending the first of a two day conference at UCLA’s Charles Young Research Library entitled “Dodging the Memory Hole: Saving Online News.” While I knew mostly what I was getting into, it hadn’t really occurred to me how much of what is on the web is not backed up or archived in any meaningful way. As a part of human nature, people neglect to back up any of their data, but huge swaths of really important data with newsworthy and historic value is being heavily neglected. Fortunately it’s an interesting enough problem to draw the 100 or so scholars, researchers, technologists, and journalists who showed up for the start of an interesting group being conglomerated through the Reynolds Journalism Institute and several sponsors of the event.

What particularly strikes me is how many of the philosophies of the IndieWeb movement and tools developed by it are applicable to some of the problems that online news faces. I suspect that if more journalists were practicing members of the IndieWeb and used their sites not only for collecting and storing the underlying data upon which they base their stories, but to publish them as well, then some of the (future) archival process may be easier to accomplish. I’ve got so many disparate thoughts running around my mind after the first day that it’ll take a bit of time to process before I write out some more detailed thoughts.

Twitter List for the Conference

As a reminder to those attending, I’ve accumulated a list of everyone who’s tweeted with the hashtag #DtMH2016, so that attendees can more easily follow each other as well as communicate online following our few days together in Los Angeles. Twitter also allows subscribing to entire lists too if that’s something in which people have interest.

Archiving the day

It seems only fitting that an attendee of a conference about saving and archiving digital news, would make a reasonable attempt to archive some of his experience right?! Toward that end, below is an archive of my tweetstorm during the day marked up with microformats and including hovercards for the speakers with appropriate available metadata. For those interested, I used a fantastic web app called Noter Live to capture, tweet, and more easily archive the stream.

Note that in many cases my tweets don’t reflect direct quotes of the attributed speaker, but are often slightly modified for clarity and length for posting to Twitter. I have made a reasonable attempt in all cases to capture the overall sentiment of individual statements while using as many original words of the participant as possible. Typically, for speed, there wasn’t much editing of these notes. I’m also attaching .m4a audio files of most of the audio for the day (apologies for shaky quality as it’s unedited) which can be used for more direct attribution if desired. The Reynolds Journalism Institute videotaped the entire day and livestreamed it. Presumably they will release the video on their website for a more immersive experience.

If you prefer to read the stream of notes in the original Twitter format, so that you can like/retweet/comment on individual pieces, this link should give you the entire stream. Naturally, comments are also welcome below.

Audio Files

Below are the audio files for several sessions held throughout the day.

Greetings and Keynote

Greetings: Edward McCain, digital curator of journalism, Donald W. Reynolds Journalism Institute (RJI) and University of Missouri Libraries and Ginny Steel, university librarian, UCLA
Keynote: Digital salvage operations — what’s worth saving? given by Hjalmar Gislason, vice president of data, Qlik

Why save online news? and NewsScape

Panel: “Why save online news?” featuring Chris Freeland, Washington University; Matt Weber, Ph.D., Rutgers, The State University of New Jersey; Laura Wrubel, The George Washington University; moderator Ana Krahmer, Ph.D., University of North Texas
Presentation: “NewsScape: preserving TV news” given by Tim Groeling, Ph.D., UCLA Communication Studies Department

Born-digital news preservation in perspective

Speaker: Clifford Lynch, Ph.D., executive director, Coalition for Networked Information on “Born-digital news preservation in perspective”

Live Tweet Archive

ChrisAldrich:

Getting Noter Live fired up for Dodging the Memory Hole 2016: Saving Online News https://www.rjionline.org/dtmh2016

Ginny Steel:

I’m glad I’m not at NBC trying to figure out the details for releasing THE APPRENTICE tapes.

Edward McCain:

Let’s thank @UCLA and the library for hosting us all.

While you’re here, don’t forget to vote/provide feedback throughout the day for IMLS

Someone once pulled up behind me and said “Hi Tiiiigeeerrr!” #Mizzou

A server at the Missourian crashed as the system was obsolete and running on baling wire. We lost 15 years of archives

The dean & head of Libraries created a position to save born digital news.

We’d like to help define stake-holder roles in relation to the problem.

Newspaper is really an outmoded term now.

I’d like to celebrate that we have 14 student scholars here today.

We’d like to have you identify specific projects that we can take to funding sources to begin work after the conference

We’ll be going to our first speaker who will be introduced by Martin Klein from Los Alamos.

Martin Klein:

Hjalmar Gislason is a self-described digital nerd. He’s the Vice President of Data.

I wonder how one becomes the President of Data?

Hjalmar Gislason:

My Icelandic name may be the most complicated part of my talk this morning.

Speaking on Digital Salvage Operations: What’s worth Saving”

My father in law accidentally threw away my wife’s favorite stuffed animal. #DeafTeddy

Some people just throw everything away because they’re not being used. Others keep everything and don’t throw it away.

The fundamental question: Do you want to save everything or do you want to get rid of everything?

I joined @qlik two years ago and moved to Boston.

Before that I was with spurl.net which was about saving copies of webpages they’d previously visited.

I had also previously invested in kjarninn which is translated as core.

We used to have little data, now we’re with gigantic data and moving to gargantuan data soon.

One of my goals today is to broaden our perspective about what data needs saving.

There’s the Web, the “Deep” Web, then there’s “Other” data which is at the bottom of the pyramid.

I got to see into the process of #panamapapers but I’d like to discuss the consequences from April 3rd.

The amount of meetings were almost more than could have been covered in real time in Iceland.

The #panamapapers were a soap opera, much like US politics.

Looking back at the process is highly interesting, but it’s difficult to look at all the data as they unfoldedd

How can we capture all the media minute by minute as a story unfolds.

You can’t trust that you can go back to a story at a certain time and know that it hasn’t been changed. #1984 #Orwell

There was a relatively pro-HRC piece earlier this year @NYTimes that was changed.

Newsdiffs tracks changes in news over time. The HRC article had changed a lot.

Let’s say you referenced @CNN 10 years ago, likely now, the CMS and the story have both changed.

8 years ago, I asked, wouldn’t we like to have the social media from Iceland’s only Nobel Laureate as a teenager?

What is private/public, ethical/unethical when dealing with data?

Much data is hidden behind passwords or on systems which are not easily accessed from a database perspective.

Most of the content published on Facebook isn’t public. It’s hard to archive in addition to being big.

We as archivists have no claim on the hidden data within Facebook.

ChrisAldrich:

The #indieweb could help archivists in the future in accessing more personal data.

Hjalmar Gislason:

Then there’s “other” data: 500 hours of video us uploaded to YouTube per minute.

No organization can go around watching all of this video data. Which parts are newsworthy?

Content could surface much later or could surface through later research.

Hornbjargsviti lighthouse recorded the weather every three hours for years creating lots of data.

And that was just one of hundreds of sites that recorded this type of data in Iceland.

Lots of this data is lost. Much that has been found was by coincidence. It was never thought to archive it.

This type of weather data could be very valuable to researchers later on.

There was also a large archive of Icelandic data that was found.

Showing a timelapse of Icelandic earthquakes https://vimeo.com/24442762

You can watch the magma working it’s way through the ground before it makes it’s way up through the land.

National Geographic featured this video in a documentary.

Sometimes context is important when it comes to data. What is archived today may be more important later.

As the economic crisis unfolded in Greece, it turned out the data that was used to allow them into EU was wrong.

The data was published at the time of the crisis, but there was no record of what the data looked like 5 years earlier.

Only way to recreate the data was to take prior printed sources. This is usu only done in extraordinary cirucumstances.

We captured 150k+ data sets with more than 8 billion “facts” which was just a tiny fraction of what exists.

How can we delve deeper into large data sets, all with different configurations and proprietary systems.

“There’s a story in every piece of data.”

Once a year energy consumption seems to dip because February has fewer days than other months. Plotting it matters.

Year over year comparisons can be difficult because of things like 3 day weekends which shift over time.

Here’s a graph of the population of Iceland. We’ve had our fair share of diseases and volcanic eruptions.

To compare, here’s a graph of the population of sheep. They outnumber us by an order(s) of magnitude.

In the 1780’s there was an event that killed off lots of sheep, so people had the upper hand.

Do we learn more from reading today’s “newspaper” or one from 30, 50, or 100 years ago?

There was a letter to the editor about an eruption and people had to move into the city.

letter: “We can’t have all these people come here, we need to build for our own people first.”

This isn’t too different from our problems today with respect to Syria. In that case, the people actually lived closer.

In the born-digital age, what will the experience look like trying to capture today 40 years hence?

Will it even be possible?

Machine data connections will outnumber “people” data connections by a factor of 10 or more very quickly.

With data, we need to analyze, store, and discard data. How do we decide in a spit-second what to keep & discard?

We’re back to the father-in-law and mother-in-law question: What to get rid of and what to save?

Computing is continually beating human tasks: chess, Go, driving a car. They build on lots more experience based on data

Whoever has the most data on driving cars and landscape will be the ultimate winner in that particular space.

Data is valuable, sometimes we just don’t know which yet.

Hoarding is not a strategy.

You can only guess at what will be important.

“Commercial use in Doubt” The third sub-headline in a newspaper about an early test of television.

There’s more to it than just the web.

Kate Zwaard:

Hoarding isn’t a strategy really resonates with librarians, what could that relationship look like?

Hjalmar Gislason:

One should bring in data science, industry may be ahead of libraries.

Cross-disciplinary approaches may be best. How can you get a data scientist to look at your problem? Get their attention?

Peter Arnett:

There’s 60K+ books about the Viet Nam War. How do we learn to integrate what we learn after an event (like that)?

Hjalmar Gislason:

Perspective always comes with time, as additional information arrives.

Scientific papers are archived in a good way, but the underlying data is a problem.

In the future you may have the ability to add supplementary data as a supplement what appears in a book (in a better way)

Archives can give the ability to have much greater depth on many topics.

Are there any centers of excellence on the topics we’re discussing today? This conference may be IT.

We need more people that come from the technical side of things to be watching this online news problem.

Hacks/Hackers is a meetup group that takes place all over the world.

It brings the journalists and computer scientists together regularly for beers. It’s some of the outreach we need.

Edward McCain:

If you’re not interested in money, this is a good area to explore. 10 minute break.

Don’t forget to leave your thoughts on the questions at the back of the room.

We’re going to get started with our first panel. Why is it important to save online news?

Matthew Weber:

I’m Matt Weber from Rugters University and in communications.

I’ll talk about web archives and news media and how they interact.

I worked at Tribune Corp. for several years and covered politics in DC.

I wanted to study the way in which the news media is changing.

We’re increadingly seeing digital only media with no offline surrogate.

It’s becomign increasingly difficult to do anything but look at it now as it exists.

There was no large scale online repository of online news to do research.

#OccupyWallStreet is one of the first examples of stories that exist online in ocurence and reportage.

There’s a growing need to archive content around local news particularly politics and democracy.

When there is a rich and vibrant local news environment, people are more likely to become engaged.

Local news is one of the least thought about from an archive perspective.

Laura Wrubel:

I’m at GWU Librarys in the scholarly technology group.

I’m involved in social feed manager which allows archivists to put together archives from social services.

Kimberly Gross, a faculty member, studies tweets of news outlets and journalists.

We created a prototype tool to allow them to collect data from social media.

Journalists were 2011 primarily using their Twitter presences to direct people to articles rather than for conversation

We collect data of political candidates.

Chris Freeland:

I’m an associate library and representing “Documenting the Now” with WashU, UCRiverside, & UofMd

Documenting the Now revolves around Twitter documentation.

It started with the Ferguson story and documenting media, videos during the protests in the community.

What can we as memory institutions do to capture the data?

We gathered 14million tweets relating to Ferguson within two weeks.

We tried to build a platform that others could use in the future for similar data capture relating to social.

Ethics is important in archiving this type of news data.

Ana Krahmer:

Digitally preserving pdfs from news organizations and hyper-local news in Texas.

We’re approaching 5million pages of archived local news.

What is news that needs to be archived, and why?

Matthew Weber:

First, what is news? The definition is unique to each individual.

We need to capture as much of the social news and social representation of news which is fragmented.

It’s an important part of society today.

We no longer produce hard copies like we did a decade ago. We need to capture the online portion.

Laura Wrubel:

We’d like to get the perspective of journalists, and don’t have one on the panel today.

We looked at how midterm election candidates used Twitter. Is that news itself? What tools do we use to archive it?

What does it mean to archive news by private citizens?

Chris Freeland:

Twitter was THE place to find information in St. Louis during the Ferguson protests.

Local news outlets weren’t as good as Twitter during the protests.

I could hear the protest from 5 blocks away and only found news about it on Twitter.

The story was bing covered very differently on Twitter than the local (mainstream) news.

Alternate voices in the mix were very interesting and important.

Twitter was in the moment and wasn’t being edited and causing a delay.

What can we learn from this massive number of Ferguson tweets.

It gives us information about organizing, and what language was being used.

Ana Krahmer:

I think about the archival portion of this question. By whom does it need to be archived?

What do we archive next?

How are we representing the current population now?

Who is going to take on the burden of archiving? Should it be corporate? Cultural memory institution?

Someone needs to currate it, who does that?

our next question: What do you view as primary barriers to news archiving?

Laura Wrubel:

How do we organize and staff? There’s no shortage of work.

Tools and software can help the process, but libraries are usually staffed very thinly.

No single institution can do this type of work alone. Collaboration is important.

Chris Freeland:

Two barriers we deal with: terms of service are an issue with archiving. We don’t own it, but can use it.

Libraries want to own the data in perpetuity. We don’t own our data.

There’s a disconnect in some of the business models for commercialization and archiving.

Issues with accessing data.

People were worried about becoming targets or losing jobs because of participation.

What is role of ethics of archiving this type of data? Allowing opting out?

What about redacting portions? anonymizing the contributions?

Ana Krahmer:

Publishers have a responsibility for archiving their product. Permission from publishers can be difficult.

We have a lot of underserved communities. What do we do with comments on stories?

Corporations may not continue to exist in the future and data will be lost.

Matthew Weber:

There’s a balance to be struck between the business side and the public good.

It’s hard to convince for profit about the value of archiving for the social good.

Chris Freeland:

Next Q: What opportunities have revealed themselves in preserving news?

Finding commonalities and differences in projects is important.

What does it mean to us to archive different media types? (think diversity)

What’s happening in my community? in the nation? across the world?

The long-history in our archives will help us learn about each other.

Ana Krahmer:

We can only do so much with the resources we have.

We’ve worked on a cyber cemetery product in the past.

Someone else can use the tools we create within their initiatives.

Chris Freeland:

repeating ?: What are issues in archiving longerform video data with regard to stories on Periscope?

Audience Question:

How do you channel the energy around archiving news archiving?

Matthew Weber:

Research in the area is all so new.

Audience Question:

Does anyone have any experience with legal wrangling with social services?

Chris Freeland:

The ACLU is waging a lawsuit against Twitter about archived tweets.

Ana Krahmer:

Outreach to community papers is very rhizomic.

Audience Question:

How do you take local examples and make them a national model?

Ana Krahmer:

We’re teenagers now in the evolution of what we’re doing.

Edward McCain:

Peter Arnett just said “This is all ore interesting than I thought it would be.”

Next Presentation: NewsScape: preserving TV news

Tim Groeling:

I’ll be talking about the NewsScape project of Francis Steen, Director, Communication Studies Archive

I’m leading the archiving of the analog portion of the collection.

The oldest of our collection dates from the 1950’s. We’ve hosted them on YouTube which has created some traction.

Commenters have been an issue with posting to YouTube as well as copyright.

NewsScape is the largest collecction of TV news and public affairs programs (local & national)

Prior to 2006, we don’t know what we’ve got.

Paul said “Ill record everytihing I can and someone in the future can deal with it.”

We have 50K hours of Betamax.

VHS are actually most threatened, despite being newest tapes.

Our budget was seriously strapped.

Maintaining closed captioning is important to our archiving efforts.

We’ve done 36k hours of encoding this year.

We use a layer of dead VCR’s over our good VCR’s to prevent RF interference and audio buzzing. 🙂

Post-2006 We’re now doing straight to digital

Preservation is the first step, but we need to be more than the world’s best DVR.

Searching the news is important too.

Showing a data visualization of news analysis with regard to the Heathcare Reform movement.

We’re doing facial analysis as well.

We have interactive tools at viz2016.com.

We’ve tracked how often candidates have smiled in election 2016. Hillary > Trump

We want to share details within our collection, but don’t have tools yet.

Having a good VCR repairman has helped us a lot.

Edward McCain:

Breaking for lunch…

Clifford Lynch:

Talk “Born-digital news preservation in perspective”

There’s a shared consensus that preserving scholarly publications is important.

While delivery models have shifted, there must be some fall back to allow content to survive publisher failure.

Preservation was a joint investment between memory institutions and publishers.

Keepers register their coverage of journals for redundancy.

In studying coverage, we’ve discovered Elsevier is REALLY well covered, but they’re not what we’re worried about.

It’s the small journals as edge cases that really need more coverage.

Smaller journals don’t have resources to get into the keeper services and it’s more expensive.

Many Open Access Journals are passion projects and heavily underfunded and they are poorly covered.

Being mindful of these business dynamics is key when thinking about archiving news.

There are a handful of large news outlets that are “too big to fail.”

There are huge numbers of small outlets like subject verticals, foreign diasporas, etc. that need to be watched

Different strategies should be used for different outlets.

The material on lots of links (as sources) disappears after a short period of time.

While Archive.org is a great resource, it can’t do everything.

Preserving underlying evidence is really important.

How we deal with massive databases and queries against them are a difficult problem.

I’m not aware of studies of link rot with relationship to online news.

Who steps up to preserve major data dumps like Snowden, PanamaPapers, or email breaches?

Social media is a collection of observations and small facts without necessarily being journalism.

Journalism is a deliberate act and is meant to be public while social media is not.

We need to come up with a consensus about what parts of social media should be preserved as news..

News does often delve into social media as part of its evidence base now.

Responsible journalism should include archival storage, but it doesn’t yet.

Under current law, we can’t protect a lot of this material without the permission of the creator(s).

The Library of Congress can demand deposit, but doesn’t.

With funding issues, I’m not wild about the Library of Congress being the only entity [for storage.]

In the UK, there are multiple repositories.

ChrisAldrich:

testing to see if I’m still live

What happens if you livetweet too much in one day.

Reframing What Academic Freedom Means in the Digital Age

Creation of a Task Force on Academic Freedom

Not long ago, my alma mater Johns Hopkins University announced the creation of a task force on Academic Freedom. Since then, I’ve corresponded with the group on a few occasions and in the spirit of my notes to them, I thought I’d share some of those thoughts with others in the academy, science writers/communicators, and even the general public who may also find them useful. Toward that end, below is a slightly modified version of my two main emails to the task force. [They’ve been revised marginally for their appearance and readability in this format and now also include section headings.] While I’m generally writing about Johns Hopkins as an example, I’m sure that the majority of it also applies to the rest of the academy.

On a personal note, the first email has some interesting thoughts and background, while the second email has some stronger broader recommendations.

Jacques-Louis David's (1787) Oil on canvas entitled "The Death of Socrates" — Jacques-Louis David’s “The Death of Socrates” (1787, Oil on canvas)

My First Thoughts to the Task Force

Matthew Green’s Blog and Questions of National Security

Early in September 2013, there was a rather large PR nightmare created for the university (especially as it regards poor representation within the blogosphere and social media) when interim Dean of the Whiting School of Engineering Andrew Douglas requested to have professor Matthew Green’s web presence modified in relation to an alleged anti-NSA post on it. Given the increasing level of NSA related privacy news at the time (and since as relates to the ongoing Edward Snowden case), the case was certainly blown out of proportion. But the Green/NSA story is also one of the most highlighted cases relating to academic freedom in higher education in the last several years, and I’m sure it may be the motivating force behind why the task force was created in the first place. (If you or the task force is unaware of the issues in that case you can certainly do a quick web search, though one of the foremost followers of the controversy was ArsTechnica which provided this post with most of the pertinent information; alternately take a look at what journalism professor Jay Rosen had to say on the issue in the Guardian.) I’m sure you can find a wealth of additional reportage from the Hopkins Office of News and Information which maintains its daily digests of “Today’s News” from around that time period.

In my mind, much of the issue and the outpouring of poor publicity, which redounded to the university, resulted from the media getting information about the situation via social media before the internal mechanisms of the university had the chance to look at the issue in detail and provide a more timely resolution. [Rumors via social media will certainly confirm Mark Twain’s aphorism that “A lie can travel half way around the world while the truth is putting on its shoes.”]

While you’re mulling over the issue of academic freedom, I would highly suggest you all closely consider the increased impact of the internet and particularly social media with regard to any policies which are proposed going forward. As the volunteer creator and initial maintainer of much of Hopkins’ social media presence on both Facebook and Twitter as well as many others for their first five years of existence (JHU was the first university in these areas of social media and most other major institutions followed our early lead), I have a keen insight to how these tools impact higher education. With easy-to-use blogging platforms and social media (Matthew Green had both a personal blog that was hosted outside the University as well as one that was mirrored through the University as well as a Twitter account), professors now have a much larger megaphone and constituency than they’ve had any time in the preceding 450 years of the academy. This fact creates unique problems as it relates to the university, its image, how it functions, and how its professoriate interact with relation to academic freedom, which is a far different animal than it had been even 17 years ago at the dawn of the internet age. Things can obviously become sticky and quickly as evinced in the Green/APL situation which was exacerbated by the APL’s single source of income at a time when the NSA and privacy were foremost in the public eye.

What are Some of the Issues for Academic Freedom in the Digital Age?

Consider the following:

How should/shouldn’t the university regulate the border of social media and internet presence at the line between personal/private lives and professional lives?
How can the university help to promote/facilitate the use of the internet/social media to increase the academic freedom of its professoriate and simultaneously lower the technological hurdles as well as the generational hurdles faced by the academy? (I suspect that few on the task force have personal blogs or twitter accounts, much less professional blogs hosted by the university beyond their simple “business card” information pages through their respective departments.)
How should the university handle issues like the Matthew Green/APL case so that comments via social media don’t gain steam and blow up in the media before the university has a chance to handle them internally? (As I recall, there were about two news cycles of JHU saying “no comment” and resulting bad press which reached the level of national attention prior to a resolution.)
How can the university help to diffuse the issues which led up to the Green/APL incident before they happen?
What type of press policy can the university create to facilitate/further academic freedom? (Here’s a bad example from professor Jonathan Katz/Washington University with some interesting commentary.)

I hope that the task force is able to spend some time with Dr. Green discussing his case and how it was handled.

Personal Reputation on the Internet in a Connected Age

I also suggest that the students on the task force take a peek into the case file of JHU’s Justin Park from 2007, which has become a textbook-case for expression on the internet/in social media and its consequences (while keeping in mind that it was a social/cultural issue which was the root cause of the incident rather than malice or base racism – this aspect of the case wasn’t/isn’t highlighted in extant internet reportage – Susan Boswell [Long-time Dean of Sudent Life] and Student Activities head Robert Turner can shed more light on the situation). Consider what would the university have done if Justin Park had been a professor instead of a student? What role did communication technology and the internet play in how these situations played out now compared to how they would have been handled when Dr. Grossman was a first year professor just starting out? [Editor’s note: Dr. Grossman is an incredible thought leader, but most of his life and academic work occurred prior to the internet age. Though unconfirmed, I suspect that his internet experience or even experience with email is exceedingly limited.]

Academic Samizdat

In a related issue on academic freedom and internet, I also hope you’re addressing or at least touching on the topic of academic samizdat, so that the university can put forward a clear (and thought-leading) policy on where we stand there as well. I could certainly make a case that the university come out strongly in favor of professors maintaining the ability to more easily self-publish without detriment to their subsequent publication chances in major journals (and resultant potential detriment to the arc of their careers), but the political ramifications in this changing landscape are certainly subtle given that the university deals with both major sides as the employer of the faculty while simultaneously being one of the major customers of the institutionalized research publishing industry. As I currently view the situation, self-publishing and the internet will likely win the day over the major publishers which puts the university in the position of pressing the issue in a positive light to its own ends and that of increasing knowledge for the world. I’m sure Dean Winston Tabb [Dean of the Sheridan Libraries at Johns Hopkins] and his excellent staff could provide the task force with some useful insight on this topic. Simultaneously, how can the increased areas of academic expression/publication (for example the rapidly growing but still relatively obscure area known as the “Digital Humanities”) be institutionalized such that publication in what have previously been non-traditional areas be included more formally in promotion decisions? If professors can be incentivized to use some of their academic freedom and expanded opportunities to both their and the university’s benefit, then certainly everyone wins. Shouldn’t academic freedom also include the freedom of where/when to publish without detriment to one’s future career – particularly in an increasingly more rapidly shifting landscape of publication choices and outlets?

The Modern Research University is a Content Aggregator and Distributor (and Should Be Thought of as Such)

Taking the topic even further several steps further, given the value of the professoriate and their intellectual creations and content, couldn’t/shouldn’t the university create a customized platform to assist their employees in disseminating and promoting their own work? As an example, consider the volume of work (approximate 16,000-20,000 journal articles/year, as well as thousands of articles written for newspapers (NY Times, Wall Street Journal, etc.), magazines, and other outlets – academic or otherwise) being generated every year by those within the university. In a time of decreasing cost of content distribution, universities no longer need to rely on major journals, magazines, television stations, cable/satellite television, et al. to distribute their “product”. To put things in perspective, I can build the infrastructure to start a 24/7 streaming video service equivalent to both a television station and a major newspaper in my garage for the capital cost about $10,000.) Why not bring it all in-house with the benefit of academic flexibility as an added draw to better support the university and its mission? (Naturally, this could all be cross-promoted to other outlets after-the-fact for additional publicity.) At a time when MOOC’s (massively open online courseware) are eroding some of the educational mission within higher education and journals are facing increased financial pressures, perhaps there should be a new model of the university as a massive content/information creation engine and distributor for the betterment of humanity? And isn’t that what Johns Hopkins already is at heart? We’re already one of the largest knowledge creators on the planet, why are we not also simultaneously one of the largest knowledge disseminators – particularly at a time when it is inexpensive to do so, and becoming cheaper by the day?

[Email closing formalities removed]

Expanded Thoughts on Proactive Academic Freedom

Reframing What Academic Freedom Means in the Digital Age

[Second email opening removed]

Upon continued thought and reading on the topic of academic freedom as well as the associated areas of technology, I might presuppose (as most probably do) that the committee will be looking more directly at the concept of preventing the university from impeding the freedom of its faculty and what happens in those situations where action ought to be taken for the benefit of the wider community (censure, probation, warnings, etc.). If it hasn’t been brought up as a point yet, I think one of the most positive things the university could do to improve not only academic freedom, but the university’s position in relation to its competitive peers, is to look at the opposite side of the proverbial coin and actually find a way for the university to PROACTIVELY help promote the voices of its faculty and assist them in broadening their reach.

I touched upon the concept tangentially in my first email (see above), but thought it deserved some additional emphasis, examples to consider, and some possible recommendations. Over the coming decades, the aging professoriate will slowly retire to be replaced with younger faculty who grew up completely within the internet age and who are far more savvy about it as well as the concepts of Web 2.0, the social web and social media. More will be literate in how to shoot and edit short videos and how to post them online to garner attention, readership, and acceptance for their ideas and viewpoints.

The recent PBS Frontline documentary “Generation Like” features a handful of pre-teens and teens who are internet sensations and garnering hundreds of thousands to millions of views of their content online. But imagine for a minute: a savvy professoriate that could do something similar with their academic thought and engaging hundreds, thousands, or millions on behalf of Johns Hopkins? Or consider the agency being portrayed in the documentary [about 30 minutes into the documentary] that helps these internet sensations and what would happen if that type of functionality was taken on by the Provost’s office?

I could presuppose that with a cross-collaboration of the Provost’s office, the Sheridan Libraries, the Film & Media Studies Department, the Digital Media Center, and the Communications Office as an institution we should be able to help better train faculty who are not already using these tools to improve their web presences and reach.

What “Reach” Do Academics Really Have?

I’ve always been struck by my conversations with many professors about the reach of their academic work. I can cite the particular experience of Dr. P.M. Forni, in the Department of Romance Languages at Krieger, when he told me that he’s written dozens of academic papers and journal articles, most of which have “at most a [collective] readership of at most 11 people on the planet” – primarily because academic specialties have become so niche. He was completely dumbfounded on the expanded reach he had in not only writing a main-stream book on the topic of civility, which was heavily influenced by his academic research and background, but in the even more drastically expanded reach provided to him by appearing on the Oprah Winfrey show shortly after its release. Certainly his experience is not a common one, but there is a vast area in between that is being lost, not only by individual professors, but by the university by extension. Since you’re likely aware of the general numbers of people reading academic papers, I won’t bore you, but for the benefit of those on the committee I’ll quote a recent article from Pacific Standard Magazine and provide an additional reference from Physics World, 2007:

A study at Indiana University found that ‘as many as 50% of papers are never read by anyone other than their authors, referees and journal editors.’ That same study concluded that ‘some 90% of papers that have been published in academic journals are never cited.’ That is, nine out of 10 academic papers—which both often take years to research, compile, submit, and get published, and are a major component by which a scholar’s output is measured—contribute little to the academic conversation.

Some Examples of Increased Reach in the Academy

To provide some examples and simple statistics on where something like this might go, allow me to present the following brief references:

As a first example, written by an academic about academia, I suggest you take a look at a recent blog post “Why academics should blog and an update on readership” by Artem Kaznatcheev, a researcher in computer science and psychology at McGill University, posting on a shared blog named “Theory, Evolution, and Games Group”. He provides a clear and interesting motivation in the first major portion of his essay, and then unwittingly (for my example), he shows some basic statistics indicating a general minimum readership of 2,000 people which occasionally goes as high as 8,000. (Knowing how his platform operates and provides base-line statistics that he’s using, it’s likely that his readership is actually possibly higher.) If one skims through the blog, it’s obvious that he’s not providing infotainment type of material like one would find on TMZ, Buzzfeed, or major media outlets, but genuine academic thought – AND MANAGING TO REACH A SIZEABLE AUDIENCE! I would posit that even better, that his blog enriching not only himself and his fellow academy colleagues, but a reasonable number of people outside of the academy and therefore the world.

Another example of an even more technical academic blog can be found in that of Dr. Terrence Tao, a Fields Medal winner (the mathematical equivalent of the Nobel prize), and mathematics professor at UCLA. You’ll note that it’s far more technical and rigorous than Dr. Kaznatcheev’s, and though I don’t have direct statistics to back it up, I can posit based on the number of comments his blog has that his active readership is even much higher. Dr. Tao uses his blog to not only expound upon his own work, but uses it to post content for classes, to post portions of a book in process, and to promote the general mathematics research community. (I note that the post he made on 3/19, already within a day has 11 comments by people who’ve read it close enough to suggest typography changes as well as sparked some actual conversation on a topic that requires an education to at least the level of a master’s degree in mathematics.

Business Insider recently featured a list of 50 scientists to follow on social media (Twitter, Facebook, Tumblr, YouTube, and blogs amongst others). While there are a handful of celebrities and science journalists, many of those featured are professors or academics of one sort or another and quite a few of them are Ph.D. candidates (the beginning of the upcoming generation of tech-savvy future faculty I mentioned). Why aren’t there any JHU professors amongst those on this list?

As another clear example, consider the recent online video produced by NPR’s “Science Friday” show featuring research about Water flowing uphill via the Leidenfrost Effect. It is not only generally interesting research work, but this particular research is not only a great advertisement for the University of Bath, it’s a great teaching tool for students, and it features the research itself as well as the involvement of undergraduates in the research. Though I’ll admit that producing these types of vignettes is not necessarily simple, imagine the potential effect on the awareness of the university’s output if we could do this with even 10% of the academic research paper output? Imagine these types of videos as inspiring tools to assist in gaining research funding from government agencies or as fundraising tools for Alumni and Development relations? And how much better that they could be easily shared and spread organically on the web, not necessarily by the JHU Corporate Umbrella, but by its faculty, students, alumni, and friends.

How Does the Academy Begin Accomplishing All of This?

To begin, I’ll mention that Keswick’s new video lab or the Digital Media Center at Homewood and a few others like them are a great start, but they are just the tip of the iceberg (and somewhat unfortunate that faculty from any division will have to travel to use the Keswick facility, if they’re even notionally aware of it and its capabilities).

I recall Mary Spiro, a communications specialist/writer with the Institute of NanoBioTechnology, doing a test-pilot Intersession program in January about 4 years ago in which she helped teach a small group of researchers how to shoot and edit their own films about their research or even tours through their lab. Something like this program could be improved, amplified, and rolled out on a much larger basis. It could also be integrated or dovetailed, in part, with the Digital Media Center and the Film and Media Studies program at Krieger to assist researchers in their work.

The Sheridan Libraries provide teaching/training on using academic tools like bibliographic programs Mendeley.com, RefWorks, Zotero, but they could extend them to social media, blogging, or tools like FigShare, GitHub, and others.

Individual departments or divisions could adopt and easily maintain free content management platforms like WordPress and Drupal (I might even specifically look at their pre-configured product for academia known as OpenScholar, for example take a look at Harvard’s version.) This would make it much easier for even non-technicalminded faculty to more easily come up to speed by removing the initial trouble of starting a blog. It also has the side benefit of allowing the university to assist in ongoing maintenance, backup, data maintenance, hosting, as well as look/feel, branding as well as web optimization. (As a simple example, and not meant to embarrass them, but despite the fact that the JHU Math Department may have been one of the first departments in the university to be on the web, it’s a travesty that their website looks almost exactly as it did 20 years ago, and has less content on it than Terrence Tao’s personal blog which he maintains as a one man operation. I’m sure that some of the issue is political in the way the web has grown up over time at Hopkins, but the lion’s share is technology and management based.)

The Provost’s office in conjunction with IT and the Sheridan Libraries could invest some time and energy in to compiling resources and vetting them for ease-of-use, best practices, and use cases and then providing summaries of these tools to the faculty so that each faculty member need not re-invent the wheel each time, but to get up and running more quickly. This type of resource needs to be better advertised and made idiot-proof (for lack of better terminology) to ease faculty access and adoption. Online resources like the Chronicle of Education’s ProfHacker blog can be mined for interesting tools and use cases, for example.

I know portions of these types of initiatives are already brewing in small individual pockets around the university, but they need to be brought together and better empowered as a group instead of as individuals working separately in a vacuum. In interacting with people across the institution, this technology area seems to be one of those that has been left behind in the “One Hopkins” initiative. One of the largest hurdles is the teaching old dogs new tricks to put it colloquially, but the hurdles for understanding and comprehending these new digital tools is coming down drastically by the day. As part of the social contract in the university’s granting and promoting academic freedom, the faculty should be better encouraged (thought certainly not forced) to exercise it. I’m sure there are mandatory annual seminars on topics like sexual harassment, should there not be mandatory technology trainings as well?

To briefly recap, it would be phenomenal to see the committee make not only their base recommendations on what most consider academic freedom, but to further make a group of strong recommendations about the University proactively teaching, training, and providing a broader array of tools to encourage the active expression of the academic freedom that is provided within Hopkins’ [or even all of the Academy’s] mighty walls.

[Email closing removed]

I certainly welcome any thoughts or comments others may have on these topics. Please feel free to add them in the comments below.

Archive of materials for Day 2

Audio Files

Summarizing archival collections using storytelling techniques

Saving the first draft of history

Kiss your app goodbye: the fragility of data journalism

The future of the past: modernizing The New York Times archive

Lightning Rounds: Six Presenters

Technology and Community

Breakout: Working with CMS

Alignment and reciprocity

Closing remarks

Live Tweet Archive

Twitter List for the Conference

Archiving the day

Audio Files

Greetings and Keynote

Why save online news? and NewsScape

Born-digital news preservation in perspective

Live Tweet Archive

Creation of a Task Force on Academic Freedom

My First Thoughts to the Task Force

Matthew Green’s Blog and Questions of National Security

What are Some of the Issues for Academic Freedom in the Digital Age?

Personal Reputation on the Internet in a Connected Age

Academic Samizdat

The Modern Research University is a Content Aggregator and Distributor (and Should Be Thought of as Such)

Expanded Thoughts on Proactive Academic Freedom

Reframing What Academic Freedom Means in the Digital Age

What “Reach” Do Academics Really Have?

Some Examples of Increased Reach in the Academy

How Does the Academy Begin Accomplishing All of This?