Notes from Day 2 of Dodging the Memory Hole: Saving Online News | Friday, October 14, 2016

If you missed the notes from Day 1, see this post.

It may take me a week or so to finish putting some general thoughts and additional resources together based on the two day conference so that I might give a more thorough accounting of my opinions as well as next steps. Until then, I hope that the details and mini-archive of content below may help others who attended, or provide a resource for those who couldn’t make the conference.

Overall, it was an incredibly well programmed and run conference, so kudos to all those involved who kept things moving along. I’m now certainly much more aware at the gaping memory hole the internet is facing despite the heroic efforts of a small handful of people and institutions attempting to improve the situation. I’ll try to go into more detail later about a handful of specific topics and next steps as well as a listing of resources I came across which may provide to be useful tools for both those in the archiving/preserving and IndieWeb communities.

Archive of materials for Day 2

Audio Files

Below are the recorded audio files embedded in .m4a format (using a Livescribe Pulse Pen) for several sessions held throughout the day. To my knowledge, none of the breakout sessions were recorded except for the one which appears below.

Summarizing archival collections using storytelling techniques


Presentation: Summarizing archival collections using storytelling techniques by Michael Nelson, Ph.D., Old Dominion University

Saving the first draft of history


Special guest speaker: Saving the first draft of history: The unlikely rescue of the AP’s Vietnam War files by Peter Arnett, winner of the Pulitzer Prize for journalism
Peter Arnett talking about news reporting in Vietnam in  60s.

Kiss your app goodbye: the fragility of data journalism


Panel: Kiss your app goodbye: the fragility of data journalism
Featuring Meredith Broussard, New York University; Regina Lee Roberts, Stanford University; Ben Welsh, The Los Angeles Times; moderator Martin Klein, Ph.D., Los Alamos National Laboratory

The future of the past: modernizing The New York Times archive


Panel: The future of the past: modernizing The New York Times archive
Featuring The New York Times Technology Team: Evan Sandhaus, Jane Cotler and Sophia Van Valkenburg; moderated by Edward McCain, RJI and MU Libraries

Lightning Rounds: Six Presenters



Lightning rounds (in two parts)
Six + one presenters: Jefferson Bailey, Terry Britt, Katherine Boss (and team), Cynthia Joyce, Mark Graham, Jennifer Younger and Kalev Leetaru
1: Jefferson Bailey, Internet Archive, “Supporting Data-Driven Research using News-Related Web Archives” 2: Terry Britt, University of Missouri, “News archives as cornerstones of collective memory” 3: Katherine Boss, Meredith Broussard and Eva Revear, New York University: “Challenges facing preservation of born-digital news applications” 4: Cynthia Joyce, University of Mississippi, “Keyword ‘Katrina’: Re-collecting the unsearchable past” 5: Mark Graham, Internet Archive/The Wayback Machine, “Archiving news at the Internet Archive” 6: Jennifer Younger, Catholic Research Resources Alliance: “Digital Preservation, Aggregated, Collaborative, Catholic” 7. Kalev Leetaru, senior fellow, The George Washington University and founder of the GDELT Project: A Look Inside The World’s Largest Initiative To Understand And Archive The World’s News

Technology and Community


Presentation: Technology and community: Why we need partners, collaborators, and friends by Kate Zwaard, Library of Congress

Breakout: Working with CMS


Working with CMS, led by Eric Weig, University of Kentucky

Alignment and reciprocity


Alignment & reciprocity by Katherine Skinner, Ph.D., executive director, the Educopia Institute

Closing remarks


Closing remarks by Edward McCain, RJI and MU Libraries and Todd Grappone, associate university librarian, UCLA

Live Tweet Archive

Reminder: In many cases my tweets don’t reflect direct quotes of the attributed speaker, but are often slightly modified for clarity and length for posting to Twitter. I have made a reasonable attempt in all cases to capture the overall sentiment of individual statements while using as many original words of the participant as possible. Typically, for speed, there wasn’t much editing of these notes. Below I’ve changed the attribution of one or two tweets to reflect the proper person(s). Fore convenience, I’ve also added a few hyperlinks to useful resources after the fact that didn’t have time to make the original tweets. I’ve attached .m4a audio files of most of the audio for the day (apologies for shaky quality as it’s unedited) which can be used for more direct attribution if desired. The Reynolds Journalism Institute videotaped the entire day and livestreamed it. Presumably they will release the video on their website for a more immersive experience.

Peter Arnett:

Condoms were required issue in Vietnam–we used them to waterproof film containers in the field.

Do not stay close to the head of a column, medics, or radiomen. #warreportingadvice

I told the AP I would undertake the task of destroying all the reporters’ files from the war.

Instead the AP files moved around with me.

Eventually the 10 trunks of material went back to the AP when they hired a brilliant archivist.

“The negatives can outweigh the positives when you’re in trouble.”

Edward McCain:

Our first panel:Kiss your app goodbye: the fragility of data jornalism

Meredith Broussard:

I teach data journalism at NYU

A news app is not what you’d install on your phone

Dollars for Docs is a good example of a news app

A news app is something that allows the user to put themself into the story.

Often there are three CMSs: web, print, and video.

News apps don’t live in any of the CMSs. They’re bespoke and live on a separate data server.

This has implications for crawlers which can’t handle them well.

Then how do we save news apps? We’re looking at examples and then generalizing.

Everyblock.com was a good example based on chicagocrime and later bought by NBC and shut down.

What?! The internet isn’t forever? Databases need to be save differently than web pages.

Reprozip was developed by NYU Center for Data and we’re using it to save the code, data, and environment.

Ben Welsh:

My slides will be at http://bit.ly/frameworkfix. I work on the data desk @LATimes

We make apps that serve our audience.

We also make internal tools that empower the newsroom.

We also use our nerdy skills to do cool things.

Most of us aren’t good programmers, we “cheat” by using frameworks.

Frameworks do a lot of basic things for you, so you don’t have to know how to do it yourself.

Archiving tools often aren’t built into these frameworks.

Instagram, Pinterest, Mozilla, and the LA Times use django as our framework.

Memento for WordPress is a great way to archive pages.

We must do more. We need archiving baked into the systems from the start.

Slides at http://bit.ly/frameworkfix

Regina Roberts:

Got data? I’m a librarian at Stanford University.

I’ll mention Christine Borgman’s book Big Data, Little Data, No data.

Journalists are great data liberators: FOIA requests, cleaning data, visualizing, getting stories out of data.

But what happens to the data once the story is published?

BLDR: Big Local Digital Repository, an open repository for sharing open data.

Solutions that exist: Hydra at http://projecthydra.org or Open ICPSR www.openicpsr.org

For metadata: www.ddialliance.org, RDF, International Image Interoperability Framework (iiif) and MODS

Martin Klein:

We’ll open up for questions.

Audience Question:

What’s more important: obey copyright laws or preserving the content?

Regina Roberts:

The new creative commons licenses are very helpful, but we have to be attentive to many issues.

Perhaps archiving it and embargoing for later?

Ben Welsh:

Saving the published work is more important to me, and the rest of the byproduct is gravy.

Evan Sandhaus:

I work for the New York Times, you may have heard of it…

Doing a quick demo of Times Machine from @NYTimes

Sophia van Valkenburg:

Talking about modernizing the born-digital legacy content.

Our problem was how to make an article from 2004 look like it had been published today.

There were 100’s of thousands of articles missing.

There was no one definitive list of missing articles.

Outlining the workflow for reconciling the archive XML and the definitive list of URLs for conversion.

It’s important to use more than one source for building an archive.

Jane Cotler:

I’m going to talk about all of “the little things” that came up along the way..

Article Matching: Fusion – How to convert print XML with web HTML that was scraped.

Primarily, we looked at common phrases between the corpus of the two different data sets.

We prioritized the print data over the digital data.

We maintain a system called switchboard that redirects from old URLs to the new ones to prevent link rot.

The case of the missing sections: some sections of the content were blank and not transcribed.

We made the decision of taking out data we had in lieu of making a better user experience for missing sections.

In the future, we’d also like to put photos back into the articles.

Evan Sandhaus:

Modernizing and archiving the @NYTimes archives is an ongoing challenge.

Edward McCain:

Can you discuss the decision to go with a more modern interface rather than a traditional archive of how it looked?

Evan Sandhaus:

Some of the decision was to get the data into an accessible format for modern users.

We do need to continue work on preserving the original experience.

Edward McCain:

Is there a way to distinguish between the print version and the online versions in the archive?

Audience Question:

Could a researcher do work on the entire corpora? Is it available for subscription?

Edward McCain:

We do have a sub-section of data availalbe, but don’t have it prior to 1960.

Audience Question:

Have you documented the process you’ve used on this preservation project?

Sophia van Valkenburg:

We did save all of the code for the project within GitHub.

Jane Cotler:

We do have meeting notes which provide some documentation, though they’re not thorough.

ChrisAldrich:

Oh dear. Of roughly 1,155 tweets I counted about #DtMH2016 in the last week, roughly 25% came from me. #noisy

Opensource tool I had mentioned to several: @wallabagapp A self-hostable application for saving web pages https://www.wallabag.org

Notes from Day 1 of Dodging the Memory Hole: Saving Online News | Thursday, October 13, 2016

Today I spent most of the majority of the day attending the first of a two day conference at UCLA’s Charles Young Research Library entitled “Dodging the Memory Hole: Saving Online News.” While I knew mostly what I was getting into, it hadn’t really occurred to me how much of what is on the web is not backed up or archived in any meaningful way. As a part of human nature, people neglect to back up any of their data, but huge swaths of really important data with newsworthy and historic value is being heavily neglected. Fortunately it’s an interesting enough problem to draw the 100 or so scholars, researchers, technologists, and journalists who showed up for the start of an interesting group being conglomerated through the Reynolds Journalism Institute and several sponsors of the event.

What particularly strikes me is how many of the philosophies of the IndieWeb movement and tools developed by it are applicable to some of the problems that online news faces. I suspect that if more journalists were practicing members of the IndieWeb and used their sites not only for collecting and storing the underlying data upon which they base their stories, but to publish them as well, then some of the (future) archival process may be easier to accomplish. I’ve got so many disparate thoughts running around my mind after the first day that it’ll take a bit of time to process before I write out some more detailed thoughts.

Twitter List for the Conference

As a reminder to those attending, I’ve accumulated a list of everyone who’s tweeted with the hashtag #DtMH2016, so that attendees can more easily follow each other as well as communicate online following our few days together in Los Angeles. Twitter also allows subscribing to entire lists too if that’s something in which people have interest.

Archiving the day

It seems only fitting that an attendee of a conference about saving and archiving digital news, would make a reasonable attempt to archive some of his experience right?! Toward that end, below is an archive of my tweetstorm during the day marked up with microformats and including hovercards for the speakers with appropriate available metadata. For those interested, I used a fantastic web app called Noter Live to capture, tweet, and more easily archive the stream.

Note that in many cases my tweets don’t reflect direct quotes of the attributed speaker, but are often slightly modified for clarity and length for posting to Twitter. I have made a reasonable attempt in all cases to capture the overall sentiment of individual statements while using as many original words of the participant as possible. Typically, for speed, there wasn’t much editing of these notes. I’m also attaching .m4a audio files of most of the audio for the day (apologies for shaky quality as it’s unedited) which can be used for more direct attribution if desired. The Reynolds Journalism Institute videotaped the entire day and livestreamed it. Presumably they will release the video on their website for a more immersive experience.

If you prefer to read the stream of notes in the original Twitter format, so that you can like/retweet/comment on individual pieces, this link should give you the entire stream. Naturally, comments are also welcome below.

Audio Files

Below are the audio files for several sessions held throughout the day.

Greetings and Keynote


Greetings: Edward McCain, digital curator of journalism, Donald W. Reynolds Journalism Institute (RJI) and University of Missouri Libraries and Ginny Steel, university librarian, UCLA
Keynote: Digital salvage operations — what’s worth saving? given by Hjalmar Gislason, vice president of data, Qlik

Why save online news? and NewsScape


Panel: “Why save online news?” featuring Chris Freeland, Washington University; Matt Weber, Ph.D., Rutgers, The State University of New Jersey; Laura Wrubel, The George Washington University; moderator Ana Krahmer, Ph.D., University of North Texas
Presentation: “NewsScape: preserving TV news” given by Tim Groeling, Ph.D., UCLA Communication Studies Department

Born-digital news preservation in perspective


Speaker: Clifford Lynch, Ph.D., executive director, Coalition for Networked Information on “Born-digital news preservation in perspective”

Live Tweet Archive

ChrisAldrich:

Getting Noter Live fired up for Dodging the Memory Hole 2016: Saving Online News https://www.rjionline.org/dtmh2016

Ginny Steel:

I’m glad I’m not at NBC trying to figure out the details for releasing THE APPRENTICE tapes.

Edward McCain:

Let’s thank @UCLA and the library for hosting us all.

While you’re here, don’t forget to vote/provide feedback throughout the day for IMLS

Someone once pulled up behind me and said “Hi Tiiiigeeerrr!” #Mizzou

A server at the Missourian crashed as the system was obsolete and running on baling wire. We lost 15 years of archives

The dean & head of Libraries created a position to save born digital news.

We’d like to help define stake-holder roles in relation to the problem.

Newspaper is really an outmoded term now.

I’d like to celebrate that we have 14 student scholars here today.

We’d like to have you identify specific projects that we can take to funding sources to begin work after the conference

We’ll be going to our first speaker who will be introduced by Martin Klein from Los Alamos.

Martin Klein:

Hjalmar Gislason is a self-described digital nerd. He’s the Vice President of Data.

I wonder how one becomes the President of Data?

Hjalmar Gislason:

My Icelandic name may be the most complicated part of my talk this morning.

Speaking on Digital Salvage Operations: What’s worth Saving”

My father in law accidentally threw away my wife’s favorite stuffed animal. #DeafTeddy

Some people just throw everything away because they’re not being used. Others keep everything and don’t throw it away.

The fundamental question: Do you want to save everything or do you want to get rid of everything?

I joined @qlik two years ago and moved to Boston.

Before that I was with spurl.net which was about saving copies of webpages they’d previously visited.

I had also previously invested in kjarninn which is translated as core.

We used to have little data, now we’re with gigantic data and moving to gargantuan data soon.

One of my goals today is to broaden our perspective about what data needs saving.

There’s the Web, the “Deep” Web, then there’s “Other” data which is at the bottom of the pyramid.

I got to see into the process of #panamapapers but I’d like to discuss the consequences from April 3rd.

The amount of meetings were almost more than could have been covered in real time in Iceland.

The #panamapapers were a soap opera, much like US politics.

Looking back at the process is highly interesting, but it’s difficult to look at all the data as they unfoldedd

How can we capture all the media minute by minute as a story unfolds.

You can’t trust that you can go back to a story at a certain time and know that it hasn’t been changed. #1984 #Orwell

There was a relatively pro-HRC piece earlier this year @NYTimes that was changed.

Newsdiffs tracks changes in news over time. The HRC article had changed a lot.

Let’s say you referenced @CNN 10 years ago, likely now, the CMS and the story have both changed.

8 years ago, I asked, wouldn’t we like to have the social media from Iceland’s only Nobel Laureate as a teenager?

What is private/public, ethical/unethical when dealing with data?

Much data is hidden behind passwords or on systems which are not easily accessed from a database perspective.

Most of the content published on Facebook isn’t public. It’s hard to archive in addition to being big.

We as archivists have no claim on the hidden data within Facebook.

ChrisAldrich:

The could help archivists in the future in accessing more personal data.

Hjalmar Gislason:

Then there’s “other” data: 500 hours of video us uploaded to YouTube per minute.

No organization can go around watching all of this video data. Which parts are newsworthy?

Content could surface much later or could surface through later research.

Hornbjargsviti lighthouse recorded the weather every three hours for years creating lots of data.

And that was just one of hundreds of sites that recorded this type of data in Iceland.

Lots of this data is lost. Much that has been found was by coincidence. It was never thought to archive it.

This type of weather data could be very valuable to researchers later on.

There was also a large archive of Icelandic data that was found.

Showing a timelapse of Icelandic earthquakes https://vimeo.com/24442762

You can watch the magma working it’s way through the ground before it makes it’s way up through the land.

National Geographic featured this video in a documentary.

Sometimes context is important when it comes to data. What is archived today may be more important later.

As the economic crisis unfolded in Greece, it turned out the data that was used to allow them into EU was wrong.

The data was published at the time of the crisis, but there was no record of what the data looked like 5 years earlier.

Only way to recreate the data was to take prior printed sources. This is usu only done in extraordinary cirucumstances.

We captured 150k+ data sets with more than 8 billion “facts” which was just a tiny fraction of what exists.

How can we delve deeper into large data sets, all with different configurations and proprietary systems.

“There’s a story in every piece of data.”

Once a year energy consumption seems to dip because February has fewer days than other months. Plotting it matters.

Year over year comparisons can be difficult because of things like 3 day weekends which shift over time.

Here’s a graph of the population of Iceland. We’ve had our fair share of diseases and volcanic eruptions.

To compare, here’s a graph of the population of sheep. They outnumber us by an order(s) of magnitude.

In the 1780’s there was an event that killed off lots of sheep, so people had the upper hand.

Do we learn more from reading today’s “newspaper” or one from 30, 50, or 100 years ago?

There was a letter to the editor about an eruption and people had to move into the city.

letter: “We can’t have all these people come here, we need to build for our own people first.”

This isn’t too different from our problems today with respect to Syria. In that case, the people actually lived closer.

In the born-digital age, what will the experience look like trying to capture today 40 years hence?

Will it even be possible?

Machine data connections will outnumber “people” data connections by a factor of 10 or more very quickly.

With data, we need to analyze, store, and discard data. How do we decide in a spit-second what to keep & discard?

We’re back to the father-in-law and mother-in-law question: What to get rid of and what to save?

Computing is continually beating human tasks: chess, Go, driving a car. They build on lots more experience based on data

Whoever has the most data on driving cars and landscape will be the ultimate winner in that particular space.

Data is valuable, sometimes we just don’t know which yet.

Hoarding is not a strategy.

You can only guess at what will be important.

“Commercial use in Doubt” The third sub-headline in a newspaper about an early test of television.

There’s more to it than just the web.

Kate Zwaard:

Hoarding isn’t a strategy really resonates with librarians, what could that relationship look like?

Hjalmar Gislason:

One should bring in data science, industry may be ahead of libraries.

Cross-disciplinary approaches may be best. How can you get a data scientist to look at your problem? Get their attention?

Peter Arnett:

There’s 60K+ books about the Viet Nam War. How do we learn to integrate what we learn after an event (like that)?

Hjalmar Gislason:

Perspective always comes with time, as additional information arrives.

Scientific papers are archived in a good way, but the underlying data is a problem.

In the future you may have the ability to add supplementary data as a supplement what appears in a book (in a better way)

Archives can give the ability to have much greater depth on many topics.

Are there any centers of excellence on the topics we’re discussing today? This conference may be IT.

We need more people that come from the technical side of things to be watching this online news problem.

Hacks/Hackers is a meetup group that takes place all over the world.

It brings the journalists and computer scientists together regularly for beers. It’s some of the outreach we need.

Edward McCain:

If you’re not interested in money, this is a good area to explore. 10 minute break.

Don’t forget to leave your thoughts on the questions at the back of the room.

We’re going to get started with our first panel. Why is it important to save online news?

Matthew Weber:

I’m Matt Weber from Rugters University and in communications.

I’ll talk about web archives and news media and how they interact.

I worked at Tribune Corp. for several years and covered politics in DC.

I wanted to study the way in which the news media is changing.

We’re increadingly seeing digital only media with no offline surrogate.

It’s becomign increasingly difficult to do anything but look at it now as it exists.

There was no large scale online repository of online news to do research.

#OccupyWallStreet is one of the first examples of stories that exist online in ocurence and reportage.

There’s a growing need to archive content around local news particularly politics and democracy.

When there is a rich and vibrant local news environment, people are more likely to become engaged.

Local news is one of the least thought about from an archive perspective.

Laura Wrubel:

I’m at GWU Librarys in the scholarly technology group.

I’m involved in social feed manager which allows archivists to put together archives from social services.

Kimberly Gross, a faculty member, studies tweets of news outlets and journalists.

We created a prototype tool to allow them to collect data from social media.

Journalists were 2011 primarily using their Twitter presences to direct people to articles rather than for conversation

We collect data of political candidates.

Chris Freeland:

I’m an associate library and representing “Documenting the Now” with WashU, UCRiverside, & UofMd

Documenting the Now revolves around Twitter documentation.

It started with the Ferguson story and documenting media, videos during the protests in the community.

What can we as memory institutions do to capture the data?

We gathered 14million tweets relating to Ferguson within two weeks.

We tried to build a platform that others could use in the future for similar data capture relating to social.

Ethics is important in archiving this type of news data.

Ana Krahmer:

Digitally preserving pdfs from news organizations and hyper-local news in Texas.

We’re approaching 5million pages of archived local news.

What is news that needs to be archived, and why?

Matthew Weber:

First, what is news? The definition is unique to each individual.

We need to capture as much of the social news and social representation of news which is fragmented.

It’s an important part of society today.

We no longer produce hard copies like we did a decade ago. We need to capture the online portion.

Laura Wrubel:

We’d like to get the perspective of journalists, and don’t have one on the panel today.

We looked at how midterm election candidates used Twitter. Is that news itself? What tools do we use to archive it?

What does it mean to archive news by private citizens?

Chris Freeland:

Twitter was THE place to find information in St. Louis during the Ferguson protests.

Local news outlets weren’t as good as Twitter during the protests.

I could hear the protest from 5 blocks away and only found news about it on Twitter.

The story was bing covered very differently on Twitter than the local (mainstream) news.

Alternate voices in the mix were very interesting and important.

Twitter was in the moment and wasn’t being edited and causing a delay.

What can we learn from this massive number of Ferguson tweets.

It gives us information about organizing, and what language was being used.

Ana Krahmer:

I think about the archival portion of this question. By whom does it need to be archived?

What do we archive next?

How are we representing the current population now?

Who is going to take on the burden of archiving? Should it be corporate? Cultural memory institution?

Someone needs to currate it, who does that?

our next question: What do you view as primary barriers to news archiving?

Laura Wrubel:

How do we organize and staff? There’s no shortage of work.

Tools and software can help the process, but libraries are usually staffed very thinly.

No single institution can do this type of work alone. Collaboration is important.

Chris Freeland:

Two barriers we deal with: terms of service are an issue with archiving. We don’t own it, but can use it.

Libraries want to own the data in perpetuity. We don’t own our data.

There’s a disconnect in some of the business models for commercialization and archiving.

Issues with accessing data.

People were worried about becoming targets or losing jobs because of participation.

What is role of ethics of archiving this type of data? Allowing opting out?

What about redacting portions? anonymizing the contributions?

Ana Krahmer:

Publishers have a responsibility for archiving their product. Permission from publishers can be difficult.

We have a lot of underserved communities. What do we do with comments on stories?

Corporations may not continue to exist in the future and data will be lost.

Matthew Weber:

There’s a balance to be struck between the business side and the public good.

It’s hard to convince for profit about the value of archiving for the social good.

Chris Freeland:

Next Q: What opportunities have revealed themselves in preserving news?

Finding commonalities and differences in projects is important.

What does it mean to us to archive different media types? (think diversity)

What’s happening in my community? in the nation? across the world?

The long-history in our archives will help us learn about each other.

Ana Krahmer:

We can only do so much with the resources we have.

We’ve worked on a cyber cemetery product in the past.

Someone else can use the tools we create within their initiatives.

Chris Freeland:

repeating ?: What are issues in archiving longerform video data with regard to stories on Periscope?

Audience Question:

How do you channel the energy around archiving news archiving?

Matthew Weber:

Research in the area is all so new.

Audience Question:

Does anyone have any experience with legal wrangling with social services?

Chris Freeland:

The ACLU is waging a lawsuit against Twitter about archived tweets.

Ana Krahmer:

Outreach to community papers is very rhizomic.

Audience Question:

How do you take local examples and make them a national model?

Ana Krahmer:

We’re teenagers now in the evolution of what we’re doing.

Edward McCain:

Peter Arnett just said “This is all ore interesting than I thought it would be.”

Next Presentation: NewsScape: preserving TV news

Tim Groeling:

I’ll be talking about the NewsScape project of Francis Steen, Director, Communication Studies Archive

I’m leading the archiving of the analog portion of the collection.

The oldest of our collection dates from the 1950’s. We’ve hosted them on YouTube which has created some traction.

Commenters have been an issue with posting to YouTube as well as copyright.

NewsScape is the largest collecction of TV news and public affairs programs (local & national)

Prior to 2006, we don’t know what we’ve got.

Paul said “Ill record everytihing I can and someone in the future can deal with it.”

We have 50K hours of Betamax.

VHS are actually most threatened, despite being newest tapes.

Our budget was seriously strapped.

Maintaining closed captioning is important to our archiving efforts.

We’ve done 36k hours of encoding this year.

We use a layer of dead VCR’s over our good VCR’s to prevent RF interference and audio buzzing. 🙂

Post-2006 We’re now doing straight to digital

Preservation is the first step, but we need to be more than the world’s best DVR.

Searching the news is important too.

Showing a data visualization of news analysis with regard to the Heathcare Reform movement.

We’re doing facial analysis as well.

We have interactive tools at viz2016.com.

We’ve tracked how often candidates have smiled in election 2016. Hillary > Trump

We want to share details within our collection, but don’t have tools yet.

Having a good VCR repairman has helped us a lot.

Edward McCain:

Breaking for lunch…

Clifford Lynch:

Talk “Born-digital news preservation in perspective”

There’s a shared consensus that preserving scholarly publications is important.

While delivery models have shifted, there must be some fall back to allow content to survive publisher failure.

Preservation was a joint investment between memory institutions and publishers.

Keepers register their coverage of journals for redundancy.

In studying coverage, we’ve discovered Elsevier is REALLY well covered, but they’re not what we’re worried about.

It’s the small journals as edge cases that really need more coverage.

Smaller journals don’t have resources to get into the keeper services and it’s more expensive.

Many Open Access Journals are passion projects and heavily underfunded and they are poorly covered.

Being mindful of these business dynamics is key when thinking about archiving news.

There are a handful of large news outlets that are “too big to fail.”

There are huge numbers of small outlets like subject verticals, foreign diasporas, etc. that need to be watched

Different strategies should be used for different outlets.

The material on lots of links (as sources) disappears after a short period of time.

While Archive.org is a great resource, it can’t do everything.

Preserving underlying evidence is really important.

How we deal with massive databases and queries against them are a difficult problem.

I’m not aware of studies of link rot with relationship to online news.

Who steps up to preserve major data dumps like Snowden, PanamaPapers, or email breaches?

Social media is a collection of observations and small facts without necessarily being journalism.

Journalism is a deliberate act and is meant to be public while social media is not.

We need to come up with a consensus about what parts of social media should be preserved as news..

News does often delve into social media as part of its evidence base now.

Responsible journalism should include archival storage, but it doesn’t yet.

Under current law, we can’t protect a lot of this material without the permission of the creator(s).

The Library of Congress can demand deposit, but doesn’t.

With funding issues, I’m not wild about the Library of Congress being the only entity [for storage.]

In the UK, there are multiple repositories.

ChrisAldrich:

testing to see if I’m still live

What happens if you livetweet too much in one day.
password-change-required

Homebrew Website Club — Los Angeles

In an effort to provide easier commuting access for a broader cross-section of Homebrew members we met last night at Yahoo’s Yahoo’s primary offices at 11995 W. Bluff Creek Drive, Playa Vista, CA 90094. We hope to alternate meetings of the Homebrew Website Club between the East and West sides of Los Angeles as we go forward. If anyone has additional potential meeting locations, we’re always open to suggestions as well as assistance.

We had our largest RSVP list to date, though some had last minute issues pop up and one sadly had trouble finding the location (likely due to a Google map glitch).

Angelo and Chris met before the quiet writing hour to discuss some general planning for future meetings as well as the upcoming IndieWebCamp in LA in November. Details and help for arrangements for out of town attendees should be posted shortly.

Notes from the “broadcast” portion of the meetup

Chris Aldrich (co-organizer)

Angelo Gladding (co-organizer)

  • Work is proceeding nicely on the overall build of Canopy
  • Discussed an issue with expanding data for social network in relation to events and potentially expanding contacts based on event attendees

Srikanth Bangalore (our host at Yahoo!)

  • Discussed some of his background in coding and work with Drupal and WordPress.
  • His personal site is https://srib.us/

Notes from the “working” portion of the meetup

We sketched out a way to help Srikanth IndieWeb-ify not only his own site, but to potentially help do so for Katie Couric’s Yahoo! based news site along with the pros/cons of workflows for journalists in general. We also considered some potential pathways for potentially bolting on webmentions for websites (like Tumblr/WordPress) which utilize Disqus for their commenting system. We worked through the details of webmentions and a bit of micropub for his benefit.

Srikanth discussed some of the history and philosophy behind why Tumblr didn’t have a more “traditional” native commenting system. The point was generally to socially discourage negativity, spamming, and abuse by forcing people to post their comments front and center on their own site (and not just in the “comments” of the receiving site) thereby making the negativity be front and center and redound to their own reputation rather than just the receiving page of the target. Most social media related sites hide (or make hard to search/find) the abusive nature of most users, while allowing them to appear better/nicer on their easier-to-find public facing persona.

Before closing out the meeting officially, we stopped by the front lobby where two wonderful and personable security guards (one a budding photographer) not only helped us with a group photo, but managed to help us escape the parking lot!

I think it’s agreed we all had a great time and look forward to more progress on projects, more good discussion, and more interested folks at the next meeting. Srikanth was so amazed at some of the concepts, it’s possible that all of Yahoo! may be IndieWeb-ified by the end of the week. 🙂

We hope you’ll join us next month on 10/05! (Details forthcoming…)

Live Tweets Archive


Ever with grand aspirations to do as good a job as the illustrious Kevin Marks, we tried some livetweeting with Noterlive. Alas the discussion quickly became so consuming that the effort was abandoned in lieu of both passion and fun. Hopefully some of the salient points were captured above in better form anyway.

Srikanth Bangalore:

I only use @drupal when I want to make money. (Replying to why his personal site was on @wordpress.) #

(This CMS comment may have been the biggest laugh of the night, though the tone captured here (and the lack of context), doesn’t do the comment any justice at all.)

Angelo Gladding:

I’m a hobby-ist programmer, but I also write code to make money. #

I’m into python which is my language of choice. #

Chris Aldrich:

Thanks again @themarketeng for hosting Homebrew Website Club at Yahoo tonight! We really appreciate the hospitality. #

My first pull request

Replied to My first pull request by Clint LalondeClint Lalonde (ClintLalonde.net)
Crazy to think that, even though I have had a GitHub account for 5 years and have poked, played and forked things, I have never made a pull request and contributed something to another project unti…
Clint, first, congratulations on your first PR!

Oddly, I had seen the VERY same post/repo a few weeks back and meant to add a readme too! (You’ll notice I got too wrapped up in reading through the code and creating some usability issues after installing the plugin instead.)

Given that you’ve got your own domain and website (and playing in ed/tech like many of us are), and you’re syndicating your blog posts out to Medium for additional reach, I feel compelled to mention some interesting web tech and philosophy in the movement. You can find some great resources and tools at their website.

In particular, you might take a look at their WordPress pages which includes some plugins and resources you’ll be sure to appreciate. One of their sets of resources is allowing you to not only syndicate your WP posts (what they call POSSE), but by using the new W3C webmention spec, you can connect many of your social media resources to brid.gy and have services like twitter, facebook, G+, instagram and others send the comments and likes on your posts there back to your blog directly, thereby allowing you to own all of your data (as well as the commentary that occurs elsewhere). I can see a lot of use for education in some of the infrastructure they’re building and aggregating there. (If you’re familiar with Known, they bake a lot of Indieweb goodness into their system from the start, but there’s no reason you shouldn’t have it for your WordPress site as well.)

If you need any help/guidance in following/installing anything there, I’m happy to help.

Congratulations again. Keep on pullin’!

Instagram Single Photo Bookmarklet

Ever wanted a simple and quick way to extract the primary details from an Instagram photo to put it on your own website?

The following javascript-based bookmarklet is courtesy of Tantek Çelik as an Indieweb tool he built at IndieWebCamp NYC2:

If you view a single photo permalink page, the following bookmarklet will extract the permalink (trimmed), photo jpg URL, and photo caption and copy them into a text note, suitable for posting as a photo that’s auto-linked:

javascript:n=document.images.length-1;s=document.images[n].src;s=s.split('?');s=s[0];u=document.location.toString().substring(0,39);prompt('Choose "Copy ⌘C" to copy photo post:',s+' '+u+'\n'+document.images[n].alt.toString().replace(RegExp(/\.\n(\.\n)+/),'\n'))

Any questions, let me know! –Tantek

If you want an easy drag-and-drop version, just drag the button below into your browser’s bookmark bar.

✁ Instagram

Editor’s note: Though we’ll try to keep the code in this bookmarklet updated, the most recent version can be found on the Indieweb wiki thought the link above.

Reply to Scott Kingery about Wallabag and Reading

Replied to a post by Scott KingeryScott Kingery (TechLifeWeb)
Chris, as a kind of sidebar to this, we talk about hosting things on our own site. I’ve always kind of thought this should be 1 piece of software we use for everything. I think that way becau…
Scott, as someone who’s studied evolutionary biology, I know that specialists in particular areas are almost always exponentially better at what they do than non-specialists.  This doesn’t mean that we don’t need alternate projects or new ideas which may result in new “Cambrian explosions,” and even better products.

I also feel that one needs the right tool for the right job. While I like WordPress for many things, it’s not always the best thing to solve the problem. In some cases Drupal or even lowly Wix may be the best solution. The key is to find the right balance of time, knowledge, capability and other variables to find the optimal solution for the moment, while maintaining the ability to change in the future if necessary. By a similar analogy there are hundreds of programming languages and all have their pros and cons.  Often the one you know is better than nothing, but if you heard about one that did everything better and faster, it would be a shame not to check it out.

This said, I often prefer to go with specialist software, though I do usually have a few requirements which overlap or align with Indieweb principles, including, but not limited to:

  • It should be open, so I can modify/change/share it with others
  • I should be able to own all the related/resultant data
  • I should be able to self-host it (if I want)
  • It should fit into my workflow and solve a problem I have while not creating too many new problems

In this case, I suspect that Wallabag is far better than anything I might have time to build and maintain myself. If there are bits of functionality that are missing, I can potentially request them or build/add them myself and contribute back to the larger good.

Naturally I do also worry about usability and maintenance as well, so if the general workflow and overhead doesn’t dovetail in with my other use cases, all bets may be off. If large pieces of my data, functionality, and workflow are housed in WordPress, for example, and something like this isn’t easily integrateable or very difficult to keep updated and maintain, then I’ll pass and look for (or build) a different solution. (Not every tool is right for just any job.) On larger projects like this, there’s also the happy serendipity that they’re big enough that WordPress (Drupal, Jekyll, other) developers can better shoehorn the functionality in to a bigger project or create a simple API thereby making the whole more valuable than the sum of the parts.

In this particular situation, it appears to be a 1-1 replacement for a closed silo version of something I’ve been using regularly, but which provides more of the benefits above than the silo does, so it seems like a no-brainer to switch.

 
To reply to this comment preferably do so on the original at: A New Reading Post-type for Bookmarking and Reading Workflow

Homebrew Website Club Meetup Pasadena/Los Angeles Notes from 8-24-16

Last night, shy a few regulars at the tail end of a slow August and almost on the eve of IndieWebCamp NY2, Angelo Gladding and I continued our biweekly Homebrew Website Club meetings.

We met at Charlie’s Coffee House, 266 Monterey Road, South Pasadena, CA, where we stayed until closing at 8:00. Deciding that we hadn’t had enough, we moved the party (South Pasadena rolls up their sidewalks early) over to the local Starbucks, 454 Fair Oaks Ave, South Pasadena, CA where we stayed until they closed at 11:00pm.

Quiet Writing Hour

Angelo manned the fort alone with aplomb while building intently. If I’m not mistaken, he did use my h-card to track down my phone number to see what was holding me up, so as they say in IRC: h-card++!

Introductions and Demonstrations

Participants included:

Needing no introductions this week, Angelo launched us off with a relatively thorough demo of his Canopy platform which he’s built from the ground up in python! Starting from an empty folder on a host with a domain name, he downloaded and installed his code directly from Github and spun up a completely new version of his site in under 2 minutes. In under 20 minutes of some simple additional downloads and configuration of a few files, he also had locations, events, people and about modules up and running. Despite the currently facile appearance of his website, there’s really a lot of untapped power in what he’s built so far. It’s all available on Github for those interested in playing around; I’m sure he’d appreciate pull requests.

Along the way, I briefly demoed some of the functionality of Kevin Marks’ deceptively powerful Noterlive web app for not only live tweeting, but also owning those tweets on one’s own site in a simple way after the fact (while also automatically including proper markup and microformats)! I also ran through some of the overall functionality of my Known install with a large number of additional plugins to compare and contrast UX/UI with respect to Canopy.

We also discussed a bit of Angelo’s recent Indieweb Graph network crawling project, and I took the opportunity to fix a bit of the representative h-card on my site. (Angelo, does a new crawl appear properly on lahacker.net now?)

Before leaving Charlie’s we did manage to remember to take a group photo this time around. Not having spent enough time chatting over the past few weeks, we decamped to a local Starbucks and continued our conversation along with some addition brief demos and discussion of other itches for future building.

We also spent a few minutes discussing the upcoming IndieWebCamp LA logistics for November as well as outreach to the broader Los Angeles area dev communities. If you’re interested in attending, please RSVP. If you’d like to volunteer or help sponsor the camp, please don’t hesitate to contact either of us. I’m personally hoping to attend DrupalCamp LA this weekend while wearing a stylish IndieWebCamp t-shirt that’s already on its way to me.

IndieWebCamp T-shirt
IndieWebCamp T-shirt

Next Meeting

In keeping with the schedule of the broader Homebrew movement, so we’re already committed to our next meeting on September 7. It’s tentatively at the same location unless a more suitable one comes along prior to then. Details will be posted to the wiki in the next few days.

Thanks for coming everyone! We’ll see you next time.

Live Tweets Archive


Though not as great as the notes that Kevin Marks manages to put together, we did manage to make good use of noterlive for a few supplementary thoughts:

Chris Aldrich:

On my way to Homebrew Website Club Los Angeles in moments. http://stream.boffosocko.com/2016/homebrew-website-club-la-2016-08-24 #

Angelo Gladding:

I’ve torn some things down, but slowly rebuilding. I’m just minutes away from rel-me to be able to log into wiki #

ChrisAldrich:

Explaining briefly how @kevinmarksnoterlive.com works for live tweeting events… #

Angelo Gladding:

My github was receiving some autodumps from a short-lived indieweb experiment. #

is describing his canopy system used to build his site #

Canopy builds in a minute and 52 secs… inside are folders roots and trunk w/ internals #

Describing how he builds in locations to Canopy #

Apparently @t has a broken certificate for https, so my parser gracefully falls back to http instead. #

 

Reply to: Getting started owning your digital home by Chris Hardie

Replied to Getting started owning your digital home by Chris HardieChris Hardie (Chris Hardie)
My recent post about owning our digital homes prompted some good feedback and discussion. When I talk about this topic with the people in my life who don't work daily in the world of websites, domain names and content management, the most common reaction I get is, "that's sounds good in theory, I'm not sure … Continue reading Getting started owning your digital home
Chris, I came across your post today by way of Bob Waldron’s post WordPress: Default Personal Digital Home (PDH).

Both his concept and that of your own post fit right into the broader themes and goals of the Indieweb community. If you weren’t aware of the movement, I highly recommend you take a look at its philosophies and goals.

There’s already a pretty strong beachhead established for WordPress within the Indieweb community including a suite of plugins for helping to improve your personal web presence, but we’d certainly welcome your additional help as the idea seems right at home with your own philosophy.

I’m happy to chat with you about the group via website, phone, email, IRC, or social media at your leisure if you’re interested in more information. I’m imminently findable via details on my homepage.


A New Reading Post-type for Bookmarking and Reading Workflow

This morning while breezing through my Woodwind feed reader, I ran across a post by Rick Mendes with the hashtags and which put me down a temporary rabbit hole of thought about reading-related post types on the internet.

I’m obviously a huge fan of reading and have accounts on GoodReads, Amazon, Pocket, Instapaper, Readability, and literally dozens of other services that support or assist the reading endeavor. (My affliction got so bad I started my own publishing company last year.)

READ LATER is an indication on (or relating to) a website that one wants to save the URL to come back and read the content at a future time.

I started a page on the IndieWeb wiki to define read later where I began writing some philosophical thoughts. I decided it would be better to post them on my own site instead and simply link back to them. As a member of the Indieweb my general goal over time is to preferentially quit using these web silos (many of which are listed on the referenced page) and, instead, post my reading related work and progress here on my own site. Naturally, the question becomes, how does one do this in a simple and usable manner with pretty and reasonable UX/UI for both myself and others?

Current Use

Currently I primarily use a Pocket bookmarklet to save things (mostly newspaper articles, magazine pieces, blog posts) for reading later and/or the like/favorite functionality in Twitter in combination with an IFTTT recipe to save the URL in the tweet to my Pocket account. I then regularly visit Pocket to speed read though articles. While Pocket allows downloading of (some) of one’s data in this regard, I’m exploring options to bring in the ownership of this workflow into my own site.

For more academic leaning content (read journal articles), I tend to rely on an alternate Mendeley-based workflow which also starts with an easy-to-use bookmarklet.

I’ve also experimented with bookmarking a journal article and using hypothes.is to import my highlights from that article, though that workflow has a way to go to meet my personal needs in a robust way while still allowing me to own all of my own data. The benefit is that fixing it can help more than just myself while still fitting into a larger personal workflow.

Brainstorming

A Broader Reading (Parent) Post-type

Philosophically a read later post-type could be considered similar to a (possibly) unshared or private bookmark with potential possible additional meta-data like: progress, date read, notes, and annotations to be added after the fact, which then technically makes it a read post type.

A potential workflow viewed over time might be: read later >> bookmark >> notes/annotations/marginalia >> read >> review. This kind of continuum of workflow might be able to support a slightly more complex overall UI for a more simplified reading post-type in which these others are all sub-types. One could then make a single UI for a reading post type with fields and details for all of the sub-cases. Being updatable, the single post could carry all the details of one’s progress.

Indieweb encourages simplicity (DRY) and having the fewest post-types possible, which I generally agree with, but perhaps there’s a better way of thinking of these several types. Concatenating them into one reading type with various data fields (and the ability of them to be public/private) could allow all of the subcategories to be included or not on one larger and more comprehensive post-type.

Examples
  1. Not including one subsection (or making it private), would simply prevent it from showing, thus one could have a traditional bookmark post by leaving off the read later, read, and review sub-types and/or data.
  2. As another example, I could include the data for read later, bookmark, and read, but leave off data about what I highlighted and/or sub-sections of notes I prefer to remain private.

A Primary Post with Webmention Updates

Alternately, one could create a primary post (potentially a bookmark) for the thing one is reading, and then use further additional posts with webmentions on each (to the original) thereby adding details to the original post about the ongoing progress. In some sense, this isn’t too far from the functionality provided by GoodReads with individual updates on progress with brief notes and their page that lists the overall view of progress. Each individual post could be made public/private to allow different viewerships, though private webmentions may be a hairier issue. I know some are also experimenting with pushing updates to posts via micropub and other methods, which could be appealing as well.

This may be cumbersome over time, but could potentially be made to look something like the GoodReads UI below, which seems very intuitive. (Note that it’s missing any review text as I’m currently writing it, and it’s not public yet.)

Overview of reading progress
Overview of reading progress

Other Thoughts

Ideally, better distinguishing between something that has been bookmarked and read/unread with dates for both the bookmarking and reading, as well as potentially adding notes and highlights relating to the article is desired. Something potentially akin to Devon Zuegel‘s “Notes” tab (built on a custom script for Evernote and Tumblr) seems somewhat promising in a cross between a simple reading list (or linkblog) and a commonplace book for academic work, but doesn’t necessarily leave room for longer book reviews.

I’ll also need to consider the publishing workflow, in some sense as it relates to the reverse chronological posting of updates on typical blogs. Perhaps a hybrid approach of the two methods mentioned would work best?

Potentially having an interface that bolts together the interface of GoodReads (picture above) and Amazon’s notes/highlights together would be excellent. I recently noticed (and updated an old post) that they’re already beta testing such a beast.

Kindle Notes and Highlights are now shoing up as a beta feature in GoodReads
Kindle Notes and Highlights are now shoing up as a beta feature in GoodReads

Comments

I’ll keep thinking about the architecture for what I’d ultimately like to have, but I’m always open to hearing what other (heavy) readers have to say about the subject and the usability of such a UI.

Please feel free to comment below, or write something on your own site (which includes the URL of this post) and submit your URL in the field provided below to create a webmention in which your post will appear as a comment.

 

I now proudly own all of the data from my Tumbr posts on my own domain. #Indieweb #ownyourdata #PESOS

I now proudly own all of the data from my Tumbr posts on my own domain. #Indieweb #ownyourdata #PESOS

Reply to Something the NIH can learn from NASA

Replied to Something the NIH can learn from NASA by Lior Pachter (& Comments by Donald Forsdyke)Lior Pachter (& Comments by Donald Forsdyke) (Bits of DNA)
Pubmed Commons provides a forum, independent of a journal, where comments on articles in that journal can be posted. Why not air your displeasure there? The article is easily found (see PMID: 27467019) and, so far, there are no comments.
I’m hoping that one day (in the very near future) that scientific journals and other science communications on the web will support the W3C’s Webmention candidate specification so that when commentators [like Lior, in this case, above] post something about an article on their site, that the full comment is sent to the original article to appear there automatically. This means that one needn’t go to the site directly to comment (and if the comment isn’t approved, then at least it still lives somewhere searchable on the web).

Some journals already count tweets, and blog mentions (generally for PR reasons) but typically don’t allow access to finding them on the web to see if they indicate positive or negative sentiment or to further the scientific conversation.

I’ve also run into cases in which scientific journals who are “moderating” comments, won’t approve reasoned thought, but will simultaneously allow (pre-approved?) accounts to flame every comment that is approved [example on Sciencemag.org: http://boffosocko.com/2016/04/29/some-thoughts-on-academic-publishing/ — see also comments there], so having the original comment live elsewhere may be useful and/or necessary depending on whether the publisher is a good or bad actor, or potentially just lazy.

I’ve also seen people use commenting layers like hypothes.is or genius.com to add commentary directly on journals, but these layers are often hidden to most. The community certainly needs a more robust commenting interface. I would hope that a decentralized version using web standards like Webmentions might be a worthwhile and robust solution.