A Fast and Powerful Scraping and Web Crawling Framework
Responsive HTML5 and CSS3 site templates designed by @ajlkn and released under the Creative Commons license.
Update: For a simpler formulation of the ideas in this essay, see Doug Belshaw’s Working openly on the web: a manifesto. Back in 2000, the patterns, principles, and best practices for building web information systems were mostly anecdotal and folkloric. Roy Fielding’s dissertation on the web’s...
Taking Back The Internet One Page At A Time.
The opening keynote from the inaugural HTML Special held before CSS Day 2016 in Amsterdam.
I hope that if you’re starting your adventure on the web, that you manage to find this as one of the first links that starts you off on your journey. It’s a great place to start.
A lot of your post also reminds me of Bryan Alexander’s relatively recent post I defy the world and to go back to RSS.
I completely get the concept of what you’re getting at with harkening back to the halcyon days of RSS. I certainly love, use, and rely on it heavily both for consumption as well as production. Of course there’s also still the competing standard of Atom still powering large parts of the web (including GNU Social networks like Mastodon). But almost no one looks back fondly on the feed format wars…
I think that while many are looking back on the “good old days” of the web, that we not forget the difficult and fraught history that has gotten us to where we are. We should learn from the mistakes made during the feed format wars and try to simplify things to not only move back, but to move forward at the same time.
Today, the easier pared-down standards that are better and simpler than either of these old and and difficult specs is simply adding Microformat classes to HTML (aka P.O.S.H) to create feeds. Unless one is relying on pre-existing infrastructure like WordPress, building and maintaining RSS feed infrastructure can be difficult at best, and updates almost never occur, particularly for specifications that support new social media related feeds including replies, likes, favorites, reposts, etc. The nice part is that if one knows how to write basic html, then one can create a simple feed by hand without having to learn the mark up or specifics of RSS. Most modern feed readers (except perhaps Feedly) support these new h-feeds as they’re known. Interestingly, some CMSes like WordPress support Microformats as part of their core functionality, though in WordPress’ case they only support a subsection of Microformats v1 instead of the more modern v2.
For those like you who are looking both backward and simultaneously forward there’s a nice chart of “Lost Infractructure” on the IndieWeb wiki which was created following a post by Anil Dash entitled The Lost Infrastructure of Social Media. Hopefully we can take back a lot of the ground the web has lost to social media and refashion it for a better and more flexible future. I’m not looking for just a “hipster-web”, but a new and demonstrably better web.
Some of the desire to go back to RSS is built into the problems we’re looking at with respect to algorithmic filtering of our streams (we’re looking at you Facebook.) While algorithms might help to filter out some of the cruft we’re not looking for, we’ve been ceding too much control to third parties like Facebook who have different motivations in presenting us material to read. I’d rather my feeds were closer to the model of fine dining rather than the junk food that the-McDonald’s-of-the-internet Facebook is providing. As I’m reading Cathy O’Neil’s book Weapons of Math Distraction, I’m also reminded that the black box that Facebook’s algorithm is is causing scale and visibility/transparency problems like the Russian ad buys which could have potentially heavily influenced the 2017 election in the United States. The fact that we can’t see or influence the algorithm is both painful and potentially destructive. If I could have access to tweaking a third-party transparent algorithm, I think it would provide me a lot more value.
As for OPML, it’s amazing what kind of power it has to help one find and subscribe to all sorts of content, particularly when it’s been hand curated and is continually self-dogfooded. One of my favorite tools are readers that allow one to subscribe to the OPML feeds of others, that way if a person adds new feeds to an interesting collection, the changes propagate to everyone following that feed. With this kind of simple technology those who are interested in curating things for particular topics (like the newsletter crowd) or even creating master feeds for class material in a planet-like fashion can easily do so. I can also see some worthwhile uses for this in journalism for newspapers and magazines. As an example, imagine if one could subscribe not only to 100 people writing about #edtech, but to only their bookmarked articles that have the tag edtech (thus filtering out their personal posts, or things not having to do with edtech). I don’t believe that Feedly supports subscribing to OPML (though it does support importing OPML files, which is subtly different), but other readers like Inoreader do.
I’m hoping to finish up some work on my own available OPML feeds to make subscribing to interesting curated content a bit easier within WordPress (over the built in, but now deprecated link manager functionality.) Since you mentioned it, I tried checking out the OPML file on your blog hoping for something interesting in the #edtech space. Alas… 😉 Perhaps something in the future?
Accelerated Mobile Pages
I’ve been following most of the (Google) Accelerated Mobile Pages (AMP) discussion (most would say debate) through episodes of This Week in Google where Leo Laporte plays an interesting foil to Jeff Jarvis over the issue. The other day I came across a bookmark from Jeremy Keith entitled Need to Catch Up on the AMP Debate? which is a good catch up by CSS-Tricks. It got me thinking about creating a bookmarklet to strip out the canonical URL for AMP pages (the spec requires them to exist in markup) to make them easier to bookmark and share across social media. In addition to social sites wrapping their URLs with short URLs (which often die or disappear as the result of linkrot) or needing to physically exit platforms (I’m looking at you Facebook with your three extra life-sucking clicks meant to protect your walled garden) to properly bookmark canonical URLs for later consumption, I’ve run across several Google prepended URLs which I’d rather not share in lieu of the real ones.
Clean and Simple URLs
As an example, his canonical bookmarklet will take something ugly like
and strip it down to its most basic
so that if you want to share it, it will remove all of the tracking cruft that comes along for the ride.
Even worse offenders like
suddenly become cleaner and clearer
These examples almost remind me of the days of forwarding chain letter emails where friends couldn’t be bothered to cut out the 10 pages of all the blockquoted portions of forwards or the annoying
> > >> >>
> > >> >>
> > >> >>
nonesense before they sent it to you… The only person who gets a pass on this anymore is Grandpa, and even he’s skating on thin ice.
Remember, friends don’t let friends share ridiculous URLs…
So in that spirit, here are the three bookmarklets that you can easily drag and drop into the bookmark bar on your browser:
The code for the three follow respectively for those who prefer to view the code prior to use, or who wish to fashion their own bookmarklets:
As a bonus tip, Kevin Marks’ post briefly describes how one can use their Chrome browser on mobile to utilize these synced bookmarklets more readily.
Of course, if you want the AMP version of pages just for their clean appearance, then perhaps you may appreciate the Mercury Reader for Chrome. There isn’t a bookmarklet for it (yet?), but it’ll do roughly the same job, but without the mobile view sizing on desktop. And then while looking that link up, I also notice Mercury also has a one line of code AMP solution too, though I recommend you brush up on what AMP is, what it does, and do you really want it before adding it.