👓 The Mueller report redactions, explained in 4 charts | Vox

Read The Mueller report redactions, explained in 4 charts by Alvin Chang (Vox)
We can’t see behind the bars. But we can see where they are — and why they’re there.

👓 See How Much Of The Mueller Report Is Redacted | NPR

Read See How Much Of The Mueller Report Is Redacted by Ryan Lucas, Alyson Hurt, Thomas Wilburn (NPR)
Attorney General William Barr explained before the release of the special counsel report that the law and regulations kept him from including everything that Robert Mueller uncovered, as well as how.

Sparklines of recent activity on my website

Inspired a bit by the work of Jeremy Keith and others, I’ve recently been playing around with some sparklines on my website. While tinkering around with things, mostly on the back end of my site, I’ve tried out several WordPress-specific plugins, both to see how they’re built and the user interfaces they provide. 

There are several simple plugins for adding sparklines to WordPress websites including:

  • Activity Sparks plugin by Greg Jackson which adds some configurable functionality for adding sparklines to WordPress sites including for posts and comments as well as for tracking categories/tags.
  • Sparkplug by Beau Lebens has similarity to the Activity Sparks plugin (above), but with a slightly older looking and somewhat less refined output.

At present, I’m using the Activity Sparks plugin in my sidebar to display the recent activity on my site in terms of my posting frequency and the comment frequency. One chart provides the daily activity on my site over the past 3 months while the other provides the monthly activity over the past 5 years.

When on particular category pages, you can see the posting velocity for those particular categories in these respective time periods. While on the homepage and other miscellaneous pages, you can see the aggregate numbers for the website.

Generally I don’t care very much about the statistics, but in aggregate they can sometimes be fun to look at. As quick examples, I can tell roughly by looking at the 5 year time span when I added certain posting features to my website or that time my site got taken down by HackerNews.


hat tip to Khürt Williams who reminded me I needed to circle back around and finish of a small piece of this project and document it.

❤️ randal_olson tweeted 10 most populous cities in the world from 1500-2018. #dataviz https://t.co/vtGEBVLdYk https://t.co/uvIkuE4VDI

Liked a tweet by Randy Olson Randy Olson (Twitter)

👓 UK Journalists on Twitter | OUseful.Info, the blog

Read UK Journalists on Twitter by Tony Hirst (OUseful.Info)
A post on the Guardian Datablog earlier today took a dataset collected by the Tweetminster folk and graphed the sorts of thing that journalists tweet about ( Journalists on Twitter: how do Britain&…

👓 Every time Ford and Kavanaugh dodged a question, in one chart | Vox

Read Every time Ford and Kavanaugh dodged a question, in one chart by Alvin Chang (Vox)
There was a striking difference in style — and substance.

An impressively telling visualization here.

👓 Squares and prettier graphs | Stuart Landridge

Read Squares and prettier graphs by Stuart Landridge (kryogenix.org)
The Futility Closet people recently posted “A Square Circle“, in which they showed: 49² + 73² = 7730 77² + 30² = 6829 68² + 29² = 5465 54² + 65² = 7141 71² + 41² = 6722 67² + 22² = 4973 which is a nice little result. I like this sort of recreational maths, so I spent a little time w...

An interesting cyclic structure here.

👓 How Y’all, Youse and You Guys Talk | The New York Times

Read How Y’all, Youse and You Guys Talk by Josh Katz (nytimes.com)
What does the way you speak say about where you’re from? Answer all the questions below to see your personal dialect map.

I’d love to see the data sets and sources they used for these visualizations.

Data mining the New York Philharmonic performance history

Read Data mining the New York Philharmonic performance history by Kris Shaffer (pushpullfork.com)

How does war affect the music an orchestra plays

The New York Philharmonic has a public dataset containing metadata for their entire performance history. I recently discovered this, and of course downloaded it and started to geek out over it. (On what was supposed to be a day off, of course!) I only explored the data for a few hours, but was able to find some really interesting things. I’m sharing them here, along with the code I used to do them (in R, using TidyVerse tools), so you can reproduce them, or dive further into other questions. (If you just want to see the results, feel free to skip over the code and just check out the visualizations and discussion below.)

All scripts, extracted data, and visualizations in this blog post can also be found in the GitHub repository for this project.

Downloading the data

First, here are the R libraries that I use in the code that follows. If you’re going to run the code, you’ll need these libraries.

library(jsonlite)  
library(tidyverse)  
library(tidytext)  
library(stringr)  
library(scales)  
library(tidyjson)  
library(purrr)  
library(lubridate)  
library(broom)

To load the NYPhil performance data into R, you can download it from GitHub and load it locally, or just load it directly into R from GitHub. (I chose the latter.)

nyp <- fromJSON('https://raw.githubusercontent.com/nyphilarchive/PerformanceHistory/master/Programs/json/complete.json')

Now their entire performance history is in a data frame called nyp!

Tidying the data

The performance history is organized in a hierarchical format ― more-or-less lists of lists of lists. (See the README file on GitHub for an explanation.) It’s an intuitive way to organize the data, but it makes it difficult to do exploratory data analysis. So I spent more time than I care to admit unpacking the hierarchical structure into a flat, two-dimensional “tidy” structure, where each row is an observation (in this case, a piece of music that appears on a particular program) and each column is a variable or measurement (in this case, things like title, composer, date of program, performance season, conductor, soloist(s), performance venue, etc.).

Getting from the hierarchical structure to a tidy data frame was something of a challenge. There are a number of different kinds of lists embedded in the JSON structure, not all of which I wanted to worry about. So I poked around for a while and then created some functions to extract the info I wanted and assign a single row to each piece on a particular program, which would include all of the pertinent details. Here are the custom functions for expanding the list of metadata for a musical work, and then reproducing the general program information for each work on that program. (Note that I left the soloist field included, but still as a list. I’m not planning on using it, but I left it in for future possibilities.)

work_to_data_frame <- function(work) {  
  workID <- work['ID']  
  composer <- work['composerName']  
  title <- work['workTitle']  
  movement <- work['movement']  
  conductor <- work['conductorName']  
  soloist <- work['soloists']  
  return(c(workID = workID,  
           composer = composer,  
           title = title,  
           movement = movement,  
           conductor = conductor,  
           soloist = soloist))  
}  

expand_works <- function(record) {  
  if (is_empty(record)) {  
    works_db <- as.data.frame(cbind(workID = NA,  
                                    composer = NA,  
                                    title = NA,  
                                    movement = NA,  
                                    conductor = NA,  
                                    soloist = NA))  
    } else {  
      total <- length(record)  
      works_db <- t(sapply(record[1:total], work_to_data_frame))  
      colnames(works_db) <- c('workID',  
                              'composer',  
                              'title',  
                              'movement',  
                              'conductor',  
                              'soloist')  
    }  
  return(works_db)  
}  

expand_program <- function(record_number) {  
  record <- nyp$programs[[record_number]]  
  total <- length(record)  
  program <- as.data.frame(cbind(id = record$id,  
                                 programID = record$programID,  
                                 orchestra = record$orchestra,  
                                 season = record$season,  
                                 eventType = record$concerts[[1]]$eventType,  
                                 location = record$concerts[[1]]$Location,  
                                 venue = record$concerts[[1]]$Venue,  
                                 date = record$concerts[[1]]$Date,  
                                 time = record$concerts[[1]]$Time))  
  works <- expand_works(record$works)  
  return(cbind(program, works))  
}

Then I used a loop to iterate these functions over the entire dataset (13771 records through the end of 2016 when I downloaded it, but this is a dynamic dataset that expands as new programs are performed), then save it to CSV and make it into a tibble (a TidyVerse-friendly data frame).

db <- data.frame()  
for (i in 1:13771) {  
  db <- rbind(db, cbind(i, expand_program(i)))  
}  

tidy_nyp <- db %>%
  as_tibble() %>%
  mutate(workID = as.character(workID),
         composer = as.character(composer),
         title = as.character(title),
         movement = as.character(movement),
         conductor = as.character(conductor),
         soloist = as.character(soloist))
tidy_nyp %>%
  write.csv('ny_phil_programs.csv')

This takes a looooooong time to process on a dual-core PC, which is why I was sure to save the results immediately for reloading in the future. Normally I would write a function that could be vectorized (processed on each value in parallel), which takes advantage of R’s (well, really C’s) high-efficiency matrix multiplication capabilities. However, because the input (one record per concert program) and output (one record per piece per program) were necessarily different lengths, I couldn’t make that work. If you know how to do that, please drop me an email or tweet and I’ll be eternally grateful!

After a cup of coffee, or maybe two!, I have a handy tibble of almost 82,000 performance records from the entire history of the NY Philharmonic!

Most common composers and works

With this tidy tibble, we can really easily find and visualize basic descriptive statistics about the dataset. For example, what composers have the most works in the corpus? Here are all the composers with 400 or more works performed, in order of frequency.

This is produced by running the following code.

tidy_nyp %>%  
  filter(!composer %in% c('NULL', 'Traditional,', 'Anthem,')) %>%  
  count(composer, sort=TRUE) %>%  
  filter(n > 400) %>%  
  mutate(composer = reorder(composer, n)) %>%  
  ggplot(aes(composer, n, fill = composer)) +  
  geom_bar(stat = 'identity') +  
  xlab('Composer') +  
  ylab('Number of works performed') +  
  theme(legend.position="none") +  
  coord_flip()

I was surprised to see Wagner on top, even ahead of Beethoven. Tchaikovsky was also a big surprise to me. He’s popular, but I’ve ushered or attended over 200 performances of the Chicago Symphony Orchestra, and Beethoven and Mozart are definitely performed more recently than Wagner and Tchaikovsky by the CSO today. So is this a NYP/CSO difference? Many of my music theory & history friends on Twitter were also surprised to see this ordering, so maybe not. In that case, have things changed over time?

Before looking at trends over time, let’s see if looking at specific works can shed any light. Here are the most performed works (and the code to produce the visualization), correcting for multiple movements listed from the same piece on the same program.

tidy_nyp %>%  
  filter(!title %in% c('NULL')) %>%  
  mutate(composer_work = paste(composer, '-', title)) %>%  
  group_by(composer_work, programID) %>%  
  summarize(times_on_program = n()) %>%  
  count(composer_work, sort=TRUE) %>%  
  filter(n > 220) %>%  
  mutate(composer_work = reorder(composer_work, n)) %>%  
  ggplot(aes(composer_work, n, fill = composer_work)) +  
  geom_bar(stat = 'identity') +  
  xlab('Composer and work') +  
  ylab('Number of times performed') +  
  theme(legend.position="none") +  
  coord_flip()

There are a lot of Wagner operas at the top! (Though it’s worth noting that only a few instances of each are full performances. Instead, most are just the overture or prelude, a common way of opening out a symphony concert.) While many of Wagner’s most performed works are very short (10-minute overtures compared to 30-to-60-minute Beethoven and Tchaikovsky symphonies), and thus Beethoven probably occupies more time on the program than Wagner, the high number of Wagner, and even Tchaikovsky, pieces on NY Phil programs is still surprising to me.

Changes over time

Let’s see how things have changed over time. We can start simply by comparing their early history to their late history. Here are composer counts from 1842 to 1929 and 1930 to 2016 (roughly equal timespans, though not equal numbers of pieces).

Pre-1930:

And post-1929:

To do this, I simply added another filter to tidy_nyp:

filter(as.integer(substr(as.character(date),1,4)) < 1930) %>%

Here we see Beethoven, Tchaikovsky, and Mozart all ahead of Wagner in more recent history, with Wagner dominating (and Mozart missing from) the earlier history.

But we can model this with more nuance. Let’s make a new tibble that contains just the information we need on composer frequency year-by-year.

comp_counts <- tidy_nyp %>%  
  filter(!composer %in% c('NULL', 'Traditional,', 'Anthem,')) %>%  
  mutate(year = as.integer(substr(as.character(date),1,4))) %>%  
  group_by(year) %>%  
  mutate(year_total = n()) %>%  
  group_by(composer, year) %>%  
  mutate(comp_total_by_year = n()) %>%  
  ungroup() %>%  
  group_by(composer, year, comp_total_by_year, year_total) %>%  
  summarize() %>%  
  mutate(share = comp_total_by_year/year_total) %>%  
  group_by(year) %>%  
  mutate(average_share = mean(share))

This produces a tibble that contains a record for each composer-year combination, with fields for:
- composer name
- year
- number of pieces by that composer in that year
- total number of pieces for the year
- composer’s share of pieces for the year
- average composer share for the year (total / number of composers)

With this information, we can then plot the changing frequency of each composer. Here are the top four on a single plot.

We can very clearly see the change in these composers’ frequency of occurrence on the NY Phil’s program over time, with Wagner’s decline very pronounced, and Mozart’s rise (in the twentieth century) clearly evident as well.

However, comparing a composer’s share of the programming year by year isn’t always apples-to-apples. Early on in the Philharmonic’s history, seasons contained far fewer pieces, and thus far fewer composers, than recent years. This has the potential to provide artificially high numbers for composers in sparser years, as seen in the following visualization (and accompanying code).

comp_counts %>%  
  group_by(year) %>%  
  summarize(comp_per_year = n()) %>%  
  ggplot(aes(year, comp_per_year)) +  
  geom_line() +  
  xlab('Year') +  
  ylab('Composers appearing on a program')

To account for this, we can normalize a composer’s share of the repertoire in a given year by dividing it by the average repertoire share for composers in the year. So here is the changing normalized frequency for each of the top four composers on a year-by-year basis.

The same trends can be seen here ― Mozart’s gentle rise and Wagner’s drastic decline ― perhaps even more starkly. In particular, Wagner’s decline from a peak in 1921 to a trough in the 1960s stands out quite strikingly. The decline is the most precipitous in the late 1940s and early 1950s.

And now an explanation begins to emerge.

A number of musicians began to boycott or avoid performing the music of Richard Wagner in the late 1930s, as recounted by conductor Daniel Barenboim. Wagner was known as “Hitler’s favorite composer,” and his music was used prominently in the Reich. The Israel Philharmonic stopped performing his music in 1938, Arturo Toscanini (who occupies a not insignificant share of this dataset as a conductor) stopped performing at Wagner festivals in Bayreuth, etc. Looking at the NY Philharmonic data, it seems like this may be a broader trend.

In addition to Wagner’s decline between WWI and the early Cold War, we can see another significant wartime change, this time an increase. From 1939 to 1946, Tchaikovsky’s share of the NY Philharmonic’s repertoire rose precipitously to his highest (normalized) share in the entire corpus. Could this be due to Russia’s role in the Grand Alliance? I don’t know. I do know that during World War II, then-living Russian composer Dmitri Shostakovich was widely performed in the US as part of a pro-Russia, anti-Nazi wartime propaganda effort (see below). Could Tchaikovsky have been part of that? I don’t know the history of it. But I wouldn’t be surprised. I also wouldn’t be surprised if Tchaikovsky simply filled the role of popular, grand, Romantic composer … who wasn’t German. (Any Tchaikovsky scholars have a perspective to add?)

Conclusion

This is just a start, but I think they’re interesting findings. As a music student and scholar, I never studied performance trends like this. My studies were mostly focused on musical structures and the evolution of compositional styles. But it’s cool to take a different kind of empirical look at musical evolution.

If this code helps you find other insights in the corpus, please drop me a line. I’m sure there’s much more to be mined out of this fascinating corpus.

And thanks to the archivists of the New York Philharmonic for putting this together! Hopefully more major orchestras will release their programming history publicly, so we can start mapping larger trends and make comparisons between them.

Banner image by Tim Hynes.

Chris Aldrich is reading “10 Great Last.fm Apps, Hacks and Mashups”

Read 10 great Last.fm apps, hacks and mashups (The Next Web)
A look at some of the best apps, hacks and mashups available for music streaming and scrobbling service Last.fm.

Curious about alternatives Last.fm’s broken RSS feeds and what people are doing with their listening data. Some relatively interesting ideas in here, but nothing earth shattering. One or two were focused on visualization, but otherwise nothing I felt I could use.