Advanced Data Analysis from an Elementary Point of View
by Cosma Rohilla Shalizi

This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press before the end of 2015. A copy of the next-to-final version will remain freely accessible here permanently.

I. Regression and Its Generalizations

1. Regression Basics
2. The Truth about Linear Regression
3. Model Evaluation
4. Smoothing in Regression
5. Simulation
6. The Bootstrap
7. Weighting and Variance
8. Splines
10. Testing Regression Specifications
11. Logistic Regression
12. Generalized Linear Models and Generalized Additive Models
13. Classification and Regression Trees
II. Distributions and Latent Structure
14. Density Estimation
15. Relative Distributions and Smooth Tests of Goodness-of-Fit
16. Principal Components Analysis
17. Factor Models
18. Nonlinear Dimensionality Reduction
19. Mixture Models
20. Graphical Models
III. Dependent Data
21. Time Series
22. Spatial and Network Data
23. Simulation-Based Inference
IV. Causal Inference
24. Graphical Causal Models
25. Identifying Causal Effects
26. Causal Inference from Experiments
27. Estimating Causal Effects
28. Discovering Causal StructureAppendices
• Data-Analysis Problem Sets
• Reminders from Linear Algebra
• Big O and Little o Notation
• Taylor Expansions
• Multivariate Distributions
• Algebra with Expectations and Variances
• Propagation of Error, and Standard Errors for Derived Quantities
• Optimization
• chi-squared and the Likelihood Ratio Test
• Proof of the Gauss-Markov Theorem
• Rudimentary Graph Theory
• Information Theory
• Hypothesis Testing
• Writing R Functions
• Random Variable Generation

Planned changes:

• Unified treatment of information-theoretic topics (relative entropy / Kullback-Leibler divergence, entropy, mutual information and independence, hypothesis-testing interpretations) in an appendix, with references from chapters on density estimation, on EM, and on independence testing
• More detailed treatment of calibration and calibration-checking (part II)
• Missing data and imputation (part II)
• Move d-separation material from “causal models” chapter to graphical models chapter as no specifically causal content (parts II and IV)?
• Expand treatment of partial identification for causal inference, including partial identification of effects by looking at all data-compatible DAGs (part IV)
• Figure out how to cut at least 50 pages
• Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?
• Move simulation to an appendix
• Move variance/weights chapter to right before logistic regression
• Move some appendices online (i.e., after references)?

(Text last updated 30 March 2016; this page last updated 6 November 2015)

🔖 Quantum Information Science II

Bookmarked Quantum Information Science II (edX)

Already know something about quantum mechanics, quantum bits and quantum logic gates, but want to design new quantum algorithms, and explore multi-party quantum protocols? This is the course for you!

In this advanced graduate physics course on quantum computation and quantum information, we will cover:

• The formalism of quantum errors (density matrices, operator sum representations)
• Quantum error correction codes (stabilizers, graph states)
• Fault-tolerant quantum computation (normalizers, Clifford group operations, the Gottesman-Knill Theorem)
• Models of quantum computation (teleportation, cluster, measurement-based)
• Quantum Fourier transform-based algorithms (factoring, simulation)
• Quantum communication (noiseless and noisy coding)
• Quantum protocols (games, communication complexity)

Research problem ideas are presented along the journey.

What you’ll learn

• Formalisms for describing errors in quantum states and systems
• Quantum error correction theory
• Fault-tolerant quantum procedure constructions
• Models of quantum computation beyond gates
• Structures of exponentially-fast quantum algorithms
• Multi-party quantum communication protocols

Meet the instructor

Isaac Chuang Professor of Electrical Engineering and Computer Science, and Professor of Physics MIT

In Washington: A Life celebrated biographer Ron Chernow provides a richly nuanced portrait of the father of our nation. With a breadth and depth matched by no other one-volume life of Washington, this crisply paced narrative carries the reader through his troubled boyhood, his precocious feats in the French and Indian War, his creation of Mount Vernon, his heroic exploits with the Continental Army, his presiding over the Constitutional Convention, and his magnificent performance as America's first president. Despite the reverence his name inspires, Washington remains a lifeless waxwork for many Americans, worthy but dull. A laconic man of granite self-control, he often arouses more respect than affection. In this groundbreaking work, based on massive research, Chernow dashes forever the stereotype of a stolid, unemotional man. A strapping six feet, Washington was a celebrated horseman, elegant dancer, and tireless hunter, with a fiercely guarded emotional life. Chernow brings to vivid life a dashing, passionate man of fiery opinions and many moods. Probing his private life, he explores his fraught relationship with his crusty mother, his youthful infatuation with the married Sally Fairfax, and his often conflicted feelings toward his adopted children and grandchildren. He also provides a lavishly detailed portrait of his marriage to Martha and his complex behavior as a slave master. At the same time, Washington is an astute and surprising portrait of a canny political genius who knew how to inspire people. Not only did Washington gather around himself the foremost figures of the age, including James Madison, Alexander Hamilton, John Adams, and Thomas Jefferson, but he also brilliantly orchestrated their actions to shape the new federal government, define the separation of powers, and establish the office of the presidency. In this unique biography, Ron Chernow takes us on a page-turning journey through all the formative events of America's founding. With a dramatic sweep worthy of its giant subject, Washington is a magisterial work from one of our most elegant storytellers.

🔖 Want to read: Streaming, Sharing, Stealing: Big Data and the Future of Entertainment by Michael D. Smith and Rahul Telang (MIT Press)

Bookmarked Streaming, Sharing, Stealing: Big Data and the Future of Entertainment (MIT Press; August 8, 2016)
Traditional network television programming has always followed the same script: executives approve a pilot, order a trial number of episodes, and broadcast them, expecting viewers to watch a given show on their television sets at the same time every week. But then came Netflix's House of Cards. Netflix gauged the show's potential from data it had gathered about subscribers' preferences, ordered two seasons without seeing a pilot, and uploaded the first thirteen episodes all at once for viewers to watch whenever they wanted on the devices of their choice. In this book, Michael Smith and Rahul Telang, experts on entertainment analytics, show how the success of House of Cards upended the film and TV industries -- and how companies like Amazon and Apple are changing the rules in other entertainment industries, notably publishing and music. We're living through a period of unprecedented technological disruption in the entertainment industries. Just about everything is affected: pricing, production, distribution, piracy. Smith and Telang discuss niche products and the long tail, product differentiation, price discrimination, and incentives for users not to steal content. To survive and succeed, businesses have to adapt rapidly and creatively. Smith and Telang explain how. How can companies discover who their customers are, what they want, and how much they are willing to pay for it? Data. The entertainment industries, must learn to play a little "moneyball." The bottom line: follow the data.

Recommended to me today by Ramzi Hajj.

Introduction to Galois Theory | Coursera

Bookmarked Introduction to Galois Theory by Ekaterina Amerik (Coursera)
A very beautiful classical theory on field extensions of a certain type (Galois extensions) initiated by Galois in the 19th century. Explains, in particular, why it is not possible to solve an equation of degree 5 or more in the same way as we solve quadratic or cubic equations. You will learn to compute Galois groups and (before that) study the properties of various field extensions. We first shall survey the basic notions and properties of field extensions: algebraic, transcendental, finite field extensions, degree of an extension, algebraic closure, decomposition field of a polynomial. Then we shall do a bit of commutative algebra (finite algebras over a field, base change via tensor product) and apply this to study the notion of separability in some detail. After that we shall discuss Galois extensions and Galois correspondence and give many examples (cyclotomic extensions, finite fields, Kummer extensions, Artin-Schreier extensions, etc.). We shall address the question of solvability of equations by radicals (Abel theorem). We shall also try to explain the relation to representations and to topological coverings. Finally, we shall briefly discuss extensions of rings (integral elemets, norms, traces, etc.) and explain how to use the reduction modulo primes to compute Galois groups.

I’ve been watching MOOCs for several years and this is one of the few I’ve come across that covers some more advanced mathematical topics. I’m curious to see how it turns out and what type of interest/results it returns.

It’s being offered by National Research University – Higher School of Economics (HSE) in Russia.

[1609.02422] What can logic contribute to information theory?

Bookmarked [1609.02422] What can logic contribute to information theory? by David Ellerman (128.84.21.199)
Logical probability theory was developed as a quantitative measure based on Boole's logic of subsets. But information theory was developed into a mature theory by Claude Shannon with no such connection to logic. But a recent development in logic changes this situation. In category theory, the notion of a subset is dual to the notion of a quotient set or partition, and recently the logic of partitions has been developed in a parallel relationship to the Boolean logic of subsets (subset logic is usually mis-specified as the special case of propositional logic). What then is the quantitative measure based on partition logic in the same sense that logical probability theory is based on subset logic? It is a measure of information that is named "logical entropy" in view of that logical basis. This paper develops the notion of logical entropy and the basic notions of the resulting logical information theory. Then an extensive comparison is made with the corresponding notions based on Shannon entropy.

Ellerman is visiting at UC Riverside at the moment. Given the information theory and category theory overlap, I’m curious if he’s working with John Carlos Baez, or what Baez is aware of this.

Based on a cursory look of his website(s), I’m going to have to start following more of this work.

Randomness And Complexity, from Leibniz To Chaitin | World Scientific Publishing

Bookmarked Randomness And Complexity, from Leibniz To Chaitin (amzn.to)
The book is a collection of papers written by a selection of eminent authors from around the world in honour of Gregory Chaitin s 60th birthday. This is a unique volume including technical contributions, philosophical papers and essays. Hardcover: 468 pages; Publisher: World Scientific Publishing Company (October 18, 2007); ISBN: 9789812770820

The Science of the Oven (Arts and Traditions of the Table: Perspectives on Culinary History)

Bookmarked The Science of the Oven (Arts and Traditions of the Table: Perspectives on Culinary History) by Hervé This (Amazon.com)
The Science of the Oven
Hervé This
Cooking
Columbia University Press
2009
Hardcover
206
Personal library

Mayonnaise "takes" when a series of liquids form a semisolid consistency. Eggs, a liquid, become solid as they are heated, whereas, under the same conditions, solids melt. When meat is roasted, its surface browns and it acquires taste and texture. What accounts for these extraordinary transformations? The answer: chemistry and physics. With his trademark eloquence and wit, Hervé This launches a wry investigation into the chemical art of cooking. Unraveling the science behind common culinary technique and practice, Hervé This breaks food down to its molecular components and matches them to cooking's chemical reactions. He translates the complex processes of the oven into everyday knowledge for professional chefs and casual cooks, and he demystifies the meaning of taste and the making of flavor. He describes the properties of liquids, salts, sugars, oils, and fats and defines the principles of culinary practice, which endow food with sensual as well as nutritional value.

For fans of Hervé This's popular volumes and for those new to his celebrated approach, The Science of the Oven expertly expands the possibilities of the kitchen, fusing the physiology of taste with the molecular structure of bodies and food.

NIMBioS Tutorial: Evolutionary Quantitative Genetics 2016

Bookmarked NIMBioS Tutorial: Evolutionary Quantitative Genetics 2016 by NIMBioS (nimbios.org)
This tutorial will review the basics of theory in the field of evolutionary quantitative genetics and its connections to evolution observed at various time scales. Quantitative genetics deals with the inheritance of measurements of traits that are affected by many genes. Quantitative genetic theory for natural populations was developed considerably in the period from 1970 to 1990 and up to the present, and it has been applied to a wide range of phenomena including the evolution of differences between the sexes, sexual preferences, life history traits, plasticity of traits, as well as the evolution of body size and other morphological measurements. Textbooks have not kept pace with these developments, and currently few universities offer courses in this subject aimed at evolutionary biologists. There is a need for evolutionary biologists to understand this field because of the ability to collect large amounts of data by computer, the development of statistical methods for changes of traits on evolutionary trees and for changes in a single species through time, and the realization that quantitative characters will not soon be fully explained by genomics. This tutorial aims to fill this need by reviewing basic aspects of theory and illustrating how that theory can be tested with data, both from single species and with multiple-species phylogenies. Participants will learn to use R, an open-source statistical programming language, to build and test evolutionary models. The intended participants for this tutorial are graduate students, postdocs, and junior faculty members in evolutionary biology.

Network Science by Albert-László Barabási

Bookmarked Network Science by Albert-László Barabási (Cambridge University Press)

I ran across a link to this textbook by way of a standing Google alert, and was excited to check it out. I was immediately disappointed to think that I would have to wait another month and change for the physical textbook to be released, but made my pre-order directly. Then with a bit of digging around, I realized that individual chapters are available immediately to quench my thirst until the physical text is printed next month.

The textbook is available for purchase in September 2016 from Cambridge University Press. Pre-order now on Amazon.com.

Disconnected, Fragmented, or United? A Trans-disciplinary Review of Network Science

Bookmarked Disconnected, Fragmented, or United? A Trans-disciplinary Review of Network Science by César A. Hidalgo (Applied Network Science | SpringerLink)

Abstract

During decades the study of networks has been divided between the efforts of social scientists and natural scientists, two groups of scholars who often do not see eye to eye. In this review I present an effort to mutually translate the work conducted by scholars from both of these academic fronts hoping to continue to unify what has become a diverging body of literature. I argue that social and natural scientists fail to see eye to eye because they have diverging academic goals. Social scientists focus on explaining how context specific social and economic mechanisms drive the structure of networks and on how networks shape social and economic outcomes. By contrast, natural scientists focus primarily on modeling network characteristics that are independent of context, since their focus is to identify universal characteristics of systems instead of context specific mechanisms. In the following pages I discuss the differences between both of these literatures by summarizing the parallel theories advanced to explain link formation and the applications used by scholars in each field to justify their approach to network science. I conclude by providing an outlook on how these literatures can be further unified.

Peter Webb’s A Course in Finite Group Representation Theory

Bookmarked A Course in Finite Group Representation Theory by Peter Webb (math.umn.edu)
Download a pre-publication version of the book which will be published by Cambridge University Press. The book arises from notes of courses taught at the second year graduate level at the University of Minnesota and is suitable to accompany study at that level.

Ten Simple Rules for Taking Advantage of Git and GitHub

Bookmarked Ten Simple Rules for Taking Advantage of Git and GitHub (journals.plos.org)
Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.