Following the Shalizi Model for Blog Maintenance

My attempt to put up a web presence is negated by the fact that I don't make many trivial updates or statements; as a result, I'm less concerned with the immediate payoff of these sorts of writings and more about making a longer statement. My friend and colleague Cosma Shalizi pointed out to me when I started this site that there are two steady states for successful blogs:

  • those that are fast to update, have lots of constant yet ephemeral traffic, and have their spam problems mitigated by a quality comment model, and
  • those that are infrequently updated, have occasional yet consistent traffic, and have their spam problems eliminated by removing the ability to comment.
I've become aware that if there's anything I want to say, I want to think carefully about it first, and make it last in the end, mirroring Cosma's model. So goodbye comments and trackbacks; if you want to respond to anything I write, you know how to find me.

Sitting this one out

As a former resident of Massachusetts, it's been interesting for me to watch the Facebook reaction to the Coakley-Brown race purely from the psychology angle. I don't trust the polls or the probabilities in this case because there's so little prior information on their reliability before special elections that are unlikely to be replicated. So I'll just sit back and wait for the obvious narratives to come rolling in before the Daily Show tonight.

Off the grid

For the first time since I've started using email on a regular basis, I'll be without it for the next week as I spend quality time with the family. For that matter, it'll be my first time in a while without pointing my eyes toward a screen. If you're reading this between December 20 and 27, you're clearly not doing the same thing.

Tentative syllabus for 36-724: Applied Bayesian Statistical Computing

As previously offered, this course was a full semester 12 unit course following three semester courses in mathematical statistics, regression modelling and computation. Now, as there is room only for six weeks and no precursor course in computing, I'm still working on how to pick the essential concepts and put them into a seven-week course. Here's what I've got so far.

Carnegie Mellon University, Spring 2010: 36-724: Applied Bayesian Statistical Computing
Instructor: Andrew C. Thomas (acthomas at stat.cmu.edu)
Class Time/Place: MWF 11:30-12:20, CFA 211

Required text:
Andrew Gelman and Jennifer Hill (2007) "Data Analysis using Regression and Multilevel/Hierarchical Models". Cambridge University Press. Buy the softcover version.

Prerequisites: 36-705 ``Intermediate Statistics'', 36-707 ``Intermediate Regression''. If you have not taken these classes specifically, examine the syllabuses for these courses and make an appointment to see me within the first week of class.

The goal of this course is to give a meaningful introduction and exploration of Bayesian statistical methods through computational techniques in seven weeks. We will focus on the principles of Bayesian hierarchical modelling methods that can be programmed efficiently and remain scientifically valid, and methods for debugging without pulling too much hair out. We will not be explicitly covering discriminative machine-learning topics, but we will cover the same debugging concepts that will make things easier when coding them up.

Programming language: R will be the supported language for the course, with the possible use of WinBUGS.

Tentative outline of the course:

Week 1: Introductions. "Central Dogma of Generative Modelling", One-level models, prior specifications and conjugacy; introduction to sampling and simulation in R.
Week 2: A reintroduction to Markov Chain theory, beginning with discrete models and moving to one-dimensional continuous models.
Week 3: Generalized linear models. Grid sampling, the Metropolis-Hastings algorithm, Gibbs sampling.
Week 4: Gaussian multilevel models. Partial and full pooling of variance components; autocorrelation and cross-correlation in chains; diagnostics for convergance.
Week 5: Generalized multilevel models; posterior predictive checking.
Week 6: Varying-slope models in the multilevel context.
Week 7: Special topics to be determined; Bayesian graphical models, causal inference.

If you have any suggestions for topics that ought to be considered, please let me know.

New Journal: Statistics, Politics and Policy

For those of us who would like to see more concrete discussion of policy issues with a strong numerical component, the journal Statistics, Politics and Policy is currently in pre-launch for a first issue this coming summer. Full disclosure: I am also serving as an associate editor, but I wouldn't have joined if I didn't believe it would be high-quality.

Because SPP is published online by Berkeley Electronic Press, it promises to have a quick turnaround time for submissions as well as great accessibility. And it's definitely something I'll try to uphold in my role with the journal.

Citation Software My High Standards Won't Accept

| 1 Comment
The main reason I started to develop PaperTrail was that no other software out there was suitable for literature reviews and identifying commonly cited papers. I've been pointed towards several other options that don't do it.

Cross-platform: Zotero as a browser app isn't comprehensive enough. Mendeley is too much in the power of another company and isn't open source. JabRef is a fair product which I'd probably use if not for the lack of cross-citations.

Mac software: Papers and BibDesk are both citation/paper managers but are OS X only. Papers isn't free. Plus I don't have a Mac.

Help me to get PaperTrail out the door

I've spent the better part of my recreational programming time in the last two years working on an enhanced bibliographic project. The purpose was to build a citation manager that would also track references, so that as you build a literature review you can keep track of common sources, "important" papers, etc. There were also a few bigger goals to the project itself that I hoped would solve some problems that academics have in general.

I named this project PaperTrail, and I've been trying to get it ready for other users to test out. The only problem is that this is my first effort at real application building since high school, and even that one didn't work out too well. I built it using the gtkmm interface in C++ on my Ubuntu Linux machine, which means that in theory, it should be cross-platform so that all my Windows and OS X using friends can use it as well. Putting this together in practice -- auto-installation, etc -- is much trickier.

Here are the goals that PaperTrail is meant to help meet:

  1. To standardize citation scripting language in a way that would incorporate author identity. Those of us with the William H. Macy problem know that establishing such a system would be wonderful; I figure we're about 90% of the way there by noting that authors reference themselves in their work, so that a database that includes citations for each paper would ease the need for a scientific author database.
  2. To better process downloaded PDFs. Academics know the pain of sorting their physical paper collection, let alone their digital volumes; I've currently got it rigged so that downloaded PDFs will open in PaperTrail, so that the relevant bibliographic information can be collected (or imported from the document itself); this info is then saved to the file and archived to its own directory like iTunes can do with music files.
  3. To quickly grab citation information from a paper. I've got a regular-expressions set-up to grab the citations from a paper's reference list (complete with index numbers) and turn them into PaperTrail entries. It clearly needs some work but it's at least built to be expanded upon.
  4. To have better tracking of multiple versions of papers that might be cited -- drafts, conference proceedings, final versions -- within a single entry.
  5. To process and nest comments and rejoinders to journal discussion papers.
  6. To export a data file to bibtex for use in LaTeX documents. I wouldn't mind adding EndNote compatibility if anyone wanted to use it.
  7. More ideas that I'm forgetting to mention.
What I need is help building the installation procedures for Windows, Mac and *nix respectively; I've almost got it for the last one, except for locating supporting files and directories. Because the audience for this is small enough (poor academics, mainly!) I have no interest in trying to find profitability in this idea, only in making a product that people would want to use and share.

The source code is posted here; please contact me if you're interested in helping out, have friends who know this stuff, or have suggestions on features that should be included.
Over what I'm sure will be some howls of objection, I maintain that Breaking Bad is the best show on AMC, better than that other one that everyone else talks about. The main reason would be that there doesn't seem to be a greater dramatic actor with comedy instincts than Bryan Cranston (splitting hairs a bit, as I think Hugh Laurie is the best comedic actor with dramatic instincts), but there are at least three issues it raises with high quality:

  • The law of unintended consequences is ultimately what runs the show. Almost every action taken by a character has a later reaction, predictable or otherwise.
  • Bankruptcy from health-related causes is a serious problem, and it's the lack of a strong insurance system that keeps people from picking their own (quality) doctors.
  • Drug addicts are people too -- any kind of approach to dealing with the problems of addiction must take it into account.
Letting alone the fact that Walter White is apparently as screwed-up a man as any of us, having made more than his fair share of bad life decisions, the precarious position that Walter White is in could have been mitigated by an insurance plan that didn't burden him with an expensive treatment.

The worst part of this is that we can't likely get back to the real meat of the discussion: what are the consequences, intended and otherwise, of each proposed change in the healthcare system in America, since the debate is buried on verifiably false scare claims.

In short this is another example of what I think of as the regression-to-the-mean of policy effects: consequences that appear large are most likely overblown, and those that appear small are likely bigger.

P.S. If you don't believe me about Bryan Cranston's dramatic chops, see him as Buzz Aldrin first.

A Short Note on Breast Cancer Screening: Really Less Effective?

There is plenty in the news on recommendations for breast cancer screening, but one detail jumped out at me -- namely, the suggestion that more women aged 40-49 (1904) would need to be screened regularly to prevent one death due to breast cancer, than women aged 50-59 (1339). This gets prominence in news reports because it's an easy way of summarizing effectiveness, even though it's a completely misleading interpretation of the recommendation. From the source material:

Total number to screen to prevent one fatality from cancer:
Ages 39-49: Mean 1904, CI (929, 6378)
Ages 50-59: Mean 1339, CI (322, 7455)


A less-than-compelling difference of effectiveness if one confidence interval lies completely within the other.

The intended point of the recommendation was that screens and operations have risks -- false positive results leading to unnecessary biopsies and unintended consequences -- though on first inspection I couldn't find any data on the mortality risk from overtreatment to compare.

P.S. There's clearly a lot more to say about the implications of this analysis, for the health care debate in the U.S. at least, but I'm in a position at least to dispute one misinterpretation.

Cochran at 100

I spent this past Saturday at a symposium for the centennial of William G. Cochran, one of my erstwhile department's co-founders, and I wasn't disappointed in the least. The impact he's had, both on the discipline and the world, appears to be vast; the suggestion that his work on the effects of smoking has saved millions of lives is an idea I'll eventually follow up on in detail.

After all I've learned about the diversity of his background, he also appears to be a highly positive case of Doctor No.

The Harvard Gazette has a nice write-up of the event, To those who want a long look at what he did, I strongly recommend The Planning of Observational Studies of Human Populations, a paper I'm ready to classify as being timeless.