One model to rule them all

I like hyperbole.

Anyway, I wrote a complicated model again. It ‘works’, it’s slow, and it doesn’t really tell me anything new about the data I ran through it. So what’s really new here.

Still, I like it for philosophical reasons. And I often think I’m a better scientific philosopher than actual scientist. So let me philosophize on this model.

There are two main inspirations for the model, and lots of secondary perks. The first main idea is that multi-omics data needs a theoretically grounded method of factor analysis. There are a lot of tools out there already, but almost all of them put priority on computational efficiency and make potentially disastrous statistical shortcuts. Of course, mine does too. But they’re different distastrous shortcuts, which I think are less disastrous.

For example, the ‘multivariate’ methods that I started with in my grad school days, and which are still very commonly used, involve the summarization of these extremely complex datasets into pairwise distance matrices of samples. All such methods require strong assumptions about the relative importance of variablewise dispersions. They also have varying degrees of success dealing with compositionality and the presence of zeros in the data. They are usually heavily influenced by ‘outlier’ observations – or in other words, non-normal residuals. I’ve become aware of a couple packages that use model-based methods with, for example, zero-inflated negative binomial errors, and I need to thoroughly check these out. But from what I’ve seen, those don’t encourage any kind of sparsity or strict variable selection, and don’t seem to be ideal for multi-omics or integrating other types of data. Basically, I expect if anyone replies to this blog, it’ll be to say ‘well this model solves X!’ But I’d be very surprised to hear about a single method that solves all of it. Especially for multi-omics rather than a single omics dataset.

The second main idea is that there are a lot of independent statistical tests that are often performed on omics data, which in reality are not at all independent of one another and whose results are thus difficult to interpret with confidence. For example, people are often interested separately in the ‘alpha diversity’ of their samples, differential abundance of individual variables among their samples, differential prevalence among their categories, and ‘beta diversity’ questions such as the very high-level ‘is there a difference in composition between treatments’. The problem is that some of these questions are pretty ill-defined – such as the last one. What do we learn about a system if there is a ‘difference in composition’ among categories, e.g. as determined with PERMANOVA? The simplest interpretation is that there are consistent location effects in some number of individual variables. Some number being a bit hard to put our fingers on, depending on the exact distance matrix and normalization procedure used… Is it just a single bacterial species which is different, and that species is super abundant and dominates the signal in the distance matrix? Or are there a lot? Or, maybe we didn’t ‘check our assumptions’ by applying PERMDISP beforehand, and the signal is actually driven by differences in variance (dispersions). Does the PERMANOVA result tell us anything more if we already have significant results from differential abundance analysis? There are a lot of individual scenarios that lead to the same ‘multivariate’ result – and those scenarios can be independently tested, while controlling for the others, if a single, well-defined model is used.

Aside from the pretty plots and generic ‘second view of the data’ method of gaining confidence in our results, I think the reason we still do both differential abundance analysis and multivariate stats like PERMANOVA is because PERMANOVA is interpreted to help gauge the ‘overall relative importance’ of various factors to the dataset as a whole. In other words, we might have five separate factors that each have some significantly differentially abundant variables associated with them, but if a couple of those factors don’t also have significant PERMANOVA results, we can rank them as less important. Again, though, given the strong assumptions of those tests, this ranking is a bit hand-wavy. In a holistic model like the one I’ve been working on, ‘overall’ relative influence can be estimated directly with hierarchical variance parameters that are sampled at the same time as location effects. The confounding influences control for one another and ask fundamentally different questions.

I have a lot of other thoughts, but I think this stream of consciousness is too long already. The model is here.

Maybe it doesn’t Matern

Calling all bored and crazy Stan coders. I need a long term coding partner.

I’ve been doing a lot of completely solo work for a couple years now, and you can guess the result. I have a couple of really awesome models that I am proud of, and yet they’re (1) not quite what I want, (2) probably way more complicated and messy than necessary, and (3) sitting there, waiting for me to figure out how to be a real modeler and justify them.

One of the things I was recently working on and am a bit stuck on is a Stan implementation of the circular Matern covariance function as described in “Covariance functions for mean square differentiable processes on spheres” (Guinness & Fuentes 2013). I just learned that the quadratic exponential covariance that I had been using is technically not valid for spherical distances, and that the linear exponential covariance is non-differentiable (bad for parameters in Stan, yes?). So this would be a nice addition. And I’m pretty sure I translated the math correctly, but it doesn’t… work. A few iterations of ADVI and it diverges when using my circular Matern implementation, whereas the version of my model with the technically invalid quadratic exponential runs just fine. So maybe I’m trying to solve a problem that doesn’t exist – but I want to make it work anyway!

Here’s the code I’ve got. I’ve filtered out a bunch of unnecessary details but haven’t cleaned the remainder up…

functions {
    real ff(int k, int j) {
        if(j)
          return(falling_factorial(k,j));
        else
          return(1);
    }
    vector circular_matern(vector d, int n, real alpha, matrix ffKJ, matrix chooseRJ) {
        real ap = alpha * pi();
        row_vector[2] csap = [cosh(ap), sinh(ap)];
        matrix[n-1,n] H;
        real annm1 = inv((-2 * square(alpha))^(n-1) * tgamma(n));
        vector[n] a;
        vector[rows(d)] adp = alpha * (d-pi());
        matrix[rows(d),2] csadp = append_col(cosh(adp), sinh(adp));
        vector[rows(d)] cov = zeros_vector(rows(d));
        for(k in 0:(n-1)) {
          for(r in 0:(n-2)) {
            H[r+1,k+1] = 0;
            for(j in 0:(2*r+1)) {
              if(j <= k) {
                H[r+1,k+1]
                 += chooseRJ[r+1,j+1]
                    * ffKJ[k+1,j+1]
                    * ap^(k-j)
                    * csap[((k-j+1) % 2) + 1];
              }
            }
          }
        }
        a = append_row(-annm1 * (H[,1:(n-1)] \ H[,n-1]), annm1);
        for(k in 0:(n-1)) {
          cov += a[k+1] * adp^k .* csadp[,(k % 2) + 1];
        }
        return(cov / (2*alpha*csap[2]));
    }
    matrix fill_sym(vector lt, int N, real c) {
        matrix[N,N] s_mat;
        int iter = 1;
        for(j in 1:(N-1)) {
            s_mat[j,j] = c;
            for(i in (j+1):N) {
                s_mat[i,j] = lt[iter];
                s_mat[j,i] = lt[iter];
                iter += 1;
            }
        }
        s_mat[N,N] = c;
        return(s_mat);
    }
}
data {
    int M[DRC];
    vector[choose(M[size(M)],2)] distSites; // distm(latlon)[lower.tri(distm(latlon)]
    int ms; // Matern smoothness parameter
}
transformed data {
    matrix[ms-1, 2 * ms] chooseRJ;
    matrix[ms, 2 * ms] ffKJ;
}
parameters {
    matrix<lower=0>[DRC+D,K] weight_scales;
    vector<lower=0>[K] siteDecay; //alpha
    vector<lower=0, upper=1>[K] siteProp; //nugget
}
transformed parameters {
    for(k in 1:K) {
        covSites[k]
            = fill_sym(siteProp[k]
                       * square(weight_scales[DRC,k])
                       * circular_matern(distSites, ms, inv(siteDecay[k]), ffKJ, chooseRJ),
                       M[size(M)],
                       square(weight_scales[DRC,k]) + 1e-10);
        // below is the alternative, which works
        //covSites[k]
        //    = fill_sym(siteProp[k]
        //               * square(weight_scales[DRC,k])
        //               * exp(-square(distSites) / (2 * square(siteDecay[k]))),
        //               M[size(M)],
        //               square(weight_scales[DRC,k]) + 1e-10);
    }
}
model {
    weight_scales ~ student_t(2,0,2.5);
    siteDecay ~ inv_gamma(5,5);
}

Fun ideas

I have fun ideas, sometimes.

Like, I was imagining the title of this blog could be variable depending on the day. On days that were going pretty well, the nice ‘Mer de Microbes’. On other days, maybe that first space would disappear.

It should be pretty easy to do. I could write some code that would allow me to toggle it rather than manually editing it all the time. But maybe it would be even better if I could write code to analyze my social media posts and guess which title is appropriate automatically?

That would be neat. I’m sure I could do it. And I’m sure I’d gain some useful skills in the meantime. But I shouldn’t spend time on stuff like that.

eh. I’ll just write about the idea instead.

Mer de Microbes

When I started my first blog, I initially used this title for it. I had recently begun working in French Polynesia, was helping a French postdoc, and was interested in learning the language, and I thought, “‘Sea of Microbes’ in French, that’s nice.” I created the page and started working on the blog. And then that French postdoc saw the url: ‘merdemicrobes’.

‘What?? You can’t really call it that…’

His first parsing of the spaceless text was ‘Merde Microbes’. Microbe shit.

This was hilarious to me, but I was using a university-hosted blogging platform and ultimately wanted (at least some of) my writing to be accessible to kids, so, chagrined, I came up with a more bland name and commenced my writing and sciencing adventure under that.

Since then, my adventures have seen some wild ups and downs. My blog expanded for a bit with some features and posts that I was proud of. And then I got busy with other things, discovered that most of the features I was developing had already been done (better) by others, lost my motivation, and it trailed off. This blog arc largely paralleled my sentiments toward my actual science, as well. The more you learn, the more you learn that others have already learned it.

I’ve had a few years of the real world of academia and politics to disabuse myself of perfectionism and idealism. I’ve realized that I’ve tried to hold myself to standards of communication that are impossible for me to achieve, and that my constant self-censorship has led to extreme frustration, burnout, and anxiety. Though I was trying to be the best scientist and communicator possible, I ultimately became just completely stuck. So today I’m trying something new. I’m laying claim to ‘Mer de Microbes’ again. Fully embracing its original intent, but also the accidental double-meaning, the grammatical imperfection, the fact that I’m not French, and everything else that will come with the name. It may change again later. But for now, I am going to try and recommence my journey through a sea of microbes. Or through microbe shit. It doesn’t really matter.