One model to rule them all

I like hyperbole.

Anyway, I wrote a complicated model again. It ‘works’, it’s slow, and it doesn’t really tell me anything new about the data I ran through it. So what’s really new here.

Still, I like it for philosophical reasons. And I often think I’m a better scientific philosopher than actual scientist. So let me philosophize on this model.

There are two main inspirations for the model, and lots of secondary perks. The first main idea is that multi-omics data needs a theoretically grounded method of factor analysis. There are a lot of tools out there already, but almost all of them put priority on computational efficiency and make potentially disastrous statistical shortcuts. Of course, mine does too. But they’re different distastrous shortcuts, which I think are less disastrous.

For example, the ‘multivariate’ methods that I started with in my grad school days, and which are still very commonly used, involve the summarization of these extremely complex datasets into pairwise distance matrices of samples. All such methods require strong assumptions about the relative importance of variablewise dispersions. They also have varying degrees of success dealing with compositionality and the presence of zeros in the data. They are usually heavily influenced by ‘outlier’ observations – or in other words, non-normal residuals. I’ve become aware of a couple packages that use model-based methods with, for example, zero-inflated negative binomial errors, and I need to thoroughly check these out. But from what I’ve seen, those don’t encourage any kind of sparsity or strict variable selection, and don’t seem to be ideal for multi-omics or integrating other types of data. Basically, I expect if anyone replies to this blog, it’ll be to say ‘well this model solves X!’ But I’d be very surprised to hear about a single method that solves all of it. Especially for multi-omics rather than a single omics dataset.

The second main idea is that there are a lot of independent statistical tests that are often performed on omics data, which in reality are not at all independent of one another and whose results are thus difficult to interpret with confidence. For example, people are often interested separately in the ‘alpha diversity’ of their samples, differential abundance of individual variables among their samples, differential prevalence among their categories, and ‘beta diversity’ questions such as the very high-level ‘is there a difference in composition between treatments’. The problem is that some of these questions are pretty ill-defined – such as the last one. What do we learn about a system if there is a ‘difference in composition’ among categories, e.g. as determined with PERMANOVA? The simplest interpretation is that there are consistent location effects in some number of individual variables. Some number being a bit hard to put our fingers on, depending on the exact distance matrix and normalization procedure used… Is it just a single bacterial species which is different, and that species is super abundant and dominates the signal in the distance matrix? Or are there a lot? Or, maybe we didn’t ‘check our assumptions’ by applying PERMDISP beforehand, and the signal is actually driven by differences in variance (dispersions). Does the PERMANOVA result tell us anything more if we already have significant results from differential abundance analysis? There are a lot of individual scenarios that lead to the same ‘multivariate’ result – and those scenarios can be independently tested, while controlling for the others, if a single, well-defined model is used.

Aside from the pretty plots and generic ‘second view of the data’ method of gaining confidence in our results, I think the reason we still do both differential abundance analysis and multivariate stats like PERMANOVA is because PERMANOVA is interpreted to help gauge the ‘overall relative importance’ of various factors to the dataset as a whole. In other words, we might have five separate factors that each have some significantly differentially abundant variables associated with them, but if a couple of those factors don’t also have significant PERMANOVA results, we can rank them as less important. Again, though, given the strong assumptions of those tests, this ranking is a bit hand-wavy. In a holistic model like the one I’ve been working on, ‘overall’ relative influence can be estimated directly with hierarchical variance parameters that are sampled at the same time as location effects. The confounding influences control for one another and ask fundamentally different questions.

I have a lot of other thoughts, but I think this stream of consciousness is too long already. The model is here.

Leave a Reply

%d bloggers like this: