stats

attempted data transformations to see if it would help make the data more normal

- log
- logit
- arcsin(sqrt) transform and ANOVA
- still zero-inflated
- best option seems to be binomial glm
- to do this I need to generate # methylated Cs and unmethylated cs
- save intermediate files in methylpy DMRfind
- concatenate all .tsv files
- these have already combined the counts from neighboring sites

- intersect with filtered bed file
- sum mCs and sum un-mCs for each DMR in each sample
- most numbers match, but some disagree. I don’t understand why. Contacted yupeng to see if there is a better way to get this data

- Turns out I can run a binomial glm on proportion data (http://www.flutterbys.com.au/stats/tut/tut10.5a.html)
- binomial glm needs integers not continuous data

- For now, going with arcsin(sqrt) transformation and ANOVA, uncorrected pval of 0.1
- liberal selection because model is a little unfair towards the data since it’s not exactly normal (still zero-inflated)
- selected DMRs still seem legit and show group differences
- arcsin DMR selection at pvalue < 0.1 is more stringent than untransformed data in ANOVA so that’s good

- Would still be better to use a glm but I’m waiting to see what Yupeng says since it’s not easy for me to regenerate that info.