I re-processed the salmon RRBS data implementing the following changes:

1. Trimming

2. Alignment

  • used alignment score threshold of L,0,-0.3.
  • exceprt from script shown below:

      #define genome folder
    	
      genome_folder="/gscratch/srlab/strigg/data/Ssalar/GENOMES/chr1-29MT"
    	
      #run bismark in PE mode on trimmed reads
    	
      find /gscratch/scrubbed/strigg/analyses/20200612/*_R1_001_val_1.fq.gz |\
      xargs basename -s _R1_001_val_1.fq.gz| \
      xargs -I{} bismark \
      --score_min L,0,-0.3 \
      -p 4 \
      --non_directional \
      --genome ${genome_folder} \
      -1 /gscratch/scrubbed/strigg/analyses/20200612/{}_R1_001_val_1.fq.gz \
      -2 /gscratch/scrubbed/strigg/analyses/20200612/{}_R2_001_val_2.fq.gz \
    	
      # deduplicate
    	
      find *.bam | \
      xargs basename -s .bam | \
      xargs -I{} deduplicate_bismark \
      --bam \
      --paired \
      {}.bam
    	
      #run methylation extractor
      bismark_methylation_extractor \
      --paired-end \
      --bedGraph \
      --comprehensive \
      --counts \
      --scaffolds \
      --multicore 14 \
      --buffer_size 75% \
      *.deduplicated.bam
    

3. Generate downstream files for calling DMRs

  • Exceprt from script shown below.
  • Note tab files generated in last part contain a last field that is the total number of reads for each CpG loci, where in the past this field has been the number of reads for unmethylated CpGs. This is because this format is closer to the allc format required for input into the DMRfind program in MethylPywhich calls methylated regions.

      # Sort files for methylkit and IGV
    	
      find *.deduplicated.bam | \
      xargs basename -s .bam | \
      xargs -I{} samtools \
      sort --threads 28 {}.bam \
      -o {}.sorted.bam
    	
      # Index sorted files for IGV
      # The "-@ 16" below specifies number of CPU threads to use.
    	
      find *.sorted.bam | \
      xargs basename -s .sorted.bam | \
      xargs -I{} samtools \
      index -@ 28 {}.sorted.bam
    	
      # Run multiqc 
    	
      /gscratch/srlab/strigg/bin/anaconda3/bin/multiqc \
      /gscratch/scrubbed/strigg/analyses/20200612/ \
      /gscratch/scrubbed/strigg/analyses/20200613/
    	
      # create merged
      find *.deduplicated.bismark.cov.gz |\
      xargs basename -s .deduplicated.bismark.cov.gz |\
      xargs -I{} coverage2cytosine \
      --genome_folder ${genome_folder} \
      -o {} \
      --merge_CpG \
      --zero_based \
      {}.deduplicated.bismark.cov.gz
    	
    	
      #creating tab files with % me, raw mCpG and total CpG counts
    	
      for f in *merged_CpG_evidence.cov
      do
        STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
        cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $5+$6}}' \
        > "${STEM}"_5x.bed
      done
    	
    

4. Assess genome coverage

I ran picard tools CollectWgsMetrics to get an idea of genome coverage across samples

The results (files ending in .collect_wgs_metrics.txt) suggest low genome coverage (~0.5%), but that is expected given this is RRBS data.

A better way to assess this might be to look at CpG coverage as I did for the meth_compare project (see https://github.com/hputnam/Meth_Compare/blob/master/scripts/README.md)

5. Calling methylated regions

Optimizations:

1. exclude CTRL samples

I decided to not include the 2 uninfected control samples from 8C and 16C because:

  • there are only uninfected controls for 26psu and not for 32psu
  • 2 samples is not enough to calculate mean methylation

2. reduce false negatives by removing the sample-catagory parameter

I’m excluding the sample category parameter because this led to false negatives (see where I encountered this in the geoduck BS data here: https://shellytrigg.github.io/232th-post/). This is different than what I had done in the past: past analysis summary 210th-post and past code 20191030_DMRfindMAPQ20_SalmoPt2.sh. Excerpt from past code shown below:

--min-cov 5 \
--output-prefix DMR250bp_MAPQ20_MCmax25_cov5x \
--dmr-max-dist 250 \
--mc-max-dist 25 \
--min-num-dms 3 \
--sample-category

STEP 1: Generate allc files

allc files were generated as follows

for f in ${input_files}*.merged_CpG_evidence.cov
do
awk -F"\t" '{print $1"\t"$3-1"\t""+""\t""CGA""\t"$5"\t"$5+$6"\t"1}' ${f} |\
gzip > $(basename ${f} _R1_001_val_1_bismark_bt2_pe.CpG_report.merged_CpG_evidence.cov)_allc.tsv.gz
done

STEP 2: Run DMRfind

This step identifies genomic regions across samples that are statistically different than what would be expected from methylation resulting from genetic variation alone.

I first copied all input files from Gannet to Ostrich as follows:

rsync --archive --verbose --progress --exclude CTRL* /Volumes/web/metacarcinus/Salmo_Calig/analyses/20200702/*_allc.tsv.gz ~/Desktop/20200811

I next ran methylpy DMRfind locally on Ostrich.

methylpy DMRfind \
--allc-files \
16C_26psu_1_S13_allc.tsv.gz \
16C_26psu_2_S14_allc.tsv.gz \
16C_26psu_3_S15_allc.tsv.gz \
16C_26psu_4_S16_allc.tsv.gz \
16C_32psu_1_S1_allc.tsv.gz \
16C_32psu_2_S2_allc.tsv.gz \
16C_32psu_3_S3_allc.tsv.gz \
16C_32psu_4_S4_allc.tsv.gz \
8C_26psu_1_S9_allc.tsv.gz \
8C_26psu_2_S10_allc.tsv.gz \
8C_26psu_3_S11_allc.tsv.gz \
8C_26psu_4_S12_allc.tsv.gz \
8C_32psu_1_S5_allc.tsv.gz \
8C_32psu_2_S6_allc.tsv.gz \
8C_32psu_3_S7_allc.tsv.gz \
8C_32psu_4_S8_allc.tsv.gz \
--samples 16C_26psu_1 16C_26psu_2 16C_26psu_3 16C_26psu_4 16C_32psu_1 16C_32psu_2 16C_32psu_3 16C_32psu_4 8C_26psu_1 8C_ ^M26psu_2 8C_26psu_3 8C_26psu_4 8C_32psu_1 8C_32psu_2 8C_32psu_3 8C_32psu_4 \
--mc-type "CGN" \
--num-procs 8 \
--min-cov 5 \
--output-prefix DMR250bp_MCmax25_cov5x \
--dmr-max-dist 250 \
--mc-max-dist 25 \
--min-num-dms 3 

Alternate DMRfind analyses I tried that didn’t give optimal results:

  • 20200616
    • here is used the following parameters and the results file contained 8 regions when I was expecting > 1000 regions based on the geoduck results
      --output-prefix DMR250bp_cov5x 
      --dmr-max-dist 250 
      --min-num-dms 3
    
  • 20200701
    • I thought my results may not be correct because I didn’t employ the min_clust parameter so I tried the following
      --output-prefix DMR250bp_cov5x_minClust1 
      --dmr-max-dist 250 
      --min-num-dms 3 
      --sample-category 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 
      --min-cluster 1
    
  • 20200702
    • I realized when I generated the DMRfind input files I mistakenly added fields 5 and 6 of the tab files again, when already field 6 of the tab files contained the total mC and umC counts.
    	
      (base) [strigg@mox1 20200613]$ head CTRL_8C_26psu_2_S18_R1_001_val_1_bismark_bt2_pe.CpG_report.merged_CpG_evidence.cov
      NC_001960.1	106	108	0.000000	0	3
      NC_001960.1	200	202	0.000000	0	419
      NC_001960.1	231	233	0.947867	4	418
      NC_001960.1	267	269	0.000000	0	337
      NC_001960.1	274	276	0.419287	2	475
      NC_001960.1	301	303	0.620347	5	801
      NC_001960.1	326	328	0.459418	3	650
      NC_001960.1	333	335	0.469484	3	636
      NC_001960.1	379	381	0.190476	1	524
      NC_001960.1	396	398	0.451467	2	441
    	
      (base) [strigg@mox1 20200613]$ head CTRL_8C_26psu_2_S18_R1_001_val_1_bismark_bt2_pe_5x.bed 
      NC_001960.1	200	202	0.000000	0	419
      NC_001960.1	231	233	0.947867	4	422
      NC_001960.1	267	269	0.000000	0	337
      NC_001960.1	274	276	0.419287	2	477
      NC_001960.1	301	303	0.620347	5	806
      NC_001960.1	326	328	0.459418	3	653
      NC_001960.1	333	335	0.469484	3	639
      NC_001960.1	379	381	0.190476	1	525
      NC_001960.1	396	398	0.451467	2	443
      NC_001960.1	463	465	0.966184	4	414
    	
      (base) [strigg@mox1 20200613]$ cat *5x.bed | awk -F"\t" '{if($6<$5)print $0}' | wc -l
      0
    	
    	
      (base) [strigg@mox1 20200613]$ zcat ../20200616/CTRL_8C_26psu_2_S18_allc_5x.tsv.gz | head
      NC_001960.1	201	+	CGA	0	419	1
      NC_001960.1	232	+	CGA	4	426	1
      NC_001960.1	268	+	CGA	0	337	1
      NC_001960.1	275	+	CGA	2	479	1
      NC_001960.1	302	+	CGA	5	811	1
      NC_001960.1	327	+	CGA	3	656	1
      NC_001960.1	334	+	CGA	3	642	1
      NC_001960.1	380	+	CGA	1	526	1
      NC_001960.1	397	+	CGA	2	445	1
      NC_001960.1	464	+	CGA	4	418	1
    	
    
    • I remade the input files as follows
    	
      for f in ${input_files}*_5x.bed
      do
      awk -F"\t" '{print $1"\t"$3-1"\t""+""\t""CGA""\t"$5"\t"$6"\t"1}' ${f} |\
      gzip > $(basename ${f} _R1_001_val_1_bismark_bt2_pe_5x.bed)_allc_5x.tsv.gz
      done
    	
    
    • I ran DMRfind with the following parameters
      --output-prefix DMR250bp_cov5x \
      --dmr-max-dist 250 \
      --min-num-dms 3 
    

6. Calling DMRs

  • R Script: 20200812_groupStats.Rmd
    • this script performs the following:
      1. filters methylated regions for coverage in at least 3 out of 4 samples
      2. performs arcsin square root transformation of % methylation data (DMR250bp_MCmax25_cov5x_rms_results_collapsed.tsv) for each region
      3. performs 2-way ANOVA to identify DMRs
      4. generates heatmaps for significant DMRs (p< 0.05)
      5. generates ANOVA results table
      6. generates histograms of DMR methylation before and after arcsin square root transformation and heatmaps of DMR methylation

Histograms of DMR methylation before and after arcsin square root transformation

Results from 2-way ANOVA (percent methylation ~ Temp * Salinity)

  • DMR_2way_aov_summ_stats.csv
    • The last column “sig_effect” denotes a significant effect at P < 0.05
      • 501 regions were not significant (denoted as “none”)
      • 48 regions showed a significant effect from salinity only (denoted as “Salinity”)
      • 30 regions showed a significant effect from temperature only (denoted as “Temp”)
      • 6 regions showed a significant additive effect meaning a significant effect from temperature and a significant effect from salinity but no significant interaction effect (denoted as “Temp+Sal”)
      • 39 regions showed a significant interaction effect from temperature and salinity (denoted as “TempxSal”)

DMRs affected by Salinity

These DMRs showed only a significant effect from salinity and not from temperature or the salinity:temperature interaction.

Heatmap showing sample % methylation across treatment groups

Heatmap showing group mean % methylation across treatment groups

DMRs affected by Temperature

These DMRs showed only a significant effect from temperature and not from salinity or the salinity:temperature interaction.

Heatmap showing sample % methylation across treatment groups

Heatmap showing group mean % methylation across treatment groups

DMRs showing a Salinity + Temperature additive effect These DMRs showed a significant effect from salinity and temperature but no significant effect from their interaction.

Heatmap showing sample % methylation across treatment groups

Heatmap showing group mean % methylation across treatment groups

DMRs showing a Salinity:Temperature interaction effect

These DMRs showed only a significant effect from salinity:temperature interaction and not from salinity or temperature alone.

Heatmap showing sample % methylation across treatment groups

Heatmap showing group mean % methylation across treatment groups

Next steps

  • genomic feature analysis
  • match up with RNAseq data