Mon. Jul. 22, 2019 Salmon sea lice methylomes and Geoduck genome

run fastqc and trim

  • I ended up getting this to run on Ostrich via this jupyter notebook:
  • The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked

      scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc
      scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc
      srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc .
      [INFO   ]         multiqc : This is MultiQC v0.9
      [INFO   ]         multiqc : Template    : default
      [INFO   ]         multiqc : Searching '.'
      [INFO   ]          fastqc : Found 92 reports
      [INFO   ]         multiqc : Report      : multiqc_report.html
      [INFO   ]         multiqc : Data        : multiqc_data
      [INFO   ]         multiqc : MultiQC complete
      srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722
      Password:
      building file list ... 
      6 files to consider
      multiqc_data/
      multiqc_data/.multiqc.log
                6,880 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=4/6)
      multiqc_data/multiqc_fastqc.txt
               20,380 100%   19.44MB/s    0:00:00 (xfr#2, to-chk=3/6)
      multiqc_data/multiqc_general_stats.txt
                6,880 100%    6.56MB/s    0:00:00 (xfr#3, to-chk=2/6)
      multiqc_data/multiqc_report.html
            2,567,171 100%   18.14MB/s    0:00:00 (xfr#4, to-chk=1/6)
      multiqc_data/multiqc_sources.txt
               12,078 100%   87.37kB/s    0:00:00 (xfr#5, to-chk=0/6)
    	
      sent 2,614,141 bytes  received 500 bytes  747,040.29 bytes/sec
      total size is 2,613,389  speedup is 1.00
      srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html
      srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip
      srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/
    
  • Raw sequence FASTQC output folder:
  • Raw sequence MultiQC report (HTML):
  • TrimGalore! output folder (adapter trimmed FastQ files):
  • Adapter trimming MultiQC report (HTML):
  • TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
  • Hard trim MultiQC report (HTML):

copy the salmon genome and sea lice genome to mox

  • Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.
    • I downloaded it locally and then copied it to Mox:

        Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES
        Password: 
        Enter passcode or select one of the following options:
         1. Duo Push to Android (XXX-XXX-0029)
         2. Phone call to Android (XXX-XXX-0029)
        Duo passcode or option [1-2]: 1
        building file list ... 
        1 file to consider
        Caligus_rogercresseyi_Genome.fa
           515420160 100%   19.75MB/s    0:00:24 (xfer#1, to-check=0/1)
        sent 515483225 bytes  received 42 bytes  14122829.23 bytes/sec
        total size is 515420160  speedup is 1.00
      
    • I also copied it to Gannet

        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus
        building file list ... 
        1 file to consider
        Caligus_rogercresseyi_Genome.fa
           515420160 100%   12.72MB/s    0:00:38 (xfer#1, to-check=0/1)
      		
        sent 515483225 bytes  received 42 bytes  13050209.29 bytes/sec
        total size is 515420160  speedup is 1.00
      
  • I downloaded the mgannetost recent RefSeq version of the S.salar genome on gannet:

          ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz
      --2019-07-19 12:21:17--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz
                 => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’
      Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11
      Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
      Logging in as anonymous ... Logged in!
      ==> SYST ... done.    ==> PWD ... done.
      ==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done.
      ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402
      ==> PASV ... done.    ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done.
      Length: 759073402 (724M) (unauthoritative)
    	
      GCF_000233375.1_ICSASG_v2_genomic.fna.gz                            100%[=================================================================================================================================================================>] 723.91M   631KB/s    in 19m 53s 
    	
      2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402] 
    
    • then copied the S. salar RefSeq genome to Mox:

        Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES
        Password: 
        Enter passcode or select one of the following options:
      		
         1. Duo Push to Android (XXX-XXX-0029)
         2. Phone call to Android (XXX-XXX-0029)
      		
        Duo passcode or option [1-2]: 1
        building file list ... 
        1 file to consider
        GCF_000233375.1_ICSASG_v2_genomic.fna.gz
           759073402 100%    7.68MB/s    0:01:34 (xfer#1, to-check=0/1)
      		
        sent 759166220 bytes  received 42 bytes  7195888.74 bytes/sec
        total size is 759073402  speedup is 1.00
      

copy data from gannet to owl

NEXT STEPS:

  1. Concatenate data from different lanes together (L001 + L002 for each sample)
  2. transfer concatenated trimmed reads from gannet to Mox
  3. determine alignment parameters for:
    • subset of Calig. data
    • subset of Salmon data
  4. run bismark on all data

Geoduck genome call notes:

method: assembly = 10X + phase overlay and illumina data used for QC

genome versions:

  • v70
    • MAKER ids 53000 transcripts
    • BUSCO: 84% complete, 78% single copy, 5% dup; 2Gb, N50 = 6Mbp
  • v074 (18scaffolds; not using BLAST to call genes use the gff maker generates)
    • MAKER: 1700 transcripts; instead map raw transcriptome reads to assembly and see if they map to more than 1700 transcriptome
    • BUSCO: 71% complete, 70% single copy, 0.9% dup; 1.42Gb (942Mbp after QC), N50 = 58Mbp
    • Steven’s alignments of subset of data with score min -0.6 had ~40% mapping
    • maybe take Steven’s data and do the same analysis to compare coverage with my previous analysis of v70, v71 and v73; or do the scaffold coverage analysis I previously did but on v074. Then see how things map

Next steps:

  • figure out what’s going on with largest scaffold: do transcripts map to it? yes
    • Use transcriptome as positive control: tissue specific should give estimate of # of transcriptome (Sam’s doing this analysis)
  • pick samples to run against 74, and find # of loci with min cov 10
    • DML (Hollie)
    • DMR 1000bp window (shelly)
    • Gene Body analysis
  • Determine if we should deduplicate or not
  • Do we need PacBio or Nanopore to validate? shouldn’t need validation with all the data we have

Lab meeting notes

Yaamini mentioned GO MWU that does rank based GO enrichment (does MW test instead of fisher)

Oyster proteomics paper: make corrections and don’t overclaim, then resend to Steven

Geoduck broodstock pH exp.: start writing manuscript

Written on July 22, 2019