L3 L4

Goal: To use Weam's L3/L4 miSeq data to streamline and standardize the DE pipeline for RNA-Seq work.

Dependencies for this pipeline

FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic http://www.usadellab.org/cms/index.php?page=trimmomatic
Trinity - best to run on Mason
BowTie 2 (required for TopHat) http://sourceforge.net/projects/bowtie-bio/files/bowtie2/ unzip and cd into directory. Type 'make'. Then copy executables to /usr/local/bin/
TopHat - there are precompiled binaries here http://ccb.jhu.edu/software/tophat/tutorial.shtml or can be compiled from source, but this is a bit more work. Directions can be found at the same site.
Cufflinks http://cole-trapnell-lab.github.io/cufflinks/install/ - download and unzip v.2.2.1 executables for Mac, then copy to /usr/local/bin
HTSeq http://www-huber.embl.de/HTSeq/doc/install.html#installation-on-macos-x download then python setup.py build, sudo python setup.py install
pysam (https://code.google.com/p/pysam/downloads/list) install - type "python setup.py build", "python setup.py install" (Not sure if this is necessary - I think it was a dependency for something else I tried

1 - Get data

Downloaded data from Weam's Galaxy:

Galaxy1-[BrugMalWolL3_S2_L001_R1_001.fastq].fastq
Galaxy2-[BrugMalWolL3_S2_L001_R2_001.fastq].fastq
Galaxy3-[BrugMalWolL4_S3_L001_R1_001.fastq].fastq
Galaxy4-[BrugMalWolL4_S3_L001_R2_001.fastq].fastq
Galaxy5-[BmalayiL31_S5_L001_R1_001.fastq].fastq
Galaxy6-[BmalayiL32_S6_L001_R1_001.fastq].fastq
Galaxy7-[BmalayiL41_S7_L001_R1_001.fastq].fastq
Galaxy8-[BmalayiL42_S8_L001_R1_001.fastq].fastq

2 - Run all through FastQC

- check adapter and overrepresented sequences and put in file for trimmomatic. Here, TruSeq Adapter with Indices 5,6,7 and 12 were found, plus the universal adapter, mostly at the end of the sequence (read through?). Made adapter file L3L4_adapterseqs_0301015.fa which I will upload here.

3 - Ran script runTrimmomatic.py

with these basic trimmomatic parameters:

ILLUMINACLIP:L3L4_adapterseqs_0301015.fa:2:40:30  LEADING:8 TRAILING:10   SLIDINGWINDOW:4:30 MINLEN:25

4 - Build a new reference transcriptome from all data

- following instructions here http://2014-msu-rnaseq.readthedocs.org/en/latest/s-building-a-reference.html

If not already done, build Bowtie2 index files for the genome:

bowtie2-build -f  brugia_malayi.PRJNA10729.WBPS1.genomic.fa brugia_malayi.PRJNA10729.WBPS1.genomic

For this, I removed the "Galaxy…[" from the file names, to avoid any confusion

The code below calls tophat2 with

  • two threads (-p 2)
  • tells it where the genome annotations are (-G /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3)
  • make genome index to save time if this needs to be run again (—transcriptome-index value = where to put it)
  • save to a folder called tophat_all
  • the base name of the bowtie index files - these need to be made in advance - brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic
  • list of sequences, first the forward and unpaired in a comma-delimited list, then the reverse reads of the paired output (in the same order as the first list.)
tophat2 -p 2 \
    -G  /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/transcriptome \
    -o tophat_all \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_paired.fq,allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_paired.fq,allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_unpaired.fq,allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_unpaired.fq,allL3/BmalayiL31_S5_L001_R1_001.fastq.fastq_output,allL3/BmalayiL32_S6_L001_R1_001.fastq.fastq_output,allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_unpaired.fq,allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_unpaired.fq,allL4/BmalayiL41_S7_L001_R1_001.fastq.fastq_output,allL4/BmalayiL42_S8_L001_R1_001.fastq.fastq_output    allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_paired.fq,allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_paired.fq

This took 05:26:56 hours

Look at the summary in align_summary.txt

Left reads:
          Input     :   7262372
           Mapped   :   6379126 (87.8% of input)
            of these:    401236 ( 6.3%) have multiple alignments (878 have >20)
Right reads:
          Input     :   7262372
           Mapped   :   6412516 (88.3% of input)
            of these:    404059 ( 6.3%) have multiple alignments (886 have >20)
Unpaired reads:
          Input     :  24533513
           Mapped   :  20587321 (83.9% of input)
            of these:   1309058 ( 6.4%) have multiple alignments (823 have >20)
85.5% overall read mapping rate.

Aligned pairs:   6184442
     of these:    389143 ( 6.3%) have multiple alignments
                    8669 ( 0.1%) are discordant alignments
85.0% concordant pair alignment rate.

Run cufflinks:

cufflinks -o cuff_all tophat_all/accepted_hits.bam

StdOut:

You are using Cufflinks v2.2.1, which is the most recent release.
[10:03:07] Inspecting reads and determining fragment length distribution.
> Processed 28665 loci.                        [*************************] 100%
> Map Properties:
>    Normalized Map Mass: 27173004.33
>    Raw Map Mass: 27173004.33
>    Fragment Length Distribution: Empirical (learned)
>                  Estimated Mean: 184.47
>               Estimated Std Dev: 35.37
[10:09:48] Assembling transcripts and estimating abundances.
> Processed 32125 loci.                        [*************************] 100%

Run cuffmerge:

ls -1 cuff_all/transcripts.gtf > cuff_list.txt

cuffmerge -g /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
    -o cuffmerge_all \
    -s /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic.fa \
    cuff_list.txt

Output is here - cuffmerge_all/merged.gtf

The tutorial here:

http://2014-msu-rnaseq.readthedocs.org/en/latest/s-tophat-round2.html

has a script remove-nostrand.py that removes lines that don't have a mapping strand.

python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf

This removes one line in my gtf file merged.gtf. Removed line is:

Bmal_v3_scaffold2    Cufflinks    exon    718571    719557    .    .    .    gene_id "XLOC_004918"; transcript_id "TCONS_00019074"; exon_number "1"; gene_name "WBGene00221590"; oId "CUFF.2927.1"; nearest_ref "Transcript:Bm1329f"; class_code "i"; tss_id "TSS10616";

Output is nostrand.gtf. This is what I will use going forward, the new transcriptome annotation.

Compare the different gtf files:

cuffcompare cuffmerge_all/nostrand.gtf \
      /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
      cuffmerge_all/merged.gtf -o compare

Output is:

# Cuffcompare v2.2.1 | Command line was:
#cuffcompare cuffmerge_all/nostrand.gtf /Users/jgrant/Documents/L3:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 cuffmerge_all/merged.gtf -o compare
#

#= Summary for dataset: cuffmerge_all/nostrand.gtf :
#     Query mRNAs :   46157 in   14124 loci  (42876 multi-exon transcripts)
#            (7756 multi-transcript loci, ~3.3 transcripts per locus)

#= Summary for dataset: /Users/jgrant/Documents/L3:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 :
#     Query mRNAs :   39895 in   15833 loci  (36675 multi-exon transcripts)
#            (7879 multi-transcript loci, ~2.5 transcripts per locus)

#= Summary for dataset: cuffmerge_all/merged.gtf :
#     Query mRNAs :   46158 in   14125 loci  (42876 multi-exon transcripts)
#            (7756 multi-transcript loci, ~3.3 transcripts per locus)

 Total union super-loci across all input datasets: 14125 
  (7623 multi-transcript, ~3.5 transcripts per locus)

5 - Now - map individual runs onto the new annotated transcriptome!

For first - L3 paired end read

tophat -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BrugMalWolL3 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_paired.fq allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_paired.fq

All remaining - each took from 30 min to 1 hour (for paired end runs)

tophat2 -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BmalayiL31 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL3/BmalayiL31_S5_L001_R1_001.fastq.fastq_output

tophat2 -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BmalayiL32 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL3/BmalayiL32_S6_L001_R1_001.fastq.fastq_output

tophat2 -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BrugMalWolL4 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_paired.fq  allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_paired.fq

tophat2 -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BmalayiL41 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL4/BmalayiL41_S7_L001_R1_001.fastq.fastq_output

tophat2 -p 2 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
    -o tophat_BmalayiL42 \
    brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
    allL4/BmalayiL42_S8_L001_R1_001.fastq.fastq_output

Check align_summary files:

less tophat_BmalayiL42/align_summary.txt

Reads:
          Input     :   4026458
           Mapped   :   3670848 (91.2% of input)
            of these:    216788 ( 5.9%) have multiple alignments (55 have >20)
91.2% overall read mapping rate.

less tophat_BmalayiL41/align_summary.txt

Reads:
          Input     :   4846418
           Mapped   :   4070247 (84.0% of input)
            of these:    219185 ( 5.4%) have multiple alignments (146 have >20)
84.0% overall read mapping rate.

less tophat_BrugMalWolL4/align_summary.txt

Left reads:
          Input     :   3135854
           Mapped   :   2756298 (87.9% of input)
            of these:    149075 ( 5.4%) have multiple alignments (239 have >20)
Right reads:
          Input     :   3135854
           Mapped   :   2772931 (88.4% of input)
            of these:    150357 ( 5.4%) have multiple alignments (240 have >20)
88.2% overall read mapping rate.

Aligned pairs:   2668575
     of these:    144025 ( 5.4%) have multiple alignments
                    3459 ( 0.1%) are discordant alignments
85.0% concordant pair alignment rate.

less tophat_BmalayiL32/align_summary.txt

Reads:
          Input     :   5871337
           Mapped   :   4143807 (70.6% of input)
            of these:    199828 ( 4.8%) have multiple alignments (53 have >20)
70.6% overall read mapping rate.

less tophat_BmalayiL31/align_summary.txt

Reads:
          Input     :   4236086
           Mapped   :   3649709 (86.2% of input)
            of these:    172054 ( 4.7%) have multiple alignments (46 have >20)
86.2% overall read mapping rate.

less tophat_BrugMalWolL3/align_summary.txt

Left reads:
          Input     :   4126518
           Mapped   :   3631879 (88.0% of input)
            of these:    207381 ( 5.7%) have multiple alignments (549 have >20)
Right reads:
          Input     :   4126518
           Mapped   :   3649102 (88.4% of input)
            of these:    208772 ( 5.7%) have multiple alignments (553 have >20)
88.2% overall read mapping rate.

Aligned pairs:   3532874
     of these:    201958 ( 5.7%) have multiple alignments
                    5250 ( 0.1%) are discordant alignments
85.5% concordant pair alignment rate.

6 - Now run HTseq-count to get number of reads pre 'gene'

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BmalayiL42/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL42_counts.txt

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BmalayiL41/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL41_counts.txt

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BrugMalWolL4/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BrugMalWolL4_counts.txt

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BmalayiL32/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL32_counts.txt

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BmalayiL31/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL31_counts.txt

htseq-count --format=bam --stranded=no --order=pos \
    /Users/jgrant/Documents/L3\:L4/tophat_BrugMalWolL3/accepted_hits.bam \
    /Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BrugMalWolL3_counts.txt

The output files are text, that look like this:

XLOC_000001    36
XLOC_000002    164
XLOC_000003    144
XLOC_000004    311
XLOC_000005    302
XLOC_000006    221
XLOC_000007    136
XLOC_000008    334
XLOC_000009    1681
XLOC_000010    658
XLOC_000011    1096
XLOC_000012    305

Make one for each sample and then compare.

These HTseq count files are used as input for EdgrR and DESeq.

6 - DE analyses with EdgeR, DESeq and DESeq2

I followed the tutorial here:
http://www.gettinggeneticsdone.com/2012/09/deseq-vs-edger-comparison.html

Look on this page for method and results:

Compare_DESeq_EdgeR

7 - Another EdgeR analysis.

I followed the instructions from the Titus Brown tutorial (http://2015-mar-semimodel.readthedocs.org/en/latest/s-data-analysis.html#running-edger-on-a-data-subset). This gave very different results from all the analyses above, so I am not that confident in it.

library("edgeR")
files <- dir(pattern="*txt$") #Make sure only count files have txt extension in the working directory
data <- readDGE(files, header=FALSE)

print(data)
head(data$counts)

###

group <- c(rep("3",3), rep("4",3))

dge = DGEList(counts=data, group=group)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)

# plot!
et <- exactTest(dge)
etp <- topTags(et, n=100000)
etp$table$logFC = -etp$table$logFC
pdf("L3L4-edgeR-MA-plot.pdf")
plot(
  etp$table$logCPM,
  etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .2, "red", "black" ) )
dev.off()

# more plot!
pdf("L3L4-edgeR-MDS.pdf")
plotMDS(dge)
dev.off()

# output CSV
write.csv(etp$table, "L3L4-edgeR.csv")

Renamed the results in L3L4-edgeR.csv with the script provided with the tutorial - add-gene-name-to-diffexpr-csv.py

Results (L3L4-edgeR-named.csv) are uploaded here.

L3L4-edgeR-MA-plot.pdf:
L3L4-edgeR-MA-plot.pdf

Very few sequences are called - many more are found with the other analysis (see above) and those are comparable through the three analyses, EdgeR, DESeq and DESeq2, so I think I believe that one!

8 - Another method NOISeq.

From Ana Conesa's lab - http://bioinfo.cipf.es/noiseq/doku.php

Frustrating - seems to need informatino about the dataset that I don't have - will look further later.

9 - Command line cuffdiff

I was unable to run cuffdiff on my 8GB iMac - got segmentation errors, but I ran it on mason with the following submission script:

#!/bin/bash
#PBS -l nodes=1:ppn=16,walltime=3:00:00:00
#PBS -l vmem=250gb
#PBS -m abe
#PBS -N cuffdiff-jg
##PBS -j oe
##PBS -k o
cd $PBS_O_WORKDIR

module load cufflinks
cuffdiff -o cuffdiffOut nostrand.gtf tophat_BmalayiL31_accepted_hits.bam,tophat_BmalayiL32_accepted_
hits.bam,tophat_BrugMalWolL3_accepted_hits.bam tophat_BmalayiL41_accepted_hits.bam,tophat_BmalayiL42
_accepted_hits.bam,tophat_BrugMalWolL4_accepted_hits.bam

Where nostrand.gtf is the reference transcriptome we made above, and the data files are the tophat mappings of the data back to that reference.

Cuffdiff gives a number of output files. I will move forward with gene_exp.diff which lists genes that are differentially expressed. Other options would be isoform_exp.diff and cds_exp.diff.

10 - Number of genes differentially expressed by different methods

Compare Cuffdiff, EdgeR, DESeq and DESeq2

method # genes called
deseq 750
deseq2 1029
edger 1231
cuffdiff 969

How do these genes overlap?

Genes called as differentially expressed by #
deseq,deseq2,edger,cuffdiff 499
edger 211
cuffdiff 205
deseq,deseq2,edger 201
deseq2,edger,cuffdiff 142
deseq2,edger 94
edger,cuffdiff 65
deseq2 47
deseq2,cuffdiff 37
deseq 15
deseq,edger 11
deseq,cuffdiff 9
deseq,edger,cuffdiff 7
deseq,deseq2,cuffdiff 5
deseq,deseq2 3

11 - Gene ontology analyses

Made a file called all_genes.csv that has all the genes that went into the analyses - XLOC…. - including the information from the previous analysis. Lines look like this:

XLOC_000037,['n';'n';'n';'n'],none
XLOC_000038,['n';'n';'n';'n'],none
XLOC_000039,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000040,['n';'n';'n';'n'],none
XLOC_000041,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000042,['n';'n';'n';'n'],none
XLOC_000043,['n'; 'y'; 'y'; 'n'],deseq2;edger

I used the script add-gene-name-to-diffexpr-csv.py (from https://github.com/ngs-docs/2015-mar-semimodel/blob/master/files/add-gene-name-to-diffexpr-csv.py) to assign WB gene names to the differentially expressed genes, based on the transcript annotation file. Results file uploaded here, called L3L4_DEgenes.xlsx. I will upload the scritp here, too, which is run as:

python <add-gene-name-to-diffexpr-csv.py annotationfile.gtf> <all_genes.csv> > <all_genes_renamed.csv>

new file looks like this

XLOC_000037,WBGene00221274,Transcript:Bm1013a,['n';'n';'n';'n'],none
XLOC_000038,WBGene00225842,Transcript:Bm5581,['n';'n';'n';'n'],none
XLOC_000039,WBGene00223835,Transcript:Bm3574a,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000040,WBGene00227499,CDS:Bm7238a:bm9,['n';'n';'n';'n'],none
XLOC_000041,WBGene00222215,CDS:Bm1954d:bm11,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000042,WBGene00234617,Transcript:Bm14356,['n';'n';'n';'n'],none
XLOC_000043,WBGene00233076,Transcript:Bm12815,['n'; 'y'; 'y'; 'n'],deseq2;edger

Downloaded an association file for brugia to GO from here

ftp://ftp.wormbase.org/pub/wormbase/releases/WS246/ONTOLOGY/gene_association.WS246.wb.b_malayi

These are automated GO associations from WormBase. Next will write a script that takes the WB gene names and adds the associated GO (where applicable) to the all_genes table. Wrote add_GO.py and uploaded it here. This will take gene_association.WS246.wb.b_malayi and all_genes_renamed.csv and will add GOs to the spreadsheet outputting all_genes_renamed_withGO.csv, which we will use to build the files needed for GOSeq.

Next, run GOSeq.

1 - set up files for GOSeq analysis. This can be time consuming and fidgety!

We need 4 files:

  1. list of ALL gene names
  2. list of DIFFERENTIALLY EXPRESSED gene names
  3. transcript lengths to calculate bias
  4. file relating GO numbers to gene names

One and two are easy - go into the output files from your differential expression analysis and copy
a list of all gene names in a file called ALL and a list of gene names that met your condition for
differential expression (eg FDR < 0.05, padj < 0.05) in a file called DEG

3) Calculate lengths of transcripts. GOSeq has some genomes built it. Since we are using an unsupported genome, we need to make this.

library(GenomicRanges)
library(rtracklayer)
GTFfile = "Your GTF File"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {sum(elementMetadata(x)$widths)}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
write.csv(output, "TranscriptLenths_new.csv")

Open file. It probably looks like this:

"","XLOC_000001","XLOC_000002","XLOC_000003","XLOC_000004","XLOC_00...
"1",366,4808,6496,4289,995,4178,2288,1089,2058,737,2195,5301,9018,7...

What you want is a file that looks like this:

366
4808
6496
4289
995
4178
2288
1089
2058
.
.
.

There is probably an R way to get this, but I found it easy to do in text wrangler.
Open your file in TextWrangler. Use find/replace to find "," and replace it with "\r"
This will put each gene name and each number on a separate line. Delete all the gene names.
Delete the "1". You will be left with the transcript lengths in the correct order. Save
this file as transcriptlength_noname.txt

4) Make file relating GO numbers to gene names

From previous work, you have a file called all_genes_renamed_withGO.csv that looks like this

XLOC_000002,WBGene00222583,CDS:Bm2322a:bm11,['n';'n';'n';'n'],none,[('GO:0008168', 'F')],Bm2322
XLOC_000003,WBGene00223921,CDS:Bm3660a:bm11,['n';'n';'n';'n'],none,[('GO:0003824', 'F'), ('GO:0008152', 'P')],Bm3660
XLOC_000004,WBGene00223258,CDS:Bm2997e:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference
XLOC_000005,WBGene00221267,Transcript:Bm1006,['n';'n';'n';'n'],none,[('GO:0000245', 'P'), ('GO:0005634', 'C')],Bm1006
XLOC_000006,WBGene00221268,CDS:Bm1007b:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference
XLOC_000007,WBGene00227036,CDS:Bm6775h:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference

What you want is a file that looks like this:

geneID,Gomapping
XLOC_000001,n/a
XLOC_000002,GO:0008168
XLOC_000003,GO:0003824
XLOC_000003,GO:0008152
XLOC_000004,n/a
XLOC_000005,GO:0000245
XLOC_000005,GO:0005634
XLOC_000006,n/a
XLOC_000007,n/a
XLOC_000008,n/a

Note: most genes will have multiple GO numbers and these need to each be on a separate line.

I did this in python with the following code - it will only work if the input file is
formatted precisely like the example above.

import os, re
infile = open('all_genes_renamed_withGO.csv','r')
li = []
di = {}
for line in infile:
gene = line.split(',')[0]
if re.search('no GO reference',line):
GOlist = []
else:
GOlist = re.findall('GO:\d+',line)
outfile = open('GOasscn.csv','w')
outfile.write('geneID,Gomapping\n'
if GOlist == []:
outfile = open('GOasscn.csv','a')
outfile.write(gene + ',n/a\n')
outfile.close()
else:
for GO in GOlist:
outfile = open('GOasscn.csv','a')
outfile.write(gene + ',' + GO + '\n')
outfile.close()

Open the file GOasscn.csv and confirm it is formatted as above.

Now you're ready to run the GOSeq analysis!

Run the GOSeq analysis - the following assumes that the files are named as they are above, and that they are in the working directory

Prepare the data in R:

library("goseq")
ALL <- read.csv("ALL", header=FALSE)
DEG <- read.csv("DEG", header=FALSE)
Transcriptlength <- read.csv("transcriptlength_noname.txt", header=FALSE)
GOterms <- read.csv("GOasscn.csv", header=TRUE)
as.data.frame.matrix(GOterms)
DEG.vector <- c(t(DEG))
ALL.vector<-c(t(ALL))
gene.vector=as.integer(ALL.vector%in%DEG.vector)
names(gene.vector)=ALL.vector
head(gene.vector)
Length.vector <- c(t(Transcriptlength))
pwf=nullp(gene.vector,bias.data=Length.vector)

DataBias.png

Run the goseq analyses

GO.wall=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE)

Run the goseq analyses on randomly sampled genes

GO.samp=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE,method="Sampling",repcnt=1000)

Plot to compare the full goseq and sampled goseq

plot(log10(GO.wall[,2]), log10(GO.samp[match(GO.samp[,1],GO.wall[,1]),2]),xlab="log10(Wallenius p-values)", ylab="log10(Sampling p-values)", xlim=c(-3,0), )
abline(0,1,col=3,lty=2)

You can also run goseq without calculating bias - though it isn't recommended

GO.nobias=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE,method="Hypergeometric")

head(GO.wall)

category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
42 GO:0008152 0.000000e+00 1 7 7
50 GO:0016020 0.000000e+00 1 6 6
51 GO:0016021 1.001676e-13 1 5 5
70 GO:0055085 1.434443e-10 1 4 4
22 GO:0005634 2.692984e-10 1 4 4
20 GO:0005524 2.713416e-10 1 4 4

head(GO.samp)

category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
4 GO:0003677 0.000999001 1 4 4
7 GO:0003824 0.000999001 1 2 2
20 GO:0005524 0.000999001 1 4 4
22 GO:0005634 0.000999001 1 4 4
31 GO:0006355 0.000999001 1 3 3
33 GO:0006506 0.000999001 1 2 2

head(GO.nobias)

category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
42 GO:0008152 2.117354e-18 1 7 7
50 GO:0016020 7.501245e-16 1 6 6
51 GO:0016021 2.606538e-13 1 5 5
4 GO:0003677 8.886820e-11 1 4 4
20 GO:0005524 8.886820e-11 1 4 4
22 GO:0005634 8.886820e-11 1 4 4

Plot sampled v full goseq analyses
Screen%20Shot%202015-03-27%20at%203.27.35%20PM.pngPlot corrected v uncorrected goseq analyses
Screen%20Shot%202015-03-27%20at%203.32.33%20PM.png

To see which genes are enriched in the DE genes:

enriched.GO=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue,method="BH")<.05]

Write out the data for future use

write.csv(GO.wall,"GO.wall.csv")
write.csv(GO.samp,"GO.samp,csv")
write.csv(GO.nobias,"GO.nobias.csv")
write.csv(enriched.GO,"enriched.GO.csv")

Import the GO library to try to make sense of the enriched GOs

source("http://bioconductor.org/biocLite.R")
bioclite("GO.db")
library(GO.db)
for(go in enriched.GO[1:10]){print(GOTERM[[go]] cat("-------—-\n")}

In this case, only two GOs are enriched in the de genes v all genes - they are

  • GO:0005509: calcium ion binding
  • GO:0008152: metabolic process

That doesn't seem that helpful!

I have been comparing all de genes. What if I split up the genes more highly expressed in one stage or the other - that might be more meaningful.

In the differential expression analysis output, the fold change is measured, with a positive number meaning overexpressed in L4 and a negative number overexpressed in L3. I'll go back and repeat the gene ontolgy analysis with these genes separated.

  • Pulled genes out of the spreadsheet
  • Ran enrichment test on each subset (L3 or L4) compared to ALL
  • only one GO for each!

L3
GOID: GO:0005509
Term: calcium ion binding
Ontology: MF
Definition: Interacting selectively and non-covalently with calcium
ions (Ca2+).
Synonym: calcium ion storage activity


L4
GOID: GO:0005615
Term: extracellular space
Ontology: CC
Definition: That part of a multicellular organism outside the cells
proper, usually taken to be outside the plasma membranes, and
occupied by fluid.
Synonym: intercellular space


I want to see if it is better to compare the GO in L3 v L4, rather than against ALL. I will need to make the transcript length file only for genes that are differentially expressed. Tried but it didn't look good - no enriched genes. Maybe because n too small for statistical significance?

12 - KEGG

I want to use KEGG pathways to annotate the differentially expressed genes. Using the file wormbase.WS240.gene_ids.txt I can get gene names for the DE genes, then use KEGG Mapper to see what pathways they are in.

NOTE: four of the L4 de genes (XLOC_003968, XLOC_004473, XLOC_013467, XLOC_002831) had no WB number associated with them - I blasted them and all 4 blasted 100% to gerbil mitochondria! I will remove these and continue with the remaining genes.

From XLOC number to fasta - using a script I wrote called getGeneNameforDEgenes.py (adapted to get fasta, too) that I will upload here. From fasta used blast to assign KEGG info here:

http://www.genome.jp/kaas-bin/kaas_main?mode=partial

Check nucleotide and run on line.

Results - most genes not mapped. Of those that were:

L3 - 49 pathways, most have one or two genes, only one has > 2, which is 'metabolic pathways'
L4 - 90 pathways, 10 have > 2 genes, top is lysosome with 7

Top L4 upregulated pathways
ko number pathway #
04142 Lysosome 7
00500 Starch and sucrose metabolism 4
00630 Glyoxylate and dicarboxylate metabolism 4
00380 Tryptophan metabolism 4
01200 Carbon metabolism 3
00564 Glycerophospholipid metabolism 3
03030 DNA replication 3
04810 Regulation of actin cytoskeleton 3
04111 Cell cycle - yeast 3
04530 Tight junction 3

Update 08/27/2015

This project had previously been used to explore the differential expression pipeline, and I had run GoSeq (an R package for gene ontology analysis) and some Kegg pipeline analyses on the resulting differentially expressed genes. GoSeq was run using the Brugia malayi gene annotation files from WormBase. However, from Kimberly (ude.hcetlac|nekuanav#ude.hcetlac|nekuanav) at WB:

…there isn't any manual GO curation for brugia in WB, so any GO annotations that exist are based on automated mappings between InterPro domains and GO terms.  If the genes on your list encode proteins that don't have a recognizable InterPro domain, then they won't have any GO annotations associated with them.

Running GoSeq with the Brugia association file gave VERY limited results. Only two GOs were enriched statistically in the differentially expressed genes.

• GO:0005509 - calcium ion binding for L3
• GO:0005615 - extracellular space for L4

The earlier project Weam did in collaboration with Tim Geary’s lab used the annotations from C. elegans, a much more carefully annotated genome, in their GO analyses. I have gone back and repeated my earlier analysis using best blast hits to C. elegans and the C. elegans annotation.

A quick outline of this method:
• blast differentially expressed genes against the C. elegans genome downloaded from wormbase.
• Make a list of best hits and extract gene symbols from the blast report.
• Use all gene symbols from the C. elegans genome as the comparative ALL list

Two ways to look at these data: the online database Panther (http://pantherdb.org/) and GoSeq (implemented in R).

1 – Panther. Since C. elegans is one of the genomes available on Panther, analysis is much easier than for Brugia. Results include gene ontologies found in the data for the three main GO categories (molecular function, biological process and cellular component) and also protein class and pathway. Example – protein class for the genes upregulated in L3 (first) and L4 (second):

Panther_protein_class_pie.png
L4 upregulated genes protein class

Note that these are all the proteins in the data – not a statistical analysis of over or underrepresented classes, or a comparison of representation of classes in L3 v L4. For the statistical analysis, need GoSeq.

2- GoSeq (in R) analysis of these data as compared to C. elegans. Here, I use the results from the differential expression analyses already done, and compare them to the genes annotated for C. elegans. So, the first steps are the same – blasting the differentially expressed genes against the C. elegans gemome, and pulling out the best hits and their gene symbols. Then, use GoSeq as before (see wiki for methods). I did this for L3 and got the same results as before when I used GoSeq using the (less well annotated) Brugia genome! So – it sucks that there is only one statistically enriched GO but is good that the results are the same!

Since the results are the same, I think it means that I can use this method (comparing to C. elegans) for the worms without gene annotation files (e.g. B. pahangi in the plant study.)


Update 11/6/2015

Decided to revisit this project after going to TropMed - people were using much less stringent requirements for calling a gene differentially expressed - only LFC > 1.0.

Did everything in R - I will upload the R markdown file and result html (L3L4.html and L3L4.Rmd) here.

Data exploration - from genome data science class. Interesting, but since most work in DESeq2 manages normalization, etc. it doesn't make much difference in the final analysis. Still, interesting to see, and the data look good.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License