Goal: To use Weam's L3/L4 miSeq data to streamline and standardize the DE pipeline for RNA-Seq work.
Table of Contents
|
Dependencies for this pipeline
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic http://www.usadellab.org/cms/index.php?page=trimmomatic
Trinity - best to run on Mason
BowTie 2 (required for TopHat) http://sourceforge.net/projects/bowtie-bio/files/bowtie2/ unzip and cd into directory. Type 'make'. Then copy executables to /usr/local/bin/
TopHat - there are precompiled binaries here http://ccb.jhu.edu/software/tophat/tutorial.shtml or can be compiled from source, but this is a bit more work. Directions can be found at the same site.
Cufflinks http://cole-trapnell-lab.github.io/cufflinks/install/ - download and unzip v.2.2.1 executables for Mac, then copy to /usr/local/bin
HTSeq http://www-huber.embl.de/HTSeq/doc/install.html#installation-on-macos-x download then python setup.py build, sudo python setup.py install
pysam (https://code.google.com/p/pysam/downloads/list) install - type "python setup.py build", "python setup.py install" (Not sure if this is necessary - I think it was a dependency for something else I tried
1 - Get data
Downloaded data from Weam's Galaxy:
Galaxy1-[BrugMalWolL3_S2_L001_R1_001.fastq].fastq
Galaxy2-[BrugMalWolL3_S2_L001_R2_001.fastq].fastq
Galaxy3-[BrugMalWolL4_S3_L001_R1_001.fastq].fastq
Galaxy4-[BrugMalWolL4_S3_L001_R2_001.fastq].fastq
Galaxy5-[BmalayiL31_S5_L001_R1_001.fastq].fastq
Galaxy6-[BmalayiL32_S6_L001_R1_001.fastq].fastq
Galaxy7-[BmalayiL41_S7_L001_R1_001.fastq].fastq
Galaxy8-[BmalayiL42_S8_L001_R1_001.fastq].fastq
2 - Run all through FastQC
- check adapter and overrepresented sequences and put in file for trimmomatic. Here, TruSeq Adapter with Indices 5,6,7 and 12 were found, plus the universal adapter, mostly at the end of the sequence (read through?). Made adapter file L3L4_adapterseqs_0301015.fa which I will upload here.
3 - Ran script runTrimmomatic.py
with these basic trimmomatic parameters:
ILLUMINACLIP:L3L4_adapterseqs_0301015.fa:2:40:30 LEADING:8 TRAILING:10 SLIDINGWINDOW:4:30 MINLEN:25
4 - Build a new reference transcriptome from all data
- following instructions here http://2014-msu-rnaseq.readthedocs.org/en/latest/s-building-a-reference.html
If not already done, build Bowtie2 index files for the genome:
bowtie2-build -f brugia_malayi.PRJNA10729.WBPS1.genomic.fa brugia_malayi.PRJNA10729.WBPS1.genomic
For this, I removed the "Galaxy…[" from the file names, to avoid any confusion
The code below calls tophat2 with
- two threads (-p 2)
- tells it where the genome annotations are (-G /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3)
- make genome index to save time if this needs to be run again (—transcriptome-index value = where to put it)
- save to a folder called tophat_all
- the base name of the bowtie index files - these need to be made in advance - brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic
- list of sequences, first the forward and unpaired in a comma-delimited list, then the reverse reads of the paired output (in the same order as the first list.)
tophat2 -p 2 \
-G /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/transcriptome \
-o tophat_all \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_paired.fq,allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_paired.fq,allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_unpaired.fq,allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_unpaired.fq,allL3/BmalayiL31_S5_L001_R1_001.fastq.fastq_output,allL3/BmalayiL32_S6_L001_R1_001.fastq.fastq_output,allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_unpaired.fq,allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_unpaired.fq,allL4/BmalayiL41_S7_L001_R1_001.fastq.fastq_output,allL4/BmalayiL42_S8_L001_R1_001.fastq.fastq_output allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_paired.fq,allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_paired.fq
This took 05:26:56 hours
Look at the summary in align_summary.txt
Left reads:
Input : 7262372
Mapped : 6379126 (87.8% of input)
of these: 401236 ( 6.3%) have multiple alignments (878 have >20)
Right reads:
Input : 7262372
Mapped : 6412516 (88.3% of input)
of these: 404059 ( 6.3%) have multiple alignments (886 have >20)
Unpaired reads:
Input : 24533513
Mapped : 20587321 (83.9% of input)
of these: 1309058 ( 6.4%) have multiple alignments (823 have >20)
85.5% overall read mapping rate.
Aligned pairs: 6184442
of these: 389143 ( 6.3%) have multiple alignments
8669 ( 0.1%) are discordant alignments
85.0% concordant pair alignment rate.
Run cufflinks:
cufflinks -o cuff_all tophat_all/accepted_hits.bam
StdOut:
You are using Cufflinks v2.2.1, which is the most recent release.
[10:03:07] Inspecting reads and determining fragment length distribution.
> Processed 28665 loci. [*************************] 100%
> Map Properties:
> Normalized Map Mass: 27173004.33
> Raw Map Mass: 27173004.33
> Fragment Length Distribution: Empirical (learned)
> Estimated Mean: 184.47
> Estimated Std Dev: 35.37
[10:09:48] Assembling transcripts and estimating abundances.
> Processed 32125 loci. [*************************] 100%
Run cuffmerge:
ls -1 cuff_all/transcripts.gtf > cuff_list.txt
cuffmerge -g /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
-o cuffmerge_all \
-s /Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic.fa \
cuff_list.txt
Output is here - cuffmerge_all/merged.gtf
The tutorial here:
http://2014-msu-rnaseq.readthedocs.org/en/latest/s-tophat-round2.html
has a script remove-nostrand.py that removes lines that don't have a mapping strand.
python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf
This removes one line in my gtf file merged.gtf. Removed line is:
Bmal_v3_scaffold2 Cufflinks exon 718571 719557 . . . gene_id "XLOC_004918"; transcript_id "TCONS_00019074"; exon_number "1"; gene_name "WBGene00221590"; oId "CUFF.2927.1"; nearest_ref "Transcript:Bm1329f"; class_code "i"; tss_id "TSS10616";
Output is nostrand.gtf. This is what I will use going forward, the new transcriptome annotation.
Compare the different gtf files:
cuffcompare cuffmerge_all/nostrand.gtf \
/Users/jgrant/Documents/L3\:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 \
cuffmerge_all/merged.gtf -o compare
Output is:
# Cuffcompare v2.2.1 | Command line was:
#cuffcompare cuffmerge_all/nostrand.gtf /Users/jgrant/Documents/L3:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 cuffmerge_all/merged.gtf -o compare
#
#= Summary for dataset: cuffmerge_all/nostrand.gtf :
# Query mRNAs : 46157 in 14124 loci (42876 multi-exon transcripts)
# (7756 multi-transcript loci, ~3.3 transcripts per locus)
#= Summary for dataset: /Users/jgrant/Documents/L3:L4/brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.annotations.gff3 :
# Query mRNAs : 39895 in 15833 loci (36675 multi-exon transcripts)
# (7879 multi-transcript loci, ~2.5 transcripts per locus)
#= Summary for dataset: cuffmerge_all/merged.gtf :
# Query mRNAs : 46158 in 14125 loci (42876 multi-exon transcripts)
# (7756 multi-transcript loci, ~3.3 transcripts per locus)
Total union super-loci across all input datasets: 14125
(7623 multi-transcript, ~3.5 transcripts per locus)
5 - Now - map individual runs onto the new annotated transcriptome!
For first - L3 paired end read
tophat -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BrugMalWolL3 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL3/BrugMalWolL3_S2_L001_R1_001.fastq.fastq_paired.fq allL3/BrugMalWolL3_S2_L001_R2_001.fastq.fastq_paired.fq
All remaining - each took from 30 min to 1 hour (for paired end runs)
tophat2 -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BmalayiL31 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL3/BmalayiL31_S5_L001_R1_001.fastq.fastq_output
tophat2 -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BmalayiL32 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL3/BmalayiL32_S6_L001_R1_001.fastq.fastq_output
tophat2 -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BrugMalWolL4 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL4/BrugMalWolL4_S3_L001_R1_001.fastq.fastq_paired.fq allL4/BrugMalWolL4_S3_L001_R2_001.fastq.fastq_paired.fq
tophat2 -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BmalayiL41 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL4/BmalayiL41_S7_L001_R1_001.fastq.fastq_output
tophat2 -p 2 \
-G cuffmerge_all/nostrand.gtf \
--transcriptome-index=/Users/jgrant/Documents/L3\:L4/merged \
-o tophat_BmalayiL42 \
brugia_genome_files/brugia_malayi.PRJNA10729.WBPS1.genomic \
allL4/BmalayiL42_S8_L001_R1_001.fastq.fastq_output
Check align_summary files:
less tophat_BmalayiL42/align_summary.txt
Reads:
Input : 4026458
Mapped : 3670848 (91.2% of input)
of these: 216788 ( 5.9%) have multiple alignments (55 have >20)
91.2% overall read mapping rate.
less tophat_BmalayiL41/align_summary.txt
Reads:
Input : 4846418
Mapped : 4070247 (84.0% of input)
of these: 219185 ( 5.4%) have multiple alignments (146 have >20)
84.0% overall read mapping rate.
less tophat_BrugMalWolL4/align_summary.txt
Left reads:
Input : 3135854
Mapped : 2756298 (87.9% of input)
of these: 149075 ( 5.4%) have multiple alignments (239 have >20)
Right reads:
Input : 3135854
Mapped : 2772931 (88.4% of input)
of these: 150357 ( 5.4%) have multiple alignments (240 have >20)
88.2% overall read mapping rate.
Aligned pairs: 2668575
of these: 144025 ( 5.4%) have multiple alignments
3459 ( 0.1%) are discordant alignments
85.0% concordant pair alignment rate.
less tophat_BmalayiL32/align_summary.txt
Reads:
Input : 5871337
Mapped : 4143807 (70.6% of input)
of these: 199828 ( 4.8%) have multiple alignments (53 have >20)
70.6% overall read mapping rate.
less tophat_BmalayiL31/align_summary.txt
Reads:
Input : 4236086
Mapped : 3649709 (86.2% of input)
of these: 172054 ( 4.7%) have multiple alignments (46 have >20)
86.2% overall read mapping rate.
less tophat_BrugMalWolL3/align_summary.txt
Left reads:
Input : 4126518
Mapped : 3631879 (88.0% of input)
of these: 207381 ( 5.7%) have multiple alignments (549 have >20)
Right reads:
Input : 4126518
Mapped : 3649102 (88.4% of input)
of these: 208772 ( 5.7%) have multiple alignments (553 have >20)
88.2% overall read mapping rate.
Aligned pairs: 3532874
of these: 201958 ( 5.7%) have multiple alignments
5250 ( 0.1%) are discordant alignments
85.5% concordant pair alignment rate.
6 - Now run HTseq-count to get number of reads pre 'gene'
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BmalayiL42/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL42_counts.txt
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BmalayiL41/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL41_counts.txt
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BrugMalWolL4/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BrugMalWolL4_counts.txt
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BmalayiL32/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL32_counts.txt
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BmalayiL31/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BmalayiL31_counts.txt
htseq-count --format=bam --stranded=no --order=pos \
/Users/jgrant/Documents/L3\:L4/tophat_BrugMalWolL3/accepted_hits.bam \
/Users/jgrant/Documents/L3\:L4/2_Building_Reference/cuffmerge_all/nostrand.gtf > BrugMalWolL3_counts.txt
The output files are text, that look like this:
XLOC_000001 36
XLOC_000002 164
XLOC_000003 144
XLOC_000004 311
XLOC_000005 302
XLOC_000006 221
XLOC_000007 136
XLOC_000008 334
XLOC_000009 1681
XLOC_000010 658
XLOC_000011 1096
XLOC_000012 305
Make one for each sample and then compare.
These HTseq count files are used as input for EdgrR and DESeq.
6 - DE analyses with EdgeR, DESeq and DESeq2
I followed the tutorial here:
http://www.gettinggeneticsdone.com/2012/09/deseq-vs-edger-comparison.html
Look on this page for method and results:
7 - Another EdgeR analysis.
I followed the instructions from the Titus Brown tutorial (http://2015-mar-semimodel.readthedocs.org/en/latest/s-data-analysis.html#running-edger-on-a-data-subset). This gave very different results from all the analyses above, so I am not that confident in it.
library("edgeR")
files <- dir(pattern="*txt$") #Make sure only count files have txt extension in the working directory
data <- readDGE(files, header=FALSE)
print(data)
head(data$counts)
###
group <- c(rep("3",3), rep("4",3))
dge = DGEList(counts=data, group=group)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)
# plot!
et <- exactTest(dge)
etp <- topTags(et, n=100000)
etp$table$logFC = -etp$table$logFC
pdf("L3L4-edgeR-MA-plot.pdf")
plot(
etp$table$logCPM,
etp$table$logFC,
xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
col = ifelse( etp$table$FDR < .2, "red", "black" ) )
dev.off()
# more plot!
pdf("L3L4-edgeR-MDS.pdf")
plotMDS(dge)
dev.off()
# output CSV
write.csv(etp$table, "L3L4-edgeR.csv")
Renamed the results in L3L4-edgeR.csv with the script provided with the tutorial - add-gene-name-to-diffexpr-csv.py
Results (L3L4-edgeR-named.csv) are uploaded here.
L3L4-edgeR-MA-plot.pdf:Very few sequences are called - many more are found with the other analysis (see above) and those are comparable through the three analyses, EdgeR, DESeq and DESeq2, so I think I believe that one!
8 - Another method NOISeq.
From Ana Conesa's lab - http://bioinfo.cipf.es/noiseq/doku.php
Frustrating - seems to need informatino about the dataset that I don't have - will look further later.
9 - Command line cuffdiff
I was unable to run cuffdiff on my 8GB iMac - got segmentation errors, but I ran it on mason with the following submission script:
#!/bin/bash
#PBS -l nodes=1:ppn=16,walltime=3:00:00:00
#PBS -l vmem=250gb
#PBS -m abe
#PBS -N cuffdiff-jg
##PBS -j oe
##PBS -k o
cd $PBS_O_WORKDIR
module load cufflinks
cuffdiff -o cuffdiffOut nostrand.gtf tophat_BmalayiL31_accepted_hits.bam,tophat_BmalayiL32_accepted_
hits.bam,tophat_BrugMalWolL3_accepted_hits.bam tophat_BmalayiL41_accepted_hits.bam,tophat_BmalayiL42
_accepted_hits.bam,tophat_BrugMalWolL4_accepted_hits.bam
Where nostrand.gtf is the reference transcriptome we made above, and the data files are the tophat mappings of the data back to that reference.
Cuffdiff gives a number of output files. I will move forward with gene_exp.diff which lists genes that are differentially expressed. Other options would be isoform_exp.diff and cds_exp.diff.
10 - Number of genes differentially expressed by different methods
Compare Cuffdiff, EdgeR, DESeq and DESeq2
method | # genes called |
deseq | 750 |
deseq2 | 1029 |
edger | 1231 |
cuffdiff | 969 |
How do these genes overlap?
Genes called as differentially expressed by | # |
deseq,deseq2,edger,cuffdiff | 499 |
edger | 211 |
cuffdiff | 205 |
deseq,deseq2,edger | 201 |
deseq2,edger,cuffdiff | 142 |
deseq2,edger | 94 |
edger,cuffdiff | 65 |
deseq2 | 47 |
deseq2,cuffdiff | 37 |
deseq | 15 |
deseq,edger | 11 |
deseq,cuffdiff | 9 |
deseq,edger,cuffdiff | 7 |
deseq,deseq2,cuffdiff | 5 |
deseq,deseq2 | 3 |
11 - Gene ontology analyses
Made a file called all_genes.csv that has all the genes that went into the analyses - XLOC…. - including the information from the previous analysis. Lines look like this:
XLOC_000037,['n';'n';'n';'n'],none
XLOC_000038,['n';'n';'n';'n'],none
XLOC_000039,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000040,['n';'n';'n';'n'],none
XLOC_000041,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000042,['n';'n';'n';'n'],none
XLOC_000043,['n'; 'y'; 'y'; 'n'],deseq2;edger
I used the script add-gene-name-to-diffexpr-csv.py (from https://github.com/ngs-docs/2015-mar-semimodel/blob/master/files/add-gene-name-to-diffexpr-csv.py) to assign WB gene names to the differentially expressed genes, based on the transcript annotation file. Results file uploaded here, called L3L4_DEgenes.xlsx. I will upload the scritp here, too, which is run as:
python <add-gene-name-to-diffexpr-csv.py annotationfile.gtf> <all_genes.csv> > <all_genes_renamed.csv>
new file looks like this
XLOC_000037,WBGene00221274,Transcript:Bm1013a,['n';'n';'n';'n'],none
XLOC_000038,WBGene00225842,Transcript:Bm5581,['n';'n';'n';'n'],none
XLOC_000039,WBGene00223835,Transcript:Bm3574a,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000040,WBGene00227499,CDS:Bm7238a:bm9,['n';'n';'n';'n'],none
XLOC_000041,WBGene00222215,CDS:Bm1954d:bm11,['y'; 'y'; 'y'; 'y'],deseq;deseq2;edger;cuffdiff
XLOC_000042,WBGene00234617,Transcript:Bm14356,['n';'n';'n';'n'],none
XLOC_000043,WBGene00233076,Transcript:Bm12815,['n'; 'y'; 'y'; 'n'],deseq2;edger
Downloaded an association file for brugia to GO from here
ftp://ftp.wormbase.org/pub/wormbase/releases/WS246/ONTOLOGY/gene_association.WS246.wb.b_malayi
These are automated GO associations from WormBase. Next will write a script that takes the WB gene names and adds the associated GO (where applicable) to the all_genes table. Wrote add_GO.py and uploaded it here. This will take gene_association.WS246.wb.b_malayi and all_genes_renamed.csv and will add GOs to the spreadsheet outputting all_genes_renamed_withGO.csv, which we will use to build the files needed for GOSeq.
Next, run GOSeq.
1 - set up files for GOSeq analysis. This can be time consuming and fidgety!
We need 4 files:
- list of ALL gene names
- list of DIFFERENTIALLY EXPRESSED gene names
- transcript lengths to calculate bias
- file relating GO numbers to gene names
One and two are easy - go into the output files from your differential expression analysis and copy
a list of all gene names in a file called ALL and a list of gene names that met your condition for
differential expression (eg FDR < 0.05, padj < 0.05) in a file called DEG
3) Calculate lengths of transcripts. GOSeq has some genomes built it. Since we are using an unsupported genome, we need to make this.
library(GenomicRanges)
library(rtracklayer)
GTFfile = "Your GTF File"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {sum(elementMetadata(x)$widths)}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
write.csv(output, "TranscriptLenths_new.csv")
Open file. It probably looks like this:
"","XLOC_000001","XLOC_000002","XLOC_000003","XLOC_000004","XLOC_00...
"1",366,4808,6496,4289,995,4178,2288,1089,2058,737,2195,5301,9018,7...
What you want is a file that looks like this:
366
4808
6496
4289
995
4178
2288
1089
2058
.
.
.
There is probably an R way to get this, but I found it easy to do in text wrangler.
Open your file in TextWrangler. Use find/replace to find "," and replace it with "\r"
This will put each gene name and each number on a separate line. Delete all the gene names.
Delete the "1". You will be left with the transcript lengths in the correct order. Save
this file as transcriptlength_noname.txt
4) Make file relating GO numbers to gene names
From previous work, you have a file called all_genes_renamed_withGO.csv that looks like this
XLOC_000002,WBGene00222583,CDS:Bm2322a:bm11,['n';'n';'n';'n'],none,[('GO:0008168', 'F')],Bm2322
XLOC_000003,WBGene00223921,CDS:Bm3660a:bm11,['n';'n';'n';'n'],none,[('GO:0003824', 'F'), ('GO:0008152', 'P')],Bm3660
XLOC_000004,WBGene00223258,CDS:Bm2997e:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference
XLOC_000005,WBGene00221267,Transcript:Bm1006,['n';'n';'n';'n'],none,[('GO:0000245', 'P'), ('GO:0005634', 'C')],Bm1006
XLOC_000006,WBGene00221268,CDS:Bm1007b:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference
XLOC_000007,WBGene00227036,CDS:Bm6775h:bm11,['n';'n';'n';'n'],none,no GO reference,no Gene reference
What you want is a file that looks like this:
geneID,Gomapping
XLOC_000001,n/a
XLOC_000002,GO:0008168
XLOC_000003,GO:0003824
XLOC_000003,GO:0008152
XLOC_000004,n/a
XLOC_000005,GO:0000245
XLOC_000005,GO:0005634
XLOC_000006,n/a
XLOC_000007,n/a
XLOC_000008,n/a
Note: most genes will have multiple GO numbers and these need to each be on a separate line.
I did this in python with the following code - it will only work if the input file is
formatted precisely like the example above.
import os, re
infile = open('all_genes_renamed_withGO.csv','r')
li = []
di = {}
for line in infile:
gene = line.split(',')[0]
if re.search('no GO reference',line):
GOlist = []
else:
GOlist = re.findall('GO:\d+',line)
outfile = open('GOasscn.csv','w')
outfile.write('geneID,Gomapping\n'
if GOlist == []:
outfile = open('GOasscn.csv','a')
outfile.write(gene + ',n/a\n')
outfile.close()
else:
for GO in GOlist:
outfile = open('GOasscn.csv','a')
outfile.write(gene + ',' + GO + '\n')
outfile.close()
Open the file GOasscn.csv and confirm it is formatted as above.
Now you're ready to run the GOSeq analysis!
Run the GOSeq analysis - the following assumes that the files are named as they are above, and that they are in the working directory
Prepare the data in R:
library("goseq")
ALL <- read.csv("ALL", header=FALSE)
DEG <- read.csv("DEG", header=FALSE)
Transcriptlength <- read.csv("transcriptlength_noname.txt", header=FALSE)
GOterms <- read.csv("GOasscn.csv", header=TRUE)
as.data.frame.matrix(GOterms)
DEG.vector <- c(t(DEG))
ALL.vector<-c(t(ALL))
gene.vector=as.integer(ALL.vector%in%DEG.vector)
names(gene.vector)=ALL.vector
head(gene.vector)
Length.vector <- c(t(Transcriptlength))
pwf=nullp(gene.vector,bias.data=Length.vector)
Run the goseq analyses
GO.wall=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE)
Run the goseq analyses on randomly sampled genes
GO.samp=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE,method="Sampling",repcnt=1000)
Plot to compare the full goseq and sampled goseq
plot(log10(GO.wall[,2]), log10(GO.samp[match(GO.samp[,1],GO.wall[,1]),2]),xlab="log10(Wallenius p-values)", ylab="log10(Sampling p-values)", xlim=c(-3,0), )
abline(0,1,col=3,lty=2)
You can also run goseq without calculating bias - though it isn't recommended
GO.nobias=goseq(pwf,gene2cat=GOterms,use_genes_without_cat=TRUE,method="Hypergeometric")
head(GO.wall)
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
42 GO:0008152 0.000000e+00 1 7 7
50 GO:0016020 0.000000e+00 1 6 6
51 GO:0016021 1.001676e-13 1 5 5
70 GO:0055085 1.434443e-10 1 4 4
22 GO:0005634 2.692984e-10 1 4 4
20 GO:0005524 2.713416e-10 1 4 4
head(GO.samp)
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
4 GO:0003677 0.000999001 1 4 4
7 GO:0003824 0.000999001 1 2 2
20 GO:0005524 0.000999001 1 4 4
22 GO:0005634 0.000999001 1 4 4
31 GO:0006355 0.000999001 1 3 3
33 GO:0006506 0.000999001 1 2 2
head(GO.nobias)
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
42 GO:0008152 2.117354e-18 1 7 7
50 GO:0016020 7.501245e-16 1 6 6
51 GO:0016021 2.606538e-13 1 5 5
4 GO:0003677 8.886820e-11 1 4 4
20 GO:0005524 8.886820e-11 1 4 4
22 GO:0005634 8.886820e-11 1 4 4
Plot corrected v uncorrected goseq analyses
To see which genes are enriched in the DE genes:
enriched.GO=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue,method="BH")<.05]
Write out the data for future use
write.csv(GO.wall,"GO.wall.csv")
write.csv(GO.samp,"GO.samp,csv")
write.csv(GO.nobias,"GO.nobias.csv")
write.csv(enriched.GO,"enriched.GO.csv")
Import the GO library to try to make sense of the enriched GOs
source("http://bioconductor.org/biocLite.R")
bioclite("GO.db")
library(GO.db)
for(go in enriched.GO[1:10]){print(GOTERM[[go]] cat("-------—-\n")}
In this case, only two GOs are enriched in the de genes v all genes - they are
- GO:0005509: calcium ion binding
- GO:0008152: metabolic process
That doesn't seem that helpful!
I have been comparing all de genes. What if I split up the genes more highly expressed in one stage or the other - that might be more meaningful.
In the differential expression analysis output, the fold change is measured, with a positive number meaning overexpressed in L4 and a negative number overexpressed in L3. I'll go back and repeat the gene ontolgy analysis with these genes separated.
- Pulled genes out of the spreadsheet
- Ran enrichment test on each subset (L3 or L4) compared to ALL
- only one GO for each!
L3
GOID: GO:0005509
Term: calcium ion binding
Ontology: MF
Definition: Interacting selectively and non-covalently with calcium
ions (Ca2+).
Synonym: calcium ion storage activity
L4
GOID: GO:0005615
Term: extracellular space
Ontology: CC
Definition: That part of a multicellular organism outside the cells
proper, usually taken to be outside the plasma membranes, and
occupied by fluid.
Synonym: intercellular space
I want to see if it is better to compare the GO in L3 v L4, rather than against ALL. I will need to make the transcript length file only for genes that are differentially expressed. Tried but it didn't look good - no enriched genes. Maybe because n too small for statistical significance?
12 - KEGG
I want to use KEGG pathways to annotate the differentially expressed genes. Using the file wormbase.WS240.gene_ids.txt I can get gene names for the DE genes, then use KEGG Mapper to see what pathways they are in.
NOTE: four of the L4 de genes (XLOC_003968, XLOC_004473, XLOC_013467, XLOC_002831) had no WB number associated with them - I blasted them and all 4 blasted 100% to gerbil mitochondria! I will remove these and continue with the remaining genes.
From XLOC number to fasta - using a script I wrote called getGeneNameforDEgenes.py (adapted to get fasta, too) that I will upload here. From fasta used blast to assign KEGG info here:
http://www.genome.jp/kaas-bin/kaas_main?mode=partial
Check nucleotide and run on line.
Results - most genes not mapped. Of those that were:
L3 - 49 pathways, most have one or two genes, only one has > 2, which is 'metabolic pathways'
L4 - 90 pathways, 10 have > 2 genes, top is lysosome with 7
ko number | pathway | # |
04142 | Lysosome | 7 |
00500 | Starch and sucrose metabolism | 4 |
00630 | Glyoxylate and dicarboxylate metabolism | 4 |
00380 | Tryptophan metabolism | 4 |
01200 | Carbon metabolism | 3 |
00564 | Glycerophospholipid metabolism | 3 |
03030 | DNA replication | 3 |
04810 | Regulation of actin cytoskeleton | 3 |
04111 | Cell cycle - yeast | 3 |
04530 | Tight junction | 3 |
Update 08/27/2015
This project had previously been used to explore the differential expression pipeline, and I had run GoSeq (an R package for gene ontology analysis) and some Kegg pipeline analyses on the resulting differentially expressed genes. GoSeq was run using the Brugia malayi gene annotation files from WormBase. However, from Kimberly (ude.hcetlac|nekuanav#ude.hcetlac|nekuanav) at WB:
…there isn't any manual GO curation for brugia in WB, so any GO annotations that exist are based on automated mappings between InterPro domains and GO terms. If the genes on your list encode proteins that don't have a recognizable InterPro domain, then they won't have any GO annotations associated with them.
Running GoSeq with the Brugia association file gave VERY limited results. Only two GOs were enriched statistically in the differentially expressed genes.
• GO:0005509 - calcium ion binding for L3
• GO:0005615 - extracellular space for L4
The earlier project Weam did in collaboration with Tim Geary’s lab used the annotations from C. elegans, a much more carefully annotated genome, in their GO analyses. I have gone back and repeated my earlier analysis using best blast hits to C. elegans and the C. elegans annotation.
A quick outline of this method:
• blast differentially expressed genes against the C. elegans genome downloaded from wormbase.
• Make a list of best hits and extract gene symbols from the blast report.
• Use all gene symbols from the C. elegans genome as the comparative ALL list
Two ways to look at these data: the online database Panther (http://pantherdb.org/) and GoSeq (implemented in R).
1 – Panther. Since C. elegans is one of the genomes available on Panther, analysis is much easier than for Brugia. Results include gene ontologies found in the data for the three main GO categories (molecular function, biological process and cellular component) and also protein class and pathway. Example – protein class for the genes upregulated in L3 (first) and L4 (second):
L4 upregulated genes protein class
Note that these are all the proteins in the data – not a statistical analysis of over or underrepresented classes, or a comparison of representation of classes in L3 v L4. For the statistical analysis, need GoSeq.
2- GoSeq (in R) analysis of these data as compared to C. elegans. Here, I use the results from the differential expression analyses already done, and compare them to the genes annotated for C. elegans. So, the first steps are the same – blasting the differentially expressed genes against the C. elegans gemome, and pulling out the best hits and their gene symbols. Then, use GoSeq as before (see wiki for methods). I did this for L3 and got the same results as before when I used GoSeq using the (less well annotated) Brugia genome! So – it sucks that there is only one statistically enriched GO but is good that the results are the same!
Since the results are the same, I think it means that I can use this method (comparing to C. elegans) for the worms without gene annotation files (e.g. B. pahangi in the plant study.)
Update 11/6/2015
Decided to revisit this project after going to TropMed - people were using much less stringent requirements for calling a gene differentially expressed - only LFC > 1.0.
Did everything in R - I will upload the R markdown file and result html (L3L4.html and L3L4.Rmd) here.
Data exploration - from genome data science class. Interesting, but since most work in DESeq2 manages normalization, etc. it doesn't make much difference in the final analysis. Still, interesting to see, and the data look good.