geneplot tutorial

This document describes an example of usage of the Python geneplot library. Input data are located in the “tests” folder of the source code at GitHub. File types include:

1- (mandatory) A GFF file (version v.3) of the annotation of the genome. Verify that the file complies with the GFF3 standard specifications. GFF3 validation tools are available at modENCODE-DCC, here.

2- (optional) A VCF file with polymorphisms previously identified, for example with GATK.

Verify that the input VCF file complies with standard specifications. There are several file verifiers available elsewhere. The name of the sample on the VCF file must be set accordingly, i.e., at the end of the header line which describes the names of the columns of the VCF file, typically the last one. See the “w79_w11182w79_UPD” sample name in the next figure of the example data:

##fileformat=VCFv4.1
[HEADER LINES]
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	w79_w11182w79_UPD
2L	13247		C	T	146819	PASS		GT:AD:DP:FT:GQ:PL	1/1:0,3:3:PASS:9:128,9,0
2L	26706		GT	G	113317	PASS		GT:AD:DP:GQ:PL	1/1:0,5:5:15:153,15,0
2L	47355		GT	G	166497	PASS		GT:AD:DP:GQ:PL	1/1:0,5:5:15:179,15,0
2L	58166		T	C	150891	PASS		GT:AD:DP:GQ:PL	1/1:0,10:10:30:0,0,1800
2L	66908		G	A	157170	PASS		GT:AD:DP:FT:GQ:PL	1/1:0,39:7:PASS:24:0,54,273
2L	66910		A	G	155506	PASS		GT:AD:DP:FT:GQ:PL	1/1:10,0:8:PASS:24:348,24,0
2L	67241		A	G	154579	PASS		GT:AD:DP:FT:GQ:PL	1/1:0,6:6:PASS:18:248,18,0
2L	67497		T	C	152461	PASS		GT:AD:DP:FT:GQ:PL	1/1:12,0:4:PASS:12:0,12,1132

Currently, only single-sample VCF files are accepted by the geneplot library. If you have a multi-sample VCF, you split it with VCFtools or BCFtools.

3- (optional) The output file of InterproScan, which identifies several protein signatures on protein sequence data, in TSV format.

Example 1.

We are going to plot protein domains and SNP data on one ARGONAUTE gene (ID gene:FBgn0262739, trancript ID transcript:FBtr0087613) of the fruit fly (Drosophila melanogaster). According to the GFF3 file, the gene has associated the protein ID FBpp0086739. We first instantiate a genome object by using the paths of the GFF3, InterproScan output and VCF files as parameters of the genome() class of the library. The sample VCF file has been downloaded from https://doi.org/10.5281/zenodo.51693. Next, we instantiate a gene object with the gene() class of the genome object previously created, using the sample ID of the SNPs we want to plot and the protein signature database to be used as parameters of the plot() method of the gene object.

import geneplot as gp

#input data Drosophila genome
filegff = '/path-to-data/geneplot/tests/fruit_fly.gff3'
iprfile = '/path-to-data/geneplot/tests/fruit_fly.ipr'
vcffiles = '/path-to-data/geneplot/tests/'

# class instantiation (genome object)
genome_1 = gp.genome(filegff, iprfile=iprfile, vcffiles=vcffiles)

# class instantiation (gene object)
gene_1 = genome_1.gene(mRNAid='transcript:FBtr0087613', proteinid='FBpp0086739')

# plot
gene_1.plot('Pfam', sp='w79_w11182w79_UPD', onlycoding=False)

The method uses Pfam database as default. After the first plotting, a list of available database signatures for that gene is printed, in case additional plots are desired. This is the Output image of the example:

ARGONAUTE gene FBpp0086739

Track 0 (bottom) represents the exon-intron structure (exon=green, UTR=grey). Track 1 represents protein domains. Each domain ID is colored from a color loop sequence. Track 2 (top) represents polymorphism data. Since SNPEff annotation is not provided in the VCF file for this sample, SNPs are plotted in black color.

Example 2.

We are going to plot protein domains and SNP data on one Pectin lyase-like gene of citrus (ID Ciclev10019941m.g.v1.0, trancript Ciclev10019941m.v1.0). According to the GFF3 file, the gene has associated the protein ID Ciclev10019941m. WE follow the same method, but in this case we only want to plot SNPs on coding areas of the gene, and they will be colored since SNPEff annotation is provided in the VCF file.

import geneplot as gp

#input data Citrus genome
filegff = '/path-to-data/geneplot/tests/citrus.gff3'
iprfile = '/path-to-data/geneplot/tests/citrus.ipr'
vcffiles = '/path-to-data/geneplot/tests/'

# class instantiation (genome object)
genome_2 = gp.genome(filegff, iprfile=iprfile, vcffiles=vcffiles)

# class instantiation (gene object)
gene_2 = genome_2.gene(mRNAid='Ciclev10019941m.v1.0', proteinid='Ciclev10019941m')

# plot
gene_2.plot('Pfam', sp='ivia_000', onlycoding=True)
ARGONAUTE gene FBpp0086739