AGEpy: A Python Package for Computational Biology

AGEpy: A Python Package for Computational Biology

Article Information

Franziska Metge^1#, Robert Sehlke^{1, 2, 3, #}, Ayesha Iqbal¹, Jorge Boucas^1*

¹Bioinformatics, Max Planck Institute for Biology of Ageing, Cologne, 50931, Germany

²Biological Mechanisms of Ageing, Max Planck Institute for Biology of Ageing, Cologne, 50931, Germany

³Cellular Networks and Systems Biology, CECAD, University of Cologne, 50931 Cologne, Germany

# - These authors have equally contributed to this work.

*Corresponding author: Jorge Boucas, Bioinformatics, Max Planck Institute for Biology of Ageing, Cologne, 50931, Germany.

Received: 03 February 2022; Accepted: 15 February 2022; Published: 04 March 2022

Citation: Franziska Metge, Robert Sehlke, Ayesha Iqbal, Jorge Boucas. Agepy: A Python Package for Computational Biology. Journal of Bioinformatics and Systems Biology 5 (2022): 58-62.

Share at Facebook

Abstract

Summary: AGEpy is a Python package focused on the transformation of interpretable data into biological meaning. It is designed to support high-throughput analysis of pre-processed biological data using either local Python based processing or Python based API calls to local or remote servers. In this application note we describe its different Python modules as well as its command line accessible tools aDiff, abed, blasto, david, and obo2tsv.

Keywords

Cytoscape; Diomart; Data interpretation; Data handling; Data visualization; David; Fasta; Gene ontology; Kegg; Gtf

Cytoscape articles; Diomart articles; Data interpretation articles; Data handling articles; Data visualization articles; David; Fasta articles; Gene ontology articles; Kegg articles; Gtf articles

Cytoscape articles Cytoscape Research articles Cytoscape review articles Cytoscape PubMed articles Cytoscape PubMed Central articles Cytoscape 2023 articles Cytoscape 2024 articles Cytoscape Scopus articles Cytoscape impact factor journals Cytoscape Scopus journals Cytoscape PubMed journals Cytoscape medical journals Cytoscape free journals Cytoscape best journals Cytoscape top journals Cytoscape free medical journals Cytoscape famous journals Cytoscape Google Scholar indexed journals Diomart articles Diomart Research articles Diomart review articles Diomart PubMed articles Diomart PubMed Central articles Diomart 2023 articles Diomart 2024 articles Diomart Scopus articles Diomart impact factor journals Diomart Scopus journals Diomart PubMed journals Diomart medical journals Diomart free journals Diomart best journals Diomart top journals Diomart free medical journals Diomart famous journals Diomart Google Scholar indexed journals Data interpretation articles Data interpretation Research articles Data interpretation review articles Data interpretation PubMed articles Data interpretation PubMed Central articles Data interpretation 2023 articles Data interpretation 2024 articles Data interpretation Scopus articles Data interpretation impact factor journals Data interpretation Scopus journals Data interpretation PubMed journals Data interpretation medical journals Data interpretation free journals Data interpretation best journals Data interpretation top journals Data interpretation free medical journals Data interpretation famous journals Data interpretation Google Scholar indexed journals Data handling articles Data handling Research articles Data handling review articles Data handling PubMed articles Data handling PubMed Central articles Data handling 2023 articles Data handling 2024 articles Data handling Scopus articles Data handling impact factor journals Data handling Scopus journals Data handling PubMed journals Data handling medical journals Data handling free journals Data handling best journals Data handling top journals Data handling free medical journals Data handling famous journals Data handling Google Scholar indexed journals Data visualization articles Data visualization Research articles Data visualization review articles Data visualization PubMed articles Data visualization PubMed Central articles Data visualization 2023 articles Data visualization 2024 articles Data visualization Scopus articles Data visualization impact factor journals Data visualization Scopus journals Data visualization PubMed journals Data visualization medical journals Data visualization free journals Data visualization best journals Data visualization top journals Data visualization free medical journals Data visualization famous journals Data visualization Google Scholar indexed journals David articles David Research articles David review articles David PubMed articles David PubMed Central articles David 2023 articles David 2024 articles David Scopus articles David impact factor journals David Scopus journals David PubMed journals David medical journals David free journals David best journals David top journals David free medical journals David famous journals David Google Scholar indexed journals Fasta articles Fasta Research articles Fasta review articles Fasta PubMed articles Fasta PubMed Central articles Fasta 2023 articles Fasta 2024 articles Fasta Scopus articles Fasta impact factor journals Fasta Scopus journals Fasta PubMed journals Fasta medical journals Fasta free journals Fasta best journals Fasta top journals Fasta free medical journals Fasta famous journals Fasta Google Scholar indexed journals Gene ontology articles Gene ontology Research articles Gene ontology review articles Gene ontology PubMed articles Gene ontology PubMed Central articles Gene ontology 2023 articles Gene ontology 2024 articles Gene ontology Scopus articles Gene ontology impact factor journals Gene ontology Scopus journals Gene ontology PubMed journals Gene ontology medical journals Gene ontology free journals Gene ontology best journals Gene ontology top journals Gene ontology free medical journals Gene ontology famous journals Gene ontology Google Scholar indexed journals Kegg articles Kegg Research articles Kegg review articles Kegg PubMed articles Kegg PubMed Central articles Kegg 2023 articles Kegg 2024 articles Kegg Scopus articles Kegg impact factor journals Kegg Scopus journals Kegg PubMed journals Kegg medical journals Kegg free journals Kegg best journals Kegg top journals Kegg free medical journals Kegg famous journals Kegg Google Scholar indexed journals Gtf articles Gtf Research articles Gtf review articles Gtf PubMed articles Gtf PubMed Central articles Gtf 2023 articles Gtf 2024 articles Gtf Scopus articles Gtf impact factor journals Gtf Scopus journals Gtf PubMed journals Gtf medical journals Gtf free journals Gtf best journals Gtf top journals Gtf free medical journals Gtf famous journals Gtf Google Scholar indexed journals

Article Details

1. Introduction

The generation of meaning from data has become a central topic in biological research. Many tools have therefore emerged for the transformation of raw data into interpretable values in the same way that others have emerged for the generation of biological meaning from such interpretable values. While the firsts are often based on a programmatic use - eg. RNAseq analysis tools like Cuffdiff [1] or DEseq2 [2] - the latests tend to be Graphical User Interface (GUI) based and have more recently evolved to provide access through an Application Programming Interface (API) - eg. KEGG [3] and DAVID [4]. API access has emerged as a reflection of the growing need to massively process data with tools initially designed for manual analysis through GUIs. APIs offer a way to democratize computational tools with a client side server-less solution. Here we introduce AGEpy, a Python package for the automation of data analysis at the data-to meaning interface. It uses standard data science dependencies like pandas [5], numpy [6], and matplotlib [7]; bionformatics tools like pybedtools [8]; and APIs requests to DAVID, blast [9, 10], cytoscape [11] and others. Defaults are set for research focused on the biology of ageing using Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, and Homo sapiens. AGEpy provides command line executables for downstream analysis of differential gene expression results output by Cuffdiff [1] - aDiff - annotation of bed files - abed - online blast queries - blasto - DAVID queries - david - as well as parsing of gene ontology obo files into tsv - obo2tsv.

2. Approach

2.1. Modules

We have divided AGEpy functions across 13 different modules with most functions making use of standard Python structures as well as data science ones, eg.: numpy arrays and pandas dataframes: bed.py, blast.py, cytoscape.py, david.py, fasta.py, go.py, gtf.py, homology.py, kegg.py, meme.py, sam.py. Of note, biom.py (biomart), cytoscape.py, david.py, and kegg.py make use of the respective service’s APIs. homology.py collects homology information from NCBI’s homologene - https://www.ncbi.nlm.nih.gov/homologene - to generate homology tables. Plots. py introduces two visualization plots for enrichment analysis (Figure 1 and 2).

Figure 1: Cellplot. A representation of enriched go terms. Enrichment of top 5 significant terms is plotted on the x-axis. For each term the number of genes is shown as well as the log2(fold change) of each respective gene.

Figure 2: Symplot. A representation of changed genes within an enriched term. Top 20 enriched terms are shown. The size of the central bars is directly proportional to the number of genes in the query belonging to the respective term and the enrichment value is reflected by the color of the bar. Percentage of down and upregulated genes in each term is shown.

2.2 Executable

2.2.1 aDiff

aDiff is an annotation and mining tool for differential expression results generated with Cuffdiff [1] - differential gene expression, differential isoforms expression, differential promoter usage, differential splicing, and differential cds. It starts by mapping Cuffdiff’s self-generated artificial gene/transcript ids to gene/transcript ids in a provided ensembl reference genome annotation (as provided to Cuffmerge). Using ensembl gene ids, it collects biotype and gene ontology information from ensembl’s biomart [12] server as well as KEGG annotations through the KEGG API [3] for each respective gene. A report of these annotated tables can be obtained in both tsv or excel format. After discarding non-significant values (as defined in Cuffdiff) it generates one report sheet for each pair-wise comparison. Each respective list of significant genes/transcripts is used to query ”The Database for Annotation, Visualization and Integrated Discovery (DAVID)” [4] for enrichment in user defined categories (default: go term bp fat, go term cc fat, got term mf fat, kegg pathway, pfam, prosite, genetic association db disease, omim disease). Protein protein interaction networks are assembled through the STRING Cytoscape app [13] using cytoscape’s REST API running locally or remotely. An example of an aDiff output from the raw data provided by [1] can be downloaded from the project's wiki - https://github.com/mpg-age-bioinformatics/AGEpy/wiki.

2.2.2 abed

abed annotates bed files with overlapped gene names, gene ids, and other defined features (eg. promoter, exon, UTR).

2.2.3 blasto

blasto uses the blast.py module to query remote blast servers through their respective REST API’s. For complex queries with multifasta files where different arguments are required for the different fasta entries, arguments can be given in the respective sequence names. Tabular results as well as html can be saved locally.

2.2.4 david

david uses part of the david.py module to query DAVID through its REST API with user provided gene lists in a tabular format. Given a second column with log2 (fold change) it will use the CellPlot and SymPlot functions from the plots.py module to display the respective plots (Figure 1 and 2). As an option DAVID queries can be performed against a user provided list of background genes.

2.2.5 obo2tsv

obo2tsv takes annotation files (in obo format) from the gene ontology consortium [14, 15] as input and converts them into a tsv formatted file with parent and child term information included. Given a gene association file, for a selected organism, the generated tabular gene ontology annotation will be merged with the respective organism data. Input can either be local or URLs to the respective files.

3. Conclusion

AGEpy provides Python support for computational biology. Use cases for several of its functions have recently been published [16] and can be further experienced with its provided executables. With most default arguments set for research on the biology of ageing using worm, fly, mouse or human data it can be useful for the interpretation of ageing research data on both small and large scales. AGEpy modules come with extensive documentation (available under the package’s docs folder and at http://agepy.readthedocs.io). The open source AGEpy Python package is freely available at https://github.com/mpg-age-bioinformatics/AGEpy.

Acknowledgements

We acknowledge all the current and past members of the Bioinformatics Core Facility of the Max Planck Institute for Biology of Ageing.

Funding

This work has been supported by the Max Planck Institute for Biology of Ageing (F.M., A.I., J.B.) and the Cologne Graduate School of Ageing Research (R.S.).

References

Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7 (2012) 562-578.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15 (2014): 550.
Kanehisa M, Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research 45 (2017): 353-361.
Jiao X, Sherman BT, Huang DW, et al. DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics (Oxford, England) 28 (2012): 1805-1806.
Data Structures for Statistical Computing in Python. In Proc. of the 9th Python in Science Conf., pages (2010): 51-56.
van der Walt S, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13 (2011): 22–30.
Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9 (2007): 90-95.
Dale RK, Pedersen BS, Quinlan AR. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics (Oxford, England) 27 (2011): 3423-3424.
Camacho C, Coulouris G, Avagyan V, et al. BLAST+: architecture and applications. BMC bioinformatics 10 (2009): 421.
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic acids research, 41 (2013): 8-20.
Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research 13 (2003): 2498-2504.
Smedley D, Haider S, Durinck S, et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic acids research 43 (2015): 589-598.
Szklarczyk D, Morris JH, Cook H, et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Research, 45 (2017): 362-368.
Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25 (2000) 25-29.
The Gene Ontology Consortium (2017). Expansion of the Gene Ontology knowledgebase and resources. Nucleic acids research, 45(D1), gkw1108+.
Boucas, J. Integration of ENCODE RNAseq and eCLIP Data Sets. Methods in molecular biology (Clifton, N.J.) 1720 (2018): 111-129.