Hello R World! - Introduction to R

Hello R World! – Introduction to R

R is an open source language for statistical analysis and graphics^[1]. The language has been used in a mass of scenarios such as data mining, machine learning, and bioinformatics studies. The package contains a wide range of statistical tests which includes parametric and non-parametric tests for hypothesis testing. Like other languages, it has conditional statements, loops and data structures. R also provides a way to visualize the data and analysis by converting them into plots. In life sciences especially in bioinformatics R has been used frequently. Many data analysis algorithms or methods are available in R which was developed by scientific researchers all around the globe. Simple hypothesis tests, like t-test can be used to find the difference in sample data or complex field data can be analyzed using ANOVA which will give the p-value along with other statistics. In biological science co-expression networks between genes using their expression can reveal many interactions pathways which can give insight into the function of genes altogether^[2]. In such cases correlation networks or weighted correlation networks are very helpful. These networks and co-expression can easily be drawn using R. Apart from simple analyses R can be used for NGS analyses. A few of examples include analysis of RNA-Seq^[3], ChIP-Seq^[4], Wole Genome Bisulfite Sequencing^[5], small RNA-Seq^[6] and many more, Using the Bioconductor package of R all these analyses can be done on local machine.

R can handle large data with large number of columns and rows without compromising the data. In one of the news published by BBC due to restriction of columns and rows in MicroSoft excel, Covid-19 data of around 16,000 patients was lost^[7]. Due to this loss of data, number of false negatives increases. This may result in spread of Covid-19 since those false negative patients or patients with possible Covid-19 infection can come in contact with other people. This issue can be easily avoided by using R instead of excel where the limit of data is very large as compared to MS Excel.

R in bioinformatics

As mentioned above that R is a statistical analysis programming language. Since it is freely available and has wide range of statistical tests and plotting option, it is widely used in the analysis of bioinformatics data.

For example there are many libraries which can remove contamination, perform quality checks on fastq files^[8]analyze Next generation sequencing data, calculate expression of genes, perform differential gene expression (DESEQ^[9] or EdgeR^[10] ), and generate heatmaps^[11] , histograms, line plots^[12], venndiagram^[13] and other relevant plots. Similarly Microarray analyses can be done using R language which calculated fold change value after reducing the noise in data one such package is limma^[14]. Limma can analyze both microarray as well as NGS data. There are a lot of tools written in R which can read files which are generated from various instruments and can’t be read directly as text, such as ab1^[15] file or BAM files. Many researchers use R language to calculate the difference in the sample and calculate p-values. Few of the most famous tests used in bioinformatics sample testing are T-test, Z-test, ANOVA, test of normality and other parametric and non-parametric tests. Machine learning in R is also used as a way to classify and cluster biological data. There are a lot of papers which uses R to create classifiers to classify biological data^[16],^[17]. Many studies have used R to create mathematical models to predict the dependent and independent variable trends. Using R classification libraries researches can do text mining saving a lot of time in manual curation. To found the relationships between various samples R is also widely used to calculate pairwise and multiple correlations^[12].

R is also used to create plots which are used in publications. There is a separate package which uses R statistical programming language using which user can do wide range of bioinformatics data analysis. Packages, which host variety of tools, can help analyze bioinformatics data such as Microarry, differential gene expression, SNP, flow, PCR and other data handling. Using package of R researches can perform above mentioned data analysis as well as much more. For example, package of R can analyze end-to-end NGS data or microarray data without much manual intervention. One of the NCBI resources, Gene Expression Omnibus (GEO)^[18]uses R to analyze microarray data available in the database online, which analyze the data and do mapping of probes to genes making it easier for non-bioinformatics researcher to perform their own analysis. There are many bioinformatics databases which used R for downloading and accessing the data these includes Ensembl which uses biomaRt, TCGAbiolinks which use to access TCGA cancer data and many other webservers. Other than that R is also used to identify motifs^[19] in the sequences and can do mutation analysis. In mutation analysis allele specific expression can be calculated in R. R language can be used to create HTML pages with inbuilt APIs which can link database to the frontend with ease. This can help in setting up a bioinformatics webserver with minimal effort using Rstudio and RShiny. R is also being used to analyze data from flow cytometry^[20], PCR^[21] and other low-throughput methods. Also alignments can also be done using R language^[22]. There are many more application of R in bioinformatics as almost all the data analysis in bioinformatics can be done using R package.

References:

[1]Team, R. C. R: A Language and Environment for Statistical Computing. Vienna, Austria (2019).
[2]Langfelder, P. & Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics (2008) doi:10.1186/1471-2105-9-559.
[3]Love, M. I., Anders, S., Kim, V. & Huber, W. RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Research (2015) doi:10.12688/f1000research.7035.1.
[4]Na, D., Rouf, M., O’Kane, C. J., Rubinsztein, D. C. & Gsponer, J. NeuroGeM, a knowledgebase of genetic modifiers in neurodegenerative diseases. BMC Med. Genomics 6, (2013).
[5].Akalin, A. et al. MethylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol. (2012) doi:10.1186/gb-2012-13-10-R87
[6]Qian, K., Auvinen, E., Greco, D. & Auvinen, P. MiRSeqNovel: An R based workflow foranalyzing miRNA sequencing data. Mol. Cell. Probes (2012) doi:10.1016/j.mcp.2012.05.002.
[7] BBC. Covid: Test error ‘should never have happened’ – Hancock.
[8] Roser, L. G., Agüero, F. & Sánchez, D. O. FastqCleaner: An interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files. BMC Bioinformatics (2019) doi:10.1186/s12859-019-2961-8.
[9] iAnders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. (2010) doi:10.1186/gb-2010-11-10-r106.
[10] Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (2009) doi:10.1093/bioinformatics/btp616.
[11] Trakhtenberg, E. F. et al. Cell types differ in global coordination of splicing and proportion of highly expressed genes. Sci. Rep. (2016) doi:10.1038/srep32249.
[12] Jha, A., Mehra, M. & Shankar, R. The regulatory epicenter of miRNAs. J. Biosci. 36, 621–638 (2011).
[13] Jha, A., Panzade, G., Pandey, R. & Shankar, R. A legion of potential regulatory sRNAs exists beyond the typical microRNAs microcosm. Nucleic Acids Res. 43, 8713–24 (2015).
[14] Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. (2015) doi:10.1093/nar/gkv007.
[15] Hill, J. T. et al. Poly peak parser: Method and software for identification of unknown indels using sanger sequencing of polymerase chain reaction products. Dev. Dyn. (2014) doi:10.1002/dvdy.24183.
[16] Ru, Y. et al. The multiMiR R package and database: Integration of microRNA-target interactions along with their disease and drug associations. Nucleic Acids Res. (2014) doi:10.1093/nar/gku631.
[17] Zhang, J. et al. MiRspongeR: An R/Bioconductor package for the identification and analysis of miRNA sponge interaction networks and modules. BMC Bioinformatics (2019) doi:10.1186/s12859-019-2861-y.
[18] Edgar, R. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
[19] Yao, Z. et al. Discriminative motif analysis of high-throughput dataset. Bioinformatics (2014) doi:10.1093/bioinformatics/btt615.
[20] JKlinke, D. J. & Brundage, K. M. Scalable analysis of flow cytometry data using R/Bioconductor. Cytom. Part A (2009) doi:10.1002/cyto.a.20746.
[21] Ahmed, M. & Kim, D. R. pcr: An R package for quality assessment, analysis and testing of qPCR data. PeerJ (2018) doi:10.7717/peerj.4473.
[22] Bodenhofer, U., Bonatesta, E., Horejš-Kainrath, C. & Hochreiter, S. Msa: An R package for multiple sequence alignment. Bioinformatics (2015) doi:10.1093/bioinformatics/btv494.