--- title: "Genotyping Many SNPs with multidog()" author: "David Gerard" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Genotyping Many SNPs with multidog()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=4.5, fig.height=3.5 ) ``` # Abstract `multidog()` provides support for genotyping many SNPs by iterating `flexdog()` over the SNPs. Support is provided for parallel computing through the [`future`](https://cran.r-project.org/package=future) package. The genotyping method is described in Gerard et al. (2018) and Gerard and Ferrão (2020). # Fit `multidog()` Let's load `updog`, `future`, and the data from Uitdewilligen et al. (2013). ```{r setup} library(future) library(updog) data("uitdewilligen") ``` `uitdewilligen$refmat` is a matrix of reference counts while `uitdewilligen$sizemat` is a matrix of total read counts. In these data, the rows index the individuals and the columns index the loci. But for insertion into `multidog()` we need it the other way around (individuals in the columns and loci in the rows). So we will transpose these matrices. ```{r} refmat <- t(uitdewilligen$refmat) sizemat <- t(uitdewilligen$sizemat) ploidy <- uitdewilligen$ploidy ``` `sizemat` and `refmat` should have the same row and column names. These names identify the loci and the individuals. ```{r} setdiff(colnames(sizemat), colnames(refmat)) setdiff(rownames(sizemat), rownames(refmat)) ``` If we want to do parallel computing, we should check that we have the proper number of cores: ```{r} future::availableCores() ``` Now let's run `multidog()`: ```{r} mout <- multidog(refmat = refmat, sizemat = sizemat, ploidy = ploidy, model = "norm", nc = 2) ``` By default, parallelization is run using ```{r, eval = FALSE} future::plan(future::multisession, workers = nc) ``` if `nc` is greater than 1. You can choose your own evaluation strategy by running `future::plan()` prior to running `multidog()`, and then setting `nc = NA`. This should be particularly useful in higher performance computing environments that use schedulers, where you can control the evaluation strategy through the [`future.batchtools`](https://cran.r-project.org/package=future.batchtools) package. For example, the following will run `multidog()` using forked R processes: ```{r, eval = FALSE} future::plan(future::multicore, workers = 2) mout <- multidog(refmat = refmat, sizemat = sizemat, ploidy = ploidy, model = "norm", nc = NA) ## Shut down parallel workers future::plan(future::sequential) ``` # `multidog()` Output There is a plot method for the output of `multidog()`. ```{r} plot(mout, indices = c(1, 5, 100)) ``` The output of multidog contains two data frame. The first contains properties of the SNPs, such as estimated allele bias and estimated sequencing error rate. ```{r} str(mout$snpdf) ``` The second data frame contains properties of each individual at each SNP, such as the estimated genotypes (`geno`) and the posterior probability of being genotyping correctly (`maxpostprob`). ```{r} str(mout$inddf) ``` You can obtain the columns in `inddf` in matrix form with `format_multidog()`. ```{r} genomat <- format_multidog(mout, varname = "geno") head(genomat) ``` To filter SNPs based on quality metrics (bias, sequencing error rate, overdispersion, etc), you can use `filter_snp()`, which uses the same non-standard evaluation you are used to from `dplyr::filter()`. That is, you can define predicates in terms of the variable names in the `snpdf` data frame from the output of `mupdog()`. It then keeps rows in both `snpdf` and `inddf` where the predicate for a SNP evaluates to `TRUE`. ```{r} dim(mout$snpdf) dim(mout$inddf) mout_cleaned <- filter_snp(mout, prop_mis < 0.05 & bias > exp(-1) & bias < exp(1)) dim(mout_cleaned$snpdf) dim(mout_cleaned$inddf) ``` # References Gerard, David, and Luís Felipe Ventorim Ferrão. "Priors for genotyping polyploids." *Bioinformatics* 36, no. 6 (2020): 1795-1800. . Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. "Genotyping Polyploids from Messy Sequencing Data." *Genetics* 210 (3). Genetics: 789–807. . Uitdewilligen, Anne-Marie A. AND D’hoop, Jan G. A. M. L. AND Wolters. 2013. "A Next-Generation Sequencing Method for Genotyping-by-Sequencing of Highly Heterozygous Autotetraploid Potato." *PLOS ONE* 8 (5). Public Library of Science: 1–14. .