Title: | Poly-Omic Prediction of Complex TRaits |
---|---|
Description: | It provides functions to generate a correlation matrix from a genetic dataset and to use this matrix to predict the phenotype of an individual by using the phenotypes of the remaining individuals through kriging. Kriging is a geostatistical method for optimal prediction or best unbiased linear prediction. It consists of predicting the value of a variable at an unobserved location as a weighted sum of the variable at observed locations. Intuitively, it works as a reverse linear regression: instead of computing correlation (univariate regression coefficients are simply scaled correlation) between a dependent variable Y and independent variables X, it uses known correlation between X and Y to predict Y. |
Authors: | Hae Kyung Im, Heather E. Wheeler, Keston Aquino Michaels, Vassily Trubetskoy |
Maintainer: | Hae Kyung Im <[email protected]> |
License: | GPL (>=3) |
Version: | 1.4.0 |
Built: | 2025-03-12 05:01:12 UTC |
Source: | https://github.com/hakyimlab/omickriging |
This is a flexible cross validation routine which wraps the Omic Kriging
calculation. The user can specify the size of the test set, all the way to
"Leave One Out" cross validation. Additionally, all relevant parameters in the
okriging
function are exposed. This function uses the doParallel
package to distribute computation over multiple cores. If the phenotype is
case/control, a ROCR AUC and GLM analysis is run and the results printed to screen.
krigr_cross_validation(cor.list, pheno.df, pheno.id = 1, h2.vec, covar.mat = NULL, nfold = 10, ncore = "all", verbose = FALSE, ...)
krigr_cross_validation(cor.list, pheno.df, pheno.id = 1, h2.vec, covar.mat = NULL, nfold = 10, ncore = "all", verbose = FALSE, ...)
cor.list |
A list of correlation matrices used in Kriging. rownames and colnames of cor should be IID list and include idtest and idtrain. |
pheno.df |
A data frame with rownames set as sample IDs and a column containing phenotype data. |
pheno.id |
The name of the column in pheno which contains phenotype data to test. |
h2.vec |
has weights for each RM relatednes matrix |
covar.mat |
Data frame of covariates with rownames() set to sample IDs. |
nfold |
Select the number of cross validation rounds to run. The value "LOOCV" will run one round of cross validation for each sample in your dataset. The value "ncore" will set the test set size such that a single round runs on each core specified in the ncore option. Any numeric value will be set to the test size. Default runs 10 rounds of cross validation. |
ncore |
The number of cores available to distribute computaition across If a numeric value is supplied, that number of cores is registered. If the value "all" is supplied, all available cores are used. |
verbose |
Report rounds on cross validation on standard out. |
... |
Optional and unnamed arguments. |
A dataframe with three columns: sample ID, observed phenotype Ytest, and predicted phenotype Ypred
This function loads a file into a data frame. This file should contain one row per sample in your study, and one column for each covariate and phenotype of interest. Additionally, it requires a header with "IID" for the column of sample IDs, and a unique name for each phenotype and covariate.
load_sample_data(phenoFile, main.pheno)
load_sample_data(phenoFile, main.pheno)
phenoFile |
File path to the phenotype/covariate file. |
main.pheno |
Column name of the main phenotype of interest. |
A data frame with dimensions (# of samples) x (# of phenotypes/covar)
This function computes a gene expression correlation matrix given a file of transcript expression levels for each sample in the study. It returns a correlation matrix with rownames and colnames as sample IDs.
make_GXM(expFile = NULL, gxmFilePrefix = NULL, idfile = NULL)
make_GXM(expFile = NULL, gxmFilePrefix = NULL, idfile = NULL)
expFile |
Path to gene expression file. |
gxmFilePrefix |
File path prefixes for outputting GCTA style binary correlation matrices. |
idfile |
Path to file containing family IDs and sample IDs with header FID and IID. |
Returns a correlation matrix of (N-samples x N-samples), with rownames and colnames as sample IDs.
## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile)
## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile)
A simple wrapper around the irlba() function which computes a partial SVD
efficiently. This function's run time depends on the number of eigenvectors
requested but scales well. Use this function to generate covariates for use
with the okriging
or krigr_cross_validation
functions.
make_PCs_irlba(X, n.top = 2)
make_PCs_irlba(X, n.top = 2)
X |
A correlation matrix. |
n.top |
Number of top principal compenents to return |
A matrix of Principal Components of dimension (# of samples) x (n.top). As expected, eigenvectors are ordered by eigenvalue. Rownames are given as sample IDs.
library(irlba)
## compute PC's using the gene expression correlation matrix from vignette ## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile) ## find top ten PC's of this matrix using SVD topPcs <- make_PCs_irlba(geneCorrelationMatrix, n.top=10)
## compute PC's using the gene expression correlation matrix from vignette ## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile) ## find top ten PC's of this matrix using SVD topPcs <- make_PCs_irlba(geneCorrelationMatrix, n.top=10)
A simple wrapper around the base R svd() function which returns the top N
eigenvectors of a matrix. Use this function to generate covariates for use
with the okriging
or krigr_cross_validation
functions. This wrapper preserves the rownames of the original matrix.
make_PCs_svd(X, n.top = 2)
make_PCs_svd(X, n.top = 2)
X |
A correlation matrix. |
n.top |
Number of top principal compenents to return |
A matrix of Principal Components of dimension (# of samples) x (n.top). As expected, eigenvectors are ordered by eigenvalue. Rownames are given as sample IDs.
## compute PC's using the gene expression correlation matrix from vignette ## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile) ## find top ten PC's of this matrix using SVD topPcs <- make_PCs_svd(geneCorrelationMatrix, n.top=10)
## compute PC's using the gene expression correlation matrix from vignette ## load gene expression values from vignette expressionFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_gene_subset.txt.gz") ## compute correlation matrix geneCorrelationMatrix <- make_GXM(expressionFile) ## find top ten PC's of this matrix using SVD topPcs <- make_PCs_svd(geneCorrelationMatrix, n.top=10)
Universal kriging formula: lambda' = ( c + X m )' iSig m' = ( x - X' iSig c )' ( X' iSig X )^-1 m' = ( t(x) - c' iSig X ) ( X' iSig X )^-1 lambda' = (c' + m' X) iSig x: #covariates x ntest X: ntrain x #cov c: ntrain x ntest
okriging(idtest, idtrain = NULL, corlist, H2vec, pheno, phenoname, Xcova = NULL)
okriging(idtest, idtrain = NULL, corlist, H2vec, pheno, phenoname, Xcova = NULL)
idtest |
A vector of sample IDs which constitute the test set. |
idtrain |
A vector of sample IDs which constitute the training set. |
corlist |
A list of correlation matrices used in Kriging. rownames and colnames of cor should be IID list and include idtest and idtrain. |
H2vec |
has weights for each RM relatednes matrix |
pheno |
A data frame with rownames set as sample IDs and a column containing phenotype data. |
phenoname |
The name of the column in pheno which contains phenotype data to test. |
Xcova |
Data frame of covariates with rownames() set to sample IDs. |
A dataframe with three columns: sample ID, observed phenotype Ytest, and predicted phenotype Ypred
Cressie 1993 Statistics for Spatial Data p.154
Function provided by GCTA maintainers (modified slightly) for accessing their recently introduced binary GRM format. The GRM is stored as a vector of numerics which correspond to the lower triangular elements including the diagonal. We simply read these, pull the diagonal elements, and inflate them into a full symmetric matrix. We add sample IDs to colnames and rownames for compatibility with other Kriging functions.
read_GRMBin(prefix, size = 4)
read_GRMBin(prefix, size = 4)
prefix |
The file path prefix to GRM binary files (e.g., test.grm.bin, test.grm.N.bin, test.grm.id.) |
size |
The length (in bytes) of each value in the raw GRM vector. Default is 4, and matches GRM writen by GCTA 1.11. |
Note that the GRM is described by three files, and this function assumes that all have a common prefix that is passed in.
GRM of dim (N.samples x N.samples) with rownames and colnames as sample ID.
http://www.complextraitgenomics.com/software/gcta/estimate_grm.html
## read binary Genetic Relatedness Matrix (GRM) generated by GCTA grmFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_genotypes.grm.bin") grmFileBase <- substr(grmFile,1, nchar(grmFile) - 4) GRM <- read_GRMBin(grmFileBase)
## read binary Genetic Relatedness Matrix (GRM) generated by GCTA grmFile <- system.file(package = "OmicKriging", "doc/vignette_data/ig_genotypes.grm.bin") grmFileBase <- substr(grmFile,1, nchar(grmFile) - 4) GRM <- read_GRMBin(grmFileBase)
Function to write a binary GRM format recently introduced by GCTA. It takes a correlation matrix as used by other Kriging functions, and writes three files: binary file for storing the diagonal + lower triangular elements, a text file for sample IDs, and a binary file storing the number of SNPs used in the correlation matrix calculation.
write_GRMBin(X, n.snps = 0, prefix, size = 4)
write_GRMBin(X, n.snps = 0, prefix, size = 4)
X |
Correlation matrix with rownames and colnames as sample IDs. |
n.snps |
Number of SNPs used in correlation matrix calculation. Default is 0.0. |
prefix |
Base file path and names for the three output files. |
size |
Number of bytes to write for each value. Default is 4 |
None. Though side effects are writing three files as described above.
http://www.complextraitgenomics.com/software/gcta/estimate_grm.html
## create a random genotype matrix nSamples <- 10 mMarkers <- 100 X <- matrix(rbinom(n=100, size=2, prob=0.5), nrow=nSamples) ## compute the Genetric Relatedness Matrix grm <- cor(X) ## write a Genetic Relatedness Matrix (GRM) ## NOTE: to following is not run here -- not writing any files in examples #write_GRMBin(grm, n.snps=mMarkers, prefix="grm.out")
## create a random genotype matrix nSamples <- 10 mMarkers <- 100 X <- matrix(rbinom(n=100, size=2, prob=0.5), nrow=nSamples) ## compute the Genetric Relatedness Matrix grm <- cor(X) ## write a Genetic Relatedness Matrix (GRM) ## NOTE: to following is not run here -- not writing any files in examples #write_GRMBin(grm, n.snps=mMarkers, prefix="grm.out")