Skip to content
Snippets Groups Projects
ROBERSY SANCHEZ RODRIGUEZ's avatar
ROBERSY SANCHEZ RODRIGUEZ authored
This is script to run on 'pjob_split' to estimate the best fitted model for J divergence for chromosome "7", "9", "17", and "22" using the data available 'jd_cancer_datasets.RData'
6bcb48da
History
Name Last commit Last update
at_mutants
cancer_data
.Rhistory
LICENSE
README.md

DataSets

This project is addressed to store data sets used in publications. The data set includes the Hellinger and J-divergence of Methylation levels in four Arabidopsis Mutant Samples and from patients with several types of cancer. ALl the analyses were made with MethylIT.

Hellinger and J Information Divergences of Methylation levels

The methylation level for an individual i at cytosine site j corresponds to a probability vector . Then, the information divergence between methylation levels The methylation level and from individuals 1 and 2 at site j is the divergence between the vectors and . If the vector of coverage is supplied, then the information divergence is estimated according to the formula:

This formula corresponds to Hellinger divergence as given in the first formula from Theorem 1 from reference 1. Otherwise:

which is formula applied to the current set of samples.

The J-divergence

Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:

  1. It is a symmetrised form of Kullback-Leibler divergence: . Kullback and Leibler themselves actually defined the divergence as (2):

which is symmetric and nonnegative, where the probability distributions P and Q are defined on the same probability space (see reference (1) and Wikipedia).

  1. In general, JD is highly correlated with Hellinger divergence, which is the main divergence currently used in Methyl-IT (see examples for function estimateDivergence.

  2. By construction, the unit of measurement of JD is given in units of bit of information, which set the basis for further information-thermodynamics analyses.

Methylation levels at a given cytosine site i from an individual j, lead to the probability vectors . Then , the J-information divergence between the methylation levels and (used as reference individual), is given by the expression :

The statistic with asymptotic Chi-squared distribution is based on the statistic suggested by Kupperman (1957) (3) for and commented in references (4-5) . That is:

Where and are the total counts (coverage in the case of methylation) used to compute the probabilities and . A basic Bayesian correction is added to prevent zero counts.

References

  1. Kundariya H, Yang X, Morton K, Sanchez R, Axtell MJ, Hutton SF, et al. MSH1-induced heritable enhanced growth vigor through grafting is associated with the RdDM pathway in plants. Nat Commun. 2020;11: 5343. doi:10.1038/s41467-020-19140-x.
  2. Kullback, S.; Leibler, R.A. (1951). "On information and sufficiency". Annals of Mathematical Statistics. 22 (1): 79–86. doi:10.1214/aoms/1177729694.
  3. Kupperman, M., 1957. Further application to information theory to multivariate analysis and statistical inference. Ph.D. Dissertation, George Washington University.
  4. Salicrú M, Morales D, Menéndez ML, Pardo L. On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis. 1994. pp. 372–391. doi:10.1006/jmva.1994.1068.
  5. Basu A., Mandal A., Pardo L (2010) Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett 80: 206-214. DOI: doi.org/10.1016/j.spl.2009.10.008.