DataSets
This project is addressed to store data sets used in publications. The data set includes the Hellinger and J-divergence of Methylation levels in four Arabidopsis Mutant Samples and from patients with several types of cancer. ALl the analyses were made with MethylIT.
Hellinger and J Information Divergences of Methylation levels
The methylation level for an individual i at
cytosine site j corresponds to a probability vector
.
Then, the information divergence between methylation levels
The methylation level
and
from individuals 1 and 2 at site j
is the divergence between the vectors
and
.
If the vector of coverage is supplied, then the information divergence is estimated according to the formula:
This formula corresponds to Hellinger divergence as given in the first formula from Theorem 1 from reference 1. Otherwise:
which is formula applied to the current set of samples.
The J-divergence
Given a the methylation levels from two individuals at a given cytosine site, this function computes the J information divergence (JD) between methylation levels. The motivation to introduce JD in Methyl-IT is founded on:
- It is a symmetrised form of Kullback-Leibler divergence:
. Kullback and Leibler themselves actually defined the divergence as (2):
which is symmetric and nonnegative, where the probability distributions P and Q are defined on the same probability space (see reference (1) and Wikipedia).
-
In general, JD is highly correlated with Hellinger divergence, which is the main divergence currently used in Methyl-IT (see examples for function estimateDivergence.
-
By construction, the unit of measurement of JD is given in units of bit of information, which set the basis for further information-thermodynamics analyses.
Methylation levels
at a given cytosine site i from an individual j, lead to the probability vectors
. Then
, the J-information divergence between the methylation levels
and
(used as reference individual),
is given by the expression :
The statistic with asymptotic Chi-squared distribution is based on the statistic suggested by Kupperman (1957) (3) for
and commented in references (4-5) . That is:
Where and
are the total counts (coverage in the case of methylation) used to compute the probabilities
and
. A basic Bayesian correction is added to prevent zero counts.
References
- Kundariya H, Yang X, Morton K, Sanchez R, Axtell MJ, Hutton SF, et al. MSH1-induced heritable enhanced growth vigor through grafting is associated with the RdDM pathway in plants. Nat Commun. 2020;11: 5343. doi:10.1038/s41467-020-19140-x.
- Kullback, S.; Leibler, R.A. (1951). "On information and sufficiency". Annals of Mathematical Statistics. 22 (1): 79â86. doi:10.1214/aoms/1177729694.
- Kupperman, M., 1957. Further application to information theory to multivariate analysis and statistical inference. Ph.D. Dissertation, George Washington University.
- Salicrú M, Morales D, Menéndez ML, Pardo L. On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis. 1994. pp. 372â391. doi:10.1006/jmva.1994.1068.
- Basu A., Mandal A., Pardo L (2010) Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett 80: 206-214. DOI: doi.org/10.1016/j.spl.2009.10.008.