Non-parametric Analysis of Granger Causality Using Local Measures of Divergence

tributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The employment of Granger causality analysis on temporal data is now a standard routine in many scientific disciplines. Since its inception , Granger causality has been modeled using a wide variety of analytical frameworks of which, linear models and derivations thereof have been the dominant choice. Nevertheless, a body of research on Granger causality and its applications has focused on non-linear and non-parametric models. One common choice for such models is based on employment of multivariate density estimators and measures of divergence. However, these models are subject to a number of estimations and tuning components that have a great impact on the final outcome. Here we focus on one such general model and improve a number of its tuning bodies. Crucially, we i) investigate the bandwidth selection issue in kernel density estimation, and ii) discuss and propose a solution to the sensitivity of estimated information theoretic measures of divergence to non-linear correspondence. The resulting framework of analysis is evaluated using varied series of simulations.


Introduction
Analysis of causality, as a natural extension of correlative analysis, has been of great interest in many scientific disciplines.Causal dynamics can be quantified and studied in as diverse areas as social events behind rising levels of criminality, the effects of protein expression on cellular motility, the causes behind growing concentrations of carbon dioxide in the atmosphere, the driving forces leading to increased inflation, and the effects of drug treatments on neuronal activity.The general concept of causality has a history stretching as far back as that of philosophical thinking.However, quantitative concepts of causality are relatively new and have been subject to much debate.Given the wide variety of views on quantitative definitions of causality, it is of little surprise that different schools of thought on this matter have generated a considerable spectrum of quantitative frameworks to define, model and analyze data-driven causal phenomena.Among these, Bayesian networks, differential equation-driven systems analysis and Granger causality have been the dominant frameworks of causal modeling and analysis [17,28].Here, we address the concept of Granger causality.The concept of Granger causality was formulated in different lights by the works of [18,48] and consolidated by Clive W.J. Granger in [15].Granger causality is a particular definition of causality where using temporal resolution, a variable is said to Granger-cause another if the earlier values of the former can enhance the prediction of the present value of the latter in the presence of the latter's earlier values.This particular definition of causality presumes a temporal signal asymmetry where the cause precedes the effect and where the information embedded in the causal variable about the occurrence of the effect conditioned on all other embedded information is unique [15,22].Expressed using the mathematical language of probability theory, under H 0 , given k lags and variables A, B and C, {B} does not Granger-cause {A} at observation index t, if The statement above can be tested by comparing the two conditional probability densities (CPDs) below [12]: Under H 0 , where {B} does not Granger-cause {A}, the CPDs (2) and (3) are identical.For the sake of convenience let us implement the following substitutions: X = A t , Y = {B} t−k t−1 and Z = {{A} t−k t−1 {C} t−k t−1 }.Thus, it is understood that all formulations, estimations and tests, regardless of their appearance are easily identifiable with any arbitrary multivariate setting.The conditional density functions in (2) and (3) are thus expressed more economically as f X|Z and f X|Y Z , respectively.Accordingly, (1) can be redefined as: Since the introduction of the concept of Granger causality in 1969, for which Granger received the Nobel Memorial Prize in Economics in 2003, this particular definition of causality has lead to a wide array of research focused on its theoretical formulation, data-driven analyses, and expansions of practical routines using various statistical tools.Models of Granger causality have extensively been reviewed in [22,30,36,40].Most prominently, Granger causality has had a long range of applications within the econometric discipline [11,14,15,39,43].Another prominent field of application includes biologically oriented domains such as biological networks inference, and systems modeling and analysis [21,32,35,46].In this study we will enhance the estimates of multidimensional probability densities used in non-parametric analysis of Granger causality by improved means of bandwidth selection, and address the sensitivity of estimated information theoretic measures to non-linear relationships through the employment of local estimates.The arrangement of the study is as follows.In Methods, we present the elementary models of Granger causality, review the non-parametric extensions using multivariate density estimation and measures of divergence, and address the bandwidth selection and sensitivity issues mentioned above.
In Results, we investigate the performance of the improved non-parametric framework using series of simulations.Lastly, in Discussion we draw upon the discussed themes and conclude the study.

Parametric models
Methods of linear regression have been the most frequently employed tools to model and test the presence of Granger causality.Using linear regression, the hypothesis in (1), omitting the variable(s) C for notational convenience, is tested using the models where α .are the regression intercept coefficients, β .and γ .are the regression variable coefficients, and the residual terms t and η t are independent and identically distributed according to a standard Gaussian N (0, σ 2 ).See [13] for a generalization of the models above.Among other techniques, the models above can be tested using the Granger-Sargent test [15], also known as the structural Chow test in econometric literature [9]: where RSS is the 'restricted' residual sum of squares under H 0 , RSS η is the 'unrestricted' residual sum of squares under H 1 , n is the number of observations, and k is the number of included lags.Naturally, these models assume linear specifications of the functional form between the regressors and the response variable.Additionally, given the widespread usage of ordinary least squares' (OLS) regression, the models demand the fulfillment of the standard assumptions of homoscedasticity, error normality, and lack of multicollinearity.A routine practice to relax the latter assumption is achieved by principal components regression or partial least squares regression at the possible expense of predictive power.A partial circumvention of the other restrictions is outlined in the following.An approach to relax the strict linearity of the models above is to use functional derivations or higher-dimensional projections of variables.Thus, the hypothesis in (1) can be tested using the models where f ij denotes the jth functional derivation of a variable at lag t − i, and J i represents the number of such derivations for each variable at each lag t − i.
Derivations and projections of this type can naturally be applied to the response variable as well, leading to multivariate models of regression.Typical derivations and projections include polynomial expansions, power transformations, spline-based expansions, and radial basis function expansions.The hypothesis modeled above can be similarly tested using the Granger-Sargent statistic in (5).Specific studies in the modeling and application of non-linear models of regression in the context of Granger causality include [1,5,7,27].
Although offering an enhanced degree of flexibility, models of non-linear regression as formulated above differ from those of linear regression merely in the sense that they model the linear correspondences between derivations of the response and the regressors.Hence, the obtained specifications of the relationships will still be linear or piece-wise linear.Whether or not the modeled relationships are the true underlying patterns of correspondence is not a trivial task to verify.Moreover, higher-dimensional projections of data may need ade-quate treatments as they often lead to overparametrization and ill-conditioned covariance matrices.

Non-parametric models
As discussed in the previous section, models of linear regression and the corresponding parametric tests lose their practical validity where assumptions are violated.As a simple example, regression via least squares' techniques with the Granger-Sargent test in (5) perform rather poorly in the presence of variables with skewed distributions.More important however, is the potential exacerbation of performance inadequacy in the case of linear model specification where the true forms of correspondence between the response and the regressors may be non-linear.
Based on these concerns, efforts to improve the modeling of Granger causality have focused on relaxing the assumptions of parametricity in general, and liberation from linear forms of correspondence in particular.Here we will focus on the circumvention of these assumptions by using means of kernel density estimation and non-parametric (information theoretic) measures of divergence.
Frameworks of this breed have been reviewed and discussed to varying extents in [22,36].In the current study however, our aim is to improve some of the overlooked aspects of non-parametric and information theoretic-driven modeling of Granger causality.These include the practices involved in multivariate kernel density estimation and the sensitivity of information theoretic measures to non-linear relationships.
In the following we outline our framework to test the hypothesis formulated in (4) by comparing the CPDs (2) and (3).Consequently, the procedure of modeling and testing (4) can be divided into three domains: i ) estimation of the CPDs, ii ) choosing a measure of divergence and, iii ) utilization of a suitable framework for tests of significance.

Estimating CPDs: Kernel density estimation
Traditionally, histogram density estimators based on equidistant, spline-based or adaptive partitioning of the observation space, have been the preferred solution for estimating probability density functions.Since their introduction however, kernel density estimation (KDE) has become the dominant solution to probability density estimation due to greater flexibility and superior performance [34,38,47].More specifically, when compared to histogram density estimators, the KDE approach leads to better rates of convergence and less sensitivity to binning [41].
Having obtained independent and identically distributed samples (x 1 , . . ., x n ) from a distribution with an unknown density f , the estimation of f using KDE reduces to the following: where K(•) is a kernel function and h is a bandwidth, a smoothing parameter.
Common kernel functions are the uniform, triangular, Epanechnikov, or the Gaussian [20,38].The latter function, the standard Gaussian kernel K(x) = φ(x) = e −x 2 /2 / √ 2π will be our choice throughout this study.Thus, using KDE, f X|Z and f X|Y Z are estimated according to the following: where C .are the normalizing constants, i.e. where is the total number of observations used in the estimation, d X is the number of dimensions of X, and h (X) .
are the bandwidths associated with the corresponding dimensions of X; resulting in an individual bandwidth for every lag of every variable in the KDE of the CPDs.The other definitions follow accordingly.Varieties of KDE in non-parametric modeling of Granger causality have been previously employed to test the hypothesis in (4) in a number of studies [8,21,35].Specifically, the kernel function chosen in [8] and the bandwidths are evaluated using a 'rule of thumb'.Notably, the choice of bandwidths is an issue that demands a discussion.

Bandwidth selection
Adequate bandwidth selection in kernel density estimation of multivariate distributions is critical to obtaining reliable estimations.The bandwidth, as a smoothing parameter, determines the number of observations included in 'windows' in the process of KDE.Specification of the bandwidth size is a clear bias-variance tradeoff where relatively small bandwidths lead to high variance and low bias, and relatively large bandwidths to the opposite [20].Currently, choosing the bandwidth parameter in a wide variety of applications is determined by a 'rule of thumb' or other application-specific evaluations.Unsurprisingly, such frameworks fail to generalize and when used without critical reflection can lead to erroneous and biased estimations.This issue has to an extent been addressed by other more adaptive methods designed to enhance the unbiasedness of KDE [19,23,45].However, none of the these approaches qualifies as a purely non-parametric practice yielding satisfactory density estimations regardless of underlying distributions.Here, in the following employments of KDE, bandwidths are chosen using the plug-in bandwidth selection method proposed in [4] based on linear diffusion processes.Using simulated data, the bandwidths produced according to [4] return reliable and accurate results and consistently outperform the traditional bandwidth selection methods (see Figure 1).The simulations in Figure 1 are based on two axes of variation: distribution bimodality and mode shape.The source distributions are two-component mixtures of Gaussian distributions with 1000 randomly generated numbers and equal mixing proportions.The bimodality of the distribution is determined by distancing the two Gaussian components and the mode shape is determined by altering the variance of one of the components.Both of these variations are performed progressively for 20•20 bins as seen in Figure 1.The obtained residuals represent the pointwise standardized deviations between the estimated densities and the underlying mixture model density.The fixed bandwidth of choice has been (4σ 5 /3n) 0.2 , where σ is the standard deviation of the sample and n is the number of samples.This bandwidth is also known as Silverman's rule of thumb [38].As seen in Figure 1, the adaptive bandwidths proposed by [4] clearly outperform the fixed bandwidths in terms of unbiasedness.Thus, in the remainder of this study all bandwidths in all instances of KDE are realized using the method in [4].

Measures of divergence
As stated in the Introduction, the existence of Granger causality is equivalent to an inequality between the conditional probability densities f X|Z and f X|Y Z .Under the null hypothesis (4) the CPDs above have to be equal.This plausible equality or its lack thereof, can be quantified using a variety of measures of divergence.Among others, the weighted Hellinger distance and the Euclidean distance have been used to quantify the divergence between the kernel density estimated CPDs in [8], of which the latter will be employed in a number of our simulations.More specifically, the Euclidean distance between ( 7) and ( 6) is defined as [44]: where (f .|. ) i is the ith realization of (f .|. ), and n is the number of such realizations.
In the remainder the focus will be on measures of divergence derived from the field of information theory.This focus is motivated by the inherent nonparametricity of information theoretic measures of divergence.Firstly, given empirically estimated distribution functions, there is no need for distribution specificity.Secondly, information theoretic measures operate in an evidencebased manner allowing detection of non-linear relationships, the practical validity of which will be discussed shortly.
The first discussed measure for quantification of divergence between f X|Y Z and f X|Z is the Jensen-Shannon divergence introduced in [26].Given the right weighting parameters, the Jensen-Shannon divergence can be regarded as the symmetrized version of the Kullback-Leibler divergence [25,26].However, before a thorough elaboration of these measures, a review on some of the basic concepts in information theory may be useful.
In information theory the (differential) Shannon entropy of a random variable X with a continuous probability distribution f X with support on X is defined as [10,37]: where b is the base of the logarithm determining the terms in which the entropy is measured (e.g.b = 2 for bits and b = e for nats).The higher the value of entropy the higher the uncertainty associated with the outcomes of the studied random variable.The most illustrative demonstrations are those of bernoulli trials such as coin tosses.A symmetric coin (p 1 = p 2 = 0.5) yields the highest entropy value whereas values of entropy decrease with increasing coin asymmetry; hence decreasing uncertainty.The Kullback-Leibler divergence between two probability distribution functions f X and g X with support on X is defined as [10,25]: Note that replacing the arguments f X and g X in (11) can lead to a different quantification of divergence.Therefore, the Kullback-Leibler divergence is not regarded as a symmetric measure of divergence.Given the definitions above, the Jensen-Shannon divergence between two probability distributions f X and f Y representing two stochastic variables X and Y , respectively, with identical support is defined by where m XY = [f X + f Y ]/2, and π .denotes the weights assigned to each distribution subject to π X + π Y = 1.In our applications, unless otherwise stated, π .= 1/d where d is the number of dimensions.Moreover, 0 Expressed in terms of Shannon entropy, ( 12) can be redefined as: Thus, with π = 0.5, the symmetrized divergence between the estimated CPDs ( 7) and ( 6) is evaluated according to: Another information theoretic measure of divergence frequently used in the context of Granger causality has been the conditional mutual information.In fact, there is an illuminating relationship between (14) and conditional mutual information which is elaborated in the Appendix 5.1.The conditional mutual information of X and Y given Z, with supports on X , Y, and Z respectively, can be expressed as: The conditional mutual information is symmetric: I(X; Y |Z) = I(Y ; X|Z).Furthermore, I(X; Y |Z) ≥ 0 with equality if and only if X and Y are independent.Lastly, I(X; X|Z) = H(X|Z).
As the conditional entropy of a set of variables could vary, the conditional mutual information in (15) should be normalized to compensate for possible fluctuations.In this study, we will employ the normalized conditional mutual information and conditional symmetric uncertainty: Other studies outside of the context of Granger causality focused on normalization of mutual information include [6,42,49].It should be noted that a number of information theoretic measures used in the context of Granger causality can be deducted to a derivation of conditional mutual information.
One of these measures is the widely used 'transfer entropy' as proposed in [33], which can be reparametrized as the conditional mutual information [22,36].Interestingly, the transfer entropy is shown to have a functional relationship with the linear estimators of Granger causality [3,29].Other measures coined under the term 'directed influence' also fall under this category as derivations of conditional mutual information [36].

Sensitivity to linearity
Based on our observations and as put forward in [24,31], information theoretic measures estimated via KDE show a quantifiable and non-random sensitivity to non-linear patterns of correspondence.Thus, given the same degree of noise, information theoretic measures estimated via KDE assign lower scores to nonlinear relationships than those of a linear manner.As this contradicts the non-parametric nature of information theoretic measures a series of different approaches have been devised to circumvent this issue.As outlined in [31], limiting the sensitivity of information theoretic measures (e.g.mutual information) to non-linearity may be achieved by means of domain partitioning.However, the partitioning devised in [31] demands a large number of observations and due to its extensive partitioning is limited to the bivariate case.It is understood that in the present case of time-series modeling using two or more variables and one or more lags, the smallest number of dimensions available for partitioning is three.Moreover, as the number of variables and lags increase, the 'curse of dimensionality' will swiftly necessitate the collection of ever larger deposits of observations.To circumvent this issue, we propose local estimates of the measures above ( 14), ( 16) and ( 17) as follows: where n is the number of observations, N i denotes the neighborhood of observation i and vectors X (N i ) , Y (N i ) and Z (N i ) represent the observation values in neighborhood N i .Naturally, the ability of these estimates to capture non-linear relationships depends on the size of the neighborhood.Adequate neighborhood sizes should be chosen subject to their ability to capture local relationships.Small neighborhood sizes would potentially suffer from large variance whereas neighborhoods of large proportions might be increasingly biased and fail to capture local structures.When applied to time-series modeling, neighborhood selection is most intuitively determined by basing it on the domain of the response variable (the 'effect' of the 'cause' variables).Simulations (see Figure 8) confirm the ability of these estimates to detect non-linear relationships.

Tests of significance
Although there are distribution parametrizations for some information theoretic measures under specific conditions [36], given our aim to constitute a non-parametric framework for modeling and testing, we choose to employ bootstrap resampling to create estimations of probability distribution for the chosen measures of divergence under the null hypotheses.

Results
The following series of simulations are designed to evaluate the performance of the hitherto discussed framework of Granger causality analysis.In the bivariate and functional time-series, the parametric framework stated in the Methods has also been included.The focus in the functional time-series is on (19) as this measure performs nearly identically as (18) and (20).In the multivariate time-series, the aim is to investigate the ability of the framework in detection of causal links in high-dimensional spaces.

Bivariate time-series
In the simulated bivariate time-series only one variable is designed to be selfgenerative.More specifically, we define two variables X and Y , and denote the temporal index by t, x t = a and y t = b where a and b are random numbers drawn from the standard Gaussian distribution.The remaining is defined as: where ∼ N (0, 1), k = 10 and g c (i) = c • i determines the degree to which lags of y are designed to correlate with x t .Common choices for c in the following simulations include 0.1,1,2,5 and 10.Each realization of X and Y consists of 100 {x t } and {y t }, respectively.That is, 1100 sample points in X and Y .At each lag k the following hypotheses are tested: The hypotheses H X 0 and H Y 0 are tested via the classical linear regression framework in 2.1.In addition, using density estimation via KDE as devised in ( 7) and ( 6), the hypotheses above are tested by employing the Euclidean distance (9), the normalized conditional mutual information (16), the conditional symmetric uncertainty (17), the Jensen-Shannon divergence (12).The results based on 100 realizations of X and Y using B = 1000 in the bootstrap resamplings, are displayed in Figures 2, 3, 4, 5, 6, and summarized in Figure 7.The arrangement of panels in Figures 2-6 is as follows: panels along the horizontal axis represent the noise levels 0.1,1,2,5 and 10 whereas panels along the vertical axis represent the number of included lags: 1, 5 and 10.The gray probability masses represent the reference distributions under the null hypotheses H X 0 .These distribution are either evaluated analytically (as for the Granger-Sargent test), or obtained computationally using the bootstrap resampling scheme outlined above.The crosses on the x-axes of these probability masses represent the 'empirical' scores of the simulations.It is easily seen that increasing noise levels, regardless of the number of included lags, lead to less frequent rejections of the null hypotheses.That is, the detectability of the synthetic causal link between the two variables is degraded as a function of added noise.The summary of the results above under H X 0 and additionally under H Y 0 are represented in Figure 7 in terms of p-values for each noise level at each lag.Here, it is easily seen that the Euclidean distance d E is outperformed by the Granger-Sargent test and the divergence measures under H X 0 .Additionally, the information theoretic measures of divergence outperform the Granger-Sargent test in the detection of causal relationships.Among the information theoretic measures, the Jensen-Shannon divergence performs most optimally.

Functional time-series
Here we define two variables X and Y , and denote the temporal index by t, x t = a and y t = b where a is a random number generated according to a U (−2, 2) distribution and b is a random number drawn from the standard Gaussian distribution.The remaining is defined as: where ∼ N (0, 1), k = 5 and g c (i) = c • i determines the degree to which lags of x and y are designed to correlate with x t .In this series c is set to three different values: 1, 2.5 and 5.Each realization of X and Y consists of 50 {x t } and {y t }, respectively.The null hypothesis is identical to that of the bivariate time-series, tested using the Granger-Sargent test and the neighborhood-based measure of divergence is normalized conditional mutual information as defined in (19) where the neighborhood size is chosen to include the 10 nearest neighbors to every observation.The results based on 100 simulations of X and Y using B = 500 in the bootstrap resamplings, are displayed in Figure 8.The groups of scatterplots on the left side of Figure 8 represent the bivariate distributions of X at lag 0 against X and Y at lags k = 1, .., 5.The scatterplots are arranged vertically top down with increasing noise levels.The three panels on the right side in Figure 8 represent the yielded p-values from the simulations.Regardless of the noise level, the Granger-Sargent test fails to capture the functional causal link between X and Y whereas the neighborhood-based normalized conditional mutual information N I N detects the relationship at all noise levels.

Multivariate time-series
The simulated multivariate time-series consists of four variables (nodes) W, X, Y, and Z and resembles a type of reverse design in its construction compared to the former bivariate cases.More specifically, w t = a, x t = b, y t = c and z t = d, and where a, b, c and d are generated randomly according to the standard Gaussian distribution.Additionally: where k = 10 and where similarly, each is generated randomly according to a N (0, 1).The remaining is constructed according to: This specific setting is designed to test whether the employed framework of KDE, divergence quantification and bootstrap tests of significance can capture time-resolved correlations in high dimensions.The cross-temporal correlations increase as more lags from the 'past' are included in the model.Consequently, during this process the space in which the data is embedded is inflated as more dimensions (lags) are added to the KDE.The results of the simulation using B = 1000 bootstrap resamplings presented in Figure 9 and 10, based on the Jensen-Shannon divergence and normalized conditional mutual information respectively, will help to illuminate the motivation behind the specific design.The demonstration of the simulation results in Figures 9 and 10 is arranged according to the following.The four nodes and their conditional causal links are represented for each progressive inclusion of lags.Any presence of a conditional causal link between any two nodes is marked by a directed arrow (for p-values < 0.05).Absences of conditional causal links are marked by dashed lines.The color-coding of the dashed lines denotes the rank correlation coefficients between one node at lag 0 and the other node at the most recently included lag.The color-coding of the arrows denotes the same quantity between the cause at the most recently included lag and the effect at lag 0. As evident in Figures 9 and 10, the two frameworks perform equally well in unveiling the conditional causal relationships between the nodes.One slight difference however, is the order in which the causal links are detected.Closer investigations revealed that this phenomenon had its roots in the betweensimulation differences in random number generation.As the causal signals have been embedded in 'late' lags, easing their detectability by looking 'farther back' in time, one significant outcome of these simulations is the power of information theoretic measures to detect conditional causal links in relatively high-dimensional spaces.

Discussion
Given the abundance of multivariate time-series data in social and life sciences, there is an evident and inherent interest in moving beyond stationary correlative analysis to dynamic analysis of time-resolved causal relationships.The concept of Granger causality, although not synonymous with causality itself, offers a powerful framework for analysis of causality in time-series data.
Regarding the elaborated methods to model Granger causality, techniques based on assumptions of parametricity (e.g.ordinary linear regression) are superior to non-parametric designs due to their lower computational demands and more accessible interpretability.Nevertheless, meeting the assumptions of parametric models may not always be feasible.
The framework outlined here in this study based on kernel density estimation (KDE) via adaptive bandwidths, information-theoretic measures of divergence, and bootstrap tests of significance, constitutes a fully non-parametric platform for analysis of Granger causality.Additionally, the inferior sensitivity of information theoretic measures derived via KDE to non-linear relationships has been solved using local neighborhood-based estimates of information theoretic measures.Furthermore, the results from our extensive simulations, based on synthetic linear and non-linear (functional) causal relationships, confirm the ability of the discussed platform in detecting causal relationships subject to a varying array of signal to noise ratios.In conclusion, both frameworks succeed to reveal the underlying conditional causal relationships between the nodes, matching the engineering of the simulations.On prospective directions of research, unbiased estimates of differential information theoretic measures of divergence should be assigned top priority.As in earlier studies, we have here seen that differential information theoretic measures of divergence estimated via KDE fail to adhere to their non-parametric premise in the presence of non-linear relationships.Possible solutions to this problem are discretization of the data space, altered kernel functions in the density estimates, or higher-dimensional projections of the data space combined with supervised regularizations.Regardless of possible prospective routes, any relevance of further improvements should be judged upon the specific aim and question of the analysis.Overall, we have shown further advances in non-parametric analysis of Granger causality, allowing such type of temporal data analysis to better take advantage of available information given the abundance of non-parametric data in numerous scientific fields such as econometrics, biology and climatology.The choice of non-parametric analysis of Granger causality is further motivated by the ever growing computational power, facilitating considerable increase in the efficiency of such frameworks of analysis.
Acknowledgements.The author wishes to thank Joanna Tyrcha and John G. Lock for insightful discussions and feedback.The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement # 258068; EU-FP7-Systems Microscopy NoE; from the Swedish Research Council grant # 340-2012-6011; and from the Center for Biosciences at Karolinska Institutet.

Jensen-Shannon divergence and conditional mutual information
can be derived further as: where I(X; Y |Z) is referred to as the conditional mutual information of X and Y given Z [10].Similar proofs without using conditional densities can be found in [2,16,49].The + signs on the x-axes represent the yielded scores from the Granger-Sargent test as formulated in (5).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same test statistic.The + signs on the x-axes represent the yielded scores from the Euclidean distance as formulated in (9).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.The + signs on the x-axes represent the yielded scores from the normalized conditional mutual information as formulated in (16).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.The + signs on the x-axes represent the yielded scores from the normalized conditional mutual information as formulated in (17).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.The + signs on the x-axes represent the yielded scores from the Jensen-Shannon divergence as formulated in (13).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.

Figure 2 :
Figure2: The Granger-Sargent test.Results based on simulations of bivariate time-series for k = 1..10 lags and 5 levels of noise as defined in Bivariate time-series.The + signs on the x-axes represent the yielded scores from the Granger-Sargent test as formulated in(5).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same test statistic.

Figure 3 :
Figure3: The Euclidean distance.Results based on simulations of bivariate time-series for k = 1..10 lags and 5 levels of noise as defined in Bivariate time-series.The + signs on the x-axes represent the yielded scores from the Euclidean distance as formulated in(9).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.

Figure 4 :
Figure 4: Normalized conditional mutual information.Results based on simulations of bivariate time-series for k = 1..10 lags and 5 levels of noise as defined in Bivariate time-series.The + signs on the x-axes represent the yielded scores from the normalized conditional mutual information as formulated in(16).The gray probability mass is that of the null hypothesis H X

Figure 5 :
Figure 5: Symmetric uncertainty.Results based on simulations of bivariate time-series for k = 1..10 lags and 5 levels of noise as defined in Bivariate time-series.The + signs on the x-axes represent the yielded scores from the normalized conditional mutual information as formulated in(17).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.

Figure 6 :
Figure6: The Jensen-Shannon divergence.Results based on simulations of bivariate time-series for k = 1..10 lags and 5 levels of noise as defined in Bivariate time-series.The + signs on the x-axes represent the yielded scores from the Jensen-Shannon divergence as formulated in(13).The gray probability mass is that of the null hypothesis H X 0 under simulated data using the same measure of divergence.

Figure 7 :
Figure 7: Summary of the bivariate simulations.The aggregate results of the simulated bivariate time-series as defined in Bivariate time-series demonstrated using p-values under the null hypotheses.The five horizontal panels represent the five different noise levels (c) implemented in the simulations.The upper panels represents the results from the hypotheses based on X (H X 0 ) being the effect whereas the lower panels represent the results from the hypotheses based on Y (H Y 0 ) being the effect.The employed measures of divergence are displayed in the legends.

Figure 8 :
Figure 8: Functional time-series and the neighbourhood-based divergence.The results of the functional time-series simulations as outlined in Functional time-series.The bivariate scatter plots demonstrate the functional relationship between the two variables at each lag and for each noise level.The plotted p-values are associated with the Granger-Sargent test and the neighbourhood-based normalized conditional mutual information.

Figure 9 :
Figure 9: Multivariate time-series results using the Jensen-Shannon Divergence.The aggregate results of the simulated multivariate time-series as outline in Multivariate time-series.The measure of interest here is the Jensen-Shannon divergence.The results are represented in a lag-wise manner as each lag strechting further back into the 'past' reveals a stronger causal relationship on aggregate.The color-coding of the arrows is based on the spearman correlation coefficient of the most recently included lag with the response variable/the effect.

Figure 10 :
Figure10: Multivariate time-series results using the normalized conditional mutual information.The aggregate results of the simulated multivariate time-series as outline in Multivariate time-series.The measure of interest here is the normalized conditional mutual information.The results are represented in a lag-wise manner as each lag strechting further back into the 'past' reveals a stronger causal relationship on aggregate.The color-coding of the arrows is based on the spearman correlation coefficient of the most recently included lag with the response variable/the effect.
The bootstrap resampling is conducted B times with replacement under H 0 : X ⊥ Y |Z where at each instance of resampling the empirical observations in Y are randomly permuted to artificially create the independence stated under H 0 .Comparing the empirically quantified measure of interest with the bootstrapped distribution under H 0 yields the relevant p-values.Naturally, depending on the chosen measure, violations of H 0 can occur under the lower, e.g.D JS (f X|Y Z ||f X|Z ), or above the upper quantiles, e.g.I(X; Y |Z), of the bootstrapped distributions.