| Title: | Evaluate Correlation Between Time Series |
|---|---|
| Description: | This package takes two time series as inputs and first, determines the lag for the primary time series at which the cross-correlation between the two time series is highest. Second, the rolling correlation is calculated between the lagged primary time-series and the secondary time series. |
| Authors: | Kylie Ainslie [aut, cre] (ORCID: <https://orcid.org/0000-0001-5419-7206>) |
| Maintainer: | Kylie Ainslie <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-07-02 08:56:07 UTC |
| Source: | https://github.com/mrc-ide/pika |
For each group, computes the mean of count_var over a baseline period
of n_baseline_periods consecutive time steps starting at
start_date (or the earliest date if start_date is NULL).
Each observation is then expressed as a fractional change relative to that
baseline mean:
calc_percent_change( dat, date_var = "date", grp_var, count_var, n_baseline_periods = 7, start_date = NULL )calc_percent_change( dat, date_var = "date", grp_var, count_var, n_baseline_periods = 7, start_date = NULL )
dat |
A data frame containing a count column, a date column, and a grouping column. |
date_var |
Character string giving the name of the date column (class
|
grp_var |
Character string giving the name of the grouping column. The baseline mean is computed separately per group. |
count_var |
Character string giving the name of the count column. |
n_baseline_periods |
Integer. Number of consecutive time steps used to
compute the baseline mean. For daily data, |
start_date |
Start date of the baseline period. Accepts a |
A value of 0 indicates no change from baseline; -0.5 indicates a 50% decrease; 1.0 indicates a doubling. Originally developed for population mobility data but applicable to any non-negative count series.
The input data frame with one additional numeric column,
perc_change, giving each observation as a fractional change
relative to the group-specific baseline mean (0 = no change from
baseline, -1 = zero counts, positive values = above baseline).
## Not run: dat_pct <- calc_percent_change( dat = mobility_data, date_var = "date", grp_var = "region", count_var = "trips", n_baseline_periods = 7, start_date = "2020-01-13" ) ## End(Not run)## Not run: dat_pct <- calc_percent_change( dat = mobility_data, date_var = "date", grp_var = "region", count_var = "trips", n_baseline_periods = 7, start_date = "2020-01-13" ) ## End(Not run)
The case data has daily confirmed cases for different provinces in China from 16 January to 24 March 2020 from the dashboard maintained by Chinese Center for Disease Prevention and Control (CCDC). The CCDC dashboard collates numbers of confirmed cases reported by national and local health commissions in each province in mainland China, and Hong Kong SAR and Macau SAR. Confirmed cases are defined as suspected cases, who have epidemiological links and/or clinical symptoms, and are detected with SARS-CoV-2 by PCR tests. However, in Hubei province, clinically diagnosed cases were additionally included between 12 and 19 February.
china_case_datachina_case_data
A data frame with 483 rows and 3 variables:
date, in YYYY-MM-DD format
name of province/region in China where cases were reported
number of daily cases reported of COVID-19
http://2019ncov.chinacdc.cn/2019-nCoV/
Computes the cross-correlation function (CCF) between x_var and
y_var for lags from -max_lag to 0 using
ccf, then returns the lag with the highest CCF for each
group. Only non-positive lags are considered (i.e. x_var leading
y_var), reflecting the assumption that changes in the primary series
precede changes in the secondary series.
cross_corr( dat, date_var = NULL, grp_var, x_var, y_var, max_lag = 20, subset_date = NULL )cross_corr( dat, date_var = NULL, grp_var, x_var, y_var, max_lag = 20, subset_date = NULL )
dat |
A data frame containing the two time series and a grouping column. |
date_var |
Character string giving the name of the date column. Required
when |
grp_var |
Character string giving the name of the grouping column. The CCF is computed separately within each group. |
x_var |
Character string giving the name of the primary (leading) time series column. |
y_var |
Character string giving the name of the secondary (lagged) time series column. |
max_lag |
Integer. Maximum number of lags to evaluate. CCF is computed
for lags |
subset_date |
Character string in the same format as |
A tibble with one row per group containing:
<grp_var>Group identifier; column name matches grp_var.
lagInteger . The lag at which the CCF between
x_var and y_var is highest within that group.
rolling_corr to compute rolling correlation at the
identified lag; ccf for the underlying CCF method.
## Not run: lags <- cross_corr( dat = my_data, date_var = "date", grp_var = "region", x_var = "r_mean", y_var = "movement", max_lag = 14 ) ## End(Not run)## Not run: lags <- cross_corr( dat = my_data, date_var = "date", grp_var = "region", x_var = "r_mean", y_var = "movement", max_lag = 14 ) ## End(Not run)
A grouped wrapper around estimate_R (Cori et al.
2013). For each group, Rt is estimated in a sliding weekly window using a
Bayesian framework with a Gamma-distributed serial interval. Results from all
groups are combined into a single data frame.
estimate_rt( dat, grp_var, date_var, incidence_var, est_method = "parametric_si", si_mean = 6.48, si_std = 3.83 )estimate_rt( dat, grp_var, date_var, incidence_var, est_method = "parametric_si", si_mean = 6.48, si_std = 3.83 )
dat |
A data frame with at least a date column, an incidence column,
and a grouping column. No |
grp_var |
Character string giving the name of the grouping column. Rt is estimated independently for each group. |
date_var |
Character string giving the name of the date column. |
incidence_var |
Character string giving the name of the daily incidence (case count) column. |
est_method |
Character string specifying the serial interval estimation
method passed to |
si_mean |
Mean of the serial interval distribution (days). Used when
|
si_std |
Standard deviation of the serial interval distribution (days).
Used when |
The default serial interval parameters (si_mean = 6.48,
si_std = 3.83) are from Nishiura et al. (2020) for COVID-19 and
should be updated for other pathogens.
A data frame with one row per estimation window per group, containing:
date_startStart date of the estimation window.
date_endEnd date of the estimation window.
<grp_var>Group identifier; column name matches grp_var.
r_meanPosterior mean Rt.
r_medianPosterior median Rt.
r_q2.52.5th percentile of the posterior (lower 95% credible interval).
r_q97.597.5th percentile of the posterior (upper 95% credible interval).
Cori A, Ferguson NM, Fraser C, Cauchemez S (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. American Journal of Epidemiology, 178(9), 1505–1512. doi:10.1093/aje/kwt133
Nishiura H, Linton NM, Akhmetzhanov AR (2020). Serial interval of novel coronavirus (COVID-19) infections. International Journal of Infectious Diseases, 93, 284–286. doi:10.1016/j.ijid.2020.02.060
estimate_R for full estimation control,
including non-parametric serial intervals.
## Not run: rt_estimates <- estimate_rt( dat = china_case_data, grp_var = "province", date_var = "date", incidence_var = "cases" ) ## End(Not run)## Not run: rt_estimates <- estimate_rt( dat = china_case_data, grp_var = "province", date_var = "date", incidence_var = "cases" ) ## End(Not run)
The daily within-city movement data, used as a proxy for economic activity, is available from 1 January to 24 March 2020 for major metropolitan cities within each province in mainland China, Hong Kong SAR, and Macau SAR. These data, provided by Exante Data Inc, measured travel activity relative to the 2019 average (excluding Lunar New Year). The underlying data are based on near real-time people movement statistics from Baidu.
exante_movement_dataexante_movement_data
A data frame with 672 rows and 3 variables:
date, in YYYY-MM-DD format
name of province/region in China
daily population-weighted within-city movement index
Produces a faceted line plot showing the primary series (x_var),
secondary series (y_var), and rolling correlation (roll_corr)
over time, with one facet per group. Horizontal reference lines are drawn
at y = -1 (dashed), 0 (solid), and 1 (dashed). Optionally adds a shaded
confidence ribbon around x_var. If dat contains a column
named roll_corr (e.g. from rolling_corr) it will be
included in the plot; if absent the line is simply omitted.
plot_corr( dat, date_var, grp_var, x_var, y_var, x_var_lower = NULL, x_var_upper = NULL, facet_labels = NULL, legend_labels = NULL, y_max = NULL, col_values = c(brewer.pal(8, "RdPu")[8], brewer.pal(8, "Greens")[5], brewer.pal(8, "Blues")[8]) )plot_corr( dat, date_var, grp_var, x_var, y_var, x_var_lower = NULL, x_var_upper = NULL, facet_labels = NULL, legend_labels = NULL, y_max = NULL, col_values = c(brewer.pal(8, "RdPu")[8], brewer.pal(8, "Greens")[5], brewer.pal(8, "Blues")[8]) )
dat |
A data frame containing the two time series, a |
date_var |
Character string giving the name of the date column (class
|
grp_var |
Character string giving the name of the grouping column used for faceting. |
x_var |
Character string giving the name of the primary time series column. |
y_var |
Character string giving the name of the secondary time series column. |
x_var_lower |
Character string giving the name of the column containing
the lower confidence bound for |
x_var_upper |
Character string giving the name of the column containing
the upper confidence bound for |
facet_labels |
Named character vector of display labels for the facets,
passed to |
legend_labels |
Character vector of length 3 giving legend labels for
|
y_max |
Numeric. Maximum value for the y-axis. If supplied, the axis is
set to |
col_values |
Character vector of length 3 specifying line colours for
|
A ggplot object.
rolling_corr to compute roll_corr;
plot_lag to visualise the lag distribution.
## Not run: plot_corr( dat = data_corr, date_var = "date_end", grp_var = "province", x_var = "r_mean", y_var = "movement", x_var_lower = "r_q2.5", x_var_upper = "r_q97.5", legend_labels = c("Rolling correlation", "Rt", "Mobility") ) ## End(Not run)## Not run: plot_corr( dat = data_corr, date_var = "date_end", grp_var = "province", x_var = "r_mean", y_var = "movement", x_var_lower = "r_q2.5", x_var_upper = "r_q97.5", legend_labels = c("Rolling correlation", "Rt", "Mobility") ) ## End(Not run)
Produces a histogram of the lag values returned by cross_corr,
showing the distribution of optimal lags across groups.
plot_lag(dat, lag_var, bins = 2)plot_lag(dat, lag_var, bins = 2)
dat |
A data frame containing a lag column, typically the output of
|
lag_var |
Character string giving the name of the lag column to plot. |
bins |
Numeric. Bin width for the histogram. Default is 2. |
A ggplot object.
cross_corr to compute the lag values;
plot_corr to visualise the time series and rolling correlation.
## Not run: lags <- cross_corr( dat = my_data, grp_var = "region", x_var = "r_mean", y_var = "movement" ) plot_lag(lags, lag_var = "lag") ## End(Not run)## Not run: lags <- cross_corr( dat = my_data, grp_var = "region", x_var = "r_mean", y_var = "movement" ) plot_lag(lags, lag_var = "lag") ## End(Not run)
Computes the Pearson correlation between x_var and y_var
over a rolling window of n time periods within each group, using
runCor. The first n - 1 observations in each
group will be NA because there are insufficient data to fill the
window.
rolling_corr(dat, date_var = "date", grp_var, x_var, y_var, n = 14)rolling_corr(dat, date_var = "date", grp_var, x_var, y_var, n = 14)
dat |
A data frame containing the two time series, a date column, and a grouping column. |
date_var |
Character string giving the name of the date column. Must be
of class |
grp_var |
Character string giving the name of the grouping column. Rolling correlation is computed separately within each group. |
x_var |
Character string giving the name of the primary time series column. |
y_var |
Character string giving the name of the secondary time series column. |
n |
Integer. Width of the rolling window in time periods. Default is 14. |
A data frame with the same columns as the input plus one additional
numeric column, roll_corr, containing the rolling Pearson
correlation between x_var and y_var. Values range from -1
to 1. The first n - 1 observations per group are NA. Note
that rows where x_var or y_var are NA are removed
before the rolling correlation is computed, so the returned frame may have
fewer rows than the input.
cross_corr to identify the optimal lag before
computing rolling correlation; runCor for the
underlying method; plot_corr to visualise the result.
## Not run: data_corr <- rolling_corr( dat = my_data, date_var = "date", grp_var = "region", x_var = "r_mean", y_var = "movement", n = 14 ) ## End(Not run)## Not run: data_corr <- rolling_corr( dat = my_data, date_var = "date", grp_var = "region", x_var = "r_mean", y_var = "movement", n = 14 ) ## End(Not run)