%\VignetteIndexEntry{gdpc: An R Package for Generalized Dynamic Principal Components}
%\VignetteEngine{R.rsp::tex}
\documentclass[nojss]{jss}
\usepackage{thumbpdf,lmodern} 
%\graphicspath{{Figures/}}

\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{graphicx, epsfig}
\usepackage{amsmath}
\usepackage{morefloats}
\usepackage{placeins}
\usepackage{booktabs}
\usepackage{graphicx}%
\newcommand{\bbet}{\boldsymbol{\beta}}
\newcommand{\bal}{\boldsymbol{\alpha}}
\newcommand{\bfac}{\mathbf{f}}

\author{Daniel Pe\~na\\UC3M-BS Institute of Financial Big Data and\\ Universidad Carlos III de Madrid \And
  Ezequiel Smucler\\ Universidad Torcuato Di Tella\AND
  Victor J. Yohai\\Universidad de Buenos Aires -- CONICET}

\title{\pkg{gdpc}: An \proglang{R} Package for Generalized Dynamic Principal Components}

\Plainauthor{Daniel Pe\~na, Ezequiel Smucler, Victor J. Yohai} 
\Plaintitle{gdpc: An R Package for Generalized Dynamic Principal Components} 
\Shorttitle{\pkg{gdpc}: Generalized Dynamic Principal Components} 

\Abstract{ \pkg{gdpc} is an \proglang{R} package for the computation
  of the generalized dynamic principal components proposed in
  \cite{PenaYohai2016}.  In this paper, we briefly introduce the
  problem of dynamical principal components, propose a solution based
  on a reconstruction criteria and present an automatic procedure to
  compute the optimal reconstruction. This solution can be applied to
  the non-stationary case, where the components need not be a linear
  combination of the observations, as is the case in the proposal of
  \cite{Brillinger}. This article discusses some new features that are
  included in the package and that were not considered in
  \cite{PenaYohai2016}. The most important one is an automatic
  procedure for the identification of both the number of lags to be
  used in the generalized dynamic principal components as well as the
  number of components required for a given reconstruction
  accuracy. These tools make it easy to use the proposed procedure in
  large data sets. The procedure can also be used when the number of
  series is larger than the number of observations. We describe an
  iterative algorithm and present an example of the use of the package
  with real data.
  
This vignette is based on \cite{JSSours}.  
  }
  
\Keywords{dimensionality reduction, high-dimensional time series, \proglang{R}}

\Plainkeywords{dimensionality reduction, high-dimensional time series, R} 

\Address{
Daniel Pe{\~n}a \\
UC3M-BS Institute of Financial Big Data and\\
Department of Statistics\\
Universidad Carlos III de Madrid\\
E-mail: \email{daniel.pena@uc3m.es}\\

Ezequiel Smucler\\
Department of Mathematics and Statistics\\
Universidad Torcuato Di Tella\\
Avenida Figueroa Alcorta 7350\\
  Buenos Aires 1428, Argentina\\
E-mail: \email{esmucler@utdt.edu}\\

Victor J. Yohai\\
Instituto de C\'alculo\\
 Universidad de Buenos Aires\\
 Ciudad Universitaria, Pabell\'on 2\\
  Buenos Aires 1428, Argentina\\
E-mail: \email{vyohai@dm.uba.ar}
}
\begin{document}

\section{Introduction}

Dimension reduction is important for the analysis of multivariate time
series, particularly for the high-dimensional data sets that are becoming
increasingly common in applications, because the number of parameters in the
usual multivariate time series models grows quadratically with the number of
variables. The first proposal of a dynamic factor model for time series is
due to \cite{Brillinger64, Brillinger} who proposed to apply standard
techniques of factor analysis to the spectral matrix. He also proposed
dynamic principal components, which are two sided linear combinations of the
data which provide an optimal reconstruction. They are obtained by the
inverse Fourier transform of the principal components of the spectral density
matrices for each frequency. \cite{Geweke} proposed a one sided
generalization of the static factor model where the factors and the
innovations are mutually independent and follow covariance stationary linear
processes, and applied standard estimation methods for factor analysis to
the spectral density matrix instead of the covariance matrix. A similar
model was used by \cite{Sargent}, who named their model the index model. In
this model the factors account for all the cross-correlations among the
series, whereas the factors and the innovations account for the
autocorrelation of the observed series. \cite{Pena87} propose a dynamic
factor model for a vector of stationary time series with white noise
innovations and proved that we can estimate the loading matrix by the
eigenvectors corresponding to non null eigenvalues of the lag covariance
matrices of the data. \cite{Stock} use dynamic factors for forecasting, by
assuming that the variable to forecast and the explanatory variables useful
for its forecasting follow a dynamic factor model. \cite{BaiNg} develop a
criterion to consistently estimate the number of factors, which will be used
in our package. \cite{Hu} proposed a test for the number of factors and
explore the generalization of the model for integrated factors that was
carried out by \cite{Pena2006}. \cite{Pan} include general nonstationary
processes for the factors. \cite{Lam2012} proposed a test for the number of
factors based on the eigenvalues of the lag covariance matrices.

\cite{Forni2000} generalized Geweke's dynamic factor model by allowing the
idiosyncratic components to be autocorrelated and contain weak
cross-correlations. The authors proposed to estimate the common components by the
projection of the data on the dynamic principal components proposed by
Brillinger. \cite{Forni2005} proposed a one sided method of estimation of a dynamic factor
model and used the method for forecasting. The forecasts generated with this procedure have been compared to the ones derived by \cite{Stock} and the results are mixed (see %
\citet{Forni2016}). A modified forecasting approach was proposed by \cite%
{Forni2015}, although again the finite sample results are
mixed. \cite{Hallin2013} give a general presentation of the methodological
foundations of dynamic factor models.

\cite{PenaYohai2016} proposed a new approach for defining dynamic principal
components that is different from the one taken by Brillinger in three ways. First, their generalized dynamic
principal components (GDPC, from now on) are built without the assumption
that the series are stationary. If the data is not stationary, the GDPC
minimize a meaningful reconstruction criterion, unlike Brillinger's approach, which
minimizes the expected reconstruction mean square error. Second, the GDPC\ need not be a linear
combination of the original series. Third, they are computed directly in the
time domain, instead of working in the frequency domain. These GDPC are
optimal reconstructors of the observed data but they can also be used to
estimate the common part in high-dimensional dynamic factor models. In fact,
it has been shown that the GDPC\ provide consistent estimates \citep{Smucler17} of the common part of dynamic factor models. Since the GDPC are based on both leads and lags of  the data, like the dynamic principal components defined by Brillinger, they are not useful for forecasting. 
However, they are useful in general as a way to summarize the information in
sets of dependent time series in which the factor structure may not apply.\
Finally, the  optimal reconstructor property of the GDPC\ is useful for data
compression to reduce resources required to store and transmit data. This
makes them potentially useful for compression of dependent data and they
could be applied for image analysis and other types of spatial data in which
dependence is important.

A large number of \proglang{R} packages are available for working with time
series. Besides the `\code{mts}' class provided by the \pkg{stats} package, 
\cite{R}, several \proglang{R} packages provide classes and methods for
handling time-indexed data. For example, the \pkg{zoo} package, \cite{zoo},
provides an \proglang{S}3 class and methods for handling totally ordered indexed
observations, in particular irregular time series. The \pkg{xts} package, 
\cite{xts}, is able to uniformly handle \proglang{R}'s different time-based
data classes. Our \pkg{gdpc} package supports \code{mts}, \code{xts} and %
`\code{zoo}' objects: if the original data is stored in an object of class %
`\code{mts}', \code{xts} or \code{zoo}, then the principal components and
reconstructed time series will also be stored in an object of class %
`\code{mts}', `\code{xts}' or `\code{zoo}' respectively, with the same date
attributes. Among the multivariate time series \proglang{R} packages, the %
\pkg{MTS} package, \cite{MTS}, is a general tool-kit for multivariate time
series that includes vector autoregressive moving average (VARMA) models, factor models, and multivariate
volatility models. \pkg{freqdom}, \cite{freqdom}, implements Brillinger's
dynamic principal components.
\pkg{pcdpca}, \cite{pcdpca}, extends Brillinger's dynamic
principal components to periodically correlated multivariate time series. An
extensive comparison of Brillinger's approach to dynamic principal components and GDPC
can be found in \cite{PenaYohai2016}. The \pkg{BigVAR} package, \cite{BigVAR}%
, estimates vector autoregressive (VAR) models with structured LASSO penalties. Many more packages
can be found at \url{https://CRAN.R-project.org/view=TimeSeries}.
To the best of our knowledge, \pkg{gdpc} \citep{gdpc-pack} is the only
publicly available implementation of GDPC.

This article is organized as follows. In Section~\ref{sec:defin} we briefly
review the definition of the GDPC and presents a new proposal to apply the
procedure in an automatic way. Several possible methods for choosing the
number of components and the number of lags used for each component are
discussed. In Section~\ref{sec:compu} we review the iterative algorithm used
to compute the GDPC. In Section~\ref{sec:using} we illustrate the use of the %
\pkg{gdpc} package using artificial and real data sets. 
We compare the performance 
regarding computation time and the reconstruction of non-stationary time series of the implementation of Brillinger's method found in the \pkg{freqdom} package to that of GDPC
in Section~\ref{sec:rec_non-stat}. In Section~\ref{sec:tuning} we compare the performance
of the different criteria available in the \pkg{gdpc} package to choose the number of lags that define the GDPC.
Section~\ref{sec:comp_times} includes a small simulation study to estimate the typical
computation times of the main algorithm for both stationary and
non-stationary time series. Finally, conclusions are provided in Section~\ref%
{sec:conclu}.

\section{Definition of generalized dynamic principal components}
\label{sec:defin}

Consider a time series vector $\mathbf{z}_{t}=(z_{1,t},\ldots,z_{m,t})$, where $%
1\leq t\leq T,$ and let $\mathbf{Z}$ be the $T\times m$ matrix whose rows
are $\mathbf{z}_{1},\ldots,\mathbf{z}_{T}$. Here $T$ stands for the number of
periods and $m$ for the number of series. We define the first dynamic
principal component with $k$ lags as a vector $\mathbf{f=}(f_{t})_{-k+1\leq
t\leq T},$ so that the reconstruction of series $z_{j,t},1\leq$ $j\leq m,$
as a linear combination of $(f_{t-k},\dots, f_{t-1}, f_{t})$ is optimal with
respect to the mean squared error (MSE) criterion. More precisely, given a
possible factor $\mathbf{f}$ of length $(T+k)$, an $m\times(k+1)$ matrix of
coefficients $\boldsymbol{\beta}=(\beta_{j,h})_{1\leq j\leq m,1\leq h\leq
k+1}$ and $\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{m})$, the
reconstruction of the original series $z_{j,t}$ is defined as 
\begin{equation}
{z}^{R,b}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha})=\alpha_{j}+\sum_{h=0}^{k}\beta_{j,h+1}f_{t-h}.
\label{eq:recons}
\end{equation}
The $R$ superscript in ${z}^{R,b}_{j,t}$ stands for reconstruction and the $b$ superscript
stands for backward, due to the reconstruction being defined using the lags of $f_t$, in constrast with the reconstruction using leads, to be defined later.

The MSE loss function when we reconstruct the $m$ series using $\mathbf{f}$, 
$\boldsymbol{\beta}$ and $\boldsymbol{\alpha}$ is given by 
\begin{equation}
\text{MSE}(\mathbf{f,\boldsymbol{\beta},\boldsymbol{\alpha})}=\frac{1}{Tm}%
\sum_{j=1}^{m}\sum_{t=1}^{T}(z_{j,t}-{z}^{R,b}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha}))^{2}
\label{eq:MSE}
\end{equation}
and the values that minimize this MSE are called $(\widehat{\mathbf{f}},%
\widehat{\boldsymbol{\beta}},\widehat{\boldsymbol{\alpha}})$. The value of $%
\widehat{\mathbf{f}}$ is not identified as we can multiply the coefficients $%
\beta_{j,h+1}$ by a constant and divide $f_{t-h}$ by the same contant and the
model is the same. To solve this issue we take $\widehat{\mathbf{f}}$ with
zero mean and unit variance.  $\widehat{\mathbf{f}}$ is then the first GDPC. 
Note that for $k=0$, $\widehat{\mathbf{f}}$ is
the first ordinary (non-dynamic) principal component of the data. The second
GDPC is defined as the first GDPC of the residuals%
\begin{equation*}
r_{j,t}=z_{j,t}-z^{R,b}_{j,t}(\widehat{\mathbf{f}}, \widehat{\boldsymbol{\beta}}, \widehat{\boldsymbol{\alpha}}),\quad 1\leq j\leq m,1\leq t\leq T.
\end{equation*}
Higher order GDPC are defined in a similar fashion.

Note that the reconstruction given in Equation~\ref{eq:recons} can be
written using leads instead of lags. Suppose to simplify that $\alpha_{j}=0$
and $k=1$ so that we have%
\begin{equation*}
{z}^{R,b}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha})=\beta_{j,1}f_{t}+\beta_{j,2}f_{t-1}.
\end{equation*}
Given $\mathbf{f}, \boldsymbol{\beta}$ and $\boldsymbol{\alpha}$, let $f_{t+1}^{\ast}=$ $f_{t}$, $\beta_{j,2}^{\ast}=\beta_{j,1}$, $%
\beta_{j,1}^{\ast}=\beta_{j,2}$ and $\alpha^{\ast}_{j}=0$. Then
\begin{equation*}
{z}^{R,b}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha})=
{z}^{R,f}_{j,t}(\mathbf{f^{\ast}}, \boldsymbol{\beta}^{\ast}, \boldsymbol{\alpha}^{\ast})=\beta_{j,2}^{\ast}f_{t+1}^{\ast}+\beta_{j,1}^{\ast}f_{t}^{%
\ast},
\end{equation*}
that is the same equation but now using leads. We just shift one position
the series of the principal components and interchange the values of the $%
\beta$ coefficients to obtain an equivalent representation but now using
leads instead of lags. In general given $\mathbf{f}, \boldsymbol{\beta}$ and $\boldsymbol{\alpha}$ we can define 
\begin{align}
&f_{t+k}^{\ast}=f_{t}, \quad t=1-k,\dots,T,
\label{eq:leadslags1}
\\ &\beta_{j,k+2-g}^{\ast}=\beta_{j,g}, \quad 1\leq g\leq k+1, j=1,\dots,m,
\label{eq:leadslags2}
\\ & \alpha^{\ast}_{j}=\alpha_{j}, \quad j=1,\dots,m
\label{eq:leadslags3}
\end{align}
and write 
\begin{equation*}
{z}^{R,f}_{j,t}(\mathbf{f^{\ast}}, \boldsymbol{\beta}^{\ast}, \boldsymbol{\alpha}^{\ast})=\alpha^{\ast}_{j}+\sum_{h=0}^{k}\beta_{j,h+1}^{\ast}f_{t+h}^{\ast
}.  
\end{equation*}
Clearly
\begin{equation*}
{z}^{R,f}_{j,t}(\mathbf{f^{\ast}}, \boldsymbol{\beta}^{\ast}, \boldsymbol{\alpha}^{\ast})=
{z}^{R,b}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha})
\end{equation*}
and hence we see that, without loss of generality, we can use either lags or leads
of the principal component to reconstruct the series. Moreover, minimizing
the function in Equation~\ref{eq:MSE} is equivalent to minimizing
\begin{equation}
\frac{1}{Tm}%
\sum_{j=1}^{m}\sum_{t=1}^{T}(z_{j,t}-{z}^{R,f}_{j,t}(\mathbf{f}, \boldsymbol{\beta}, \boldsymbol{\alpha}))^{2}.  \nonumber
\end{equation}
Since the notation is
less cumbersome using leads, the derivation for the optimal solution in
Section~\ref{sec:compu} is presented using leads. 
Moreover, all internal
computations in the \pkg{gdpc} package are performed using leads. However,
since the reconstruction using lags is more intuitive, the final output is
passed to the form using lags via Equations~\ref{eq:leadslags1},~\ref{eq:leadslags2}, \ref{eq:leadslags3}. It can be shown that the GDPC can also be equivalently defined using a $k/2$ leads and $k/2$ lags of $f_t$ to define the reconstruction, see \cite{PenaYohai2016} for details. This explains the noisy character of the GDPC at the ends of the sample, see Figure~\ref{Fig:SP50PC} for example, and why, like Brillinger's DPC, the GDPC are not directly useful for forecasting.

Note that if we are considering $p$ dynamic principal components of order $k_{i}$
each, $i=1,\dots,p$, the number of values required to reconstruct
the original series is $\sum_{i=1}^{p}(T+k_{i}+m(k_{i} +1)+m)$.
In practice, the number of components and the number of lags used for each
component need to be chosen. The MSE of the reconstruction decreases when
either of these two numbers is increased, but the amount of information
needed to be stored for the reconstruction will also increase. The number of
components can be chosen so that a pre-specified fraction of the total
variance of the data is explained, as is usual in ordinary principal
component analysis. Regarding the choice of the number of lags for each
component, let $k$ be the number of lags used to fit the component under
consideration. Let $\widehat{y}_{j,t}=\widehat{\alpha}_{j}+\sum_{h=0}^{k}%
\widehat{\beta}_{j,h+1}\widehat{f}_{t+h}$ be the interpolator used where $y_{j,t}=z_{j,t}$ for
the first component and will be equal to the residuals from the fit with the
previous components otherwise. Let $r_{j,t}=y_{j,t}-\widehat{y}_{j,t},$ be
the residuals from the fit with the component, and $\mathbf{R}$ be the
corresponding matrix of residuals from this fit and let ${\boldsymbol{\Sigma}%
=(\mathbf{R}^{\top}\mathbf{R})/T}$. Given $k_{max}$, $k$ can be chosen among 
$0,\dots,k_{max}$ as the value that minimizes some criterion. This criterion
should take into account the MSE of the reconstruction and the amount of
information to be stored. The following four criteria are available in %
\pkg{gdpc}:
%
\begin{itemize}
\item Leave-one-out (LOO) cross-validation: 
\begin{equation*}
LOO_{k}=\frac{1}{Tm}\sum\limits_{i=1}^{m}\sum\limits_{t=1}^{T}\frac {%
r_{i,t}^{2}}{(1-h_{k,tt})^{2}},
\end{equation*}
where $h_{k,tt}$ are the diagonal elements of the hat matrix $\mathbf{H}_{k}=%
\mathbf{F}_{k}(\mathbf{F}_{k}^{\top}\mathbf{F}_{k})^{-1}\mathbf{F}%
_{k}^{\top} $ , with $\mathbf{F}_{k}$ being the $T\times(k+2)$ matrix with
rows $(f_{t-k},f_{t-k+1},\dots,f_{t},1)$.\bigskip

\item An AIC type criterion (\cite{Akaike}): 
\begin{equation*}
AIC_{k}=T\log\left( \text{trace}(\boldsymbol{\Sigma})\right) +m(k+2)2.
\end{equation*}

\item A BIC type criterion (\cite{BIC}): 
\begin{equation*}
BIC_{k}=T\log\left( \text{trace}(\boldsymbol{\Sigma})\right) +m(k+2)\log T.
\end{equation*}

\item A criterion based on the $IC_{p3}$ proposal of \cite{BaiNg}: 
\begin{equation*}
BNG_{k}=\min(T, m)\log\left( \text{trace}(\boldsymbol{\Sigma})\right) +\log(\min(T, m))(k+1).
\end{equation*}
\end{itemize}
%
The AIC and BIC type criteria are known not to work well when the ratio $T/m$
is small; this is to be expected since they are based on the assumption that 
$m$ is fixed and $T\rightarrow\infty$. For the cases in which the ratio $T/m$
is small it is interesting to use a criterion based on both $m$ and $T$
going to infinity. For this reason, we included the criterion $BNG_{k}$,
based on a proposal of \cite{BaiNg} for choosing the number of factors in a
factor model. We will compare the performance of the criteria in Section~\ref{sec:tuning}.

\section{Computing the GDPC}

\label{sec:compu}

To compute the GDPC we note that, given the component, the $\mathbf{\beta}$
coefficients are regression coefficients that can be computed by least
squares, and given the $\mathbf{\beta}$ coefficients we have again linear
equations that can be easily computed. To write the equations we have to
solve we introduce some notation. Define $a\vee b=\max(a,b)\ $and $a\wedge
b=\min(a,b).$ Let 
$$\mathbf{C}_{j}(\alpha_{j})=(c_{j,t,q}(\alpha_{j}))_{1\leq
t\leq T+k,1\leq q\leq k+1}
$$ be the $(T+k)\times(k+1)$ matrix defined by 
$$
c_{j,t,q}(\alpha_{j})=(z_{j,t-q+1}-\alpha_{j}),
$$ when $1\vee(t-T+1)\leq
q\leq(k+1)\wedge t$ and zero otherwise. Let 
$$
\mathbf{D}_{j}(\mathbf{\beta }%
_{j}\mathbf{)}=(d_{j,t,q}(\mathbf{\beta}_{j}\mathbf{)})
$$ be the $%
(T+k)\times(T+k)$ matrix given by 
$$d_{j,t,q}(\mathbf{\beta}_{j}\mathbf{)}%
=\sum_{v=(t-k)\vee1}^{t\wedge T}\beta_{j,q-v+1}\beta_{j,t-v+1},
$$ when $%
(t-k)\vee1\leq q\leq(t+k)\wedge(T+k)$ and zero otherwise. Define 
$$\mathbf{D}(%
\mathbf{\beta})=\sum_{j=1}^{m}\mathbf{D}_{j}(\mathbf{\beta}_{j}\mathbf{).}
$$
Differentiating Equation~\ref{eq:MSE} with respect to $\mathbf{f}$, it is
easy to show that 
\begin{equation}
\mathbf{f=D}(\boldsymbol{\beta})^{-1}\sum_{j=1}^{m}\mathbf{C}_{j}(%
\boldsymbol{\alpha})\boldsymbol{\beta}_{j}  \label{eq:f}
\end{equation}
where $\boldsymbol{\beta}_{j}$, $j=1,\dots,m$, are the rows of $\boldsymbol{%
\beta}$ . Differentiating Equation~\ref{eq:MSE} with respect to $\boldsymbol{%
\beta}_{j}$ and $\alpha_{j}$ we obtain 
\begin{equation}
\left( 
\begin{array}{c}
\boldsymbol{\beta}_{j} \\ 
\alpha_{j}%
\end{array}
\right) =\left( \mathbf{F(f)}^{\top}\mathbf{F(f)}\right) ^{-1}\mathbf{F(f)}%
^{\top}\mathbf{\ z}^{(j)},  \label{eq:beta}
\end{equation}
where $\mathbf{z}^{(j)}=(z_{j,1},\ldots,z_{j,T})^{\top}$ and $\mathbf{F(f)}$ is
the $T\times(k+2)$ matrix with $t$-th row ($f_{t},f_{t+1},\ldots,f_{t+k},1).$

To define an iterative algorithm to compute $(\widehat{\mathbf{f}}, \widehat{%
\boldsymbol{\beta}}, \widehat{\boldsymbol{\alpha}})$ it is enough to provide 
$\mathbf{f}^{(0)}$ and a rule describing how to compute $\boldsymbol{\beta}%
^{(h)},\boldsymbol{\alpha}^{(h)}$ and $\mathbf{f}^{(h+1)}$ once $\mathbf{f}%
^{(h)}$ is known. The following two steps based on Equations~\ref{eq:f} and %
\ref{eq:beta} describe a natural rule to perform this recursion.

\begin{description}
\item[step 1] Based on Equation~\ref{eq:beta}, define $\boldsymbol{\beta}%
_{j}^{(h)}$ and $\alpha_{j}^{(h)},$ for $1\leq j\leq m$\ , by 
\begin{equation*}
\left( 
\begin{array}{c}
\boldsymbol{\beta}_{j}^{(h)} \\ 
\alpha_{j}^{(h)}%
\end{array}
\right) =\left( \mathbf{F(f}^{(h)}\mathbf{)}^{\top}\mathbf{F(f}^{(h)}\mathbf{%
)}\right) ^{-1}\mathbf{F(f}^{(h)}\mathbf{)}^{\top}\mathbf{z}^{(j)}.
\end{equation*}

\item[step 2] Based on Equation~\ref{eq:f}, define $\mathbf{f}^{(h+1)}$ by 
\begin{equation*}
\mathbf{f}^{\ast}=\mathbf{D}(\boldsymbol{\beta}^{(h)})^{-1}\mathbf{C}(%
\boldsymbol{\alpha}^{(h)})\boldsymbol{\beta}^{(h)}
\end{equation*}
\ and%
\begin{equation*}
\mathbf{f}^{(h+1)}=(T+k-1)^{1/2}(\mathbf{f}^{\ast}-\overline{\mathbf{f}}%
^{\ast})\mathbf{/|||\mathbf{f}^{\ast}-\overline{\mathbf{f}}^{\ast}||.}
\end{equation*}
\end{description}

The initial value $\mathbf{f}^{(0)}$ can be chosen equal to the ordinary
first principal component, completed with $k$ leads. We stop after %
\code{niter_max} iterations or when 
\begin{equation*}
\frac{ \text{MSE}(\mathbf{f}^{(h)} ,\boldsymbol{\beta}^{(h)},\boldsymbol{%
\alpha} ^{(h)})- \text{MSE}(\mathbf{f}^{(h+1)},\boldsymbol{\beta }^{(h+1)},%
\boldsymbol{\alpha}^{(h+1)})}{\text{MSE}(\mathbf{f}^{(h)},\boldsymbol{\beta}%
^{(h)},\boldsymbol{\alpha}^{(h)})}<\text{\code{tol}}
\end{equation*}
for some user-supplied values \code{tol} and \code{niter_max}.

All numerically intensive computations are performed in \proglang{C++},
using the \pkg{Armadillo} linear algebra library, \cite{Armadillo}. The %
\proglang{C++} code is integrated with \proglang{R} using packages \pkg{Rcpp}%
, \cite{Rcpp}, and \pkg{RcppArmadillo}, \cite{RcppArmadillo}.

In step 1 of the algorithm we need to solve $m$ least squares problems, each with $T$ observations and $k+2$ predictor variables. The worst case complexity for solving
each of these least squares problems is $O((k+2)^2 T)=O(k^2 T)$. Hence
the worst case
complexity for step 1 of the algorithm
is $O(m k^2 T)$. The worst case complexity for solving the linear system in step 2 of the algorithm is $O((T+k)^3)$. Hence, at it each iteration the worst
case complexity is $O((T+k)^3 + m k^2 T)$, which is linear in $m$. Thus, the algorithm
is able to deal with very high-dimensional problems.
Note that there are no
restrictions on the values of $\mathbf{f}$ and, in particular, we do not
assume, as in \cite{Brillinger}, that the GDPC\ must be linear combinations of the
series. Note that in Equation~\ref{eq:f} the computation of the component is linear in
the observed data given the $\boldsymbol{\beta }$ parameters. Also, the $%
\boldsymbol{\beta }$ parameters are  estimated  as linear functions of the
observations given the GDPC by Equation~\ref{eq:beta}. However, these two step
estimation leads to a component which,  in general, will not be a linear
function of the observed series. It is shown in \cite{PenaYohai2016} in
particular stationary cases that the components are approximately linear combinations of
the data. This is a similar result to the one found in the static case with
standard principal components where we do not impose this restriction but
the optimal solution verifies it. However, when the series are not
stationary this additional flexibility leads to values of $\mathbf{f}$ better
adapted to the possible nonstationary character of the time series. We will back up
this claim with experimental results in Section~\ref{sec:rec_non-stat}.

\section[Using the gdpc package]{Using the \pkg{gdpc} package}

\label{sec:using}

There are two main functions available to the user: \code{gdpc} and %
\code{auto.gdpc}. We first describe \code{gdpc}.

\subsection[The gdpc function and class]{The \code{gdpc} function and class}

\label{sec:gdpcfun}

The function \code{gdpc} computes a single GDPC with a number of lags that
has to be provided by the user. It has the following arguments:
%
\begin{itemize}
\item {\code{Z}:}{\ Data matrix. Each column is a different time series.}

\item {\code{k}:}{\ Integer. Number of lags to use.}

\item {\code{f\_ini}:}{\ (Optional). Numeric vector. Starting point for the
iterations. If no argument is passed the ordinary first principal component
completed with \code{k} lags is used.}

\item {\code{tol}:}{\ Relative precision. Default is $10^{-4}$.}

\item {\code{niter\_max}:}{\ Integer. Maximum number of iterations. Default
is 500.}

\item {\code{crit}:}{\ A string specifying the criterion to be used to
evaluate the reconstruction. Options are \code{"LOO"}, \code{"AIC"}, \code{"BIC"} and \code{"BNG"}.
Default is \code{"LOO"}.}
\end{itemize}
%
The output of this function is an \proglang{S}3 object of class `\code{gdpc}',
that is, a list with entries:
%
\begin{itemize}
\item {\code{expart}:}{\ Proportion of the variance explained.}

\item {\code{mse}:}{\ Mean squared error.}

\item {\code{crit}:}{\ The value of the criterion of the reconstruction,
according to what the user specified.}

\item {\code{k}:}{\ Number of lags used.}

\item {\code{alpha}:}{\ Vector of intercepts corresponding to \code{f}.}

\item {\code{beta}:}{\ Matrix of loadings corresponding to \code{f}. Column
number $j$ is the vector of $j-1$ lag loadings.}

\item {\code{f}: }{Coordinates of the first dynamic principal component
corresponding to the periods $1,\dots,T$.}

\item {\code{initial\_f}:}{\ Coordinates of the first dynamic principal
component corresponding to the periods $-\text{\code{k}}+1,\dots,0$. Only
for the case $\text{\code{k}}>0$, otherwise 0.}

\item {\code{call}:}{\ The matched call.}

\item {\code{conv}:}{\ Logical. Did the iterations converge?}

\item {\code{niter}:}{\ Integer. Number of iterations.}
\end{itemize}
%
\code{fitted}, \code{plot} and \code{print} methods are available for this
class.

We illustrate the use of this function with the an artificial data set.
First, we load the package and generate the artificial data.
%
\begin{CodeChunk}
\begin{CodeInput}
R> library("gdpc")
R> set.seed(1234)
R> T <- 200
R> m <- 5000
R> f <- rnorm(T + 1)
R> x <- matrix(0, T, m)
R> u <- matrix(rnorm(T * m), T, m)
R> for (i in 1:m) {
+    x[, i] <- 10 * sin(2 * pi * (i / m)) * f[1:T] +
+      10 * cos(2 * pi * (i / m)) * f[2:(T + 1)] + u[, i]
+  }
\end{CodeInput}
\end{CodeChunk}
%
We use \code{gdpc} to compute a single GDPC using one lag. The rest of the
arguments are the default ones. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> fit <- gdpc(x, k = 1)
R> fit
\end{CodeInput}
\begin{CodeOutput}
	    Number.of.lags   LOO   MSE Explained.Variance
Component 1              1 1.017 0.986              0.991
\end{CodeOutput}
\end{CodeChunk}
%
The result is stored in \code{fit} and the code \code{fit} produces the
object to be printed. This shows the number of lags used, the value of the
criterion specified by the user (the default \code{"LOO"} in this case), the MSE of
the reconstruction and the fraction of explained variance. It is seen that
the reconstruction is excellent, with more than 99\% of the variance of the
data explained.

Now we can plot the loadings and the principal component by using the %
\code{plot} method for the `\code{gdpc}' class. The method has the following
arguments
%
\begin{itemize}
\item {\code{x}: }{\ An object of class `\code{gdpc}', usually the result of %
\code{gdpc} or one of the entries of the result of \code{auto.gdpc}.}

\item {\code{which}:}{\ String. Indicates what to plot, either \code{"Component"}
or \code{"Loadings"}. Default is \code{"Component"}.}

\item {\code{which\_load}: }{Lag number indicating which loadings should be
plotted. Only used if \code{which = "Loadings"}. Default is 0.}

\item {\code{\dots}: }{Additional arguments to be passed to the plotting
functions.}
\end{itemize}
%
Continuing with our example, we plot the loadings and the principal
component. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> par(mfrow = c(3, 1))
R> plot(fit, which = "Loadings", which_load = 0, xlab = "", ylab = "")
R> plot(fit, which = "Loadings", which_load = 1, xlab = "", ylab = "")
R> plot(fit, which = "Component", xlab = "", ylab = "")
\end{CodeInput}
\end{CodeChunk}
%
The result is shown in Figure~\ref{Fig:SinCos}.

\begin{figure}[t!]
\centering
\includegraphics[width=0.7\textwidth, trim=0 10 0 5, clip]{FigSinCos}
\caption{Plot of the loadings and principal component for an artificial data
set.}
\label{Fig:SinCos}
\end{figure}

The reconstruction of the original series can be obtained using the %
\code{fitted} method for the `\code{gdpc}' class. We store it in \code{recons}.%
%
\begin{CodeChunk}
\begin{CodeInput}
R> recons <- fitted(fit)
\end{CodeInput}
\end{CodeChunk}
%
We can compare the approximate amount of storage needed for the original
data set and for the \code{fit} object\code{x} and the \code{gdpc} object %
\code{fit}. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> object.size(x)
\end{CodeInput}
\begin{CodeOutput}
8000216 bytes
\end{CodeOutput}
\begin{CodeInput}
R> object.size(fit)
\end{CodeInput}
\begin{CodeOutput}
124352 bytes
\end{CodeOutput}
\end{CodeChunk}
%
Hence, the amount of memory needed to store \code{fit} is about $1.6\%$ of
that needed to store \code{x}.

\subsection[The auto.gdpc function and the gdpcs class]{The \code{auto.gdpc}
function and the `\code{gdpcs}' class}

\label{sec:autogdpcfun}

The function \code{auto.gdpc} computes \textit{several} GDPC. The number of
components can be supplied by the user or chosen automatically so that a
given proportion of variance is explained. The number of lags is chosen
automatically using one of the criteria listed in Section~\ref{sec:defin}.
The function has the following arguments
%
\begin{itemize}
\item {\code{Z}:}{\ Data matrix. Each column is a different time series.}

\item {\code{crit}: }{\ A string specifying the criterion to be used to
choose the number of lags. Options are \code{"LOO"}, \code{"AIC"}, \code{"BIC"} and \code{"BNG"}.
Default is \code{"LOO"}.}

\item {\code{normalize}: }{\ Integer. Either 1, 2 or 3. Indicates whether
the data should be standardized. Default is 1. See details below.}

\item {\code{auto\_comp}: }{\ Logical. If \code{TRUE} compute components
until the proportion of explained variance is equal to \code{expl\_var},
otherwise use \code{num\_comp} components. Default is \code{TRUE}.}

\item {\code{expl\_var}:}{\ A number between 0 and 1. Desired proportion of
explained variance (only used if \code{auto_comp}==\code{TRUE}). Default is
0.9.}

\item {\code{num\_comp}:}{\ Integer. Number of components to be computed
(only used if \newline
\code{auto\_comp}==\code{FALSE}). Default is 5.}

\item {\code{tol}: }{\ Relative precision. Default is $10^{-4}$.}

\item {\code{k\_max}:}{\ Integer. Maximum possible number of lags. Default
is 10.}

\item {\code{niter\_max}:}{\ Integer. Maximum number of iterations. Default
is 500.}

\item {\code{ncores}:}{\ Integer. Number of cores to be used for parallel
computations. Default is 1.}

\item {\code{verbose}:}{\ Logical. Should progress be reported? Default is FALSE.}
\end{itemize}
%
The argument \code{normalize} indicates whether the data should be
normalized. If \code{normalize} = 1, the data is analysed in the original
units, without mean and variance standardization. If \code{normalize} = 2,
the data is standardized to zero mean and unit variance before computing the
principal components, but the intercepts and loadings are those needed to
reconstruct the original series. If \code{normalize} = 3 the data are
standardized as in \code{normalize} = 2, but the intercepts and the loadings
are those needed to reconstruct the standardized series. Default is %
\code{normalize} = 1, and hence the data is analysed in its original units.

The choice of \code{"LOO"} as the default criterion for choosing the number of lags is
justified in Section~\ref{sec:tuning}.
The optimal number of lags for each component can be computed in parallel,
using the \proglang{R} packages \pkg{doParallel}, \cite{doPar} and %
\pkg{foreach}, \cite{foreachJSS}. The argument \code{ncores} indicates the
number of cores to be used for the parallel computations. The default value
is 1, and hence by default no parallel computations are performed. If \code{verbose=TRUE}
a message is printed each time a component is succesfully computed.

The output of \code{auto.gdpc} is an \proglang{S}3 object of class `\code{gdpcs}',
that is, a list of length equal to the number of computed components. The $i$%
-th entry of this list is an object of class `\code{gdpc}' that stores the
information of the $i$-th dynamic principal component, that is, a list with
entries
%
\begin{itemize}
\item {\code{expart}:}{\ Proportion of the variance explained by the first $%
i $ components.}

\item {\code{mse}:}{\ Mean squared error of the reconstruction using the
first $i$ components.}

\item {\code{crit}:}{\ The value of the criterion of the reconstruction,
according to what the user specified.}

\item {\code{k}:}{\ Number of lags chosen.}

\item {\code{alpha}:}{\ Vector of intercepts corresponding to \code{f}.}

\item {\code{beta}:}{\ Matrix of loadings corresponding to \code{f}. Column
number $j$ is the vector of $j-1$ lag loadings.}

\item {\code{f}: }{Coordinates of the $i$-th dynamic principal component
corresponding to the periods $1,\dots,T$.}

\item {\code{initial\_f}:}{\ Coordinates of the $i$-th dynamic principal
component corresponding to the periods $-\text{\code{k}}+1,\dots,0$. Only
for the case $\text{\code{k}}>0$, otherwise 0.}

\item {\code{call}:}{\ The matched call.}

\item {\code{conv}:}{\ Logical. Did the iterations converge?}

\item {\code{niter}:}{\ Integer. Number of iterations.}
\end{itemize}
%
\code{components}, \code{fitted}, \code{plot} and \code{print} methods are
available for this class.

We illustrate the use of this function with a real data set, %
\code{pricesSP50}, that is part of the package. The data set if formed by
fifty series corresponding to the stock prices of the first 50 components of
the Standard\&Poor's 500 index. Five hundred daily observations starting
1/1/2010 are available. The class of the object storing the data set is %
`\code{mts}'.

The data set can be loaded and part of it can be plotted using the following
commands 
%
\begin{CodeChunk}
\begin{CodeInput}
R> data("pricesSP50")
R> plot(pricesSP50[, 1:4], main = "Four components of the S&P500 index")
\end{CodeInput}
\end{CodeChunk}
%
The plot is shown in Figure~\ref{Fig:SP504}.

\begin{figure}[t!]
\centering
\includegraphics[width=0.55\textwidth, trim=0 20 0 5, clip]{FigSP504}
\caption{Plot of four components of the S\&P500 index.}
\label{Fig:SP504}
\end{figure}

Next, we apply \code{auto.gdpc} to the data set, storing the result in %
\code{fit_SP}. Since some of the series are significantly more variable than
others, we apply the procedure to the normalized data set using the option %
\code{normalize}=2. Since convergence is somewhat slow for this data set, we
increased the maximum number of iterations from the default 500 to 1000 by
setting \code{niter\_max}=1000. We used 8 cores to perform the computation,
by setting \code{ncores}=8. The rest of the arguments are left to their
default values. In particular the number of lags for each component is
chosen as the value than minimizes the LOO criterion and the number of
components is chosen so that the reconstruction explains at least 90\% of
the variance of the data. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> fit_SP <- auto.gdpc(pricesSP50, normalize = 2, niter_max = 1000,
+    ncores = 8)
\end{CodeInput}
\begin{CodeInput}
R> fit_SP
\end{CodeInput}
\begin{CodeOutput}
	    Number.of.lags   LOO   MSE Explained.Variance
Component 1             10 0.185 0.174              0.826
Component 2              8 0.074 0.071              0.929
\end{CodeOutput}
\end{CodeChunk}
%
The whole computation took about 3 minutes on \proglang{R} version 3.4.0 on
a computer running OS X El Capitan 10.11.4 64-bit with an Intel Xeon CPU E5-1650 v2 3.50GHz. Entering \code{fit_SP} produces the object to be printed.
A table is printed where row $i$ shows the number of lags used in component $%
i $, the value of the criterion specified by the user (the default \code{"LOO"} in
this case) in component $i$, the MSE of the reconstruction using the
components $1,\dots,i$ and the fraction of explained variance by the
reconstruction using components $1,\dots,i$. Note that the MSEs and the
criteria are those of the reconstruction of the normalized series in this
case.

We can obtain the reconstruction of the original time series using the %
\code{fitted} method for the `\code{gdpcs}' class. The method has the
following arguments
%
\begin{itemize}
\item {\code{object}:}{\ An object of class `\code{gdpcs}', usually the result
of \code{auto.gdpc}.}

\item {\code{num_comp}: }{\ Integer indicating how many components to use
for the reconstruction. Default is 1.}

\item {\dots:}{\ Additional arguments for compatibility.}
\end{itemize}
%
Note that the \pkg{gdpc} package supports `\code{mts}', `\code{xts}' and %
`\code{zoo}' objects. The principal components and reconstructed time series
will be stored in an object of class `\code{mts}', `\code{xts}' or `\code{zoo}'
respectively, the same as the original data, with the same date attributes.
Thus, since the \code{pricesSP50} data was stored in an object of class %
`\code{mts}', the reconstructed time series will be of class `\code{mts}', with
the same date attributes.

We store the reconstructed time series using both computed components in an %
\code{mts} object called \code{recons}, assign each of the series its
corresponding name, and plot the first four series. The result is shown in
Figure~\ref{Fig:SP50Recon}. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> recons <- fitted(fit_SP, num_comp = 2)
R> colnames(recons) <- colnames(pricesSP50)
R> plot(recons[, 1:4],
+    main = "Reconstruction of four components of the S&P500 index")
\end{CodeInput}
\end{CodeChunk}
%
\begin{figure}[t!]
\centering
\includegraphics[width=0.55\textwidth, trim=0 20 0 -2, clip]{FigSP50Recon}
\caption{Plot of the reconstruction of four components of the S\&P500 index.}
\label{Fig:SP50Recon}
\end{figure}

We can get the dynamic principal components using the \code{components}
method for the class `\code{gdpcs}'. The method has the following arguments
%
\begin{itemize}
\item {\code{object}:}{\ An object of class `\code{gdpcs}', usually the result
of \code{auto.gdpc}.}

\item {\code{which_comp}:}{\ Numeric vector indicating which components to
get. Default is 1.}
\end{itemize}
%
Since the original data was stored in an object of class `\code{mts}', the
components will also be of class `\code{mts}'. We store the components in an %
`\code{mts}' object called \code{comps}. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> comps <- components(fit_SP, which_comp = c(1, 2))
\end{CodeInput}
\end{CodeChunk}
%
We can either directly plot the \code{comps} time series matrix or use the %
\code{plot} method for the class `\code{gdpcs}'. The method has the following
arguments
%
\begin{itemize}
\item {\code{x}: }{An object of class `\code{gdpcs}', usually the result of %
\code{auto.gdpc}.}

\item {\code{plot.type}:}{\ Argument to be passed to the \code{plot}
    for `\code{zoo}' objects. Used only when the original data set was
    stored in an object of class `\code{zoo}'.  Default is
    \code{"multiple"}.}

\item {\code{which_comp}:}{\ Numeric vector indicating which components to
plot. Default is 1.}

\item {\dots:}{\ Additional arguments to be passed to the plotting functions.%
}
\end{itemize}
%
Using the plot method for the `\code{gdpcs}' we can plot both principal
components. The result is shown in Figure~\ref{Fig:SP50PC}. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> plot(fit_SP, which_comp = c(1, 2))
\end{CodeInput}
\end{CodeChunk}
%
\begin{figure}[t!]
\centering
\includegraphics[width=0.55\textwidth, trim=0 25 0 5, clip]{FigSP50PC}
\caption{Plot of the first two dynamic principal components of the 
\code{pricesSP50} data set.}
\label{Fig:SP50PC}
\end{figure}

We can compare the approximate amount of storage needed for the original
data set \code{pricesSP50} and the `\code{gdpcs}' object \code{fit_SP}. 
%
\begin{CodeChunk}
\begin{CodeInput}
R> object.size(fit_SP)
\end{CodeInput}
\begin{CodeOutput}
31184 bytes
\end{CodeOutput}
\begin{CodeInput}
R> object.size(pricesSP50)
\end{CodeInput}
\begin{CodeOutput}
204192 bytes
\end{CodeOutput}
\end{CodeChunk}
%
Hence, the amount of memory needed to store \code{fit_SP} is about $14\%$ of
that needed to store \code{pricesSP50}.

\section{Reconstructing non-stationary data}
\label{sec:rec_non-stat} 

We compare the performance of GDPC and the dynamic principal
components proposed by \cite{Brillinger} by conducting a small simulation study.
We consider the following
VARI(1,1) model. The data is generated according to 
\begin{equation*}
\mathbf{z}_{t}=\mathbf{z}_{t-1}+\mathbf{x}_{t},
\end{equation*}%
where $\mathbf{x}_{t}$ satisfies a stationary VAR(1) model 
\begin{equation*}
\mathbf{x}_{t}=\mathbf{A}\mathbf{x}_{t-1}+\mathbf{u}_{t}.
\end{equation*}%
The $\mathbf{u}_{t}$ are i.i.d. with a standard multivariate normal
distribution. In each replication the matrix $\mathbf{A}$ is generated
randomly as $\mathbf{A}=\mathbf{V}\boldsymbol{\Lambda }\mathbf{V}^{\top }$,
where $\mathbf{V}$ is an orthogonal matrix generated at random with uniform
distribution and $\boldsymbol{\Lambda }$ is a diagonal matrix with diagonal
elements are independent random variables with uniform distribution on the $%
[0,0.9]$ interval. Note that the generated vector time series is not stationary. 

We computed GDPC using one component and $k=10$ lags and the dynamic principal
components of Brillinger (DPC) using one component with 5 leads and 5 lags. We used the \code{dpca} function from the \pkg{freqdom} package \citep{freqdom} to compute the method proposed by Brillinger. We take $(m,T)\in \{10, 50, 100\}\times \{100,200\}$  and do 500
replications. Table~\ref{tab:exp_var_time} shows the results. We report the average computation time in seconds and the average fraction of variance explained.
We see that the reconstructions obtained with GDPC are much better than those obtained with DPC, with the
fraction of variance explained by GDPC being around 0.90 to 0.95, and the fraction
of variance explained by DPC being around $0.65$ to $0.70$. Moreover, the computation
times in seconds for GDPC are much lower. For example, for the case of $T=m=100$, GDPC
takes on average 1.69 seconds to compute a solution whereas the same task takes 90 seconds
using DPC.

% latex table generated in R 3.4.4 by xtable 1.8-2 package
% Mon May  7 11:11:21 2018
\begin{table}[t!]
\centering
\begin{tabular}{llcccc}
  \hline
       &     & \multicolumn{2}{c}{GDPC}       & \multicolumn{2}{c}{DPC}  \\
 \cmidrule(lr){3-4} \cmidrule(lr){5-6}   $T$ & $m$ & Time & Explained Variance & Time & Explained Variance\\
 \hline
100 & 10 & 1.34 & 0.95 & 2.87 & 0.72 \\ 
   & 50 & 1.51 & 0.95 & 21.11 & 0.69 \\ 
   & 100 & 1.69 & 0.94 & 90.09 & 0.68 \\ 
  200 & 10 & 8.76 & 0.93 & 2.94 & 0.69 \\ 
   & 50 & 7.77 & 0.92 & 21.74 & 0.65 \\ 
   & 100 & 7.79 & 0.91 & 92.94 & 0.64 \\ 
   \hline
\end{tabular}
\caption{Average computation time in seconds and fraction of explained variance for each method.} 
\label{tab:exp_var_time}
\end{table}


\section{The performance of the lag selection criteria}
\label{sec:tuning} 
In this section we compare the performance of the four different criteria available in
the \pkg{gdpc} package to automatically choose the number of lags used to define the GDPC.
Since we propose to choose the number of components to explain a given fraction of the variance in the data, we focus on the case in which one component is to be computed. We consider the following two scenarios.
%
\begin{itemize}
\item DFM1: The data is generated as 
$$
z_{j,t} = b_{0,j}f_{t} +  b_{1,j}f_{t-1}+e_{j,t},\quad 1\leq t\leq
T,1\leq j\leq m.
$$
$f_{t}$ follows a stationary AR(1) model, $f_{t}=\theta f_{t-1}+u_t$, with standard normal innovations and $\theta$ generated at random on the $(-1, 1)$ interval at each replication. The loadings $b_{0,j}, b_{1,j}$, $j=1,\dots,m$ are generated at random uniformly on the $(-1,1)$ interval. The variables $e_{j,t}$ are generated as i.i.d. standard normal. This is a dynamic factor model with one factor and one lag.
\item DFM2: The data is generated as 
$$z_{j,t} = b_{0,j}f_{t} +  b_{1,j}f_{t-1}+b_{2,j}f_{t-2}+e_{j,t}, \quad 1\leq t\leq
T,1\leq j\leq m.
$$ 
The factor $f_t$ follows a stationary MA(1) model, $f_{t}=\theta u_{t-1}+u_{t}$ with standard normal innovations and with $\theta$ generated at random at each replication, uniformly on the $(-1,1)$ interval. The loadings and the errors are generated as in model DFM1. 
This is a dynamic factor model with one factor and two lags. 
\end{itemize}
%
We take $(m, T) \in \lbrace 5, 10, 20, 200, 800\rbrace \times \lbrace 200, 400\rbrace$ and do 500 replications. We report the average number of lags chosen
by each criterion and the resulting average reconstruction mean squared error. Tables~\ref{tab:k_choice} and~\ref{tab:mse_choice} show the results.

We see that in the cases in which the dimension of the vector time series is small, say $m=5$ or $m=10$, and the sample size is large, the AIC does well, choosing a number of lags close to the true values (1 for DFM1, 2 for DFM2). In this cases BNG tends to underestimate the required number of lags, whereas LOO tends to overestimate it. 
For moderate sized problems, with $m=20$, LOO does best all around.
For the case of high-dimensional vector time series ($m=200, 800$), BNG and LOO perform similarly, choosing a number of lags that is on average very close to the truth. In this cases AIC underestimates the number of lags needed. BIC tends to underestimate the required number of lags in all cases and does not perform well at all in any of the cases considered here.

In Table~\ref{tab:mse_choice} we see that: in the cases where the methods choose on average a number of lags close to the true values, the reconstruction errors are close to the variance of the idiosyncratic part, which is 1 in this case. In cases where the methods overestimate the number of lags needed, the reconstruction error is smaller than the idiosyncratic variance, see for example the case of $T=200$, $m=5$ for the LOO criterion. This is to be expected, since if a larger number of lags than needed is used, the components will also explain part of the variance in the idiosyncratic part. In cases where
the number of lags needed is underestimated, the reconstruction error is larger than 1, see for example the case of $T=200$ and $m=800$ for the BIC criterion.

Since we believe the main appeal of GDPC is for moderate to large sized panels of time series, the default criterion used in \code{auto.gdpc} is LOO. However, for small panels
better results can be obtained by using the AIC criterion.

% latex table generated in R 3.4.4 by xtable 1.8-2 package
% Mon May  7 11:18:05 2018
\begin{table}[t!]
\centering
\begin{tabular}{llrrrrrrrr}
  \hline
       &     & \multicolumn{4}{c}{DFM1}       & \multicolumn{4}{c}{DFM2}  \\
 \cmidrule(lr){3-6} \cmidrule(lr){7-10}   $T$ & $m$ & BNG & LOO & AIC & BIC & BNG & LOO & AIC & BIC\\
 \hline
200 & 5 & 0.02 & 4.14 & 0.81 & 0.26 & 0.00 & 4.02 & 1.64 & 0.54 \\ 
   & 10 & 0.08 & 2.51 & 0.43 & 0.06 & 0.02 & 2.73 & 0.92 & 0.00 \\ 
   & 20 & 0.28 & 1.27 & 0.15 & 0.00 & 0.29 & 2.15 & 0.01 & 0.00 \\ 
   & 200 & 1.01 & 1.01 & 0.00 & 0.00 & 2.00 & 2.00 & 0.00 & 0.00 \\ 
   & 800 & 1.01 & 1.01 & 0.00 & 0.00 & 2.00 & 2.00 & 0.00 & 0.00 \\ 
  400 & 5 & 0.02 & 3.76 & 0.95 & 0.51 & 0.00 & 3.83 & 1.88 & 1.16 \\ 
   & 10 & 0.08 & 2.00 & 0.80 & 0.20 & 0.02 & 2.79 & 1.75 & 0.19 \\ 
   & 20 & 0.19 & 1.15 & 0.38 & 0.03 & 0.18 & 2.10 & 0.83 & 0.00 \\ 
   & 200 & 1.02 & 1.02 & 0.00 & 0.00 & 2.00 & 2.01 & 0.00 & 0.00 \\ 
   & 800 & 1.01 & 1.01 & 0.00 & 0.00 & 2.00 & 2.00 & 0.00 & 0.00 \\ 
   \hline
\end{tabular}
\caption{Average number of lags chosen by each method in scenarios DFM1 and DFM2.} 
\label{tab:k_choice}
\end{table}


% latex table generated in R 3.4.4 by xtable 1.8-2 package
% Mon May  7 11:18:05 2018
\begin{table}[t!]
\centering
\begin{tabular}{llrrrrrrrr}
  \hline
       &     & \multicolumn{4}{c}{DFM1}       & \multicolumn{4}{c}{DFM2}  \\
 \cmidrule(lr){3-6} \cmidrule(lr){7-10}   $T$ & $m$ & BNG & LOO & AIC & BIC & BNG & LOO & AIC & BIC\\
 \hline
200 & 5 & 0.88 & 0.75 & 0.80 & 0.83 & 0.96 & 0.75 & 0.79 & 0.87 \\ 
   & 10 & 0.97 & 0.87 & 0.92 & 0.98 & 1.08 & 0.87 & 0.95 & 1.09 \\ 
   & 20 & 1.00 & 0.93 & 1.02 & 1.06 & 1.09 & 0.93 & 1.14 & 1.15 \\ 
   & 200 & 0.98 & 0.98 & 1.11 & 1.11 & 0.98 & 0.98 & 1.19 & 1.19 \\ 
   & 800 & 0.98 & 0.98 & 1.11 & 1.11 & 0.98 & 0.98 & 1.20 & 1.20 \\ 
  400 & 5 & 0.88 & 0.78 & 0.80 & 0.82 & 0.96 & 0.78 & 0.79 & 0.82 \\ 
   & 10 & 0.98 & 0.89 & 0.91 & 0.96 & 1.08 & 0.89 & 0.90 & 1.04 \\ 
   & 20 & 1.01 & 0.94 & 0.98 & 1.05 & 1.11 & 0.94 & 1.03 & 1.15 \\ 
   & 200 & 0.99 & 0.99 & 1.12 & 1.12 & 0.99 & 0.98 & 1.20 & 1.20 \\ 
   & 800 & 0.99 & 0.99 & 1.13 & 1.13 & 0.99 & 0.99 & 1.21 & 1.21 \\ 
   \hline
\end{tabular}
\caption{Average reconstruction mean squared error for scenarios DFM1 and DFM2, when the 
                          number of lags is chosen by each method.} 
\label{tab:mse_choice}
\end{table}

\section{Computing times}

\label{sec:comp_times} In this section, we estimate the run times of our
implementation of the algorithm described in Section~\ref{sec:compu} for
different problem sizes by conducting a small simulation study. Timings were
carried out for the \code{gdpc} function, since it is the one that
implements the main numerical algorithm. Timings were carried out on %
\proglang{R} version 3.4.4 on a computer running Linux Ubuntu 18.04 64-bit with an Intel Core i7-7560U @ 2.40GHz x4. 

We take $(m,T)\in \{100,1000,2000\}\times \{200,400\}$ and consider the
following two models. 
\begin{itemize}
\item DFM3: The data is generated according to
\begin{equation*}
z_{j,t}=\sin (2\pi (j/m))f_{t}+\cos (2\pi (j/m))f_{t-1} +e_{j,t}, \quad 1\leq t\leq
T,1\leq j \leq m,
\end{equation*}%
where the $e_{j,t}$ are i.i.d. standard normal variables. The vector of
factors $\mathbf{f}_{t}=(f_{1t},f_{2t})$ is generated according to a vector
autoregressive model $\mathbf{f}_{t}=\mathbf{A}\mathbf{f}_{t-1}+v_{t}\mathbf{%
b}$, where the $v_{t}$ are i.i.d. standard normals, $\mathbf{A}$ is a
diagonal matrix with diagonal equal to $(-0.8,0.7)$ and $\mathbf{b}%
=(1,1)^{\top }$. Note that the resulting time series is stationary.

\item DFM4: The data is generated according to 
\begin{equation*}
z_{j,t}=\sin (2\pi (j/m))f_{t}+\cos (2\pi (j/m))f_{t-1} + (j/m) f_{t-2}+e_{j,t}, \quad 1\leq t\leq
T,1\leq j \leq m,
\end{equation*}%
where the $e_{j,t}$ are i.i.d. standard normal variables. The factor $f_t$ follows
an integrated MA(1) model, $f_{t}=f_{t-1}+\theta f_{t-1}+u_t$, with standard normal
innovations and where $\theta$ is generated at random uniformly on the $(-0.8,0.8)$ 
interval at each replication. Note the the resulting time series is non-stationary.
\end{itemize}
%
We used $k=5$ lags for the first model and $k=2$ for the second model. We report
the average computation times in seconds over 500 replications.

The results are shown in Table~\ref{tab:times}. We see 
that in the stationary case (DFM3) the algorithm can
compute solutions for data sets with thousands of time series in under 15
seconds. It is seen that the convergence of the
algorithm is slower in the non-stationary case (DFM4), since the computing
times increase drastically for problems of the same size as before. However,
even in the non-stationary case, the algorithm can compute solutions to
problems with thousands of time series in under 1 minute. 
Moreover, note that computations were performed on an ordinary desktop
computer, on a high-performance cluster we expect the algorithm to be able
to compute the GDPC for tens of thousands of series in a matter of minutes. 

% latex table generated in R 3.4.4 by xtable 1.8-2 package
% Sun May  6 11:14:01 2018
\begin{table}[t!]
\centering
\begin{tabular}{llrr}
  \hline
T & m & DFM3 & DFM4 \\ 
  \hline
200 & 100 & 0.35 & 0.92 \\ 
   & 1000 & 2.07 & 5.89 \\ 
   & 2000 & 3.88 & 11.50 \\ 
  400 & 100 & 1.38 & 4.48 \\ 
   & 1000 & 6.86 & 22.01 \\ 
   & 2000 & 12.68 & 40.46 \\ 
   \hline
\end{tabular}
\caption{Average computing times in seconds for stationary and non-stationary factor models.} 
\label{tab:times}
\end{table}

\section{Conclusions}

\label{sec:conclu}

The \pkg{gdpc} package provides functions to compute the generalized dynamic
principal components for a set of time series. These components are useful
to reconstruct the series and to describe their underlying dynamic
structure. Also, they can be used as estimators of the common part
in a dynamic factor model, as shown in Section~\ref{sec:tuning} and by the theoretical results in \cite{Smucler17}. The package is useful for the analysis of large
data sets, even when the number of series is much larger than the length of
the series. The \pkg{gdpc} package is available from the Comprehensive %
\proglang{R} Archive Network at 
\url{https://
CRAN.R-project.org/package=gdpc}, including the \code{pricesSP50} data set. 
\FloatBarrier

\section*{Acknowledgments}

The authors would like to thank two anonymous reviewers and the editors for their helpful comments. Part of this work was conducted while Ezequiel Smucler was a Ph.D student at the Deparment of Mathematics at Universidad de Buenos Aires, funded by a CONICET fellowship.

\bibliography{ref}

\end{document}
