## Abstract

The availability of historical streamflow data of the desired length is often limited and, in these situations, the ability to synthetically generate statistically significant datasets becomes important. We previously developed a highly efficient stochastic modelling approach for the synthetic generation of daily streamflow sequences using the systematic combination of a hidden Markov model with the generalized Pareto distribution (the HMM-GP model). Daily streamflow sequences provide limited information on various significant small duration flooding events exceeding the peak over threshold values, but these are averaged out in the daily datasets. These small duration intense flooding events are often capable of causing significant damage and are important in conducting thorough flood risk management and flood risk assessment studies. This paper presents upgrades to our HMM-GP stochastic modelling approach and examines its efficiency in simulating streamflow at a temporal resolution of 15 minutes. The potential of the HMM-GP model in simulating a synthetic 15-minute streamflow series is investigated by comparing various statistical characteristics (e.g. percentiles, the probability density distribution and the autocorrelation function) of the observed streamflow records with 100 synthetically simulated streamflow time series. The proposed modelling schematics are robustly validated across case studies in four UK rivers (the Don, Nith, Dee and Tweed).

Floods have more devastating impacts than any other form of natural catastrophe in the UK and can cause disruption to crucial public services, such as energy, water, health, transport and communication networks. The analysis of historical flow records can provide a wealth of information on the flow patterns and possible futuristic behaviour of rivers, which is essential in planning and designing flood risk management and flood risk assessments. Streamflow datasets can be analysed to understand the influence of several different climatic and geological factors affecting natural processes, including hydrological processes and the assessment and management of water resources and reservoirs. It is highly unlikely that the exact pattern of observed flows (including extreme flows) will recur in exactly the same order in the near future. However, given that sufficiently long records of the observed flow at fine resolution (e.g. 15 minutes) are often available, it is possible to estimate the key statistics defining the overall characteristics (e.g. the mean, variance, percentiles, probability density distributions, seasonality and correlation of streamflow time series) of both the river flow sequences and the extreme events. This information can be utilized smartly through efficient stochastic modelling schemes, such as our hidden Markov model with the generalized Pareto distribution (the HMM-GP model), to recreate realistically plausible alternative scenarios with identical or similar statistical characteristics to the observed records.

The observed flow records are often not long enough to extract reliable statistics to understand the variability and uncertainty in the flow patterns and the occurrence of all significant rare events (high/low flow). To conduct a statistically significant impact analysis of a 1 : *N* year return period extreme event, a flow series at least 4*N* years in length is recommended – that is, to conduct a statistically significant impact analysis of a 1 : 200 year return period extreme event, a flow series with a length of at least 800 years is needed. Daily records and peak over threshold (POT) datasets often skip several significant small duration intense flooding events, which, on their own, are capable of causing severe damage. Existing gauge data are usually not sufficiently extensive and are practically unobtainable at the required scale (800 years long with a 15-minute resolution). One possible solution to address these issues is to develop robust methodologies for artificially generating realistically possible synthetic streamflow sequences based on historical measured flow gauge data within acceptable statistical errors (i.e. reasonably close in statistical characteristics to the observed records) (Myron & Barbara 1971).

We present here a thorough investigation into the existing framework of the modelling approach based on the HMM-GP model to examine its suitability for simulating streamflow at a much finer temporal resolution of 15 minutes. We propose a systematic reframing of the HMM-GP modelling framework for using seasonal-trend decomposition based on loess (STL) (Cleveland *et al.* 1990), and a robust validation of the scheme by conducting an extensive comparison of various statistical characteristics of the observed records with the synthetically simulated time series across four rivers in the UK: the Don, Nith, Dee and Tweed. To our knowledge, this is the first published report of the development of an efficient stochastic modelling framework for the synthetic generation of streamflow at this fine temporal resolution.

The HMM-GP modelling framework, as presented herein, is of a transformative nature and can be further investigated to examine its suitability for simulating a range of geological time series at a range of different temporal resolutions (Eidsvik *et al.* 2004; Wu 2011). An interesting application in this area could include the calibration of the HMM-GP model to reconstruct palaeotime geological/geomorphological datasets (Gangopadhyay *et al.* 2009).

This paper is organized as follows. The next section presents a brief background on the state of the art in the stochastic modelling techniques used for the generation of synthetic streamflows. The following five sections are organized to provide an overview of the case study sites, details of the stochastic modelling framework, a thorough model validation exercise and a synopsis of key outcomes and conclusions.

## Background

The conventional approach to generating multiple realizations of observed streamflow sequences is to calibrate hydrological rainfall–runoff models based on stochastic information with rainfall and temperature datasets. Several previous studies have demonstrated the efficiency of these modelling approaches in simulating synthetic streamflow (Mellor *et al.* 2000; Zhang *et al.* 2013). However, the accurate calibration of hydrological models requires an expert skill set, computational resources and, most importantly, the availability of a wide range of input data (including information on several different parameters and boundary conditions). These modelling techniques often have limited success as a result of the inherent error propagation in the modelling schematics. The limitations associated with the conventional hydrological rainfall–runoff modelling approaches have encouraged the development of more direct approaches, such as the training of stochastic/computational modelling techniques with observed time series of streamflow.

Classical stochastic modelling approaches, such as the autoregressive moving average and the periodic autoregressive modelling frameworks (Stedinger & Taylor 1982; Salas 1985; Box *et al.* 2015), have often been used to generate synthetic streamflow sequences, particularly for monthly and annually averaged flow sequences (O'Connell 1974; Mohammadi *et al.* 2006; Sadek 2016; Musa 2013; O'Connell & O'Donnell 2014). Several other mathematical, statistical and computational modelling techniques have been investigated for synthetic streamflow generation, including the artificial neural networks approach (Stedinger & Taylor 1982; Ochoa-Riveria *et al.* 2002; Ahmed & Sarma 2007), the wavelet transform scheme (Wang *et al.* 2011), gamma autoregressive modelling (Fernandez & Salas 1990), modified *k*-nearest neighbour modelling (Prairie *et al.* 2006) and other non-parametric stochastic modelling approaches (Sharma *et al.* 1997). However, the application of these modelling schemes appears to be considerably less common in the development of efficient stochastic models for the simulation of daily streamflow sequences (possibly due to the large variability in the daily flow dataset) (Wang & Ding 2007; Martins & Sadeeq 2011; Can *et al.* 2012). Most of these daily streamflow modelling approaches were developed in the context of large international catchments with a distinct seasonality relative to catchments in the UK.

In this context, the Markov model and its variants have been extensively investigated for the simulation of synthetic daily streamflow (Yapo *et al.* 1993; Xu *et al.* 2001; Szilagyi *et al.* 2006; Augustin *et al.* 2008; Pender *et al.* 2016). A semi-Markov modelling approach that fits the Markov process within a multinomial logit modelling framework has been developed for the synthetic simulation of daily streamflow sequences (Augustin *et al.* 2008). This approach uses the Markov model to estimate the state transition probabilities of the flow values and fits a multinomial logit autoregressive scheme to link climatic information (observed at time *t* and in the past) for the generation of synthetic flow projections. Unlike the traditional Markov scheme, this approach estimates flow rates within each state by the random sampling of values from the associated continuous distributions of flow values observed within the state. This approach may introduce unexpected errors in the simulated flow, mainly because the sampling is potentially biased towards high probability values of observed flow.

Pender *et al.* (2016) developed a novel HMM-based approach that overcame this issue. The structural design of the HMM scheme provides a robust sampling methodology via the systemic selection of all possible values within a particular state based on the individual probabilities of the values. The flexible structure of the HMM approach facilitates the systematic integration of the distribution of extreme values to effectively simulate the extreme upper limit of the flow values. To attain the optimum efficiency in simulating extreme values (which is important in all fluvial flood hazard studies), they thoroughly compared two possible versions of the modelling schematics: the HMM integrated with the generalized extreme value (HMM-GEV) distribution and the HMM integrated with the generalized Pareto (HMM-GP) distribution. A robust assessment of both modelling schemes was conducted for four hydrologically and morphologically distinct catchments in the UK and the results suggested that the HMM-GP approach provides more effective results than the HMM-GEV approach. The HMM-GP framework has been reported to simulate multiple realizations of flow sequences from long-term measured flow gauge input data. The methodology retains the general statistics of the observed river flow records, but essentially reorders the magnitude, spacing and frequency of river flows. This generates multiple realistic alternatives of artificial flow sequences that can be used, for example, as the inflow boundary conditions of hydraulic flood models to assess the sensitivity of flood hazards to the flow sequence.

Demayanov *et al.* (this volume, in review) investigated the potential of machine learning techniques, specifically self-organizing maps, for the unsupervised clustering of sedimentary logs, with applications in exploring sedimentological patterns in river deposits. The HMM-GP approach presented here and the approach of Demayanov *et al.* (this volume, in review) are complementary to each other and are of a transformative nature in the application in a wide area of applications, including hydrology and geology.

## Data: stations, sources and processing

### Data stations

This study focused on four UK rivers: the Don, Nith, Dee and Tweed. These four catchments were specifically selected as a result of their similar river mobility, slope and urban extent (Table 1; Fig. 1). The topographic location and characterization of these four rivers were carefully considered to give a systematic spatial representation across Scotland. The Nith is located in SW Scotland, the Tweed in SE Scotland, and the Dee and Don in NE Scotland. All the selected river catchments have urban developments in their mid- to lower catchment areas and agricultural land in their headwater areas, although the extent of urban development is generally <0.5%. The catchment area of the four rivers ranges from 799 to 4390 km^{2} and the variation in river slope is in the range 111–169 m m^{−1}. The average recorded daily flow rate at the location furthest upstream of the rivers is <13 m^{3} s^{−1} and the downstream gauging stations show an average daily flow variability respective to the total catchment area within the range 28–81 m^{3} s^{−1}.

### Data source

The 15-minute gauged dataset was provided by the Scottish Environmental Protection Agency (SEPA), who participated in this research as a key project partner and technical adviser.

### Data processing

Fifteen distinct gauging stations across the rivers Don, Nith, Dee and Tweed were systematically analysed. The gauging stations were selected for the availability of 15-minute data, the length of records and the minimal amount of missing data. Most of the data provided by SEPA for this investigation were complete (>95% data continuity in all instances). Some missing data were noted and these were carefully recorded, investigated and infilled following standard interpolation techniques, such as *X*(*t*) = [*X*(*t* − 1) + *X*(*t* + 1)]/2. The HMM-GP model was fitted to all 15 distinct gauging stations and its efficiency was examined thoroughly. However, for the purpose of illustration, we selected one gauging station from each river, specifically Haughton for the Don, Friars Carse for the Nith, Woodend for the Dee and Norham for the Tweed. The selection criteria applied to select these gauging stations were: gauged flow records for >30 years; mostly urbanized catchments; areas of known flood risk; and frequent use in previous studies. Figure 2 shows example annual hydrographs from 1990 for each river at the selected gauging station.

To demonstrate the monthly variability in flow rate, Figure 3 is an ordered monthly box plot showing the distribution of five summary statistics (5th percentile, 25th percentile, 50th percentile, 75th percentile and 95th percentile, with the maximum values listed on top of each box plot) for the four rivers at the specified gauging stations. The summary statistics were estimated for continuous 15-minute observations recorded for each month over the specified period, i.e. 1972–2015 for the Don at Haughton, 1958–2015 for the Nith at Friars Carse, 1973–2015 for the Dee at Woodend and 1960–2015 for the Tweed at Norham. The summary statistics vary considerably across the rivers and across the typical UK seasons, i.e. winter (December–February), spring (March–May), summer (June–August) and autumn (September–November). Other interesting observations from the modelling point of view are the large gaps between the 95th percentile values and the maximum flow values in all four rivers, which clearly highlight the need for an effective modelling scheme to simulate extreme flow values.

## Stochastic modelling framework

The HMM is a popular stochastic modelling approach and has been successfully applied to model a range of complex processes, such as bioinformatics, speech, molecular evolution, stock markets, natural languages, and human and animal behaviour (Baum & Petrie 1966; Rabiner 1989; Durbin *et al.* 1998; Manning & Schuetze 1999). Patidar *et al.* (2016*a*, *b*) investigated the efficiency of HMM-based approaches in generating synthetic electricity demand profiles at a 1-minute resolution in parallel with AutoRegressive Integrated Moving Average (ARIMA)-based models based models. The key idea underpinning the proposed methodology exploits the fact that a time series is mainly composed of three components: the trend, the seasonal variation and a random component. The trend and seasonal components of a time series are attributed to the deterministic process, whereas the random component is attributed to the uncertainty in the system. Time series deseasonalization techniques are conventionally applied to segregate the time series into these three components. To generate synthetic streamflow series, the HMM modelling procedure as described by Patidar *et al.* (2016*a*, *b*) and Pender *et al.* (2016) involves the deseasonalization of the time series through the standard approach described in equation (1):
(1)
where *Q*_{μ} and *Q*_{μ} are the monthly mean and standard deviation, respectively, of the specified period of the log-transformed daily streamflow series (where *t* represents the daily value) – that is, the monthly mean and standard deviation values were estimated for each month of the year and along the full length of the time series (e.g. for a 30-year time series, 30 × 12 = 360 mean and standard deviation values were estimated). Each observed daily value (which was log-transformed) was then processed according to equation (1) with the respective monthly mean and standard deviation of the period in which they occurred.

The deseasonalization procedure described in equation (1) segregates the seasonal component from the observed series while leaving behind the remaining series (*Q*_{Deseasonalized}) consisting of the trend and a random component. The HMM is fitted to the deseasonalized (*Q*_{Deseasonalized}) part of the streamflow series and then simulated to generate *N* synthetic deseasonalized streamflow series, which are then reseasonalized (by adding the corresponding value of *Q*_{μ} and then multiplying by *Q*_{μ}) to output the synthetic daily streamflow series. The model has been robustly validated and shown to be highly efficient in accurately simulating daily streamflow realizations up to the 99th percentile of the observed flow values. The model appears to have limited applicability in simulating extreme flow values above the 99th percentile. The extreme high flow values are important for various reasons and need to be simulated effectively. To effectively model the extreme values, the generalized Pareto distribution was fitted to the 99th percentile of the observed flow series and then the fitted distribution was used to sample the extreme flow values for the synthetic flow series (Pender *et al.* 2016).

To effectively simulate the 15-minute streamflow sequences, we followed HMM-GP methodology described by Pender *et al.* (2016) with some recent developments, replacing the simplistic approach in equation (1) for the deseasonalization of streamflow sequences with the classical approach STL process detailed by Cleveland *et al.* (1990). A typical time series is again mainly composed of three components: trend, seasonal and random. The trend and seasonal components can be attributed to the deterministic processes of the system, whereas the random component represents the uncertainty factors influencing the processes. The STL process applied here allows a systematic decomposition of the time series into three components. The observed flow at time *t* (*O*_{t}) can be represented as sum of the trend (*T*_{t}), seasonal (*S*_{t}) and random (*R*_{t}) components at time *t*, i.e. *O*_{t} + *S*_{t} + *R*_{t}. The STL process allows a systematic decomposition of the time series into these three distinct components.

A step-by-step methodological framework was used to process the 15-minute streamflow sequences through the updated HMM-GP framework.

Take the log of the time series (this step transforms an additive time series into a multiplicative series).

Apply the STL time series deseasonalization procedure based on the loess process (Cleveland

*et al.*1990) to the log series. This step segregates the log time series into three components: trend, seasonal and a random component.Fit an HMM to the random component of the time series and generate

*N*(a user-specified integer value describing the desired number of synthetic series) simulations of the random component of user-specified length (we simulated a synthetic series of similar length to the observed series).Add the simulated random components to the corresponding seasonal and trend components for that period (the seasonal component quantifies the seasonal influences on the time series).

An HMM approach can have limited success when modelling extreme (i.e. >99th percentile) values. Therefore the synthetic series resampled the extreme values from an extreme value distribution (specifically, a generalized Pareto distribution was carefully fitted to the extreme values >99th percentile) of the observed series.

The mechanism developed to fit an HMM approach(consisting of five components) integrated with the generalized Pareto distribution to streamflow sequences has been detailed step-by-step by Pender *et al.* (2016).

## Model validation

This section presents a thorough investigation of the efficiency of the proposed HMM-GP framework in simulating the 15-minute streamflow sequences. The HMM-GP model has been fitted following the specified methodology for all four rivers and for the time periods specified in Table 1. To facilitate a robust procedure for model validation, various statistical characteristics of the synthetic streamflow sequences were rigorously compared with the observed series.

### Comparison of annual statistics

Figures 4, 5, 6, 7 compare the annual percentile statistics (5th, 25th, 50th, 75th and 95th percentiles) collated and estimated for each year during the observation period for the observed and 100 synthetic streamflow profiles for the Don at Haughton, the Nith at Friars Carse, the Dee at Woodend and the Tweed at Norham. To generalize the representation, black dotted lines with solid circles are used to represent the estimated statistics for the observed time series and solid brown circles are used to represent the corresponding estimated statistics for the 100 synthetic profiles. All the percentiles were estimated for the 15-minute streamflow data and systematically collated for each year over the entire period of the observation record. For all four rivers and for all the percentiles considered, the synthetic profiles appear to follow the trend of the observed dataset over the entire period of the observations. The statistics estimated for the 100 synthetic profiles are close to the corresponding statistics estimated for the observed dataset, but with some minor variations across both ends (which is anticipated because the synthetic profiles represent an artificial scenario that may be realistically plausible). Another observation of note is the increase in the variability of the estimated percentile statistics for 100 synthetic profiles with increasing percentile values, which could be attributed to the relative size effects in the value. To demonstrate the effectiveness and robustness of the modelling techniques, five percentile values were selected to give a uniform selection from the whole range (0–100th percentiles).

To examine the robustness of the HMM-GP scheme, we conducted an extensive comparison of the probability density distributions, quantiles (0–98th percentiles with a step size of 1) and autocorrelation functions (ACFs) for the observed and 100 synthetic streamflow realizations of the four rivers (Figure 8). The thick solid black lines in Figure 8 show the calculated statistics for the observed time series and the solid brown lines display the corresponding 100 synthetic streamflow realizations. The statistics were estimated for the entire available 15-minute streamflow dataset and simulations over the observation period for the Don at Haughton, the Nith at Friars Carse, the Dee at Woodend and the Tweed at Norham. From the analysis in Figure 9, it can be seen that the probability density distribution and the quantile graphs for 100 synthetic realizations are close to the probability density distribution and quantile lines of the observed time series of all four rivers.

Autocorrelation, considered to be the signature property of a time series, can be defined as the correlation of the time series with its own past and future values (also referred as lagged correlation or serial correlation). The autocorrelation of a time series is analysed using an ACF plot that measures the correlation of the time series at time *t* with the past values at lag times of (*t* − 1), (*t* − 2), (*t* − 3) … successively. As expected, Figure 8 shows that the observed streamflow time series has a strong autocorrelation (ACF > 0.6) for all four rivers. The ACF lines for the corresponding 100 synthetic series appear to closely follow the trends of the observed series with a constant minor difference in the range 0.1–0.2.

This analysis suggests that the proposed HMM-GP scheme is efficient in capturing the key annual statistical features of the observed streamflow sequences within the 100 synthetic sequences.

### Comparison of extreme percentiles

It is widely accepted that most of the computational models are unable to model extreme values effectively. In the context of the present work, this is mainly because the distance between the distinct values observed in the high percentile ranges varies significantly compared with the rest of the dataset. For example, across all four rivers, the distance between the 99th and 100th percentiles ranges between *c.* 75–800 m^{3} s^{−1} for the Don, *c.* 500–1500 m^{3} s^{−1} for the Tweed and *c.* 200–1500 m^{3} s^{−1} for the Nith and Dee. Thus there are only a few discrete points available for modelling extreme events, which is not sufficient to capture their dynamics and such extreme events often act as outliers within the modelling schematic. One of the unique features of the proposed HMM-GP schematic is that the integration of the distribution of extreme values within the modelling schematics significantly improves the overall ability of the HMM-GP model to simulate extreme values.

To demonstrate the effectiveness of the modelling techniques in simulating extreme values, Figure 9 provides a robust comparison of ten extreme percentile values starting from the 99th percentile (with a step size of 0.1) and ending at the 100th percentile for the observed (black dotted lines with solid circles) and corresponding 100 synthetic streamflow profiles (solid circles) for all four rivers. The extreme value percentile analysis has been conducted for long 15-minute streamflow time series covering the entire observation period for each river. Figure 9 clearly shows the potential of the HMM-GP model in simulating extreme values. The extreme percentile values of 100 synthetic series appear to follow the trend of the observed values with some variations (as expected) in both directions. The modelling schematic appears to be effective in simulating flow values up to the 99.99th percentile, although some large variations along the maximum values have been noted. The observed large variation in the simulated maximum flow values can be statistically minimized by introducing a cut-off threshold value for the maximum value (depending on the intended application of the HMM-GP model) in the synthetic series, although this part has been left untreated in this study for the purposes of illustration.

### Comparison of monthly statistics

To further explore the effectiveness of the HMM-GP framework in conserving the monthly statistical characteristics of the observed time series in simulated realizations, we now consider the systematic percentile analysis conducted for January and July for all four rivers. January and July represent the dynamic variations during a winter and a summer month, respectively (Figures 10 and 11). For the purposes of illustration, the 5th, 50th, and 95th percentiles are compared for the observed and corresponding 100 synthetic streamflow profiles. The 5th and the 95th percentiles cover the extremes ranges, whereas the 50th percentile is used to represent a measure of central tendency. For all the displayed graphs, the black dotted lines with solid circles represent the estimated statistics of the observed flow time series and the brown solid circles represent the estimated statistics for the corresponding 100 synthetic realizations. The percentiles were estimated for the 15-minute observed and simulated streamflow data by systematically collating the January and July data for each year over the observation period for all four rivers. The 100 synthetic realizations of the observed dataset for all four rivers appear to follow the trend of the observed series across the observation period with some variations (as expected) in both directions.

## Conclusions

We have examined the suitability of the HMM-GP approach, originally developed for synthesizing daily streamflow sequences, for simulating long historical 15-minute streamflow time series. To investigate the efficiency of the proposed methodology, we made a thorough comparison of several statistical characteristics of the observed time series with 100 synthetically simulated series. The HMM-GP approach has been shown to successfully simulate long historical time series of streamflow across different months of the year and over the entire period of observation for all four rivers in this study. Extreme events are at the heart of flood risk management/assessment studies. It has been shown that the integration of extreme value distributions within the HMM framework allows the accurate simulation of high flow values occurring above the 99th percentile (rare events). This is the unique and probably most outstanding feature of the proposed HMM-GP schematic, which has significantly affected its overall ability to simulate high flow extreme events. The HMM-GP framework has been successfully applied to simulate daily streamflow (Pender *et al.* 2016) and energy demand profiles at a fine temporal resolution of 1 minute (Patidar *et al.* 2016*b*). The HMM-GP framework is transformative and is suitable for applications in a number of different areas of engineering and science, but specifically for a range of water-related disaster issues, where it can be used to support decision-making in flood control and management frameworks (Kuwajima *et al.* this volume, in review).

The HMM-GP approach presented here synthesizes streamflow data using the observed flow data and does not account for any future changes in the factors influencing river flow processes (e.g. climate change). The modelling framework of the HMM-GP can be further developed to integrate the influence of factors affecting river flow processes. Pender *et al.* (2016) reported preliminary work demonstrating the integration of climate variables, specifically precipitation, within the HMM-GP model. The integration of the STL-based decomposition of time series, a novel addition to the HMM-GP approach presented here, can be further exploited in understanding the influence of climate variables on streamflow simulations. The overall scope of the HMM-GP framework presented here is enormous and will be investigated further to examine the possibility of integrating various geological and climatic factors within the modelling framework to realize their impact on river flow processes.

The methodology proposed in this paper has the following advantages.

The HMM-GP approach can be used to generate

*N*(a sufficiently large integer value) synthetic series to serve a range of potential applications.The methodology requires no additional information other than the historical observed data series and can be trained using a considerably small dataset with sufficient statistical features.

All the synthetic series generated with the model represent a realistically plausible alternative scenario and have the same (considerably close) statistical properties to the historical observed series.

The proposed methodology can be easily adapted to different timescales (daily, hourly, monthly or annual) for a range of plausible real-world applications.

## Acknowledgements

The 15-minute data and other technical information on the river case studies were provided by the Scottish Environment Protection Agency.

## Funding

This work was conducted in association with the Scottish Environment Protection Agency and is part of EPSRC's Impact Acceleration Project and the Maths Foresees funded feasibility projects. This work was funded by the Engineering and Physical Sciences Research Council.

- © 2018 The Author(s). Published by The Geological Society of London. All rights reserved