A High-Level Survey of Synthetic Data Generation in IoT and New Emerging Methods

The development of IoT systems needs input data to check its hardware and software. There are two main external inputs to an IoT system – UI for the user and the sensors. A set of data from the UI can always be emulated but the data from the sensors are difficult to emulate. General practice is to collect a small amount of data from the “things” that are to be controlled, monitored, governed, and optimized by the IoT system and use it for testing in the emulation mode. Availability of these data is generally limited because most of the “things” generally operate in a narrow band of system parameters.

However, software development of an IoT needs all sorts of data that can check the edge cases, anomalies, and sudden events. In its software, edges, anomalies, and special cases make up 60 to 70 percent of code. So, it is important to synthesize data according to the specifications and requirements. This data is synthetic in IoT. This data is generated by simulation or similar numerical processes and is characteristically very similar, almost undisguisable from the actual data. In this blog, we will mostly talk about the current practices and various processes of developing synthetic data for IoT using machine learning.

Synthetic data can be generated using rule-based systems or modelling the dynamic system behind the thing for which the IoT system is being developed. In some cases, modelling and rule-based methods are most effective but these methods are not general enough for incorporation into an IoT platform library. Moreover, the methods need a lot of expert domain knowledge and are cumbersome and slow to compute. That is where the data-oriented approach of traditional machine learning or deep learning shines. Given a limited amount of time series set (note each measured observable state of the thing generates a time series), which are realistic and coupled with sensor/observational noises, the goal is to develop data required for all sorts of testing of the system that includes – edge cases, anomalies, failure, noises, and scenarios.

Common time series-based models are the Autoregressive model (AR), Moving Average (MA), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA) models by which a stationary time series can be modelled. Once the time series is modelled, the model parameters ( p, d, q in the case of ARIMA ) can be used to generate new data with different residual noises. Further, if we have many sets of time series data for various system parameter values (called conditions, latent variables, or regression variables depending on the algorithms), a predictor can be generated to forecast model parameters (p, d, q) for a set of system parameters. The time series data for various edge cases can be predicted in this process. We can also synthesize time series for various scenarios for testing and development of the IoT system. For nonstationary time series, the series can be made stationary by differentiating, etc.. The process for getting the edge cases using regression will not be straightforward but can be calculated.

Deep learning generative models solve the problem of non-stationarity more elegantly. The models can be generated using Generative Adversarial Networks (GANs), diffusion models, Variational Autoencoders (VAEs), or their variants. These methods are more versatile because they can:

Capture the nonlinear characteristics of the data, resulting in more realistic data generation.
Data generalization between the model and system is smoother.
Generalizations for a system are better and dynamics (especially a simple one) can be captured more effectively.
A high volume of data can be generated without needing complicated domain knowledge.

On the negative side, algorithmically these methods need a much higher amount of number crunching and are sometimes infamous for divergence. GAN, can (Conditional GAN), seqGAN, and TimeGAN are the prominent algorithms used today to generate synthetic data. Note that VAE or diffusion methods produce similar results and there are similar variational algorithms. Most of the above methods are well suited for mildly nonlinear systems with discrete system parameters. In all these methods system parameters are included as latent or condition variables in the vector of variables. Various loss functions are proposed and experimented with in different algorithms for a smooth convergence in these methods. The major criticism against these methods are:

cGANs are often trained on large volumes of data. Unfortunately, we usually have only a few time series for system parameters.
Because of this lack of data, algorithms often collapse.

DoppelGANger uses two distinct discriminators for the latent variable and the observed state variables and therefore combines the power of regression and GAN. In this category of algorithms, ccGAN shows a lot of promise in solving the problem of nonlinear system parameters. These system parameters are considered continuous and a new loss function is proposed. New concepts of hard and soft vicinity are introduced to ensure computational stability in non-linear systems. To date, the algorithms are mainly applied for generating image data but the algorithm shows tremendous promise of generalization for IoT systems also.

Note that most of the well-designed “things” operating with linear or mildly non nonlinear dynamics, and data-oriented algorithms described above are advised. On the other hand, for a complicated dynamic system going through drastic state change in a very short period, it may be better to use a properly modelled digital twin to create the synthetic data.

For more details, please read below:

A Survey of Time Series Data Generation In IoT by Chuchen Hu, Zihan Sun, Chao Li, Yong Zang, Chunchiao Xing, Sensors MDPI
Synthetic Data for Deep Learning by Sergey I. Nikolenko, 1Synthesis.ai, San Francisco, CA; 2Steklov Institute of Mathematics at St. Petersburg, Russia, Sep 2019
Synthetic Data Generation for the Internet of Things, Jason W. Anderson, K. E. Kennedy, Linh B. Ngo, Andre Luckow, Amy W. Apon, School of Computing, Clemson University, Clemson, SC

A High-Level Survey of Synthetic Data Generation in IoT and New Emerging Methods

Quick Links