Some ML algorithms (e.g., random forests) work very nicely with missing data. No data cleaning is required when using these algorithms. In addition to not breaking down amid missing data, these algorithms use the fact of “missingness” as a feature to predict with. This compensates for when the missing points are not randomly missing.
Or, rather than dodge the problem, although that might be the best approach, you can impute the missing values and work from there. Here, very simple ML algorithms that look for the nearest data point (K-Nearest Neighbors) and infer its value work well. Simplicity here can be optimal because the modeling in data cleaning should not be mixed with the modeling in forecasting.
There are also remedies for missing data in time series. The challenge of time series data is that relationships exist, not just between variables, but between variables and their preceding states. And, from the point of view of a historical data point, relationships exist with the future states of the variables.
For the sake of predicting missing values, a data set can be augmented by including lagged values and negative-lagged values (i.e., future values). This, now-wider, augmented data set will have correlated predictors. The regularization trick can be used to forecast missing points with the available data. And, a strategy of repeatedly sampling, forecasting, and then averaging the forecasts can be used. Or, a similar turnkey approach is to use principal component analysis (PCA) following a similar strategy where a meta-algorithm will repeatedly impute, project, and refit until the imputed points stop changing. This is easier said than done, but it is doable.
Author(s): David Romoff
Publication Date: February 2022
Publication Site: Risks & Rewards, SOA