We report an estimate of the Earth’s average land surface temperature for the period 1753 to 2011. To address issues of potential station selection bias, we used a larger sampling of stations than had prior studies. For the period post 1880, our estimate is similar to those previously reported by other groups, although we report smaller uncertainties. The land temperature rise from the 1950s decade to the 2000s decade is 0.90 ± 0.05°C (95% confidence). Both maximum and minimum temperatures have increased during the last century. Diurnal variations decreased from 1900 to 1987 and then increased; this increase is significant but not understood. The period of 1753 to 1850 is marked by sudden drops in land surface temperature that are coincident with known volcanism; the response function is approximately 1.5 ± 0.5°C per 100 Tg of atmospheric sulfate. This volcanism, combined with a simple proxy for anthropogenic effects (logarithm of the CO2 addition of a solar forcing term. Thus, for this very simple model, solar forcing does not appear to contribute to the observed global warming of the past 250 years; the entire change can be modeled by a sum of volcanism and a single anthropogenic proxy. The residual variations include interannual and multi-decadal variability very similar to that of the Atlantic Multidecadal Oscillation (AMO).
A global land–ocean temperature record has been created by combining the Berkeley Earth monthly land temperature field with spatially kriged version of the HadSST3 dataset. This combined product spans the period from 1850 to present and covers the majority of the Earth’s surface: approximately 57 % in 1850, 75 % in 1880, 95 % in 1960, and 99.9 % by 2015. It includes average temperatures in 1∘×1∘ lat–long grid cells for each month when available. It provides a global mean temperature record quite similar to records from Hadley’s HadCRUT4, NASA’s GISTEMP, NOAA’s GlobalTemp, and Cowtan and Way and provides a spatially complete and homogeneous temperature field. Two versions of the record are provided, treating areas with sea ice cover as either air temperature over sea ice or sea surface temperature under sea ice, the former being preferred for most applications. The choice of how to assess the temperature of areas with sea ice coverage has a notable impact on global anomalies over past decades due to rapid warming of air temperatures in the Arctic. Accounting for rapid warming of Arctic air suggests ∼ 0.1 ∘C additional global-average temperature rise since the 19th century than temperature series that do not capture the changes in the Arctic. Updated versions of this dataset will be presented each month at the Berkeley Earth website (http://berkeleyearth.org/data/, last access: November 2020), and a convenience copy of the version discussed in this paper has been archived and is freely available at https://doi.org/10.5281/zenodo.3634713 (Rohde and Hausfather, 2020).
Author(s): Robert A. Rohde1 and Zeke Hausfather1,2
Citation: Rohde, R. A. and Hausfather, Z.: The Berkeley Earth Land/Ocean Temperature Record, Earth Syst. Sci. Data, 12, 3469–3479, https://doi.org/10.5194/essd-12-3469-2020, 2020.
Some ML algorithms (e.g., random forests) work very nicely with missing data. No data cleaning is required when using these algorithms. In addition to not breaking down amid missing data, these algorithms use the fact of “missingness” as a feature to predict with. This compensates for when the missing points are not randomly missing.
Or, rather than dodge the problem, although that might be the best approach, you can impute the missing values and work from there. Here, very simple ML algorithms that look for the nearest data point (K-Nearest Neighbors) and infer its value work well. Simplicity here can be optimal because the modeling in data cleaning should not be mixed with the modeling in forecasting.
There are also remedies for missing data in time series. The challenge of time series data is that relationships exist, not just between variables, but between variables and their preceding states. And, from the point of view of a historical data point, relationships exist with the future states of the variables.
For the sake of predicting missing values, a data set can be augmented by including lagged values and negative-lagged values (i.e., future values). This, now-wider, augmented data set will have correlated predictors. The regularization trick can be used to forecast missing points with the available data. And, a strategy of repeatedly sampling, forecasting, and then averaging the forecasts can be used. Or, a similar turnkey approach is to use principal component analysis (PCA) following a similar strategy where a meta-algorithm will repeatedly impute, project, and refit until the imputed points stop changing. This is easier said than done, but it is doable.