Adventures in Smarshland

September 27, 2009

Replicating the Temperature Anomaly Dataset

Filed under: Uncategorized — smarsh @ 10:34 pm

I am and will always be a skeptic. For this reason, I have sought to replicate the dataset which is basis for the research in global temperature change debate.  One of the great achievement of technology recently is the ability to do very close to the same level of research using commodity hardware as that of major government funded projects.  Furthermore, due to the open source software movement much of the necessary software to scrap together large datasets and analyze them is free.  Even better much of the world’s government data is now freely available, or obtainable through FOIA requests.  The data I used to try to corroborate the global temperature anomaly time series is available through NOAA, through the National Climatic Data Center.

I used their ftp access and some linux commands to extract all available data.  Unfortunately, the data come in a separate tar file for each year inside of which is a .gz file for each weather station.  So using some further linux commands I untarred and unzipped these files.  Then I concatenated each of these files together and deleted the header rows.  First you will need to install wget, which can be accomplished through synaptic package manager.  The rest of the command can be executed from a standard Ubuntu terminal.  Of course you will first need to cd into an appropriate directory.

wget -nc -r -nd -A “*.tar”
cat *.tar | tar xv -ignore-blocks *.tar

for z in *.gz; do gunzip $z; done
for z in *.op; do cat $z >> data.txt; done

sed -i ‘/STN/d’ data.txt

Then I fired up a MySQL instance and loaded the giant 15 Gb file.  I found the following tutorial helpful for setting up the MySQL database.  The installation can be done with Synaptic package manager, but getting all of the permissions and the database created can be a little bit confusing.  After the database was set up I used a GUI front end, MySQL Administrator, to help craft the SQL queries a little easier.

The initial load of the data can be accomplished with the following query:

LOAD DATA LOCAL INFILE ‘/media/Backup Drive/GSOD/data.txt’ INTO TABLE weather.test
SET station=SUBSTR(var,1,6),

From this I summarized the data to the yearly level.  I then did some analysis of how clean the data were.  I eliminated any observations with less than 360 days of data per year.  I then executed some queries to find the set of weather stations that had no missing data in the period between at least 1973 to 2008 and no gaps if the station’s data extends further.  In other words these constraints serve to make sure the data is largely well reported and continuously reported.

Here is the SQL code to generate the data described above:

CREATE TABLE weather.aggregate (SELECT station, year, AVG(temp) AS temp, COUNT(*) AS cnt
FROM weather.test
WHERE year <= 2008
GROUP BY station, year
ORDER BY station, year)

DELETE FROM weather.aggregate WHERE cnt < 360 OR cnt > 366

CREATE TABLE weather.stations (SELECT station, COUNT(*) as num_years, (MAX(year)-MIN(year)+1) AS year_range, MAX(year) AS max_year from weather.aggregate GROUP BY station)

DELETE FROM weather.stations WHERE year_range > num_years OR max_year < 2008 OR num_years < 36

Now we need to index the data to some fixed time period so we can analyze the temperature variation across time.  In statistical terms this is just “de-meaning” the data allowing for a set of dummy variables across stations.  Since the 1973 to 2008 period is well reported in the entire set of stations we index to this period:

CREATE TABLE weather.mean_temp (SELECT aggregate.station, avg(aggregate.temp) AS mean_temp
FROM weather.aggregate, weather.stations
WHERE aggregate.year > 1972 AND aggregate.station = stations.station
GROUP BY station)

CREATE TABLE weather.aggregate2 (SELECT a.station, a.year, a.temp, (a.temp – s.mean_temp) AS temp_anom
FROM weather.aggregate a, weather.mean_temp s
WHERE a.station = s.station
ORDER BY a.station, a.year)

Now it is time to do some statistical analysis.  My favorite program for these purposes is R from CRAN.  R is a statistical programming language with strong support for objects types common to statistical analysis such as matrix and vector types.  Much of the cutting edge in statistical analysis is programmed in R and is heavily the favorite in the academic community.  It is also free open source software (FOSS).  R is similar to Firefox in that is comes with a lot of great base functionality, but much more can be unlocked through add-ins.

I took the last created table named aggregate2 and exported it as a comma seperated values file calling it resultset.csv

I fire up R and import this file to begin the analysis.  R is a a scripting language.  Here are the commands I executed in my script:

data2 = read.table(file=”/home/scott/resultset.csv”,sep=”,”,header=TRUE)
qplot(year,temp_anom,data=data2,geom=c(“point”,”smooth”,”jitter”),alpha=I(1/10),ylim=c(-4,4),main=”Global Temp. Anom.\nHigh Reliability Stations”,xlab=”Year”,ylab=”Temp (Deg. F)”)

I used variants on the qplot function to create the following graphics outputs:

Taking averages over stations across years we have the graph below, however this obscures the reliability of the data across years.  Before 1973 we only have at best 24 stations to draw from, so the standard error of the process mean explodes by a factor of 3.6.  Before 1946 there are fewer than 10 stations tapering down to only 1 station first two years of the time series.


Looking at the data from the higher reliability period of 1973 and forward we observe a pattern of increasing temperature.  A basic OLS regression estimates the increase at 4.8 deg. f per century, with at better than 0.001 percent confidence level.  However, OLS may not be an appropriate methodology given the time series nature of the data.  An ARIMA model would probably be much more appropriate, however ARIMA models are difficult to estimate with much precision when there are so few observation to build the model, much less validate or forecast from.  The data would probably be best estimated with at the daily level in a time series cross section framework (TSCS) with seasonal differencing.


The following graphs I think are instructive in understanding the reliability of the data in the pre-1973 period.  Each point is a single yearly station observation of annual mean temperature.  The data are slightly perturbed randomly and made semi-transparent so you can get a better sense of the density of the data.  As you can see the estimate of the process mean before 1973 is quite a bit more volatile than the period 1973 and afterward.  However, due to the penalty function used in the estimating basis function of the smoother, the post-1973 period is over-smoothed and the pre-1973 is under-smoothed.


According to the fit of the smoothing function on only the post-1973 period, it appears that the process mean is about 1.7 deg. f. higher over the course of the 36 year period.  I still remain skeptical of strength of the data to make long range forecast of this time series, mainly because of the short duration of the data and inability to build the model and test it on a holdout period.


It is also important to understand the geospatial element of the data.  The following is a graphic of the locations of the station used in the analysis.  We lack homogeneity especially in the southern hemisphere and no data in the ocean region aside from a few islands in the pacific.



  1. I actually upgraded him to the latest version a few weeks back.

    Comment by Daniel Marsh — September 27, 2009 @ 11:24 pm

  2. […] he will have beaten world record holder Haile Gebrselassie’s time of 2h:03m:59. And because Scott likes statistics, here is a graph that proves it! I have assumed that Haile doesn’t improve any with age, but […]

    Pingback by The Deliverator – Wannabee » Blog Archive » Marathon x 2 — October 6, 2009 @ 1:47 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress