Adventures in Smarshland

September 27, 2009

Replicating the Temperature Anomaly Dataset

Filed under: Uncategorized — smarsh @ 10:34 pm

I am and will always be a skeptic. For this reason, I have sought to replicate the dataset which is basis for the research in global temperature change debate.  One of the great achievement of technology recently is the ability to do very close to the same level of research using commodity hardware as that of major government funded projects.  Furthermore, due to the open source software movement much of the necessary software to scrap together large datasets and analyze them is free.  Even better much of the world’s government data is now freely available, or obtainable through FOIA requests.  The data I used to try to corroborate the global temperature anomaly time series is available through NOAA, through the National Climatic Data Center.

I used their ftp access and some linux commands to extract all available data.  Unfortunately, the data come in a separate tar file for each year inside of which is a .gz file for each weather station.  So using some further linux commands I untarred and unzipped these files.  Then I concatenated each of these files together and deleted the header rows.  First you will need to install wget, which can be accomplished through synaptic package manager.  The rest of the command can be executed from a standard Ubuntu terminal.  Of course you will first need to cd into an appropriate directory.

wget -nc -r -nd -A “*.tar”
cat *.tar | tar xv -ignore-blocks *.tar

for z in *.gz; do gunzip $z; done
for z in *.op; do cat $z >> data.txt; done

sed -i ‘/STN/d’ data.txt

Then I fired up a MySQL instance and loaded the giant 15 Gb file.  I found the following tutorial helpful for setting up the MySQL database.  The installation can be done with Synaptic package manager, but getting all of the permissions and the database created can be a little bit confusing.  After the database was set up I used a GUI front end, MySQL Administrator, to help craft the SQL queries a little easier.

The initial load of the data can be accomplished with the following query:

LOAD DATA LOCAL INFILE ‘/media/Backup Drive/GSOD/data.txt’ INTO TABLE weather.test
SET station=SUBSTR(var,1,6),

From this I summarized the data to the yearly level.  I then did some analysis of how clean the data were.  I eliminated any observations with less than 360 days of data per year.  I then executed some queries to find the set of weather stations that had no missing data in the period between at least 1973 to 2008 and no gaps if the station’s data extends further.  In other words these constraints serve to make sure the data is largely well reported and continuously reported.

Here is the SQL code to generate the data described above:

CREATE TABLE weather.aggregate (SELECT station, year, AVG(temp) AS temp, COUNT(*) AS cnt
FROM weather.test
WHERE year <= 2008
GROUP BY station, year
ORDER BY station, year)

DELETE FROM weather.aggregate WHERE cnt < 360 OR cnt > 366

CREATE TABLE weather.stations (SELECT station, COUNT(*) as num_years, (MAX(year)-MIN(year)+1) AS year_range, MAX(year) AS max_year from weather.aggregate GROUP BY station)

DELETE FROM weather.stations WHERE year_range > num_years OR max_year < 2008 OR num_years < 36

Now we need to index the data to some fixed time period so we can analyze the temperature variation across time.  In statistical terms this is just “de-meaning” the data allowing for a set of dummy variables across stations.  Since the 1973 to 2008 period is well reported in the entire set of stations we index to this period:

CREATE TABLE weather.mean_temp (SELECT aggregate.station, avg(aggregate.temp) AS mean_temp
FROM weather.aggregate, weather.stations
WHERE aggregate.year > 1972 AND aggregate.station = stations.station
GROUP BY station)

CREATE TABLE weather.aggregate2 (SELECT a.station, a.year, a.temp, (a.temp – s.mean_temp) AS temp_anom
FROM weather.aggregate a, weather.mean_temp s
WHERE a.station = s.station
ORDER BY a.station, a.year)

Now it is time to do some statistical analysis.  My favorite program for these purposes is R from CRAN.  R is a statistical programming language with strong support for objects types common to statistical analysis such as matrix and vector types.  Much of the cutting edge in statistical analysis is programmed in R and is heavily the favorite in the academic community.  It is also free open source software (FOSS).  R is similar to Firefox in that is comes with a lot of great base functionality, but much more can be unlocked through add-ins.

I took the last created table named aggregate2 and exported it as a comma seperated values file calling it resultset.csv

I fire up R and import this file to begin the analysis.  R is a a scripting language.  Here are the commands I executed in my script:

data2 = read.table(file=”/home/scott/resultset.csv”,sep=”,”,header=TRUE)
qplot(year,temp_anom,data=data2,geom=c(“point”,”smooth”,”jitter”),alpha=I(1/10),ylim=c(-4,4),main=”Global Temp. Anom.\nHigh Reliability Stations”,xlab=”Year”,ylab=”Temp (Deg. F)”)

I used variants on the qplot function to create the following graphics outputs:

Taking averages over stations across years we have the graph below, however this obscures the reliability of the data across years.  Before 1973 we only have at best 24 stations to draw from, so the standard error of the process mean explodes by a factor of 3.6.  Before 1946 there are fewer than 10 stations tapering down to only 1 station first two years of the time series.


Looking at the data from the higher reliability period of 1973 and forward we observe a pattern of increasing temperature.  A basic OLS regression estimates the increase at 4.8 deg. f per century, with at better than 0.001 percent confidence level.  However, OLS may not be an appropriate methodology given the time series nature of the data.  An ARIMA model would probably be much more appropriate, however ARIMA models are difficult to estimate with much precision when there are so few observation to build the model, much less validate or forecast from.  The data would probably be best estimated with at the daily level in a time series cross section framework (TSCS) with seasonal differencing.


The following graphs I think are instructive in understanding the reliability of the data in the pre-1973 period.  Each point is a single yearly station observation of annual mean temperature.  The data are slightly perturbed randomly and made semi-transparent so you can get a better sense of the density of the data.  As you can see the estimate of the process mean before 1973 is quite a bit more volatile than the period 1973 and afterward.  However, due to the penalty function used in the estimating basis function of the smoother, the post-1973 period is over-smoothed and the pre-1973 is under-smoothed.


According to the fit of the smoothing function on only the post-1973 period, it appears that the process mean is about 1.7 deg. f. higher over the course of the 36 year period.  I still remain skeptical of strength of the data to make long range forecast of this time series, mainly because of the short duration of the data and inability to build the model and test it on a holdout period.


It is also important to understand the geospatial element of the data.  The following is a graphic of the locations of the station used in the analysis.  We lack homogeneity especially in the southern hemisphere and no data in the ocean region aside from a few islands in the pacific.


September 10, 2009

Heathcare: Hopefully Information and Analysis You Haven’t Heard Yet

Filed under: Uncategorized — smarsh @ 7:24 pm

Passing universal single-payer federal health insurance legislation has been the political ambition of many influential politicians for decades to no avail.  It strikes me, upon a little reflection, this is by design (of our Constitution).  The Reserved Powers clause of the Tenth Amendment, given no prior stipulation of the federal government’s authority to tax for and provide health insurance.  A reading as such should be interpreted to be reserved to the States to solve the issue in a fashion that protects their rights to fashion a law appropriate to their specifics circumstances.  Little attention has been paid to the attempts several states have made at providing health coverage.  Hawaii is one notable state that does have a “public” option.  But their are many other liberal states which would likely support single-payer systems, so the question becomes, if not, why not?  The answer I believe is two-fold, but first a history lesson.
Why is it today that the typical pooled risk group through which health insurance coverage is purchased is one’s employer?  The system arose during WWII when the federal government imposed wage restrictions and price controls.  So, firms, knowing they must provide a compelling reason to attract and retain top talent began providing perquisites which were not regulated by the IRS, one such instance was health insurance.  Note that the same tax exempt status of health insurance and other perquisites still remain today.  Naturally firms will maintain these attractive features because it is cheaper to provide them on a pretax basis.  In essence health insurance through employers exists today for two reasons, its inception was spurred by artificially imposed wage rigidities and remains because of tax avoidance strategies.  Other pooled groups could otherwise be formed to solve the adverse selection problem such as schools, churches, unions, knitting clubs, etc.
But why haven’t states been able to solve the issue at least where it has a high degree of popular support.  Alas, many have tried, but have failed not for lack of legislative support, but because of federal legislation which “preempts” states rights.  States such as California, Colorado, Michigan, and Minnesota have all attempted legislation which levies a payroll tax to effectively create an assigned risk insurance pool.  Sadly, another breach of the Tenth Amendment has nullified such laws.  Section 514 of the Employee Retirement Income Security Act of 1974 (ERISA) has been the avenue by which corporations have appealed to the federal circuit courts to intervene on their behalf.  Hawaii is the only state other than Massachusetts which has state level health coverage, to the best of my knowledge.  Hawaii is able to provide coverage because their system was set up before the 1974 legislation and is hence grandfathered.  Massachusetts has a barely workable system because the fee charged to employers who don’t provide coverage is so nominal they have not banded through collective action to challenge the legislation under s.514 of ERISA, but if they did they would likely be successful in getting it overturned.
I would like to provide for an assigned risk pool coverage option so basic health coverage can be afforded by all, even if I have to subsidize coverage for others, for the same reason I support the socialized provision of fire departments.  I believe, when poor and rich alike are treated for communicable diseases and other basic health problems everyone benefits, call it an externality if you like.  My bigger beef is the solutions best suited for different states are likely very different in form.  Washington state for instance has a relatively youthful population, lower rates of smoking, higher rates of physical active people.  Washington state probably needs more sport medicine doctors to keep active people active, something a federal cookie cutter approach likely would not recognize.  On the opposite end of the spectrum, Ohio probably needs more dietitians and smoking cessation specialists given high rates of obesity and smoking.

I want to cover several additional topics and will do so when I have the time:
-the market efficiency of high deductibles
-ending federal corn subsidies as a method of improving health and the consequent overproduction of corn syrup
-The competing “public option” as a violation of 5th amendment’s contravention against deprivation of private property with due compensation

Powered by WordPress