Near, Far, Wherever You Are 

 Tobler's First Law of Geography states that "everything is related to everything else, but near things are more related than distant things."  Obviously there are many examples -- an infection is more likely to spread to a nearby person than to a far away one, a new highway might depress house prices for people living right next to it, and so on.  The point is that there can be important dependencies and heterogeneities that vary with space, among other associations.  And in those cases the usual assumptions that observations or errors are independently distributed don't hold. Urgh.  Welcome to the world of spatial statistics. 

 As an estimation problem this is often addressed through clustering methods.  Households in a village with some infected persons are at higher risks than households in neighboring villages.  Or are they really?  Clustering works when the locations are relatively homogenous and separated.  What if there is no good way to classify observations into clusters, for example, if an area is evenly populated?  Or if the infected household lives right at the end of the village road, and some neighbors are in the other village?  The administrative boundaries commonly used for clustering (village name) might not properly account for the actual proximity or whatever defines the space between the observations.  If a transmitting mosquito wouldn't care much about the village name when deciding who to bite next, why should an analyst rely on it? 

 Using clustering may often be a good approximation but in some cases it's not good enough and there can be substantial spatial lags (observations are spatially dependent), spatial errors (error terms are related) and spatial heterogeneity (model parameters vary across space).  Those can lead to biased estimates, inefficient ones, or both.  The bad news is that those effects can matter a lot.  The good news is that there are methods to test for spatial dependence and correlation, and estimation techniques to deal with them. 

 Of course the underlying interactions we are trying to better capture can be anything from linear to more complicated relations.  It is unlikely that they are perfecrly well described by any abstract spatial model, so we will still need to make assumptions.  But at least there are some methods that can handle cases where the usual assumptions fail, and they can make an important difference to the analysis.  I will write more about them in later blog entries.  Meanwhile you might be interested in the following texts: 

 -- James LeSage's Econometrics Toolbox (www.spatial-econometrics.com) has an excellent workbook discussing spatial econometrics and examples for the MATLAB functions provided on the same site 
-- Anselin (2002) "Under the Hood: Issues in the Specification and Interpretation of Spatial Regression Models" Agricultural Economics 27: 247-267 provides a quick overview of the issues 
-- Anselin (1988) Spatial Econometrics: Methods and Models is the classic and widely quoted reference for spatial statistics