MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY General Editors V. Isham, N. Keiding, T. Louis, N. Reid, R. Tibshirani, and H. Tong 1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960) 2 Queues D.R. Cox and W.L. Smith (1961) 3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964) 4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966) 5 Population Genetics W.J. Ewens (1969) 6 Probability, Statistics and Time M.S. Barlett (1975) 7 Statistical Inference S.D. Silvey (1975) 8 The Analysis of Contingency Tables B.S. Everitt (1977) 9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977) 10 Stochastic Abundance Models S. Engen (1978) 11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979) 12 Point Processes D.R. Cox and V. Isham (1980) 13 Identification of Outliers D.M. Hawkins (1980) 14 Optimal Design S.D. Silvey (1980) 15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981) 16 Classification A.D. Gordon (1981) 17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995) 18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982) 19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982) 20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984) 21 Analysis of Survival Data D.R. Cox and D. Oakes (1984) 22 An Introduction to Latent Variable Models B.S. Everitt (1984) 23 Bandit Problems D.A. Berry and B. Fristedt (1985) 24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985) 25 The Statistical Analysis of Composition Data J. Aitchison (1986) 26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986) 27 Regression Analysis with Applications G.B. Wetherill (1986) 28 Sequential Methods in Statistics, 3rd edition G.B. Wetherill and K.D. Glazebrook (1986) 29 Tensor Methods in Statistics P. McCullagh (1987) 30 Transformation and Weighting in Regression R.J. Carroll and D. Ruppert (1988) 31 Asymptotic Techniques for Use in Statistics O.E. Bandorff-Nielsen and D.R. Cox (1989) 32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989) 33 Analysis of Infectious Disease Data N.G. Becker (1989) 34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989) 35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989) 36 Symmetric Multivariate and Related Distributions K.T. Fang, S. Kotz and K.W. Ng (1990)

© 2004 by CRC Press LLC

37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989) 38 Cyclic and Computer Generated Designs, 2nd edition J.A. John and E.R. Williams (1995) 39 Analog Estimation Methods in Econometrics C.F. Manski (1988) 40 Subset Selection in Regression A.J. Miller (1990) 41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990) 42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991) 43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990) 44 Inspection Errors for Attributes in Quality Control N.L. Johnson, S. Kotz and X. Wu (1991) 45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992) 46 The Analysis of Quantal Response Data B.J.T. Morgan (1992) 47 Longitudinal Data with Serial Correlation—A State-Space Approach R.H. Jones (1993) 48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993) 49 Markov Models and Optimization M.H.A. Davis (1993) 50 Networks and Chaos—Statistical and Probabilistic Aspects O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993) 51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994) 52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994) 53 Practical Risk Theory for Actuaries C.D. Daykin, T. Pentikäinen and M. Pesonen (1994) 54 Biplots J.C. Gower and D.J. Hand (1996) 55 Predictive Inference—An Introduction S. Geisser (1993) 56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993) 57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993) 58 Nonparametric Regression and Generalized Linear Models P.J. Green and B.W. Silverman (1994) 59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994) 60 Kernel Smoothing M.P. Wand and M.C. Jones (1995) 61 Statistics for Long Memory Processes J. Beran (1995) 62 Nonlinear Models for Repeated Measurement Data M. Davidian and D.M. Giltinan (1995) 63 Measurement Error in Nonlinear Models R.J. Carroll, D. Rupert and L.A. Stefanski (1995) 64 Analyzing and Modeling Rank Data J.J. Marden (1995) 65 Time Series Models—In Econometrics, Finance and Other Fields D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996) 66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996) 67 Multivariate Dependencies—Models, Analysis and Interpretation D.R. Cox and N. Wermuth (1996) 68 Statistical Inference—Based on the Likelihood A. Azzalini (1996) 69 Bayes and Empirical Bayes Methods for Data Analysis B.P. Carlin and T.A Louis (1996) 70 Hidden Markov and Other Models for Discrete-Valued Time Series I.L. Macdonald and W. Zucchini (1997)

© 2004 by CRC Press LLC

71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997) 72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997) 73 Multivariate Models and Dependence Concepts H. Joe (1997) 74 Theory of Sample Surveys M.E. Thompson (1997) 75 Retrial Queues G. Falin and J.G.C. Templeton (1997) 76 Theory of Dispersion Models B. Jørgensen (1997) 77 Mixed Poisson Processes J. Grandell (1997) 78 Variance Components Estimation—Mixed Models, Methodologies and Applications P.S.R.S. Rao (1997) 79 Bayesian Methods for Finite Population Sampling G. Meeden and M. Ghosh (1997) 80 Stochastic Geometry—Likelihood and computation O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998) 81 Computer-Assisted Analysis of Mixtures and Applications— Meta-analysis, Disease Mapping and Others D. Böhning (1999) 82 Classification, 2nd edition A.D. Gordon (1999) 83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999) 84 Statistical Aspects of BSE and vCJD—Models for Epidemics C.A. Donnelly and N.M. Ferguson (1999) 85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000) 86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000) 87 Complex Stochastic Systems O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001) 88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001) 89 Algebraic Statistics—Computational Commutative Algebra in Statistics G. Pistone, E. Riccomagno and H.P. Wynn (2001) 90 Analysis of Time Series Structure—SSA and Related Techniques N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001) 91 Subjective Probability Models for Lifetimes Fabio Spizzichino (2001) 92 Empirical Likelihood Art B. Owen (2001) 93 Statistics in the 21st Century Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001) 94 Accelerated Life Models: Modeling and Statistical Analysis Vilijandas Bagdonavicius and Mikhail Nikulin (2001) 95 Subset Selection in Regression, Second Edition Alan Miller (2002) 96 Topics in Modelling of Clustered Data ˇ Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002) 97 Components of Variance D.R. Cox and P.J. Solomon (2002) 98 Design and Analysis of Cross-Over Trials, 2nd Edition Byron Jones and Michael G. Kenward (2003) 99 Extreme Values in Finance, Telecommunications, and the Environment Bärbel Finkenstädt and Holger Rootzén (2003) 100 Statistical Inference and Simulation for Spatial Point Processes Jesper Møller and Rasmus Plenge Waagepetersen (2004) 101 Hierarchical Modeling and Analysis for Spatial Data Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)

© 2004 by CRC Press LLC

Monographs on Statistics and Applied Probability 101

Hierarchical Modeling and Analysis for Spatial Data

Sudipto Banerjee Bradley P. Carlin Alan E. Gelfand

CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C.

© 2004 by CRC Press LLC

Library of Congress Cataloging-in-Publication Data Banerjee, Sudipto. Hierarchical modeling and analysis for spatial data / Sudipto Banerjee, Bradley P. Carlin, Alan E. Gelfand. p. cm. — (Monographs on statistics and applied probability : 101) Includes bibliographical references and index. ISBN 1-58488-410-X (alk. paper) 1. Spatial analysis (Statistics)—Mathematical models. I. Carlin, Bradley P. II. Gelfand, Alan E., 1945- III. Title. IV. Series. QA278.2.B36 2004 519.5—dc22

2003062652

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microﬁlming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Speciﬁc permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identiﬁcation and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com © 2004 by Chapman & Hall/CRC No claim to original U.S. Government works International Standard Book Number 1-58488-410-X Library of Congress Card Number 2003062652 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper

© 2004 by CRC Press LLC

to Sharbani, Caroline, and Mary Ellen

© 2004 by CRC Press LLC

Contents Preface 1 Overview of spatial data problems

xv 1

2 Basics of point-referenced data models

21

1.1 Introduction to spatial data and models 1.1.1 Point-level models 1.1.2 Areal models 1.1.3 Point process models 1.2 Fundamentals of cartography 1.2.1 Map projections 1.2.2 Calculating distance on the earth's surface 1.3 Exercises

2.1 Elements of point-referenced modeling 2.1.1 Stationarity 2.1.2 Variograms 2.1.3 Isotropy 2.1.4 Variogram model tting 2.2 Spatial process models ? 2.2.1 Formal modeling theory for spatial processes 2.2.2 Covariance functions and spectra 2.2.3 Smoothness of process realizations 2.2.4 Directional derivative processes 2.2.5 Anisotropy 2.3 Exploratory approaches for point-referenced data 2.3.1 Basic techniques 2.3.2 Assessing anisotropy 2.4 Classical spatial prediction 2.5 Computer tutorials 2.5.1 EDA and variogram tting in S+SpatialStats 2.5.2 Kriging in S+SpatialStats 2.5.3 EDA, variograms, and kriging in geoR

© 2004 by CRC Press LLC

1 6 7 8 10 10 17 18

22 22 22 24 29 30 30 32 36 37 37 39 39 44 48 52 52 56 61

x

CONTENTS

2.6 Exercises

65

3 Basics of areal data models

69

4 Basics of Bayesian inference

99

3.1 Exploratory approaches for areal data 3.1.1 Measures of spatial association 3.1.2 Spatial smoothers 3.2 Brook's Lemma and Markov random elds 3.3 Conditionally autoregressive (CAR) models 3.3.1 The Gaussian case 3.3.2 The non-Gaussian case 3.4 Simultaneous autoregressive (SAR) models 3.5 Computer tutorials 3.5.1 Adjacency matrix construction in S+SpatialStats 3.5.2 SAR and CAR model tting in S+SpatialStats 3.5.3 Choropleth mapping using the maps library in S-plus 3.6 Exercises 4.1 Introduction to hierarchical modeling and Bayes' Theorem 4.2 Bayesian inference 4.2.1 Point estimation 4.2.2 Interval estimation 4.2.3 Hypothesis testing and model choice 4.3 Bayesian computation 4.3.1 The Gibbs sampler 4.3.2 The Metropolis-Hastings algorithm 4.3.3 Slice sampling 4.3.4 Convergence diagnosis 4.3.5 Variance estimation 4.4 Computer tutorials 4.4.1 Basic Bayesian modeling in R or S-plus 4.4.2 Advanced Bayesian modeling in WinBUGS 4.5 Exercises

5 Hierarchical modeling for univariate spatial data 5.1 Stationary spatial process models 5.1.1 Isotropic models 5.1.2 Bayesian kriging in WinBUGS 5.1.3 More general isotropic correlation functions 5.1.4 Modeling geometric anisotropy 5.2 Generalized linear spatial process modeling 5.3 Nonstationary spatial process models ? 5.3.1 Deformation 5.3.2 Kernel mixing of process variables

© 2004 by CRC Press LLC

70 71 75 76 79 79 83 84 88 88 90 92 95

99 103 103 104 105 110 111 113 116 116 119 120 120 122 125

129

129 130 134 135 141 146 149 151 152

CONTENTS

5.3.3 Mixing of process distributions 5.4 Areal data models 5.4.1 Disease mapping 5.4.2 Traditional models and frequentist methods 5.4.3 Hierarchical Bayesian methods 5.5 General linear areal data modeling 5.6 Comparison of point-referenced and areal data models 5.7 Exercises

xi

156 158 158 158 159 167 168 170

6 Spatial misalignment

175

7 Multivariate spatial modeling

217

8 Spatiotemporal modeling

255

6.1 Point-level modeling 6.1.1 Gaussian process models 6.1.2 Methodology for the point-level realignment 6.2 Nested block-level modeling 6.3 Nonnested block-level modeling 6.3.1 Motivating data set 6.3.2 Methodology for nonnested block-level realignment 6.4 Misaligned regression modeling 6.5 Exercises

7.1 Separable models 7.1.1 Spatial prediction, interpolation, and regression 7.1.2 Regression in the Gaussian case 7.1.3 Avoiding the symmetry of the cross-covariance matrix 7.1.4 Regression in a probit model 7.2 Coregionalization models ? 7.2.1 Coregionalization models and their properties 7.2.2 Unconditional and conditional Bayesian speci cations 7.2.3 Spatially varying coregionalization models 7.2.4 Model- tting issues 7.3 Other constructive approaches ? 7.4 Multivariate models for areal data 7.4.1 Motivating data set 7.4.2 Multivariate CAR (MCAR) theory 7.4.3 Modeling issues 7.5 Exercises

8.1 General modeling formulation 8.1.1 Preliminary analysis 8.1.2 Model formulation 8.1.3 Associated distributional results 8.1.4 Prediction and forecasting

© 2004 by CRC Press LLC

176 176 178 182 192 193 195 206 212

218 220 222 224 225 230 230 233 237 237 243 245 246 247 249 253

257 257 258 261 263

xii

CONTENTS

8.2 Point-level modeling with continuous time 8.3 Nonseparable spatiotemporal models ? 8.4 Dynamic spatiotemporal models ? 8.4.1 Brief review of dynamic linear models 8.4.2 Formulation for spatiotemporal models 8.5 Block-level modeling 8.5.1 Aligned data 8.5.2 Misalignment across years 8.5.3 Nested misalignment both within and across years 8.5.4 Nonnested misalignment and regression 8.6 Exercises

267 274 275 276 277 282 282 287 290 293 298

9 Spatial survival models

301

10 Special topics in spatial process modeling

343

Appendices

377

9.1 Parametric models 9.1.1 Univariate spatial frailty modeling 9.1.2 Spatial frailty versus logistic regression models 9.2 Semiparametric models 9.2.1 Beta mixture approach 9.2.2 Counting process approach 9.3 Spatiotemporal models 9.4 Multivariate models ? 9.5 Spatial cure rate models ? 9.5.1 Models for right- and interval-censored data 9.5.2 Spatial frailties in cure rate models 9.5.3 Model comparison 9.6 Exercises

10.1 Process smoothness revisited ? 10.1.1 Smoothness of a univariate spatial process 10.1.2 Directional nite dierence and derivative processes 10.1.3 Distribution theory 10.1.4 Directional derivative processes in modeling 10.2 Spatially varying coecient models 10.2.1 Approach for a single covariate 10.2.2 Multivariate spatially varying coecient models 10.2.3 Spatiotemporal data 10.2.4 Generalized linear model setting 10.3 Spatial CDFs 10.3.1 Basic de nitions and motivating data sets 10.3.2 Derived-process spatial CDFs 10.3.3 Randomly weighted SCDFs

© 2004 by CRC Press LLC

303 303 308 311 311 313 315 320 328 329 333 334 339

344 344 346 348 350 356 357 359 361 366 367 367 370 372

CONTENTS

xiii

A Matrix theory and spatial computing methods

379

B Answers to selected exercises References Author index Subject index

405 423 443 449

A.1 A.2 A.3 A.4 A.5

Gaussian elimination and LU decomposition Inverses and determinants Cholesky decomposition Fast Fourier transforms Strategies for large spatial and spatiotemporal data sets A.5.1 Subsampling A.5.2 Spectral methods A.5.3 Lattice methods A.5.4 Dimension reduction A.5.5 Coarse- ne coupling A.6 Slice Gibbs sampling for spatial process model tting A.6.1 Constant mean process with nugget A.6.2 Mean structure process with no pure error component A.6.3 Mean structure process with nugget A.7 Structured MCMC sampling for areal model tting A.7.1 Applying structured MCMC to areal data A.7.2 Algorithmic schemes

© 2004 by CRC Press LLC

379 383 384 385 387 387 388 389 390 391 392 395 396 397 398 400 402

Preface As recently as two decades ago, the impact of hierarchical Bayesian methods outside of a small group of theoretical probabilists and statisticians was minimal at best. Realistic models for challenging data sets were easy enough to write down, but the computations associated with these models required integrations over hundreds or even thousands of unknown parameters, far too complex for existing computing technology. Suddenly, around 1990, the \Markov chain Monte Carlo (MCMC) revolution" in Bayesian computing took place. Methods like the Gibbs sampler and the Metropolis algorithm, when coupled with ever-faster workstations and personal computers, enabled evaluation of the integrals that had long thwarted applied Bayesians. Almost overnight, Bayesian methods became not only feasible, but the method of choice for almost any model involving multiple levels incorporating random eects or complicated dependence structures. The growth in applications has also been phenomenal, with a particularly interesting recent example being a Bayesian program to delete spam from your incoming email (see popfile.sourceforge.net). Our purpose in writing this book is to describe hierarchical Bayesian methods for one class of applications in which they can pay substantial dividends: spatial (and spatiotemporal) statistics. While all three of us have been working in this area for some time, our motivation for writing the book really came from our experiences teaching courses on the subject (two of us at the University of Minnesota, and the other at the University of Connecticut). In teaching we naturally began with the textbook by Cressie (1993), long considered the standard as both text and reference in the eld. But we found the book somewhat uneven in its presentation, and written at a mathematical level that is perhaps a bit high, especially for the many epidemiologists, environmental health researchers, foresters, computer scientists, GIS experts, and other users of spatial methods who lacked signi cant background in mathematical statistics. Now a decade old, the book also lacks a current view of hierarchical modeling approaches for spatial data. But the problem with the traditional teaching approach went beyond the mere need for a less formal presentation. Time and again, as we presented

© 2004 by CRC Press LLC

xvi

PREFACE

the traditional material, we found it wanting in terms of its exibility to deal with realistic assumptions. Traditional Gaussian kriging is obviously the most important method of point-to-point spatial interpolation, but extending the paradigm beyond this was awkward. For areal (block-level) data, the problem seemed even more acute: CAR models should most naturally appear as priors for the parameters in a model, not as a model for the observations themselves. This book, then, attempts to remedy the situation by providing a fully Bayesian treatment of spatial methods. We begin in Chapter 1 by outlining and providing illustrative examples of the three types of spatial data: pointlevel (geostatistical), areal (lattice), and spatial point process. We also provide a brief introduction to map projection and the proper calculation of distance on the earth's surface (which, since the earth is round, can dier markedly from answers obtained using the familiar notion of Euclidean distance). Our statistical presentation begins in earnest in Chapter 2, where we describe both exploratory data analysis tools and traditional modeling approaches for point-referenced data. Modeling approaches from traditional geostatistics (variogram tting, kriging, and so forth) are covered here. Chapter 3 oers a similar presentation for areal data models, again starting with choropleth maps and other displays and progressing toward more formal statistical models. This chapter also presents Brook's Lemma and Markov random elds, topics that underlie the conditional, intrinsic, and simultaneous autoregressive (CAR, IAR, and SAR) models so often used in areal data settings. Chapter 4 provides a review of the hierarchical Bayesian approach in a fairly generic setting, for readers previously unfamiliar with these methods and related computing and software. (The penultimate sections of Chapters 2, 3, and 4 oer tutorials in several popular software packages.) This chapter is not intended as a replacement for a full course in Bayesian methods (as covered, for example, by Carlin and Louis, 2000, or Gelman et al., 2004), but should be sucient for readers having at least some familiarity with the ideas. In Chapter 5 then we are ready to cover hierarchical modeling for univariate spatial response data, including Bayesian kriging and lattice modeling. The issue of nonstationarity (and how to model it) also arises here. Chapter 6 considers the problem of spatially misaligned data. Here, Bayesian methods are particularly well suited to sorting out complex interrelationships and constraints and providing a coherent answer that properly accounts for all spatial correlation and uncertainty. Methods for handling multivariate spatial responses (for both point- and block-level data) are discussed in Chapter 7. Spatiotemporal models are considered in Chapter 8, while Chapter 9 presents an extended application of areal unit data modeling in the context of survival analysis methods. Chapter 10 considers novel methodology associated with spatial process modeling, including spa-

© 2004 by CRC Press LLC

PREFACE

xvii

tial directional derivatives, spatially varying coecient models, and spatial cumulative distribution functions (SCDFs). Finally, the book also features two useful appendices. Appendix A reviews elements of matrix theory and important related computational techniques, while Appendix B contains solutions to several of the exercises in each of the book's chapters. Our book is intended as a research monograph, presenting the \state of the art" in hierarchical modeling for spatial data, and as such we hope readers will nd it useful as a desk reference. However, we also hope it will be of bene t to instructors (or self-directed students) wishing to use it as a textbook. Here we see several options. Students wanting an introduction to methods for point-referenced data (traditional geostatistics and its extensions) may begin with Chapter 1, Chapter 2, Chapter 4, and Section 5.1 to Section 5.3. If areal data models are of greater interest, we suggest beginning with Chapter 1, Chapter 3, Chapter 4, Section 5.4, and Section 5.5. In addition, for students wishing to minimize the mathematical presentation, we have also marked sections containing more advanced material with a star (?). These sections may be skipped (at least initially) at little cost to the intelligibility of the subsequent narrative. In our course in the Division of Biostatistics at the University of Minnesota, we are able to cover much of the book in a 3-credit-hour, single-semester (15-week) course. We encourage the reader to check http://www.biostat.umn.edu/~brad/ on the web for many of our data sets and other teaching-related information. We owe a debt of gratitude to those who helped us make this book a reality. Kirsty Stroud and Bob Stern took us to lunch and said encouraging things (and more importantly, picked up the check) whenever we needed it. Cathy Brown, Alex Zirpoli, and Desdamona Racheli prepared signi cant portions of the text and gures. Many of our current and former graduate and postdoctoral students, including Yue Cui, Xu Guo, Murali Haran, Xiaoping Jin, Andy Mugglin, Margaret Short, Amy Xia, and Li Zhu at Minnesota, and Deepak Agarwal, Mark Ecker, Sujit Ghosh, Hyon-Jung Kim, Ananda Majumdar, Alexandra Schmidt, and Shanshan Wu at the University of Connecticut, played a big role. We are also grateful to the Spring 2003 Spatial Biostatistics class in the School of Public Health at the University of Minnesota for taking our draft for a serious \test drive." Colleagues Jarrett Barber, Nicky Best, Montserrat Fuentes, David Higdon, Jim Hodges, Oli Schabenberger, John Silander, Jon Wake eld, Melanie Wall, Lance Waller, and many others provided valuable input and assistance. Finally, we thank our families, whose ongoing love and support made all of this possible. Sudipto Banerjee Minneapolis, Minnesota Bradley P. Carlin Durham, North Carolina Alan E. Gelfand October 2003

© 2004 by CRC Press LLC

CHAPTER 1

Overview of spatial data problems 1.1 Introduction to spatial data and models

Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are highly multivariate, with many important predictors and response variables, geographically referenced, and often presented as maps, and temporally correlated, as in longitudinal or other time series structures. For example, for an epidemiological investigation, we might wish to analyze lung, breast, colorectal, and cervical cancer rates by county and year in a particular state, with smoking, mammography, and other important screening and staging information also available at some level. Public health professionals who collect such data are charged not only with surveillance, but also statistical inference tasks, such as modeling of trends and correlation structures, estimation of underlying model parameters, hypothesis testing (or comparison of competing models), and prediction of observations at unobserved times or locations. In this text we seek to present a practical, self-contained treatment of hierarchical modeling and data analysis for complex spatial (and spatiotemporal) data sets. Spatial statistics methods have been around for some time, with the landmark work by Cressie (1993) providing arguably the only comprehensive book in the area. However, recent developments in Markov chain Monte Carlo (MCMC) computing now allow fully Bayesian analyses of sophisticated multilevel models for complex geographically referenced data. This approach also oers full inference for non-Gaussian spatial data, multivariate spatial data, spatiotemporal data, and, for the rst time, solutions to problems such as geographic and temporal misalignment of spatial data layers. This book does not attempt to be fully comprehensive, but does attempt to present a fairly thorough treatment of hierarchical Bayesian approaches for handling all of these problems. The book's mathematical level is roughly comparable to that of Carlin and Louis (2000). That is, we sometimes state

© 2004 by CRC Press LLC

2

OVERVIEW OF SPATIAL DATA PROBLEMS

results rather formally, but spend little time on theorems and proofs. For more mathematical treatments of spatial statistics (at least on the geostatistical side), the reader is referred to Cressie (1993), Wackernagel (1998), Chiles and Del ner (1999), and Stein (1999a). For more descriptive presentations the reader might consult Bailey and Gattrell (1995), Fotheringham and Rogerson (1994), or Haining (1990). Our primary focus is on the issues of modeling (where we oer rich, exible classes of hierarchical structures to accommodate both static and dynamic spatial data), computing (both in terms of MCMC algorithms and methods for handling very large matrices), and data analysis (to illustrate the rst two items in terms of inferential summaries and graphical displays). Reviews of both traditional spatial methods (Chapters 2 and 3) and Bayesian methods (Chapter 4) attempt to ensure that previous exposure to either of these two areas is not required (though it will of course be helpful if available). Following convention, we classify spatial data sets into one of three basic types: point-referenced data, where Y (s) is a random vector at a location s 2 0; 2 > 0; 2 > 0 :

(t) = +0 t otherwise Note that (t) ! 1 as t ! 1, and so this semivariogram does not correspond to a weakly stationary process (although it is intrinsically station-

© 2004 by CRC Press LLC

ELEMENTS OF POINT-REFERENCED MODELING

25

ary). This semivariogram is plotted in Figure 2.1(a) using the parameter values 2 = 0:2 and 2 = 0:5. 2. Spherical: 8 >

+ 32t ; 12 (t)3 : 0

if t 1=; if 0 < t 1=; : otherwise

The spherical semivariogram is valid in r = 1; 2, or 3 dimensions, but for r 4 it fails to correspond to a spatial variance matrix that is positive de nite (as required to specify a valid joint probability distribution). The spherical form does give rise to a stationary process and so the corresponding covariance function is easily computed (see the exercises that follow). This variogram owes its popularity largely to the fact that it oers clear illustrations of the nugget, sill, and range, three characteristics traditionally associated with variograms. Speci cally, consider Figure 2.1(b), which plots the spherical semivariogram using the parameter values 2 = 0:2, 2 = 1, and = 1. While (0) = 0 by de nition, (0+) limt!0+ (t) = 2 ; this quantity is the nugget. Next, limt!1 (t) = 2 + 2 ; this asymptotic value of the semivariogram is called the sill. (The sill minus the nugget, which is simply 2 in this case, is called the partial sill.) Finally, the value t = 1= at which (t) rst reaches its ultimate level (the sill) is called the range. It is for this reason that many of the variogram models of this subsection are often parametrized through R 1=. Confusingly, both R and are sometimes referred to as the range parameter, although is often more accurately referred to as the decay parameter. Note that for the linear semivariogram, the nugget is 2 but the sill and range are both in nite. For other variograms (such as the next one we consider), the sill is nite, but only reached asymptotically. 3. Exponential:

(t) =

2 + 2 (1 ; exp (;t)) if t > 0; 0

otherwise :

The exponential has an advantage over the spherical in that it is simpler in functional form while still being a valid variogram in all dimensions (and without the spherical's nite range requirement). However, note from Figure 2.1(c), which plots this semivariogram assuming 2 = 0:2, 2 = 1, and = 2, that the sill is only reached asymptotically, meaning that strictly speaking, the range R = 1= is in nite. In cases like this, the notion of an eective range if often used, i.e., the distance at which there is essentially no lingering spatial correlation. To make this notion precise, we must convert from scale to C scale (possible here since limt!1 (t) exists; the exponential is not only intrinsically but also weakly stationary).

© 2004 by CRC Press LLC

26

BASICS OF POINT-REFERENCED DATA MODELS

From (2.3) we have C (t) = limu!1 (u) ; (t) = 2 + 2 ; 2 + 2 (1 ; exp(;t)) = 2 exp(;t) : Hence 2 + 2 if t = 0 C (t) = 2exp( ;t) if t > 0 :

(2:5)

If the nugget 2 = 0, then this expression reveals that the correlation between two points t units apart is exp(;t); note that exp(;t) = 1; for t = 0+ and exp(;t) = 0 for t = 1, both in concert with this interpretation. A common de nition of the eective range, t0 , is the distance at which this correlation has dropped to only 0.05. Setting exp(;t0 ) equal to this value we obtain t0 3=, since log(0:05) ;3. The range will be discussed in more detail in Subsection 2.2.2. Finally, the form of (2.5) gives a clear example of why the nugget ( 2 in this case) is often viewed as a \nonspatial eect variance," and the partial sill (2 ) is viewed as a \spatial eect variance." Along with , a statistician would likely view tting this model to a spatial data set as an exercise in estimating these three parameters. We shall return to variogram model tting in Subsection 2.1.4. 4. Gaussian: 2 2 ;1 ; exp ;;2 t2 if t > 0 +

(t) = (2:6) 0 otherwise : The Gaussian variogram is an analytic function and yields very smooth realizations of the spatial process. We shall say much more about process smoothness in Subsection 2.2.3. 5. Powered exponential: 2 p 2 t>0 :

(t) = + (1 ; 0exp (; jtj )) ifotherwise (2:7) Here 0 < p 2 yields a family of valid variograms. Note that both the Gaussian and the exponential forms are special cases of this one. 6. Rational quadratic:

(t) = 7. Wave:

(t) =

© 2004 by CRC Press LLC

(

(

2 + (1+2tt22 ) if t > 0 0

2 + 2 1 ; sin(tt) 0

otherwise

:

if t > 0 : otherwise

ELEMENTS OF POINT-REFERENCED MODELING

Model Linear Spherical Exponential Powered exponential Gaussian Rational quadratic Wave Power law Matern Matern at = 3=2

27

Covariance function, C (t) C (t) does 8 not exist 0 < if t 1= C (t) = : 2 1 ; 23 t + 12 (t)3 if 0 < t 1= 2 + 2 otherwise 2 exp( ; t ) if t > 0 C (t) = 2 2 otherwise 2 + p ) if t > 0 exp( ;j t j C (t) = 2 2 otherwise 2 +2 2 exp( ; t ) if t>0 C (t) = 2 + 2 otherwise

C (t) =

(

(

2 2 1 ; (1+tt if t > 0 2) 2 + 2 otherwise sin( t ) 2 t if t > 0 2 + 2 otherwise

C (t) = C (t) does ( not exist p p 2 C (t) = 2;1 ;( ) (2 2t) 2K (2 t) if t > 0 + otherwise 2 (1 + t ) exp ( ; t ) if t > 0 C (t) = 2 + 2 otherwise

Summary of covariance functions (covariograms) for common parametric isotropic models.

Table 2.1

Note this is an example of a variogram that is not monotonically increasing. The associated covariance function is C (t) = 2 sin(t)=(t). Bessel functions of the rst kind include the wave covariance function and are discussed in detail in Subsections 2.2.2 and 5.1.3. 8. Power law 2 2 t>0

(t) = +0 t of otherwise : This generalizes the linear case and produces valid intrinsic (albeit not weakly) stationary semivariograms provided 0 < 2. 9. Matern : The variogram for the Matern class is given by i ( h p 2 + 2 1 ; (2 ;1t) K (2pt) if t > 0 2 ;( ) : (2:8)

(t) = 2 otherwise This class was originally suggested by Matern (1960, 1986). Interest in it was revived by Handco*ck and Stein (1993) and Handco*ck and Wallis (1994),

© 2004 by CRC Press LLC

28

BASICS OF POINT-REFERENCED DATA MODELS

model Linear Spherical Exponential Powered exponential Gaussian Rational quadratic

Variogram, (t) 2 2 t>0

(t) = +0 t ifotherwise 8 2 + 2 < if t 1= 2 2

(t) = : + 23 t ; 12 (t)3 if 0 < t 1= 0 otherwise 2 2 (1 ; exp(;t)) if t > 0 +

(t) = 0 otherwise 2 2 (1 ; exp(;jtjp )) if t > 0 +

(t) = 0 otherwise 2 2 (1 ; exp(;2 t2 )) if t > 0 +

(t) = 0 otherwise

(t) =

Wave

(t) =

Power law

(t) =

Matern

(t) =

Matern at = 3=2

(t) =

Table 2.2

( (

(

2 + (1+2tt22 ) if t > 0 0

otherwise if t > 0 t 0 otherwise 2 + 2 t if t > 0 0 h otherwise p ) p i 2 + 2 1 ; (22;t 1 ;( ) K (2 t) if t > 0 0 otherwise 2 + 2 [1 ; (1 + t) exp (;t)] if t > 0 0 otherwise

2 + 2 (1 ; sin(t) )

Summary of variograms for common parametric isotropic models.

who demonstrated attractive interpretations for as well as . Here > 0 is a parameter controlling the smoothness of the realized random eld (see Subsection 2.2.3) while is a spatial scale parameter. The function ; () is the usual gamma function while K is the modi ed Bessel function of order (see, e.g., Abramowitz and Stegun, 1965, Chapter 9). Implementations of this function are available in several C/C++ libraries and also in the R package geoR. Note that special cases of the above are the exponential ( = 1=2) and the Gaussian ( ! 1). At = 3=2 we obtain a closed form as well, namely (t) = 2 + 2 [1 ; (1 + t) exp (;t)] for t > 0, and 2 otherwise. The covariance functions and variograms we have described in this subsection are conveniently summarized in Tables 2.1 and 2.2, respectively.

© 2004 by CRC Press LLC

ELEMENTS OF POINT-REFERENCED MODELING

29

2.1.4 Variogram model tting Having seen a fairly large selection of models for the variogram, one might well wonder how we choose one of them for a given data set, or whether the data can really distinguish them (see Subsection 5.1.3 in this latter regard). Historically, a variogram model is chosen by plotting the empirical semivariogram (Matheron, 1963), a simple nonparametric estimate of the semivariogram, and then comparing it to the various theoretical shapes available from the choices in the previous subsection. The customary empirical semivariogram is X

b(t) = 2N1(t) [Y (si ) ; Y (sj )]2 ; (2:9) (si ;sj )2N (t)

where N (t) is the set of pairs of points such that jjsi ; sj jj = t, and jN (t)j is the number of pairs in this set. Notice that, unless the observations fall on a regular grid, the distances between the pairs will all be dierent, so this will not be a useful estimate as it stands. Instead we would \grid up" the t-space into intervals I1 = (0; t1 ); I2 = (t1 ; t2 ), and so forth, up to IK = (tK ;1 ; tK ) for some (possibly regular) grid 0 < t1 < < tK . Representing the t values in each interval by its midpoint, we then alter our de nition of N (t) to N (tk ) = f(si ; sj ) : jjsi ; sj jj 2 Ik g ; k = 1; : : : ; K : Selection of an appropriate number of intervals K and location of the upper endpoint tK is reminiscent of similar issues in histogram construction. Journel and Huijbregts (1979) recommend bins wide enough to capture at least 30 pairs per bin. Clearly (2.9) is nothing but a method of moments (MOM) estimate, the semivariogram analogue of the usual sample variance estimate s2 . While very natural, there is reason to doubt that this is the best estimate of the semivariogram. Certainly it will be sensitive to outliers, and the sample average of the squared dierences may be rather badly behaved since under a Gaussian distributional assumption for the Y (si ), the squared dierences will have a distribution that is a scale multiple of the heavily skewed 21 distribution. In this regard, Cressie and Hawkins (1980) proposed a robusti ed estimate that uses sample averages of jY (si ) ; Y (sj )j1=2 ; this estimate is available in several software packages (see Section 2.5.1 below). Perhaps more uncomfortable is that (2.9) uses data dierences, rather than the data itself. Also of concern is the fact that the components of the sum in (2.9) will be dependent within and across bins, and that N (tk ) will vary across bins. In any case, an empirical semivariogram estimate can be plotted, viewed, and an appropriately shaped theoretical variogram model can be t to this \data." Since any empirical estimate naturally carries with it a signi -

© 2004 by CRC Press LLC

30

BASICS OF POINT-REFERENCED DATA MODELS

cant amount of noise in addition to its signal, this tting of a theoretical model has traditionally been as much art as science: in any given real data setting, any number of dierent models (exponential, Gaussian, spherical, etc.) may seem equally appropriate. Indeed, tting has historically been done \by eye," or at best by using trial and error to choose values of nugget, sill, and range parameters that provide a good match to the empirical semivariogram (where the \goodness" can be judged visually or by using some least squares or similar criterion); again see Section 2.5.1. More formally, we could treat this as a statistical estimation problem, and use nonlinear maximization routines to nd nugget, sill, and range parameters that minimize some goodness-of- t criterion. If we also have a distributional model for the data, we could use maximum likelihood (or restricted maximum likelihood, REML) to obtain sensible parameter estimates; see, e.g., Smith (2001) for details in the case of Gaussian data modeled with the various parametric variogram families outlined in Subsection 2.1.3. In Chapter 4 and Chapter 5 we shall see that the hierarchical Bayesian approach is broadly similar to this latter method, although it will often be easier and more intuitive to work directly with the covariance model C (t), rather than changing to a partial likelihood in order to introduce the semivariogram.

2.2 Spatial process models ? 2.2.1 Formal modeling theory for spatial processes When we write the collection of random variables fY (s) : s 2 Dg for some region of interest D or more generally fY (s) : s 2 points(myscallops$long, myscallops$lat, cex=0.75)

It is often helpful to add contour lines to the plot. In order to add such lines it is necessary to carry out an interpolation. This essentially lls in the gaps in the data over a regular grid (where there are no actual observerd data) using a bivariate linear interpolation. This is done in S-plus using the interp function. The contour lines may then be added to the plot using the contour command: >int.scp contour(int.scp, add=T)

Figure 2.12 shows the result of the last four commands, i.e., the map of the scallop locations and log catch contours arising from the linear interpolation. Two other useful ways of looking at the data may be through image and perspective (three-dimensional surface) plots. Remember that they will use the interpolated object so a preexisting interpolation is also compulsory here.

© 2004 by CRC Press LLC

54

BASICS OF POINT-REFERENCED DATA MODELS

• •

• • •• •• • • • 2 4 2 4• •• • • • 4• • • •• • • • • • • •• • • •• • • • • • • • • • • • 4 • • • • • • • • • • • • •

• •

• • • • • • • • • • • •• • •• • 4 • • • • • •• •6 • • • •• • • • • • • • • • •• •• •6 • • •• • • • • • • •• • • • • • • • 6• •• • • • • • • • • • 6 • • •6 • 6 • • 42 • • • •• • •

Map of observed scallop sites and contours of (linearly interpolated) raw log catch data, scallop data. Figure 2.12

>image(int.scp) >persp(int.scp)

The empirical variogram can be estimated in both the standard and \robust" (Cressie and Hawkins) way with built-in functions. We rst demonstrate the standard approach. After a variogram object is created, typing that object yields the actual values of the variogram function with the distances at which they are computed. A summary of the object may be invoked to see information for each lag, the total number of lags, and the maximum intersite distance. >scallops.var scallops.var >summary(scallops.var)

In scallops.var, distance corresponds to the spatial lag (h in our usual notation), gamma is the variogram (h), and np is the number of points in each bin. In the output of the summary command, maxdist is the largest distance on the map, nlag is the number of lags (variogram bins), and lag is maxdist/nlag, which is the width of each variogram bin. By contrast, the robust method is obtained simply by specifying \robust" in the method option:

© 2004 by CRC Press LLC

55

1

1

2

2

gamma 3

gamma 3

4

4

5

5

6

6

COMPUTER TUTORIALS

0.0

0.2

0.4

0.6 0.8 distance

1.0

1.2

1.4

0.0

0.2

0.4

(a)

Figure 2.13

data.

0.6 0.8 distance

1.0

1.2

1.4

(b)

Ordinary (a) and robust (b) empirical variograms for the scallops

>scallops.var.robust par(mfrow=c(1,2)) >plot(scallops.var) >plot(scallops.var.robust) >printgraph(file=``scallops.empvario.ps'')

The output from this picture is shown in Figure 2.13. The covariogram (a plot of an isotropic empirical covariance function (2.15) versus distance) and correlogram (a plot of (2.15) divided by C^ (0) versus distance) may be created using the covariogram and correlogram functions. (When we are through here, we set the graphics device back to having one plot per page using the par command.) >scallops.cov plot(scallops.cov) >scallops.corr plot(scallops.corr) >printgraph(file=``scallops.covariograms.ps'') > par(mfrow=c(1,1))

Theoretical variograms may also be computed and compared to the observed data as follows. Invoke the model.variogram function and choose an

© 2004 by CRC Press LLC

56

BASICS OF POINT-REFERENCED DATA MODELS

initial theoretical model; say, range=0.8, sill=1.25, and nugget=0.50. Note that the fun option speci es the variogram type we want to work with. Below we choose the spherical (spher.vgram); other options include exponential (exp.vgram), Gaussian (gauss.vgram), linear (linear.vgram), and power (power.vgram). >model.variogram(scallops.var.robust, fun=spher.vgram, range=0.80, sill=1.25, nugget = 0.50)

We remark that this particular model provides relatively poor t to the data; the objective function takes a relatively high value (roughly 213). (You are asked to nd a better- tting model in Exercise 7.) Formal estimation procedures for variograms may also be carried out by invoking the nls function on the spher.fun function that we can create:

f

>spher.fun scallops.nl1 coef(scallops.nl1)

g

Thus we are using nls to minimize the squared distance between the theoretical and empirical variograms. Note there is nothing to the left of the \~" character at the beginning of the nls statement. Many times our interest lies in spatial residuals, or what remains after detrending the response from the eects of latitude and longitude. An easy way to do that is by using the gam function in S-plus. Here we plot the residuals of the scallops lgcatch variable after the eects of latitude and longitude have been accounted for: >gam.scp par(mfrow=c(2,1)) >plot(gam.scp, residuals=T, rug=F)

Finally, at the end of the session we unload the spatial module, after which we can either do other work, or quit. >module(spatial, unload=T) >q()

2.5.2 Kriging in S+SpatialStats We now present a tutorial in using S+SpatialStats to do basic kriging. At the command prompt type S-plus to start the software, and load the spatial module into the environment: >module(spatial)

© 2004 by CRC Press LLC

COMPUTER TUTORIALS

57

Recall that the scallops data is preloaded as a data frame in S-plus, and a descriptive summary of this data set can be obtained by typing >summary(scallops)

while the rst row of the data may be seen by typing >scallops[1,]

Recall from our Section 2.5.1 tutorial that the data on tcatch was highly skewed, so we needed to create another dataframe called myscallops, which includes the log transform of tcatch (or actually log(tcatch +1)). We called this new variable lgcatch. We then computed both the regular empirical variogram and the \robust" (Cressie and Hawkins) version, and compared both to potential theoretical models using the variogram command. >scallops.var.robust trellis.device() >plot(scallops.var.robust)

Next we recall S-plus' ability to compute theoretical variograms. We invoke the model.variogram function, choosing a theoretical starting model (here, range=0.8, sill=4.05, and nugget=0.80), and using fun to specify the variogram type. >model.variogram(scallops.var.robust, fun=spher.vgram, range=0.80, sill=4.05, nugget = 0.80) >printgraph(file=``scallops.variograms.ps'')

The output from this command (robust empirical semivariogram with this theoretical variogram overlaid) is shown in Figure 2.14. Note again the model.variogram command allows the user to alter the theoretical model and continually recheck the value of the objective function (where smaller values indicate better t of the theoretical to the empirical). Formal estimation procedures for variograms may also be carried out by invoking the nls function on the spher.fun function that we create:

f

>spher.fun scallops.nl1 summary(scallops.nl1)

g

We now call the kriging function krige on the variogram object to produce estimates of the parameters for ordinary kriging: >scallops.krige newdata scallops.predicttwo scallops.predict scallops.predict[1,] >x y z scallops.predict.interp persp(scallops.predict.interp)

It may be useful to recall the location of the sites and the surface plot of the raw data for comparison. We create these plots on a separate graphics device: >trellis.device() >plot(myscallops$long, myscallops$lat) >int.scp persp(int.scp)

Figure 2.15 shows these two perspective plots side by side for comparison. The predicted surface on the left is smoother, as expected. It is also useful to have a surface plot of the standard errors, since we expect to see higher standard errors where there is less data. This is well illustrated by the following commands: >z.se scallops.predict.interp.se persp(scallops.predict.interp.se)

Other plots, such as image plots for the prediction surface with added contour lines, may be useful: >image(scallops.predict.interp)

© 2004 by CRC Press LLC

BASICS OF POINT-REFERENCED DATA MODELS

0 2

Z 4

6

Z -2 0 2 4 6 8

8

60

40 .5

40

.5

40 Y

39 .5

39

-72 .5 -72 3 -7 X .5 -73

40 39 Y .5

.5

39

(a)

-72 -73 X 5 . -73

-72

(b)

Perspective plots of the kriged prediction surface (a) and interpolated raw data (b), log scallop catch data.

Figure 2.15

>par(new=T, xaxs=``d'', yaxs=``d'') >contour(scallops.predict.interp)

Turning to universal kriging, here we illustrate with the scallops data using latitude and longitude as the covariates (i.e., trend surface modeling). Our covariance matrix X is therefore n 3 (n = 148 here) with columns corresponding to the intercept, latitude and longitude. >scallops.krige.universal scallops.predict.universal scallops.predict.universal[1,] >x y z scallops.predict.interp persp(scallops.predict.interp) >q()

© 2004 by CRC Press LLC

COMPUTER TUTORIALS

61

2.5.3 EDA, variograms, and kriging in geoR R is an increasing popular freeware alternative to S-plus, available from the web at www.r-project.org. In this subsection we describe methods for kriging and related geostatistical operations available in geoR, a geostatistical data analysis package using R, which is also freely available on the web at www.est.ufpr.br/geoR/. Since the syntax of S-plus and R is virtually identical, we do not spend time here repeating the material of the past subsection. Rather, we only highlight a few dierences in exploratory data analysis steps, before moving on to model tting and kriging. Consider again the scallop data. We recall that it is often helpful to create image plots and place contour lines on the plot. These provide a visual idea of the realized spatial surface. In order to do these, it is necessary to rst carry out an interpolation. This essentially lls up the gaps (i.e., where there are no points) using a bivariate linear interpolation. This is done using the interp.new function in R, located in the library akima. Then the contour lines may be added to the plot using the contour command. The results are shown in Figure 2.16.

>library(akima) >int.scp interp.new(myscallops$long,

myscallops$lat, myscallops$lgcatch, extrap=T) image(int.scp, xlim=range(myscallops$long), ylim=range(myscallops$lat)) contour(int.scp, add=T)

> >

Another useful way of looking at the data is through surface plots (or perspective plots). This is done by invoking the persp function:

>persp(int.scp,

xlim=range(myscallops$long), ylim=range(myscallops$lat))

The empirical variogram can be estimated in the classical way and in the robust way with in-built R functions. There are several packages in R that perform the above computations. We illustrate the geoR package, mainly because of its additional ability to t Bayesian geostatistical models as well. Nevertheless, the reader might want to check out the CRAN website (http://cran.us.r-project.org/) for the latest updates and several other spatial packages. In particular, we mention fields, gstat, sgeostat, spatstat, and spatdep for exploratory work and some model tting of spatial data, and GRASS and RArcInfo for interfaces to GIS software. Returning to the problem of empirical variogram tting, we rst invoke the geoR package. We will use the function variog in this package, which takes in a geodata object as input. To do this, we rst create an object, obj, with only the coordinates and the response. We then create the geodata

© 2004 by CRC Press LLC

BASICS OF POINT-REFERENCED DATA MODELS

40.0 39.5 39.0

Latitude

40.5

62

−73.5

−73.0

−72.5

−72.0

Longitude Figure 2.16

An image plot of the scallops data, with contour lines super-imposed.

object using the as.geodata function, specifying the columns holding the coordinates, and the one holding the response.

>obj cbind(myscallops$long,myscallops$lat, myscallops$lgcatch)

>scallops.geo as.geodata(myscallops,coords.col=1:2, data.col=3)

Now, a variogram object is created.

>scallops.var variogram(scallops.geo, estimator.type=''classical'')

>scallops.var

The robust estimator (see Cressie, 1993, p.75) can be obtained by typing

>scallops.var.robust variogram(scallops.geo, estimator.type=''modulus'')

A plot of the two semivariograms (by both methods, one below the other, as in Figure 2.17) can be obtained as follows:

>par(mfrow=c(2,1)) >plot(scallops.var) >plot(scallops.var.robust)

Covariograms and correlograms are invoked using the covariogram and functions. The remaining syntax is the same as in S-plus. The function variofit estimates the sill, the range, and the nugget

correlogram

© 2004 by CRC Press LLC

COMPUTER TUTORIALS

63

6 4 2 0

semivariance

(a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

2.0

2.5

3.0

distance

6 4 2 0

semivariance

8

(b)

0.0

0.5

1.0

1.5 distance

Plots of the empirical semivariograms for the scallops data: (a) classical; (b) robust. Figure 2.17

parameters under a speci ed covariance model. A variogram object (typically an output from the variog function) is taken as input, together with initial values for the range and sill (in ini.cov.pars), and the covariance model is speci ed through cov.model. The covariance modeling options include exponential, gaussian, spherical, circular, cubic, wave, power, powered.exponential, cauchy, gneiting, gneiting.matern, and pure.nugget (no spatial covariance). Also, the initial values provided in ini.cov.pars do not include those for the nugget. It is concatenated with the value of the nugget option only if fix.nugget=FALSE. If the latter is TRUE, then the value in the nugget option is taken as the xed true value. Thus, with the exponential covariance function for the scallops data, we can estimate the parameters (including the nugget eect) using

>scallops.var.fit variofit(scallops.var.robust,

ini.cov.pars = c(1.0,50.0), cov.model=''exponential'', fix.nugget=FALSE, nugget=1.0)

The output is given below. Notice that this is the weighted least squares approach for tting the variogram: variofit: model parameters estimated by WLS (weighted least squares): covariance model is: matern with fixed kappa = 0.5 (exponential)

© 2004 by CRC Press LLC

64

BASICS OF POINT-REFERENCED DATA MODELS parameter estimates: tausq sigmasq phi 0.0000 5.1289 0.2160

Likelihood model tting In the previous section we saw parameter estimation through weighted least squares of variograms. Now we introduce likelihood-based and Bayesian estimation functions in geoR. Both maximum likelihood and REML methods are available through the geoR function likfit. To estimate the parameters for the scallops data, we invoke

>scallops.lik.fit likfit(scallops.geo,

ini.cov.pars=c(1.0,2.0),cov.model = ``exponential'', trend = ``cte'', fix.nugget = FALSE, nugget = 1.0, nospatial = TRUE, method.lik = ``ML'')

The option trend = ``cte'' means a spatial regression model with constant mean. This yields the following output:

> scallops.lik.fit

likfit: estimated model parameters: beta tausq sigmasq phi 2.3748 0.0947 5.7675 0.2338

Changing method.lik = ``REML'' yields the restricted maximum likelihood estimation. Note that the variance of the estimate of beta is available by invoking scallops.lik.fit$beta.var, so calculating the con dence interval for the trend is easy. However, the variances of the estimates of the covariance parameters is not easily available within geoR. Kriging in geoR There are two in-built functions in geoR for kriging: one is for classical or conventional kriging, and is called krige.conv, while the other performs Bayesian kriging and is named krige.bayes. We now brie y look into these two types of functions. The krige.bayes function is not as versatile as WinBUGS in that it is more limited in the types of models it can handle, and also the updating is not through MCMC methods. Nevertheless, it is a handy tool and already improved upon the aforementioned likelihood methods by providing posterior samples of all the model parameters, which lead to estimation of their variability. The krige.bayes function can be used to estimate parameters for spatial regression models. To t a constant mean spatial regression model for the scallops data, without doing predictions, we invoke krige.bayes specifying a constant trend, an exponential covariance model, a at prior for the

© 2004 by CRC Press LLC

EXERCISES

65

constant trend level, the reciprocal prior for discrete uniform prior for tausq.

sigmasq

(Jerey's), and a

>scallops.bayes1

> > > > > >

out scallops.krige.bayes$posterior out out$sample beta.qnt quantile(out$beta, c(0.50,0.025,0.975)) phi.qnt quantile(out$phi, c(0.50,0.025,0.975)) sigmasq.qnt quantile(out$sigmasq, c(0.50,0.025,0.975)) tausq.rel.qnt quantile(out$tausq.rel, c(0.50,0.025,0.975)) beta.qnt 50% 2.5% 97.5% 1.931822 -6.426464 7.786515

> phi.qnt 50% 2.5% 97.5% 0.5800106 0.2320042 4.9909913 sigmasq.qnt 50% 2.5% 97.5% 11.225002 4.147358 98.484722

> tausq.rel.qnt 50% 2.5% 97.5% 0.03 0.00 0.19

Note that tausq.rel refers to the ratio of the nugget variance to the spatial variance, and is seen to be negligible here, too. This is consistent with all the earlier analysis, showing that a purely spatial model (no nugget) would perhaps be more suitable for the scallops data.

2.6 Exercises

1. For semivariogram models #2, 4, 5, 6, 7, and 8 in Subsection 2.1.3, (a) identify the nugget, sill, and range (or eective range) for each;

© 2004 by CRC Press LLC

66

BASICS OF POINT-REFERENCED DATA MODELS

(b) nd the covariance function C (t) corresponding to each (t), provided it exists. 2. Prove that for Gaussian processes, strong stationarity is equivalent to weak stationarity. 3. Consider the triangular (or \tent") covariance function, 2 (1 ; khk =) if khk ; 2 > 0; > 0; C (khk) = : 0 if khk > It is valid in one dimension. (The reader can verify that it is the characteristic function of the density function f (x) proportional to 1 ; cos(x)=x2 .) Nowpin twopdimensions, consider a 6 8 grid with locations sjk = (j= 2; k= 2); j = 1; : : : ; 6; k = 1; : : : ; 8. Assign ajk to sjk such that ajk = 1 if j + k is even, ajk = ;1 if j + k is odd. Show that V ar[ajk Y (sjk )] < 0, and hence that the triangular covariance function is invalid in two dimensions. 4. The turning bands method (Christakos, 1984; Stein, 1999a) is a technique for creating stationary covariance functions on ngb

ngb

no.sites

ngb.mat

BASICS OF AREAL DATA MODELS

sids.mapgrp[(sids.mapgrp==1)] >sids.mapgrp[(sids.mapgrp==4)] >sids.mapgrp[(sids.mapgrp==0)]

title(main="Actual Transformed SIDS Rates") >legend(locator(1), legend= c("3.5"), fill=c(4,2,3,1))

In the rst command, the modi ed mapping vector sids.mapgrp is speci ed as the grouping variable for the dierent colors. The fill=T option automates the shading of regions, while the next command (with add=T) adds the county boundaries. Finally, the locator(1) option within the legend command waits for the user to click on the position where the legend is desired; Figure 3.5(a) contains the result we obtained. We hasten to add that one can automate the placing of the legend by replacing the

© 2004 by CRC Press LLC

94

BASICS OF AREAL DATA MODELS a) actual transformed SIDS rates

3.5

b) fitted SIDS rates from SAR model

3.5

Unsmoothed raw (a) and spatially smoothed tted (b) rates, North Carolina SIDS data. Figure 3.5

option with actual (x; y) coordinates for the upper left corner of the legend box. To draw a corresponding map of the tted values from our SAR model (using our parameter estimates in the mean structure), we must rst create a modi ed vector of the ts (again due to the presence of Currituck county): locator(1)

>sids.race.fit sids.race.fit.map sids.race.fit.mapgrp breaks.sids)

© 2004 by CRC Press LLC

sids.race.fit.mapgrp[(sids.race.fit.mapgrp==1)] >sids.race.fit.mapgrp[(sids.race.fit.mapgrp==4)] >sids.race.fit.mapgrp[(sids.race.fit.mapgrp==0)] >map("county", "north carolina", fill=T,

95 noise signal sids.race.pred sids.race.pred.map ;0, j bij 1 for all i, and j bij < 1 for at least one i. Let D = Diag i2 be a diagonal matrix with positive elements i2 such that D;1 (I ; B ) is symmetric; that is, bij =i2 = bji =j2 , for all i; j . Show that D;1 (I ; B ) is positive de nite. 4. Looking again at (3.13), obtain a simple sucient condition on B such

© 2004 by CRC Press LLC

96

BASICS OF AREAL DATA MODELS that the CAR prior with precision matrix D;1 (I ; B ) is a pairwise

dierence prior, as in (3.16). 5. Show that ;y1 = Dw; ; W is nonsingular (thus resolving the impropri ety in (3.15)) if 2 1=(1) ; 1=(n) , where (1) < (2) < < (n) are the ordered eigenvalues of Dw;1=2 WDw;1=2 . 6. Show that if all entries in W are nonnegative and Dw ;W is nonsingular with > 0, then all entries in (Dw ; W );1 are nonnegative. f just 7. Recalling the SAR formulation using the scaled adjacency matrix W f below (3.25), prove that I ; W will be nonsingular if 2 (;1; 1), so that may be sensibly referred to as an \autocorrelation parameter." 8. In the setting of Subsection 3.3.1, if (;y1 )ij = 0, then show that Yi and Yj are conditionally independent given Yk ; k 6= i; j . 9. The le www.biostat.umn.edu/~brad/data/state-sat.dat gives the 1999 state average SAT data (part of which is mapped in Figure 3.1), while www.biostat.umn.edu/~brad/data/contig-lower48.dat gives the contiguity (adjacency) matrix for the lower 48 U.S. states (i.e., excluding Alaska and Hawaii, as well as the District of Columbia). (a) Use the S+SpatialStats software to construct a spatial.neighbor object from the contiguity le. (b) Use the slm function to t the SAR model of Section 3.4, taking the verbal SAT score as the response Y and the percent of eligible students taking the exam in each state as the covariate X . Use row-normalized weights based on the contiguity information in spatial.neighbor object. Is knowing X helpful in explaining Y ? (c) Using the maps library in S-plus, draw choropleth maps similar to Figure 3.1 of both the tted verbal SAT scores and the spatial residuals from this t. Is there evidence of spatial correlation in the response Y once the covariate X is accounted for? (d) Repeat your SAR model analysis above, again using slm but now assuming the CAR model of Section 3.3. Compare your estimates with those from the SAR model and interpret any changes. (e) One might imagine that the percentage of eligible students taking the exam should perhaps aect the variance of our model, not just the mean structure. To check this, re t the SAR model replacing your row-normalized weights with weights equal to the reciprocal of the percentage of students taking the SAT. Is this model sensible? 10. Consider the data www.biostat.umn.edu/~brad/data/Columbus.dat, taken from Anselin (1988, p. 189). These data record crime information for 49 neighborhoods in Columbus, OH, during 1980. Variables measured include NEIG, the neighborhood id value (1{49); HOVAL, its mean

© 2004 by CRC Press LLC

EXERCISES

97

housing value (in $1,000); INC, its mean household income (in $1,000); CRIME, its number of residential burglaries and vehicle thefts per thousand households; OPEN, a measure of the neighborhood's open space; PLUMB, the percentage of housing units without plumbing; DISCBD, the neighborhood centroid's distance from the central business district; X, an x-coordinate for the neighborhood centroid (in arbitrary digitizing units, not polygon coordinates); Y, the same as X for the y-coordinate; AREA, the neighborhood's area; and PERIM, the perimeter of the polygon describing the neighborhood. (a) Use S+SpatialStats to construct spatial.neighbor objects for the neighborhoods of Columbus based upon centroid distances less than i. 3.0 units, ii. 7.0 units, iii. 15 units. (b) For each of the four spatial neighborhoods constructed above, use the slm function to t SAR models with CRIME as the dependent variable, and HOVAL, INC, OPEN, PLUMB, and DISCBD as the covariates. Compare your results and interpret your parameter estimates in each case. (c) Repeat your analysis using Euclidean distances in the B matrix itself. That is, in equation (3.26), set B = W with the Wij the Euclidean distance between location i and location j . (d) Repeat part (b) for CAR models. Compare your estimates with those from the SAR model and interpret them.

© 2004 by CRC Press LLC

CHAPTER 4

Basics of Bayesian inference In this chapter we provide a brief review of hierarchical Bayesian modeling and computing for readers not already familiar with these topics. Of course, in one chapter we can only scratch the surface of this rapidly expanding eld, and readers may well wish to consult one of the many recent textbooks on the subject, either as preliminary work or on an as-needed basis. It should come as little surprise that the book we most highly recommend for this purpose is the one by Carlin and Louis (2000); the Bayesian methodology and computing material below roughly follows Chapters 2 and 5, respectively, in that text. However, a great many other good Bayesian books are available, and we list a few of them and their characteristics. First we must mention the texts stressing Bayesian theory, including DeGroot (1970), Berger (1985), Bernardo and Smith (1994), and Robert (1994). These books tend to focus on foundations and decision theory, rather than computation or data analysis. On the more methodological side, a nice introductory book is that of Lee (1997), with O'Hagan (1994) and Gelman, Carlin, Stern, and Rubin (2004) oering more general Bayesian modeling treatments.

4.1 Introduction to hierarchical modeling and Bayes' Theorem By modeling both the observed data and any unknowns as random variables, the Bayesian approach to statistical analysis provides a cohesive framework for combining complex data models and external knowledge or expert opinion. In this approach, in addition to specifying the distributional model f (yj) for the observed data y = (y1 ; : : : ; yn ) given a vector of unknown parameters = (1 ; : : : ; k ), we suppose that is a random quantity sampled from a prior distribution (j), where is a vector of hyperparameters. For instance, yi might be the empirical mammography rate in a sample of women aged 40 and over from county i, i the underlying true mammography rate for all such women in this county, and a parameter controlling how these true rates vary across counties. If is

© 2004 by CRC Press LLC

100

BASICS OF BAYESIAN INFERENCE

known, inference concerning is based on its posterior distribution, p(jy; ) = p(py(y; jj) ) = R pp((yy;;jj))d = R ff((yyjj))((jj))d : (4:1) Notice the contribution of both the data (in the form of the likelihood f ) and the external knowledge or opinion (in the form of the prior ) to the posterior. Since, in practice, will not be known, a second stage (or hyperprior) distribution h() will often be required, and (4.1) will be replaced with R p(jy) = p(py(y; ) ) = R f f(y(yjj))((jj)h)h(() )ddd : Alternatively, we might replace by an estimate ^ obtained as the maxR imizer of the marginal distribution p(yj) = f (yj)(j)d, viewed as a function of . Inference could then proceed based on the estimated posterior distribution p(jy; ^ ), obtained by plugging ^ into equation (4.1). This approach is referred to as empirical Bayes analysis; see Berger (1985), Maritz and Lwin (1989), and Carlin and Louis (2000) for details regarding empirical Bayes methodology and applications. The Bayesian inferential paradigm oers potentially attractive advantages over the classical, frequentist statistical approach through its more philosophically sound foundation, its uni ed approach to data analysis, and its ability to formally incorporate prior opinion or external empirical evidence into the results via the prior distribution . Data analysts, formerly reluctant to adopt the Bayesian approach due to general skepticism concerning its philosophy and a lack of necessary computational tools, are now turning to it with increasing regularity as classical methods emerge as both theoretically and practically inadequate. Modeling the i as random (instead of xed) eects allows us to induce speci c (e.g., spatial) correlation structures among them, hence among the observed data yi as well. Hierarchical Bayesian methods now enjoy broad application in the analysis of spatial data, as the remainder of this book reveals. A computational challenge in applying Bayesian methods is that for most realistic problems, the integrations required to do inference under (4.1) are generally not tractable in closed form, and thus must be approximated numerically. Forms for and h (called conjugate priors) that enable at least partial analytic evaluation of these integrals may often be found, but in the presense of nuisance parameters (typically unknown variances), some intractable integrations remain. Here the emergence of inexpensive, highspeed computing equipment and software comes to the rescue, enabling the application of recently developed Markov chain Monte Carlo (MCMC) integration methods, such as the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) and the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990). This is the subject of Section 4.3.

© 2004 by CRC Press LLC

INTRODUCTION TO HIERARCHICAL MODELING AND BAYES' THEOREM 101

Illustrations of Bayes' Theorem Equation (4.1) is a generic version of what is referred to as Bayes' Theorem or Bayes' Rule. It is attributed to Reverend Thomas Bayes, an 18th-century nonconformist minister and part-time mathematician; a version of the result was published (posthumously) in Bayes (1763). In this subsection we consider a few basic examples of its use. Example 4.1 Suppose ; we have observed a single normal (Gaussian) observation Y N ; 2 with 2 known, so that the likelihood f (yj) = ; 2 N yj; 2 p12 exp(; (y2;2) ); y 2 1 and ESS () < N , so that Vd arESS (^N ) > Vd ariid (^N ), in concert with intuition. That is, since we have fewer than N eective samples, we expect some in ation in the variance of our estimate. In practice, the autocorrelation time () in (4.21) is often estimated simply by cutting o the summation when the magnitude of the terms rst drops below some \small" value (say, 0.1). This procedure is simple but may lead to a biased estimate of (). Gilks et al. (1996, pp. 50{51) recommend an initial convex sequence estimator mentioned by Geyer (1992) which, while while still output-dependent and slightly more complicated, actually yields a consistent (asymptotically unbiased) estimate here. A nal and somewhat simpler (though also more naive) method of estimating V ar(^N ) is through batching. Here we divide our single long run of length N into m successive batches of length k (i.e., N = mk), with Pm 1 ^ batch means B1 ; : : : ; Bm . Clearly N = B = m i=1 Bi . We then have the variance estimate m X Vd arbatch (^N ) = m(m1; 1) (Bi ; ^N )2 ; (4:22) i=1 provided that k is large enough so that the correlation between batches is negligible, and m is large enough to reliably estimate V ar(Bi ). It is important to verify that the batch means are indeed roughly independent, say, by checking whether the lag 1 autocorrelation of the Bi is less than 0.1. If this is not the case, we must increase k (hence N , unless the current m is already quite large), and repeat the procedure. Regardless of which of the above estimates V^ is used to approximate V ar(^N ), a 95% con dence interval for E (jy) is then given by p ^N z:025 V^ ; where z:025 = 1:96, the upper .025 point of a standard normal distribution. If the batching method is used with fewer than 30 batches, it is a good idea to replace z:025 by tm;1;:025 , the upper .025 point of a t distribution with m ; 1 degrees of freedom. WinBUGS oers both naive (4.20) and batched (4.22) variance estimates; this software is (at last!) the subject of the next section.

4.4 Computer tutorials

4.4.1 Basic Bayesian modeling in R or S-plus In this subsection we merely point out that for simple (typically lowdimensional) Bayesian calculations employing standard likelihoods paired with conjugate priors, the built-in density, quantile, and plotting functions in standard statistical packages may well oer sucient power; there is no

© 2004 by CRC Press LLC

COMPUTER TUTORIALS

121

need to use a \Bayesian" package per se. In such cases, statisticians might naturally turn to S-plus or R (the increasingly popular freeware package that is \not unlike S") due to their broad array of special functions (especially those oering summaries of standard distributions), graphics, interactive environments, and easy extendability. As a concrete example, suppose we are observing a data value Y from a Bin(n; ) distribution, with density proportional to p(yj) / y (1 ; )n;y : (4:23) The Beta(; ) distribution oers a conjugate prior for this likelihood, since its density is proportional to (4.23) as a function of , namely p() / ;1 (1 ; ) ;1 : (4:24) Using Bayes' Rule (4.1), it is clear that p(jy) / y+;1 (1 ; )n;y+ ;1 / Beta(y + ; n ; y + ) ; (4.25) another Beta distribution. Now consider a setting where n = 10 and we observe Y = yobs = 7. Choosing = = 1 (i.e., a uniform prior for ), the posterior is a Beta(yobs + 1; n ; yobs + 1) = Beta(8; 4) distribution. In either R or S-plus we can obtain a plot of this distribution by typing > theta yobs qbeta(.5, yobs+1, n-yobs+1)

while the endpoints of a 95% equal-tail credible interval are > qbeta(c(.025, .975), yobs+1, n-yobs+1)

In fact, these points may be easily added to our posterior plot (see Figure 4.2) by typing > abline(v=qbeta(.5, yobs+1, n-yobs+1)) > abline(v=qbeta(c(.025, .975), yobs+1, n-yobs+1), lty=2)

The pbeta and rbeta functions may be used similarly to obtain prespeci ed posterior probabilities (say, Pr( < 0:8jyobs)) and random draws from the posterior, respectively. Indeed, similar density, quantile, cumulative probability, and random generation routines are available in R or S-plus for a wide array of standard distributional families that often emerge as posteriors (gamma, normal, multivariate normal, Dirichlet, etc.). Thus in settings where MCMC techniques are unnecessary, these languages may oer the most sensible

© 2004 by CRC Press LLC

BASICS OF BAYESIAN INFERENCE

1.5 0.0

0.5

1.0

posterior density

2.0

2.5

3.0

122

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.2 Illustrative beta posterior, with vertical reference lines added at the .025, .5, and .975 quantiles.

approach. They are especially useful in situations requiring code to be wrapped around statements like those above so that repeated posterior calculations may be performed. For example, when designing an experiment to be analyzed at some later date using a Bayesian procedure, we would likely want to simulate the procedure's performance in repeated sampling (the Bayesian analog of a power or \sample size" calculation). Such repeated sampling might be of the data for xed parameters, or over both the data and the parameters. (We hasten to add that WinBUGS can be called from R, albeit in a special way; see www.stat.columbia.edu/~gelman/bugsR/. Future releases of WinBUGS may be available directly within R itself.) 4.4.2 Advanced Bayesian modeling in WinBUGS In this subsection we provide a introduction to Bayesian data analysis in WinBUGS, the most well-developed and general Bayesian software package available to date. WinBUGS is the Windows successor to BUGS, a UNIX package whose name originally arose as a humorous acronym for Bayesian inference Using Gibbs Sampling. The package is freely available from the website http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml. The sofware comes with a user manual, as well as two examples manuals that are enormously helpful for learning the language and various strategies for Bayesian data analysis. We remark that for further examples of good applied Bayesian work,

© 2004 by CRC Press LLC

COMPUTER TUTORIALS

123

in addition to the ne book by Gilks et al. (1996), there are the series of \Bayesian case studies" books by Gatsonis et al. (1993, 1995, 1997, 1999, 2002, 2003), and the very recent Bayesian modeling book by Congdon (2001). While this lattermost text assumes a walking familiarity with the Bayesian approach, it also includes a great many examples and corresponding computer code for their implementation in WinBUGS. WinBUGS has an interactive environment that enables the user to specify models (hierarchical) and it actually performs Gibbs sampling to generate posterior samples. Convergence diagnostics, model checks and comparisons, and other helpful plots and displays are also available. We will now look at some WinBUGS code for greater insight into its modeling language. Example 4.3 The line example from the main WinBUGS manual will be considered in stages, in order to both check the installation and to illustrate the use of WinBUGS. Consider a set of 5 (obviously arti cial) (X; Y ) pairs: (1, 1), (2, 3), (3, 3), (4, 3), (5, 5). We shall t a simple linear regression of Y on X using the notation, ;

Yi N i ; 2 ; where i = + xi :

As the WinBUGS code below illustrates, the language allows a concise expression of the model, where dnorm(a,b) denotes a normal distribution with mean a and precision (reciprocal of the variance) b, and dgamma(c,d) denotes a gamma distribution with mean c=d and variance c=d2 . The data means mu[i] are speci ed using a logical link (denoted by 0 almost surely, but in practice we may observe some zij = 0 as, for example, with the log counts in the scallop data example. A correction is needed and can be achieved by adding to zij;obs where is, say, one half of the smallest possible positive zij;obs .) Setting k = 1 in (5.12), we note that of the Bessel mixtures, the vecomponent model with xed 's and random weights is best according to the D1;m statistic. Here, given max = 7:5, the nodes are xed to be 1 = 1:25; 2 = 2:5; 3 = 3:75; 4 = 5:0, and 5 = 6:25. One would expect that the t measured by the G1;m criterion should improve with increasing

© 2004 by CRC Press LLC

STATIONARY SPATIAL PROCESS MODELS

141

8

Parametric Semivariograms

4 0

2

gamma(d)

6

Exponential Gaussian Cauchy Spherical Bessel-J0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

2.0

2.5

3.0

2.0

2.5

3.0

distance

8

Bessel Mixtures - Random Weights

4 0

2

gamma(d)

6

Two Three Four Five

0.0

0.5

1.0

1.5

distance

8

Bessel Mixtures - Random Phi’s

4 0

2

gamma(d)

6

Two Three Four Five

0.0

0.5

1.0

1.5

distance

Figure 5.4

Posterior means for various semivariogram models.

p. However, the models do not form a nested sequence in p, except in some instances (e.g., the p = 2 model is a special case of the p = 5 model). Thus, the apparent poorer t of the four-component xed model relative to the three-component model is indeed possible. The random Bessel mixture models were all very close and, as a class, these models t as well or better than the best parametric model. Hence, modeling mixtures of Bessel functions appears more sensitive to the choice of xed 's than to xed weights.

5.1.4 Modeling geometric anisotropy As mentioned in Section 2.2.5, anisotropy refers to the situation where the spatial correlation between two observations depends upon the separation vector between their locations, rather than merely its length (i.e., the distance between the points). Thus here we have Cov (Y (s + h) ; Y (s)) = (h; ) : Ansiotropy is generally dicult to deal with, but there are special cases

© 2004 by CRC Press LLC

142

HIERARCHICAL MODELING FOR UNIVARIATE SPATIAL DATA

Model Parametric exponential Gaussian Cauchy spherical Bessel independent Semiparametric xed ` , random w` : two three four ve random ` , xed w` : two three four ve Table 5.1

G1;m

Pm

D1;m

10959 10861 10683 11447 11044 11578

13898 13843 13811 13959 14037 16159

24857 24704 24494 25406 25081 27737

11071 10588 10934 10567

13968 13818 13872 13818

25039 24406 24806 24385

10673 10677 10636 10601

13907 13959 13913 13891

24580 24636 24549 24492

Model choice for tted variogram models, 1993 scallop data.

that are tractable yet still interesting. Among these, the most prominent in applications is geometric anisotropy. This refers to the situation where the coordinate space can be linearly transformed to an isotropic space. A linear transformation may correspond to rotation or stretching of the coordinate axes. Thus in general, (h; ) = 0 (jjLhjj ; ) ; where l is a d d matrix describing the linear transformation. Of course, if L is the identity matrix, this reduces to the isotropic case. We assume a second-order stationary normal model for Y, arising from the customary model, Y (s) = + w(s) + (s) as in (5.1). This yields Y N (1 ; ()), where = ( 2 ; 2 ; B )T , B = LT L, and () = 2 I + 2 H ((h0 B h) 21 ) : (5:13) In (5.13), the matrix H has (i; j )th entry ((h0ij B hij ) 21 ) where is a valid correlation function and hij = si ; sj . Common forms for would be those in Table 2.2. In (5.13), 2 is the semiovariogram nugget and 2 + 2 1is the 1 sill. The variogram is 2 ( 2 ; 2 ; (h0 B h) 2 ) = 2( 2 + 2 (1 ; ((h0 B h) 2 )):

© 2004 by CRC Press LLC

STATIONARY SPATIAL PROCESS MODELS

143

200

> 16

150 - 200

12 - 16

105 - 150

8 - 12

95 - 105

7 - 8

50 - 95

1 - 7

< 50

0 - 1

N 60

60

120

Miles

Scotland lip cancer data: (a) crude standardized mortality ratios (observed / expected 100); (b) AFF covariate values.

Figure 5.10

Figure 5.11 contains the WinBUGS code for this problem, which is also available at http://www.biostat.umn.edu/~brad/data2.html. Note the use of vague priors for c and h as suggested by Best et al. (1999), and the use of the sd function in WinBUGS to greatly facilitate computation of . The basic posterior (mean, sd) and convergence (lag 1 autocorrelation) summaries for ; 1 ; 1 , and 56 are given in Table 5.4. Besides the Best et al. (1999) prior, two priors inspired by equation (5.48) are also reported; see Carlin and Perez (2000). The AFF covariate appears signi cantly different from 0 under all 3 priors, although convergence is very slow (very high values for l1acf). The excess variability in the data seems mostly due to clustering (E (jy) > :50), but the posterior distribution for does not seem robust to changes in the prior. Finally, convergence for the i (reason-

© 2004 by CRC Press LLC

GENERAL LINEAR AREAL DATA MODELING

167

model { for (i in 1 : regions) { O[i] ~ dpois(mu[i]) log(mu[i]) 0), and Z2 (s) = I (Y2 (s) > 0). Approximate the cross-covariance matrix of Z(s) = (Z1(s); Z2(s))T . 4. The data in www.biostat.umn.edu/~brad/data/ColoradoLMC.dat record maximum temperature (in tenths of a degree Celsius) and precipitation (in cm) during the month of January 1997 at 50 locations in the U.S. state of Colorado. (a) Let X denote temperature and Y denote precipitation. Following the model of Example 7.3, t an LMC model to these data using the conditional approach, tting X and then Y jX . (b) Repeat this analysis, but this time tting Y and then X jY . Show that your new results agree with those from part (a) up to simulation variability. 5. If Cl and Cl0 are isotropic, obtain Cll0 (s) in (7.31) by transformation to polar coordinates.

© 2004 by CRC Press LLC

254

MULTIVARIATE SPATIAL MODELING

6. The usual and generalized (but still proper) MCAR models may be constructed using linear transformations of some nonspatially correlated T variables. T Consider a vector blocked by components, say = T 1 ; 2 , where each i is n 1, n being the number of areal units. Suppose we look upon these vectors as arising from linear transformations 1 = A1 v1 and 2 = A2 v2 ; where A1 and A2 are any n n matrices, v1 = (v11 ; : : : ; v1n )T and v2 = (v21 ; : : : ; v2n)T with covariance structure Cov (v1i ; v1j ) = 11 I[i=j] ; Cov (v1i ; v2j ) = 12 I[i=j] ; and Cov (v2i ; v2j ) = 22 I[i=j] ; where I[i=j] = 1 if i = j and 0 otherwise. Thus, although v1 and v2 are associated, their nature of association is nonspatial in that covariances remain same for every areal unit, and there is no association between variables in dierent units. (a) Show that the dispersion matrix (v ;v ) equals I , where = (ij )i;j=1;2 . (b) Show that setting A1 = A2 = A yields a separable covariance structure for . What choice of A would render a separable MCAR model, analogous to (7.34)? (c) Show that appropriate (dierent) choices of A1 and A2 yield the generalized MCAR model, as in (7.36). 1

© 2004 by CRC Press LLC

2

CHAPTER 8

Spatiotemporal modeling In both theoretical and applied work, spatiotemporal modeling has received dramatically increased attention in the past few years. This reason behind this increase is easy to see: the proliferation of data sets that are both spatially and temporally indexed, and the attendant need to understand them. For example, in studies of air pollution, we are interested not only in the spatial nature of a pollutant surface, but also in how this surface changes over time. Customarily, temporal measurements (e.g., hourly, daily, three-day average, etc.) are collected at monitoring sites over several years. Similarly, with climate data we may be interested in spatial patterns of temperature or precipitation at a given time, but also in dynamic patterns in weather. With real estate markets, we might be interested in how the single-family home sales market changes on a quarterly or annual basis. Here an additional wrinkle arises in that we do not observe the same locations for each time period; the data are cross-sectional, rather than longitudinal. Applications with areal unit data are also commonplace. For instance, we may look at annual lung cancer rates by county for a given state over a number of years to judge the eectiveness of a cancer control program. Or we might consider daily asthma hospitalization rates by zip code, over a period of several months. From a methodological point of view, the introduction of time into spatial modeling brings a substantial increase in the scope of our work, as we must make separate decisions regarding spatial correlation, temporal correlation, and how space and time interact in our data. Such modeling will also carry an obvious associated increase in notational and computational complexity. As in previous chapters, we make a distinction between the cases where the geographical aspect of the data is point level versus areal unit level. Again the former case is typically handled via Gaussian process models, while the latter often uses CAR speci cations. A parallel distinction could be drawn for the temporal scale: is time viewed as continuous (say, over tij (and in fact Tij = 1 if Nij = 0). Given Nij ; the Uijk 's are independent with survival function S (tj ij ) and corresponding density function f (tj ij ). The parameter ij is a collection of all the parameters (including possible regression parameters) that may be involved in a parametric speci cation for the survival function S . In this section we will work with a two-parameter Weibull distribution speci cation for the density function f (tj ij ), where we allow the Weibull scale parameter to vary across the regions, and , which may serve as a link to covariates in a regression setup, to vary across individuals. Therefore f (tji ; ij ) = i ti ;1 exp (ij ; ti exp (ij )). In terms of the hazard function h, f (tji ; ij ) = h (tji ; ij ) S (tji ; ij ), with h (t; i ; ij ) = i ti ;1 exp (ij ) and S (tji ; ij ) = exp (;ti exp (ij )). Note we implicitly assume proportional hazards, with baseline hazard function h0 (tji ) = i ti ;1 : Thus an individual ij who is censored at time tij before undergoing the event contributes (S (tij ji ; ij ))Nij to the likelihood, while an individual who experiences the event at time tij contributes Nij (S (tij ji ; ij ))Nij ;1 f (tij ji ; ij ). The latter expression follows from the fact that the event is experienced when any one of the latent factors occurs. Letting ij be the observed event indicator for individual ij , this person contributes

L (tij jNij ; i ; ij ; ij ) ij = (S (tij ji ; ij ))Nij (1;ij ) Nij S (tij ji ; ij )Nij ;1 f (tij ji ; ij ) ; and the joint likelihood for all the patients can now be expressed as

L (ftijQgj fNQij g ; fi g; fij g ; fij g) i L (t jN ; ; ; ) = Ii=1 nj=1 ij ij i ij ij QI Qni = i=1 j=1 (S(tij ji ; ij ))Nij (1;ij ) ij Nij S (tij i ; ij )Nij ;1 f (tij ji ; ij ) Q Q i = Ii=1 nj=1 (S (tij ji ; ij ))Nij ;ij (Nij f (tij ji ; ij ))ij :

© 2004 by CRC Press LLC

SPATIAL CURE RATE MODELS ?

331

This expression can be rewritten in terms of the hazard function as ni I Y Y i=1 j =1

(S (tij ji ; ij ))Nij (Nij h (tij ji ; ij ))ij :

(9:15)

A Bayesian hierarchical formulation is completed by introducing prior distributions on the parameters. We will specify independent prior distributions p (Nij jij ), p (i j ) and p (ij j ) for fNij g, fi g, and fij g, respectively. Here, , , and fij g are appropriate hyperparameters. Assigning independent hyperpriors p (ij j ) for fij g and assuming the hyperparameters = ( ; ; ) to be xed, the posterior distribution for the parameters, p (fij g ; fij g ; fNij g ; fi g j ftij g ; fij g), is easily found (up to a proportionality constant) using (9.15) as QI

n

i [S (t j ; )]Nij [N h (t j ; )]ij p (i j ) nj=1 ij i ij ij ij i ij p (Nij jij ) p (ij j ) p (ij j )g : Chen et al. (1999) assume that the Nij are distributed as independent Poisson random variables with mean ij , i.e., p (Nij jij ) is Poisson (ij ). In this setting it is easily seen that the survival distribution for the (i; j )th patient, P (Tij tij ji ; ij ), is given by exp f;ij (1 ; S (tij ji ; ij ))g. Since S (tij ji ; ij ) is a proper survival function (corresponding to the latent factor times Uijk ), as tij ! 1, P (Tij tij ji ; ij ) ! exp (;ij ) > 0. Thus we have a subdistribution for Tij with a cure fraction given by exp (;ij ). Here a hyperprior on the ij 's would have support on the positive real line.

i=1

Q

While there could certainly be multiple latent factors that increase the risk of smoking relapse (age started smoking, occupation, amount of time spent driving, tendency toward addictive behavior, etc.), this is rather speculative and certainly not as justi able as in the cancer setting for which the multiple factor approach was developed (where Nij > 1 is biologically motivated). As such, we instead form our model using a single, omnibus, \propensity for relapse" latent factor. In this case, we think of Nij as a binary variable, and specify p (Nij jij ) as Bernoulli (1 ; ij ). In this setting it is easier to look at the survival distribution after marginalizing out the Nij . In particular, note that ij = 1 P (Tij tij ji ; ij ; Nij ) = S (tij j1i;; ij ) ; N Nij = 0 : That is, if the latent factor is absent, the subject is cured (does not experience the event). Marginalizing over the Bernoulli distribution for Nij , we obtain for the (i; j )th patient the survival function S (tij jij ; i ; ij ) P (Tij tij ji ; ij ) = ij + (1 ; ij ) S (tij ji ; ij ), which is the classic curerate model attributed to Berkson and Gage (1952) with cure fraction ij . Now we can write the likelihood function for the data marginalized over

© 2004 by CRC Press LLC

332

fNij g, L (ftij g j fi g; fij g; fij g ; fij g), as

SPATIAL SURVIVAL MODELS ij

1;ij ; dtdij S (tij jij ; i ; ij ) i=1Q j =1Q[S (tij jij ; i ; ij )] i [S (t j ; ; )]1;ij [(1 ; ) f (t j ; )]ij ; = Ii=1 nj=1 ij ij i ij ij ij i ij

QI

Q ni

which in terms of the hazard function becomes ni I Y Y

i=1 j =1

[S (tij jij ; i ; ij )]1;ij [(1 ; ij ) S (tij ji ; ij ) h (tij ji ; ij )]ij ;

(9:16) where the hyperprior for ij has support on (0; 1). Now the posterior distribution of the parameters is proportional to

L (ftij g j fi g; fij g; fij g ; fij g)

8 I < Y

ni Y

i=1 :

j =1

p (i j )

9 =

p (ij j ) p (ij j ); :

(9:17) Turning to the issue of incorporating covariates, in the general setting with Nij assumed to be distributed Poisson, Chen et al. (1999) propose their introduction inthe cure fraction through a suitable link function g , so that ij = g xTij e , where g maps the entire real line to the positive axis. This is sensible when we believe that the risk factors aect the probability of an individual being cured. Proper posteriors arise for the regression coecients e even under improper priors. Unfortunately, this is no longer true when Nij is Bernoulli (i.e., in the Berkson and Gage model). Vague but proper priors may still be used, but this makes the parameters dicult to interpret, and can often lead to poor MCMC convergence. Since a binary Nij seems most natural in our setting, we instead introduce covariates into S (tij ji ; ij ) through the Weibull link ij , i.e., we let ij = xTij . This seems intuitively more reasonable anyway, since now the covariates in uence the underlying factor that brings about the smoking relapse (and thus the rapidity of this event). Also, proper posteriors arise here for under improper posteriors even though Nij is binary. As such, henceforth we will only consider the situation where the covariates enter the model in this way (through the Weibull link function). This means we are unable to separately estimate the eect of the covariates on both the rate of relapse and the ultimate level of relapse, but \fair" estimation here (i.e., allocating the proper proportions of the covariates' eects to each component) is not clear anyway since at priors could be selected for , but not for e . Finally, all of our subsequent models also assume a constant cure fraction for the entire population (i.e., we set ij = for all i; j ). Note that the posterior distribution in (9.17) is easily modi ed to incorQ porate covariates. For example, with ij = xTij , we replace ij p (ij j )

© 2004 by CRC Press LLC

SPATIAL CURE RATE MODELS ?

333

in (9.17) with p ( j ), with as a xed hyperparameter. Typically a at or vague Gaussian prior may be taken for p ( j ).

Interval-censored data The formulation above assumes that our observed data are right-censored. This means that we are able to observe the actual relapse time tij when it occurs prior to the nal oce visit. In reality, our study (like many others of its kind) is only able to determine patient status at the oce visits themselves, meaning we observe only a time interval (tijL ; tijU ) within which the event (in our case, smoking relapse) is known to have occurred. For patients who did not resume smoking prior to the end of the study we have tijU = 1, returning us to the case of right-censoring at time point tijL . Thus we now set ij = 1 if subject ij is interval-censored (i.e., experienced the event), and ij = 0 if the subject is right-censored. Following Finkelstein (1986), the general interval-censored cure rate likelihood, L (f(tijL ; tijU )g j fNij g ; fi g; fij g ; fij g), is given by ni I Y Y

i=1 j =1

[S (tijL ji ; ij )]Nij ;ij fNij [S (tijL ji ; ij ) ; S (tijU ji ; ij )]gij =

ni I Y Y i=1 j =1

[S (tijL ji ; ij

)]Nij

Nij 1 ; SS ((ttijU jji;; ij )) ijL i ij

ij

:

As in the previous section, in the Bernoulli setup after marginalizing out the fNij g the foregoing becomes L (f(tijL ; tijU )g j fi g; fij g; fij g ; fij g), and can be written as ij ni I Y Y S (tijL jij ; i ; ij ) 1 ; SS ((ttijU jjij ;; i;; ij )) : (9:18) ijL ij i ij i=1 j =1

We omit details (similar to those in the previous section) arising from the Weibull parametrization and subsequent incorporation of covariates through the link function ij . 9.5.2 Spatial frailties in cure rate models The development of the hierarchical framework in the preceding section acknowledged the data as coming from I dierent geographical regions (clusters). Such clustered data are common in survival analysis and often modeled using cluster-speci c frailties i . As with the covariates, we will introduce the frailties i through the Weibull link as intercept terms in the log-relative risk; that is, we set ij = xTij + i . Here we allow the i to be spatially correlated across the regions; similarly we would like to permit the Weibull baseline hazard parameters, i ,

© 2004 by CRC Press LLC

334

SPATIAL SURVIVAL MODELS

to be spatially correlated. A natural approach in both cases is to use a univariate CAR prior. While one may certainly employ separate, independent CAR priors on and flog i g, another option is to allow these two spatial priors to themselves be correlated. In other words, we may want a bivariate spatial model for the i = (i ; i )T = (i ; log i )T . As mentioned in Sections 7.4 and 9.4, we may use the MCAR distribution for this purpose. In our setting, the MCAR distribution on the concatenated vector = (T ; T )T is Gaussian with mean 0 and precision matrix ;1 (Diag (mi ) ; W ), where is a 2 2 symmetric and positive de nite matrix, 2 (0; 1), and mi and W remain as above. In the current context, we may also wish to allow dierent smoothness parameters (say, 1 and 2 ) for and , respectively, as in Section 9.4. Henceforth, in this section we will denote the proper MCAR with a common smoothness parameter by MCAR (; ), and the multiple smoothness parameter generalized MCAR by MCAR (1 ; 2 ; ). Combined with independent (univariate) CAR models for and , these oer a broad range of potential spatial models. 9.5.3 Model comparison Suppose we let denote the set of all model parameters, so that the deviance statistic (4.9) becomes D( ) = ;2 log f (yj ) + 2 log h(y) : (9:19) When DIC is used to compare nested models in standard exponential family settings, the unnormalized likelihood L( ; y) is often used in place of the normalized R form f (yj ) in (9.19), since in this case the normalizing function m( ) = L( ; y)dy will be free of and constant across models, hence contribute equally to the DIC scores of each (and thus have no impact on model selection). However, in settings where we require comparisons across dierent likelihood distributional forms, it appears one must be careful to use the properly scaled joint density f (yj ) for each model. We argue that use of the usual proportional hazards likelihood (which of course is not a joint density function) is in fact appropriate for DIC computation here, provided we make a fairly standard assumption regarding the relationship between the survival and censoring mechanisms generating the data. Speci cally, suppose the distribution of the censoring times is independent of that of the survival times and does not depend upon the survival model parameters (i.e., independent, noninformative censoring). Let g (tij ) denote the density of the censoring time for the ij th individual, with corresponding survival (1-cdf) function R (tij ). Then the right-censored likelihood (9.16) can be extended to the joint likelihood speci cation, QI Qni 1;ij i=1 j =1 [S (tij jij ; i ; ij )] [(1 ; ij ) S (tij ji ; ij ) h (tij ji ; ij )]ij [R (tij )]ij [g (tij )]1;ij ;

© 2004 by CRC Press LLC

SPATIAL CURE RATE MODELS ?

335

as for example in Le (1997, pp. 69{70). While not a joint probability density, this likelihood is still an everywhere nonnegative and integrable function of the survival model parameters , and thus suitable for use with the Kullback-Leibler divergences that underlie DIC (Spiegelhalter et al., 2002, p. 586). But by assumption, R (t) and g (t) do not depend upon . Thus, like an m( ) that is free of , they may be safely ignored in both the pD and DIC calculations. Note this same argument implies that we can use the unnormalized likelihood (9.16) when comparing not only nonnested parametric survival models (say, Weibull versus gamma), but even parametric and semiparametric models (say, Weibull versus Cox) provided our de nition of \likelihood" is comparable across models. Note also that here our \focus" (in the nomenclature of Spiegelhalter et al., 2002) is solely on . An alternative would be instead to use a missing data formulation, where we include the likelihood contribution of fsij g, the collection of latent survival times for the right-censored individuals. Values for both and the fsij g could then be imputed along the lines given by Cox and Oakes (1984, pp. 165{166) for the EM algorithm or Spiegelhalter et al. (1995b, the \mice" example) for the Gibbs sampler. This would alter our focus from to ( ; fsij g), and pD would re ect the correspondingly larger eective parameter count. Turning to the interval censored case, here matters are only a bit more complicated. Converting the interval-censored likelihood (9.18) to a joint likelihood speci cation yields ni I Y Y

(tijU jij ; i ; ij ) ij S S (tijL jij ; i ; ij ) 1 ;

S (tijL jij ; i ; ij ) ij [R (tijL )]ij 1 ; RR ((ttijU )) [g (tijL )]1;ij : ijL Now [R (tijL )]ij (1 ; R (tijU ) =R (tijL ))ij [g (tijL )]1;ij is the function absorbed into m( ), and is again free of . Thus again, use of the usual form i=1 j =1

of the interval-censored likelihood presents no problems when comparing models within the interval-censored framework (including nonnested parametric models, or even parametric and semiparametric models). Note that it does not make sense to compare a particular right-censored model with a particular interval-censored model. The form of the available data is dierent; model comparison is only appropriate to a given data set. Example 9.5 (Smoking cessation data). We illustrate our methods using the aforementioned study of smoking cessation, a subject of particular interest in studies of lung health and primary cancer control. Described more fully by Murray et al. (1998), the data consist of 223 subjects who reside in 53 zip codes in the southeastern corner of Minnesota. The subjects, all of whom were smokers at study entry, were randomized into either a

© 2004 by CRC Press LLC

336

SPATIAL SURVIVAL MODELS

Map showing missingness pattern for the smoking cessation data: lightly shaded regions are those having no responses.

Figure 9.10

smoking intervention (SI) group, or a usual care (UC) group that received no special antismoking intervention. Each subject's smoking habits were monitored at roughly annual visits for ve consecutive years. The subjects we analyze are actually the subset who are known to have quit smoking at least once during these ve years, and our event of interest is whether they relapse (resume smoking) or not. Covariate information available for each subject includes sex, years as a smoker, and the average number of cigarettes smoked per day just prior to the quit attempt. To simplify matters somewhat, we actually t our spatial cure rate models over the 81 contiguous zip codes shown in Figure 9.10, of which only the 54 dark-shaded regions are those contributing patients to our data set. This enables our models to produce spatial predictions even for the 27 unshaded regions in which no study patients actually resided. All of our MCMC algorithms ran 5 initially overdispersed sampling chains, each for 20,000 iterations. Convergence was assessed using correlation plots, sample trace plots, and Gelman-Rubin (1992) statistics. In every case a burn-in period of 15,000 iterations appeared satisfactory. Retaining the remaining 5,000 samples from each chain yielded a nal sample of 25,000 for posterior summarization. Table 9.12 provides the DIC scores for a variety of random eects cure rate models in the interval-censored case. Models 1 and 2 have only random frailty terms i with i.i.d. and CAR priors, respectively. Models 3 and 4 add random Weibull shape parameters i = log i , again with i.i.d. and CAR priors, respectively, independent of the priors for the i . Finally, Models 5 and 6 consider the full MCAR structure for the (i ; i ) pairs, assuming common and distinct spatial smoothing parameters, respectively. The DIC scores do not suggest that the more complex models are signi cantly better; apparently the data encourage a high degree of shrinkage in the random

© 2004 by CRC Press LLC

SPATIAL CURE RATE MODELS ?

Model Log-relative risk

xTij + i ; i iid N (0; ) ; i = 8 i xTij + i ; fi g CAR () ; i = 8 i xTij + i ; i iid N (0; ) ; i iid N (0; ) xTij + i ; fi g CAR () ; fi g CAR ( ) xTij + i ; (fi g ; fi g) MCAR (; ) xTij + i ; (fi g ; fi g) MCAR (; ; )

1 2 3 4 5 6 Table 9.12

337

pD DIC 10.3 9.4 13.1 10.4 7.9 8.2

438 435 440 439 434 434

DIC and pD values for various competing interval-censored models.

Parameter Intercept Sex (male = 0) Duration as smoker SI/UC (usual care = 0) Cigarettes smoked per day (cure fraction)

Median (2.5%, 97.5%) {2.720 ({4.803, {0.648) 0.291 ({0.173, 0.754) {0.025 ({0.059, 0.009) {0.355 ({0.856, 0.146) 0.010 ({0.010, 0.030) 0.694 (0.602, 0.782) 0.912 (0.869, 0.988) 0.927 (0.906, 0.982) 11 (spatial variance component, i ) 0.005 (0.001, 0.029) 22 (spatial variance component, ) 0.007 (0.002, 0.043) i p 12 = 11 22 0.323 ({0.746, 0.905) Table 9.13

Posterior quantiles, full model, interval-censored case.

eects (note the low pD scores). In what follows we present results for the \full" model (Model 6) in order to preserve complete generality, but emphasize that any of the models in Table 9.12 could be used with equal con dence. Table 9.13 presents estimated posterior quantiles (medians, and upper and lower .025 points) for the xed eects , cure fraction , and hyperparameters in the interval-censored case. The smoking intervention does appear to produce a decrease in the log relative risk of relapse, as expected. Patient sex is also marginally signi cant, with women more likely to relapse than men, a result often attributed to the (real or perceived) risk of weight gain following smoking cessation. The number of cigarettes smoked per day does not seem important, but duration as a smoker is signi cant, and in

© 2004 by CRC Press LLC

338

SPATIAL SURVIVAL MODELS

a)

b) 0.95 - 0.98

-0.30 - -0.20

0.98 - 0.99

-0.20 - -0.10

0.99 - 1.01

-0.10 - -0.02

RR RR

-0.02 - 0.02

0.02 - 0.10

RR RR

1.01 - 1.03 1.03 - 1.05

0.10 - 0.20

1.05 - 1.08

0.20 - 0.37

1.08 - 1.19

Maps of posterior means for the i (a) and the i (b) in the full spatial MCAR model, assuming the data to be interval-censored (see also color insert).

Figure 9.11

a possibly counterintuitive direction: shorter-term smokers relapse sooner. This may be due to the fact that people are better able to quit smoking as they age (and are thus confronted more clearly with their own mortality). The estimated cure fraction in Table 9.13 is roughly .70, indicating that roughly 70% of smokers in this study who attempted to quit have in fact been \cured." The spatial smoothness parameters and are both close to 1, again suggesting we would lose little by simply setting them both equal to 1 (as in the standard CAR model). Finally, the last lines of both tables indicate only a moderate correlation between the two random eects, again consistent with the rather weak case for including them in the model at all. We compared our results to those obtained from the R function survreg using a Weibull link, and also to Weibull regression models t in a Bayesian fashion using the WinBUGS package. While neither of these alternatives featured a cure rate (and only the WinBUGS analysis included spatial random eects), both produced xed eect estimates quite consistent with those in Table 9.13. Turning to graphical summaries, Figure 9.11 (see also color insert Figure C.13) maps the posterior medians of the frailty (i ) and shape (i ) parameters in the full spatial MCAR (Model 6) case. The maps reveal some interesting spatial patterns, though the magnitudes of the dierences appear relatively small across zip codes. The south-central region seems to be of some concern, with its high values for both i (high overall relapse rate) and i (increasing baseline hazard over time). By contrast, the four zip codes comprising the city of Rochester, MN (home of the Mayo Clinic, and marked with an \R" in each map) suggest slightly better than average cessation behavior. Note that a nonspatial model cannot impute anything other than the \null values" (i = 0 and i = 1) for any zip code contributing no data (all of the unshaded regions in Figure 9.10). Our spatial model however is able to impute nonnull values here, in accordance with the observed values in neighboring regions.

© 2004 by CRC Press LLC

EXERCISES

339

Unit Drug Time Unit Drug Time Unit Drug Time A 1 74+ E 1 214 H 1 74+ A 2 248 E 2 228+ H 1 88+ A 1 272+ E 2 262 H 1 148+ A 2 344 H 2 162 F 1 6 B 2 4+ F 2 16+ I 2 8 B 1 156+ F 1 76 I 2 16+ F 2 80 I 2 40 C 2 100+ F 2 202 I 1 120+ F 1 258+ I 1 168+ D 2 20+ F 1 268+ I 2 174+ D 2 64 F 2 368+ I 1 268+ D 2 88 F 1 380+ I 2 276 D 2 148+ F 1 424+ I 1 286+ D 1 162+ F 2 428+ I 1 366 D 1 184+ F 2 436+ I 2 396+ D 1 188+ I 2 466+ D 1 198+ G 2 32+ I 1 468+ D 1 382+ G 1 64+ D 1 436+ G 1 102 J 1 18+ G 2 162+ J 1 36+ E 1 50+ G 2 182+ J 2 160+ E 2 64+ G 1 364+ J 2 254 E 2 82 E 1 186+ H 2 22+ K 1 28+ E 1 214+ H 1 22+ K 1 70+ K 2 106+ Survival times (in half-days) from the MAC treatment trial, from Carlin and Hodges (1999). Here, \+" indicates a censored observation.

Table 9.14

9.6 Exercises 1. The data located at www.biostat.umn.edu/~brad/data/MAC.dat, and also shown in Table 9.14, summarize a clinical trial comparing two treatments for Mycobacterium avium complex (MAC), a disease common in late-stage HIV-infected persons. Eleven clinical centers (\units") have enrolled a total of 69 patients in the trial, 18 of which have died; see Cohn et al. (1999) and Carlin and Hodges (1999) for full details regarding this trial.

© 2004 by CRC Press LLC

340

SPATIAL SURVIVAL MODELS

As in Section 9.1, let tij be the time to death or censoring and xij be the treatment indicator for subject j in stratum i (j = 1; : : : ; ni , i = 1; : : : ; k). With proportional hazards and a Weibull baseline hazard, stratum i's hazard is then h(tij ; xij ) = h0 (tij )!i exp( 0 + 1 xij ) = i tiji ;1 exp( 0 + 1 xij + Wi ) ; where i > 0, = ( 0 ; 1 )0 2 J . While this does not mean that Y (si ) and Y (s0i ) are perfectly associated, it does mean that specifying Y () at J distinct locations determines the value of the process at all other locations. As a result, such modeling may be more attractive for spatial random eects than for the data itself. A variant of this strategy is a conditioning idea. Suppose we partition the region of interest into M subregions so P that we have the total of n points partitioned into nm in subregion m with M m=1 nm = n. Suppose we assume that Y (s) and Y (s0 ) are conditionally independent given s lies in subregion m and s0 lies in subregion m0 . However, suppose we assign random eects

(s1 ); :::; (sM ) with (sm) assigned to subregion m. Suppose the sM 's are \centers" of the subregions (using an appropriate de nition) and that the (sM ) follows a spatial process that we can envision as a hyperspatial process. There are obviously many ways to build such multilevel spatial structures, achieving a variety of spatial association behaviors. We do not elaborate here but note that matrices will now be nm nm and M M rather than n n. A.5.5 Coarse- ne coupling Lastly, particularly for hierarchical models with a non-Gaussian rst stage, a version of the coarse- ne idea as in Higdon, Lee, and Holloman (2003) may be successful. The idea here is, with a non-Gaussian rst stage, if spatial random eects (say, (s1 ); : : : ; (sn )) are introduced at the second stage, then, as in Subsection 5.2, the set of (si ) will have to be updated at each iteration of a Gibbs sampling algorithm. Suppose n is large and that the \ ne" chain does such updating. This chain will proceed very slowly. But now suppose that concurrently we run a \coarse" chain using a much smaller subset n0 of the si 's. The coarse chain will update very rapidly. Since the process for () is the same in both chains it will be the case that the coarse one will explore the posterior more rapidly. However, we need realizations from the ne chain to t the model using all of the data. The coupling idea is to let both the ne and coarse chains run, and after

© 2004 by CRC Press LLC

392

MATRIX THEORY AND SPATIAL COMPUTING METHODS

a speci ed number of updates of the ne chain (and many more updates of the coarse chain, of course) we attempt a \swap;" i.e., we propose to swap the current value of the ne chain with that of the coarse chain. The swap attempt ensures that the equilibrium distributions for both chains are not compromised (see Higdon, Lee, and Holloman, 2003). For instance, given the values of the 's for the ne iteration, we might just use the subset of 's at the locations for the coarse chain. Given the values of the 's for the coarse chain, we might do an appropriate kriging to obtain the 's for the ne chain. Such coupling strategies have yet to be thoroughly investigated.

A.6 Slice Gibbs sampling for spatial process model tting

Auxiliary variable methods are receiving increased attention among those who use MCMC algorithms to simulate from complex nonnormalized multivariate densities. Recent work in the statistical literature includes Tanner and Wong (1987), Besag and Green (1993), Besag et al. (1995), and Higdon (1998). The particular version we focus on here introduces a single auxiliary variable to \knock out" or \slice" the likelihood. Employed in the context of spatial modeling for georeferenced data using a Bayesian formulation with commonly used proper priors, in this section we show that convenient Gibbs sampling algorithms result. Our approach thus nds itself as a special case of recent work by Damien, Wake eld, and Walker (1999), who view methods based on multiple auxiliary variables as a general approach to constructing Markov chain samplers for Bayesian inference problems. We are also close in spirit to recent work of Neal (2003), who also employs a single auxiliary variable, but prefers to slice the entire nonnormalized joint density and then do a single multivariate updating of all the variables. Such updating requires sampling from a possibly high-dimensional uniform distribution with support over a very irregular region. Usually, a bounding rectangle is created and then rejection sampling is used. As a result, a single updating step will often be inecient in practice. Currently, with the wide availability of cheap computing power, Bayesian spatial model tting typically turns to MCMC methods. However, most of these algorithms are hard to automate since they involve tuning tailored to each application. In this section we demonstrate that a slice Gibbs sampler, done by knocking out the likelihood and implemented with a Gibbs updating, enables essentially an automatic MCMC algorithm for tting Gaussian spatial process models. Additional advantages over other simulation-based model tting schemes accrue, as we explain below. In this regard, we could instead slice the product of the likelihood and the prior, yielding uniform draws to implement the Gibbs updates. However, the support for these conditional uniform updates changes with iteration. The conditional interval arises through matrix inverse and determinant functions of model parameters with matrices of dimension equal to the sample size. Slicing only the

© 2004 by CRC Press LLC

SLICE GIBBS SAMPLING FOR SPATIAL PROCESS MODEL FITTING

393

likelihood and doing Gibbs updates using draws from the prior along with rejection sampling is truly \o the shelf," requiring no tuning at all. Approaches that require rst and second derivatives of the log likelihood or likelihood times prior, e.g., the MLE approach of Mardia and Marshall (1984) or Metropolis-Hastings proposal approaches within Gibbs samplers will be very dicult to compute, particularly with correlation functions such as those in the Matern class. Formally, if L(; Y ) denotes the likelihood and () is a proper prior, we introduce the single auxiliary variable U , which, given and Y , is distributed uniformly on (0; L(; Y )). Hence the joint posterior distribution of and U is given by p(; U jY ) / () I (U < L(; Y )) ; (A:10) where I denotes the indicator function. The Gibbs sampler updates U according to its full conditional distribution, which is the above uniform. A component i of is updated by drawing from its prior subject to the indicator restriction given the other 's and U . A standard distribution is sampled and only L needs to be evaluated. Notice that, if hyperparameters are introduced into the model, i.e., () is replaced with (j)( ), the foregoing still applies and is updated without restriction. Though our emphasis here is spatial model tting, it is evident that slice Gibbs sampling algorithms are more broadly applicable. With regard to computation, for large data sets often evaluation of L(; Y ) will produce an under ow, preventing sampling from the uniform distribution for U given and Y . However, log L(; Y ) will typically not be a problem to compute. So, if V = ; log U , given and Y , V + log L(; Y ) Exp(mean = 1:0), and we can transform (A.10) to p(; V jY ) / exp(;V ) I (; log L(; Y ) < V < 1). In fact, in some cases we can implement a more ecient slice sampling algorithm than the slice Gibbs sampler. We need only impose constrained sampling on a subset of the components of . In particular, suppose we write = (1 ; 2 ) and suppose that the full conditional distribution for 1, p(1j2; Y ) / L(1; 2; Y )(1j2), is a standard distribution. Then consider the following iterative updating scheme: sample U given 1 and 2 as above; then, update 2 given 1 and U with a draw from (2 j1) subject to the constraint U < L(1; 2 ; Y ); nally, update 1 with an unconditional draw from p(1 j2 ; Y ). Formally, this scheme is not a Gibbs sampler. Suppressing Y , we are updating p(U j1 ; 2), then p(2 j1 ; U ), and nally p(1j2 ). However, the rst and third distribution uniquely determine p(U; 1j2 ) and, this, combined with the second, uniquely determine the joint distribution. The Markov chain iterated in this fashion still has p(; U jY ) as its stationary distribution. In fact, if p(1j2 ; Y ) is a standard distribution, this implies that we can marginalize over 1 and run the slice Gibbs sampler on 2 with U . Given posterior draws of 2 , we can sample 1 one for one from its posterior using p(1 j2; Y ) and the

© 2004 by CRC Press LLC

394

MATRIX THEORY AND SPATIAL COMPUTING METHODS

R

fact that p(1 jY ) = p(1 j2 ; Y )p(2 jY ). Moreover, if p(1j2 ; Y ) is not a standard distribution, we can add Metropolis updating of 1 either in its entirety or through its components (we can also use Gibbs updating here). We employ these modi ed schemes for dierent choices of (1 ; 2) in the remainder of this section. We note that the performance of the algorithm depends critically on the distribution of the number of draws needed from (2 j1 ) to update 2 given 1 and U subject to the constraint U < L(1 ; 2; Y ). Henceforth, this will be referred to as \getting a point in the slice." A naive rejection sampling scheme (repeatedly sample from (2 j1) until we get to a point in the slice) may not always give good results. An algorithm that shrinks the support of (2 j1) so that it gives a better approximation to the slice whenever there is a rejection is more appropriate. We propose one such scheme called \shrinkage sampling" described in Neal (2003). In this context, it results in the following algorithm. For simplicity, let us assume 2 is one-dimensional. If a point ^ 2 drawn from (2 j1) is not in the slice and is larger (smaller) than the current value 2 (which is of course in the slice), the next draw is made from (2 j1) truncated with the upper (lower) bound being ^2 . The truncated interval keeps shrinking with each rejection until a point in the slice is found. The multidimensional case works by shrinking hyperrectangles. As mentioned in Neal (2003), this ensures that the expected number of points drawn will not be too large, making it a more appropriate method for general use. However, intuitively it might result in higher autocorrelations compared to the simple rejection sampling scheme. In our experience, the shrinkage sampling scheme has performed better than the naive version in most cases. Suppresing Y in our notation, we summarize the main steps in our slice Gibbs sampling algorithm as follows: (a) Partition = (1 ; 2) so that samples from p(1 j2 ) are easy to obtain; (b) Draw V = ; log L() + Z , where Z Exp(mean = 1); (c) Draw 2 from p(2 j1; V ) I (; log L() < V < 1) using shrinkage sampling; (d) Draw 1 from p(1 j2 ); (e) Iterate (b) through (d) until we get the appropriate number of MCMC samples. The spatial models on which we focus arise through the speci cation of a Gaussian process for the data. With, for example, an isotropic covariance function, proposals for simulating the range parameter for, say, an exponential choice, or the range and smoothness parameters for a Matern choice can be dicult to develop. That is, these parameters appear in the covariance matrix for Y in a nonstructured way (unless the spatial locations are on a regular grid). They enter the likelihood through the determinant

© 2004 by CRC Press LLC

SLICE GIBBS SAMPLING FOR SPATIAL PROCESS MODEL FITTING

395

and inverse of this matrix. And, for large n, the fewer matrix inversion and determinant computations, the better. As a result, for a noniterative sampling algorithm, it is very dicult to develop an eective importance sampling distribution for all of the model parameters. Moreover, as overall model dimension increases, resampling typically yields a very \spiked" discrete distribution. Alternative Metropolis algorithms require eective proposal densities with careful tuning. Again, these densities are dicult to obtain for parameters in the correlation function. Morover, in general, such algorithms will suer slower convergence than the Gibbs samplers we suggest, since full conditional distributions are not sampled. Furthermore, in our experience, with customary proposal distributions we often encounter serious autocorrelation problems. When thinning to obtain a sample of roughly uncorrelated values, high autocorrelation necessitates an increased number of iterations. Additional iterations require additional matrix inversion and determinant calculation and can substantially increase run times. Discretizing the parameter spaces has been proposed to expedite computation in this regard, but it too has problems. The support set is arbitrary, which may be unsatisfying, and the support will almost certainly be adaptive across iterations, diminishing any computational advantage. A.6.1 Constant mean process with nugget Suppose Y (s1 ); : : : ; Y (sn ) are observations from a constant mean spatial process over s 2 D with a nugget. That is, Y (si ) = + w(si ) + (si ) ; (A:11) where the (si ) are realizations of a white noise process with mean 0 and variance 2 . In (A.11), the w(si ) are realizations from a second-order stationary Gaussian process with covariance function 2 C (h; ) where C is a valid two-dimensional correlation function with parameters and separation vector h. Below we work with the Matern class (2.8), so that = (; ). Thus (A.11) becomes a ve-parameter model: = (; 2 ; 2 ; ; )T . Note that though the Y (si ) are conditionally independent given the w(si ), a Gibbs sampler that also updates the latent w(si )'s will be sampling an (n+5)-dimensional posterior density. However, it is possible to marginalize explicitly over the w(si )'s (see Section 5.1), and it is almost always preferable to implement iterative simulation with a lower-dimensional distribution. The marginal likelihood associated with Y = (Y (s1 ); : : : ; Y (sn )) is L(; 2 ; 2 ; ; ; Y ) =j 2 H () + 2 I j; 21 expf;(Y ; 1)T (2 H () + 2 I );1 (Y ; 1)=2g ; (A:12) where (H ())ij = 2 C (dij ; ) (dij being the distance between si and sj ).

© 2004 by CRC Press LLC

396

MATRIX THEORY AND SPATIAL COMPUTING METHODS

Suppose we adopt a prior of the form 1 ()2 ( 2 )3 (2 )4 ()5 ( ). Then (A.10) becomes 1 ()2 ( 2 )3 (2 )4 ()5 ( ) I (U < L(; 2 ; 2 ; ; ; Y )). The Gibbs sampler is most easily implemented if, given and , we diagonalize H (), i.e., H () = P ()D()(P ())T where P () is orthogonal with the columns of P () giving the eigenvectors of H () and D() is a diagonal matrix with diagonal elements i , the eigenvalues of H (). Then (A.12) simpli es to n Y (2 i + 2 )); 21 expf; 12 (Y ;1)T P ()(2 D()+ 2 I );1 P T ()(Y ;1)g: i=1

As a result, the constrained updating of 2 and 2 at a given iteration does not require repeated calculation of a matrix inverse and determinant. To minimize the number of diagonalizations of H () we update and together. If there is interest in the w(si ), their posteriors can be sampled straightforwardly after the marginalized model is tted. For instance, R p(w(si )jY ) = p(w(si )j; Y )p(jY )d so each posterior sample ? , using a draw from p(w(si )j? ; Y ) (which is a normal distribution), yields a sample from the posterior for w(si ). We remark that (A.11) can also include a parametric transformation of Y (s). For instance, we could employ a power transformation to nd a scale on which the Gaussian process assumption is comfortable. This only requires replacing Y (s) with Y p (s) and adds one more parameter to the likelihood in (A.12). Lastly, we note that other dependence structures for Y can be handled in this fashion, e.g., equicorrelated forms, Toeplitz forms, and circulants. A.6.2 Mean structure process with no pure error component Now suppose Y (s1 ); : : : ; Y (sn ) are observations from a spatial process over s 2 D such that Y (si ) = X T (si ) + w(si ) : (A:13) Again, the w(si ) are realizations from a second-order stationary Gaussian process with covariance parameters 2 and . In (A.13), X (si ) could arise as a vector of site level covariates or X T (si ) could be a trend surface speci cation as in the illustration below. To complete the Bayesian speci cation we adopt a prior of the form 1 ( )2 (2 )3 () where 1 ( ) is N ( ; ) with and known. This model is not hierarchical in the sense of our earlier forms, but we can marginalize explicitly over , obtaining L(2 ; ; Y ) =j 2 H () + X X T j; 21 expf;(Y ; X )T (2 H () + X X T );1 (Y ; X )=2g ; (A:14)

© 2004 by CRC Press LLC

SLICE GIBBS SAMPLING FOR SPATIAL PROCESS MODEL FITTING

397

where the rows of X are the X X X T is symmetric positive semide nite. Hence, there exists a nonsingular matrix Q() such that (Q;1 ())T Q;1() = H () and also satisfying (Q;1 ())T Q;1 () = X X T , where is diagonal with diagonal elements that are eigenvalues of X X T H ;1 (). Therefore, (A.14) simpli es to jQ()j Qnin=1 (2 + i )); 21 o exp ; 21 (Y ; X )T Q()T (2 I + );1 Q()(Y ; X ) : As in the previous section, we run a Gibbs sampler to update U given 2 ; , and Y , then 2 given ; U , and Y , and nally given 2 ; U , and Y . Then, given posterior samples fl2; l ;2l = 1; :: : ; Lg we can obtain posterior samples for one for one given l and l by drawing l from a N (Aa; A) distribution, where A;1 = 1l2 X T H ;1 (l )X + (A:15) and a = 1l2 X T H ;1 (l )Y + ; 1 : T (si ): Here, H ( ) is positive de nite while

In fact, using standard identities (see, e.g., Rao, 1973, p. 29), 1 ;1 ; 1 T ; 1 = ; X T Q()(2 I + );1QT ()X ; 2 X H ()X +

facilitating sampling from (A.15). Finally, if and were viewed as unknown we could introduce hyperparameters. In this case would typically be diagonal and might be 0 1, but the simultaneous diagonalization would still simplify the implementation of the slice Gibbs sampler. We note an alternate strategy that does not marginalize over and does not require simultaneous diagonalization. The likelihood of ( ; 2 ; ) is given by L( ; 2 ; ; Y ) / j2 H ()j; 12 exp ;(Y ; X )T H ();1 (Y ; X )=22 : (A:16) Letting 1 = ( ; 2 ) and 2 = with normal and inverse gamma priors on and 2, respectively, we can update and 2 componentwise conditional 2 on 2; Y , since j ; 2 ; Y is normal while 2 j ; 2; Y is inverse gamma. 2j1; U; Y is updated using the slice Gibbs sampler with shrinkage as described earlier. A.6.3 Mean structure process with nugget Extending (A.11) and (A.13) we now assume that Y (s1 ); : : : ; Y (sn ) are observations from a spatial process over s 2 D such that Y (si ) = X T (si ) + w(si ) + (si ) : (A:17)

© 2004 by CRC Press LLC

398

MATRIX THEORY AND SPATIAL COMPUTING METHODS

As above, we adopt a prior of the form 1 ( )2 ( 2 )3 (2 )4 (), where 1 ( ) is N ( ; ). Note that we could again marginalize over and the w(si ) as in the previous section, but the resulting marginal covariance matrix is of the form 2 H ()+ X X T + 2 I . The simultaneous diagonalization trick does not help here since Q() is not orthogonal. Instead we just marginalize over the w(si ), obtaining the joint posterior p( ; 2 ; 2 ; ; U jY ) proportional to

1 ( )2 ( 2 )3 (2 )4 () I U < j2 H () + 2 I j; 12 expf;(Y ; X )T (2 H () + 2 I );1 (Y ; X )=2g :

We employ the modi ed scheme suggested below (A.10) taking 1 = and 2 = ( 2 ; 2; ). The required full conditional distribution p( j 2 ; 2 ; ; Y ) is N (Aa; A), where A;1 = X T (2 H () + 2 I );1 X + ; 1 and a = X T (2 H () + 2 I );1 Y + ; 1 :

A.7 Structured MCMC sampling for areal model tting Structured Markov chain Monte Carlo (SMCMC) was introduced by Sargent, Hodges, Carlin (2000) as a general method for Bayesian computing in richly parameterized models. Here, \richly parameterized" refers to hierarchical and other multilevel models. SMCMC (pronounced \smick-mick") provides a simple, general, and exible framework for accelerating convergence in an MCMC sampler by providing a systematic way to update groups of similar parameters in blocks while taking full advantage of the posterior correlation structure induced by the model and data. Sargent (2000) apply SMCMC to several dierent models, including a hierarchical linear model with normal errors and a hierarchical Cox proportional hazards model. Blocking, i.e., simultaneously updating multivariate blocks of (typically highly correlated) parameters, is a general approach to accelerating MCMC convergence. Liu (1994) and Liu et al. (1994) con rm its good performance for a broad class of models, though Liu et al. (1994, Sec. 5) and Roberts and Sahu (1997, Sec. 2.4) give examples where blocking slows a sampler's convergence. In this section, we show that spatial models of the kind proposed by Besag, York, and Mollie (1991) using nonstationary \intrinsic autoregressions" are richly parameterized and lend themselves to the SMCMC algorithm. Bayesian inference via MCMC for these models has generally used single-parameter updating algorithms with often poor convergence and mixing properties. There have been some recent attempts to use blocking schemes for similar models. Cowles (2002, 2003) uses SMCMC blocking strategies for geostatistical and areal data models with normal likelihoods,

© 2004 by CRC Press LLC

STRUCTURED MCMC SAMPLING FOR AREAL MODEL FITTING

399

while Knorr-Held and Rue (2002) implement blocking schemes using algorithms that exploit the sparse matrices that arise out of the areal data model. We study several strategies for block-sampling parameters in the posterior distribution when the likelihood is Poisson. Among the SMCMC strategies we consider here are blocking using dierent-sized blocks (grouping by geographical region), updating jointly with and without model hyperparameters, \oversampling" some of the model parameters, reparameterization via hierarchical centering and \pilot adaptation" of the transition kernel. Our results suggest that our techniques will generally be far more accurate (produce less correlated samples) and often more ecient (produce more eective samples per second) than univariate sampling procedures. SMCMC algorithm basics Following Hodges (1998), we consider a hierarchical model expressed in the general form, 2 y 3 2X 0 3 2 3 1 2 3

66 66 4

M

6 7 77 66 7 77 = 66 H H 777 4 5 + 666 777 : 4 5 5 4 5 1

1

2

G1 G2

2

(A:18)

The rst row of this layout is actually a collection of rows corresponding to the \data cases," or the terms in the joint posterior into which the response, the data y, enters directly. The terms in the second row (corresponding to the Hi ) are called \constraint cases" since they place stochastic constraints on possible values of 1 and 2 . The terms in the third row, the \prior cases" for the model parameters, have known (speci ed) error variances for these parameters. Equation (A.18) can be expressed as Y = X + E, where X and Y are known, is unknown, and E is an error term with block diagonal covariance matrix ; = Diag(Cov(); Cov(); Cov( )). If the error structure for the data is normal, i.e., if the vector in the constraint case formulation (A.18) is normally distributed, then the conditional posterior density of is jY; ; N ((X T ;;1X );1(X T ;;1Y ) ; (X T ;;1X );1) : (A:19) The basic SMCMC algorithm is then nothing but the following two-block Gibbs sampler : (a) Sample as a single block from the above normal distribution, using the current value of ;. (b) Update ; using the conditional distribution of the variance components, using the current value of .

© 2004 by CRC Press LLC

400

MATRIX THEORY AND SPATIAL COMPUTING METHODS

In our spatial model setting, the errors are not normally distributed, so the normal density described above is not the correct conditional posterior distribution for . Still, a SMCMC algorithm with a Metropolis-Hastings implementation can be used, with the normal density in (A.19) taken as the candidate density. A.7.1 Applying structured MCMC to areal data Consider again the Poisson-CAR model of Subsection 5.4.3, with no covariates so that i = i + i ; i = 1; : : : ; N; where N is the total number of regions, and f1 ; ::; N g; f1; ::; N g are vectors of random eects. The i 's are independent and identically distributed Gaussian normal variables with precision parameter h , while the i 's are assumed to follow a CAR(c ) distribution. We place conjugate gamma hyperpriors on the precision parameters, namely h G(h ; h ) and c G(c ; c ) with h = 1:0, h = 100:0, c = 1:0 and c = 50:0 (these hyperpriors have means of 100 and 50, and standard deviations of 10,000 and 2,500, respectively, a speci cation recommended by Bernardinelli et al., 1995). There is a total of 2N + 2 model parameters: fi : i = 1; : : : N g, fi : i = 1; : : : N g, h and c . The SMCMC algorithm requires that we transform the Yi data points to ^i = log(Yi =Ei ), which can be conveniently thought of as the response since they should be roughly linear in the model parameters (the i 's and i 's). For the constraint case formulation, the different levels of the model are written down case by case. The data cases are ^i ; i = 1; : : : ; N . The constraint cases for the i 's are i N (0; 1=h), i = 1; : : : ; N . For the constraint cases involving the i 's, the dierences between the neighboring i 's can be used to get an unconditional distribution for the i 's using pairwise dierences (Besag et al., 1995). Thus the constraint cases can be written as (i ; j )jc N (0; 1=c) (A:20) for each pair of adjacent regions (i; j ). To obtain an estimate of ;, we need estimates of the variance-covariance matrix corresponding to the ^i 's (the data cases) and initial estimates of the variance-covariance matrix for the constraint cases (the rows corresponding to the i 's and i 's). Using the delta method, we can obtain an approximation as follows: assume Yi N (Ei ei ; Ei ei ) (roughly), so invoking the delta method we can see that Var(log(Yi =Ei )) is approximately 1/Yi . A reasonably good starting value is particularly important here since we never update these variance estimates (the data variance section of ; stays the same throughout the algorithm). For initial estimates of the variance components corresponding to the i 's and the i 's, we can use the mean of the hyperprior densities on h and c , and substitute these values into ;.

© 2004 by CRC Press LLC

STRUCTURED MCMC SAMPLING FOR AREAL MODEL FITTING

401

As a result, the SMCMC candidate generating distribution is thus of the form (A.19), with the Yi 's replaced by ^ . To compute the Hastings ratio, the distribution of the i 's is rewritten in the joint pairwise dierence form with the appropriate exponent for c (Hodges, Carlin, and Fan, 2003):

p(1 ; 2 ; :::; N jc ) / c(N ;1)=2

8 9 < c X = exp :; 2 (i ; j ) ; ; ij 2

(A:21)

where i j if i and j are neighboring regions. Finally, the joint distribution of the i 's is given by

p(1 ; 2 ; :::; N jh ) / hN=2 exp

(

)

N X h ; 2 i2 :

(A:22)

i=1

As above, the response vector is ^ T = flog(Y1 =E1 ); : : : ; log(YN =EN )g. The (2N + C ) 2N design matrix for the spatial model is de ned by

2 I I 66 N N N N X = 66 ;IN N 0N N 4

3 77 77 : 5

(A:23)

0C N AC N The design matrix is divided into two halves, the left half corresponding to the N i 's and the right half referring to the N i 's. The top section of this design matrix is an N 2N matrix relating ^i to the model parameters i and i . In the ith row, a 1 appears in the ith and (N + i)th columns while 0's appear elsewhere. Thus the ith row corresponds to i = i + i . The middle section of the design matrix is an N 2N matrix that imposes a stochastic constraint on each i separately (i 's are i.i.d normal). The bottom section of the design matrix is a C 2N matrix with each row having a ;1 and 1 in the (N + k)th and (N + l)th columns, respectively, corresponding to a stochastic constraint being imposed on l ; k (using the pairwise dierence form of the prior on the i 's as described in (A.20) with regions l and k being neighbors). The variance-covariance matrix ; is a diagonal matrix with the top left section corresponding to the variances of the data cases, i.e., the ^i 's. Using the variance approximations described above, the (2N + C ) (2N + C ) block diagonal variance-covariance matrix is 2 Diag(1=Y1; 1=Y2; : : : ; 1=YN ) 0N N 0N C 3

66 4

; = 66

© 2004 by CRC Press LLC

0N N

h IN N

0N C

0C N

0C N

c IC C

1

1

77 77 : 5

(A:24)

402

MATRIX THEORY AND SPATIAL COMPUTING METHODS

Note that the exponent on c in (A.21) would actually be C=2 (instead of (N ; 1)=2) if obtained by taking the product of the terms in (A.20). Thus, (A.20) is merely a form we use to describe the distribution of the i s for our constraint case speci cation. The formal way to incorporate the distribution of the i s in the constraint case formulation is by using an alternate speci cation of the joint distribution of the i 's, as described in Besag and Kooperberg (1995). This form is a N N Gaussian density with precision matrix, Q,

p(1 ; 2 ; :::; N jc ) / exp ; 2c T Q ; where T = (1 ; 2 ; :::; N ); (A:25)

and

8 c < Qij = : 0 ;1

if i = j where c = number of neighbors of region i if i is not adjacent to j : if i is not adjacent to j However, it is possible to show that this alternate formulation (using the corresponding design and ; matrices) results in the same SMCMC candidate mean and covariance matrix for given h and c as the one described in (A.19); see Haran, Hodges, and Carlin (2003) for details. A.7.2 Algorithmic schemes

Univariate MCMC (UMCMC): For the purpose of comparing the dierent blocking schemes, one might begin with a univariate (updating one variable at a time) sampler. This can be done by sampling h and c from their gamma full conditional distributions, and then, for each i, sampling each i and i from its full conditional distribution, the latter using a Metropolis step with univariate Gaussian random walk proposals. Reparameterized Univariate MCMC (RUMCMC): One can also reparameterize from (1 ; : : : ; N ; 1 ; : : : ; N ) to (1 ; : : : ; N ; 1 ; : : : ; N ), where i = i + i . The (new) model parameters and the precision parameters can be sampled in a similar manner as for UMCMC. This \hierarchical centering" was suggested by (Besag et al. (1995) and Waller et al. (1997) for the spatial model, and discussed in general by Gelfand et al. (1995, 1996). Structured MCMC (SMCMC): A rst step here is pilot adaptation, which involves sampling (h , c ) from their gamma full conditionals, updating the ; matrix using the averaged (h , c ) sampled so far, updating the SMCMC candidate covariance matrix and mean vector using the ; matrix, and then sampling (; ) using the SMCMC candidate in a MetropolisHastings step. We may run the above steps for a \tuning" period, after which we x the SMCMC candidate mean and covariance, sampled (h , c ) as before, and use the Metropolis-Hastings to sample (; ) using SMCMC

© 2004 by CRC Press LLC

STRUCTURED MCMC SAMPLING FOR AREAL MODEL FITTING

403

proposals. Some related strategies include adaptation of the ; matrix more or less frequently, adaptation over shorter and longer periods of time, and pilot adaptation while blocking on groups of regions. Our experience with pilot adaptation schemes indicates that a single proposal, regardless of adaptation period length, will probably be unable to provide a reasonable acceptance rate for the many dierent values of (h ; c ) that will be drawn in realistic problems. As such, we typically turn to oversampling relative to (h ; c ); that is, the SMCMC proposal is always based on the current (h ; c ) value. In this algorithm, we sample h and c from their gamma full conditionals, then compute the SMCMC proposal based on the ; matrix using the generated h and c . For each (h ; c ) pair, we run a Hastings independence subchain by sampling a sequence of length 100 (say) of 's using the SMCMC proposal. Further implementational details for this algorithm are given in Haran (2003). Reparameterized Structured MCMC (RSMCMC): This nal algorithm is the SMCMC analogue of the reparametrized univariate algorithm (RUMCMC). It follows exactly the same steps as the SMCMC algorithm, with the only dierence being that is now (; ) instead of (; ), and the proposal distribution is adjusted according to the new parameterization. Haran, Hodges, and Carlin (2003) compare these schemes in the context of two areal data examples, using the notion of eective sample size, or ESS (Kass et al., 1998). ESS is de ned for each parameter as the number of MCMC samples drawn by the parameter's so-called autocorrelation P divided time, = 1 + 2 1 k=1 (k ), where (k ) is the autocorrelation at lag k . One can estimate from the MCMC chain, using the initial monotone positive sequence estimator as given by Geyer (1992). Haran et al. (2003) nd UMCMC to perform poorly, though the reparameterized univariate algorithm (RUMCMC) does provide a signi cant improvement in this case. However, SMCMC and RSMCMC still perform better than both univariate algorithms. Even when accounting for the amount of time taken by the SMCMC algorithm (in terms of eective samples per second), the SMCMC scheme results in a far more ecient sampler than the univariate algorithm; for some parameters, SMCMC produced as much as 64 times more eective samples per second. Overall, experience with applying several SMCMC blocking schemes to real data sets suggests that SMCMC provides a standard, systematic technique for producing samplers with far superior mixing properties than simple univariate Metropolis-Hastings samplers. The SMCMC and RSMCMC schemes appear to be reliable ways of producing good ESSs, irrespective of the data sets and parameterizations. In many cases, the SMCMC algorithms are also competitive in terms of ES/s. In addition since the blocked SMCMC algorithms mix better, their convergence should be easier to diagnose and thus lead to nal parameter estimates that are less biased.

© 2004 by CRC Press LLC

404

MATRIX THEORY AND SPATIAL COMPUTING METHODS

These estimates should also have smaller associated Monte Carlo variance estimates.

© 2004 by CRC Press LLC

APPENDIX B

Answers to selected exercises Chapter 1

3. As hinted in the problem statement, level of urbanicity might well explain the poverty pattern evident in Figure 1.2. Other regional spatially oriented covariates to consider might include percent of minority residents, percent with high school diploma, unemployment rate, and average age of the housing stock. The point here is that spatial patterns can often be explained by patterns in existing covariate data. Accounting for such covariates in a statistical model may result in residuals that show little or no spatial pattern, thus obviating the need for formal spatial modeling. 7.(a) The appropriate R code is as follows: # #

R program to compute geodesic distance see also www.auslig.gov.au/geodesy/datums/distance.htm

#

input:

point1=(long,lat) and point2=(long,lat) in degrees distance in km between the two points

# output: # example: point1_c(87.65,41.90) # Chicago (downtown) point2_c(87.90,41.98) # Chicago (O'Hare airport) point3_c(93.22,44.88) # Minneapolis (airport) # geodesic(point1,point3) returns 558.6867 geodesic