## Abstract

Here we propose a simple statistical algorithm for rapidly scoring loci associated with disease or traits due to recessive mutations or deletions using genome‐wide single nucleotide polymorphism genotyping case–control data in unrelated individuals. This algorithm identifies loci by defining homozygous segments of the genome present at significantly different frequencies between cases and controls. We found that false positive loci could be effectively removed from the output of this procedure by applying different physical size thresholds for the homozygous segments. This procedure is then conducted iteratively using random sub‐datasets until the number of selected loci converges. We demonstrate this method in a publicly available data set for Alzheimer's disease and identify 26 candidate risk loci in the 22 autosomes. In this data set, these loci can explain 75% of the genetic risk variability of the disease.

## Introduction

Advances in whole‐genome single nucleotide polymorphism (SNP) assay technology have provided a powerful array of tools for simultaneously scoring common genetic variation. However, it is often difficult to identify loci associated with disease because of the large number of tests carried out and the associated conservative multiplicity adjustment, such as Bonferroni method. We are interested in identifying such loci associated with a disease likely due to recessive mutation or gene deletions.

High density SNP analysis readily reveals the presence of large homozygous segments in unrelated subjects (Hinds *et al*, 2005; Simon‐Sanches *et al*, 2007; Wang *et al*, 2007). The probability of a randomly selected SNP locus being homozygous (‘AA’ or ‘BB’) based on data from HapMap is about 0.65 (Hinds *et al*, 2005; Rabbee and Speed, 2006) and this may lend itself to autozygosity mapping in ostensibly outbred populations; however, traditional autozygosity mapping methods (Lander and Botstein, 1987; Mueller and Bishop, 1993; Gschwend *et al*, 1996) based on consanguineous relationships are not appropriate for unrelated individuals. To identify loci with possible recessive effects of relatively high penetrance in outbred populations, large sample sizes are needed for genotyping. Some recent studies on homozygosity analysis of SNP assays have been attempted using different approaches (Woods *et al*, 2004; Lencz *et al*, 2007; Miyazawa *et al*, 2007). However, they either have some familial relationship requirements (Woods *et al*, 2004; Miyazawa *et al*, 2007) or a high false positive rate (Lencz *et al*, 2007).

In the context of SNP genotyping, it is often not easy to distinguish heterozygous genomic deletion from homozygosity; thus a segment with all loci genotyped being ‘AA’ or ‘BB’ in a pedigree genotype file could be either a region of genuine homozygosity or effective hemizygosity caused by genomic deletion. We call such a region ‘apparently homozygous region’ (AH). By carrying out an appropriate association analysis on AHs, one can detect not only the possible recessively mutated loci from some common ancestor but also deletions (Hunter, 2005; Klein *et al*, 2005; Van Eyken *et al*, 2007).

In this paper, we propose a simple statistical algorithm for genome‐wide AH analysis (GAHA) of case–control data in unrelated subjects. It can robustly identify loci that are associated with disease by efficiently removing false positive loci. We demonstrate this method in a publicly available data set for Alzheimer's disease (AD) (Coon *et al*, 2007), consisting 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls. A total of 26 loci from the 22 autosomes are identified and they explain 75% of the genetic risk variability of the disease.

## Results and discussion

### AH size threshold

In the context of the current data, it is not appropriate to use the number of loci as a measure of AH size as previously reported (Lencz *et al*, 2007) because of its dependence on SNP density. Here we use the number of nucleotide basepairs between the first and last loci of an AH as a measure of AH size.

Let *C* be a size threshold of AHs. We are interested in identifying loci proportions of which are significantly different between controls and cases in AHs with sizes ⩾C. As seen in Figure 1, for example, there are *n*_{1} cases and with a given *C* we count the proportion of the locus SNP‐1 on AHs *p*_{1}=(number of AHs containing SNP‐1)/*n*_{1}. Similarly, for *n*_{0} controls, we find the proportion *p*_{0} of the same locus. Using *p*_{1} and *p*_{0}, we compute *z*‐statistic for proportional test as described in Materials and methods. The locus is selected for further screening if ∣*z*∣⩾*z*_{1−α/2}, where α is the level of significance. The test statistic *z* follows a standard normal distribution asymptotically as *n*_{0} and *n*_{1} increase with each greater than 30.

We investigated the power for selecting loci based on α, AH percentage difference between cases and controls, and AH size threshold *C* through simulation. The relationships between *z* value and AH percentage difference with various *C* are shown in Supplementary Figure 1. At a significance level α=0.001, the powers to detect candidate loci were computed accordingly. We define that a candidate locus is detectable if the power>0.8. Our results showed that at a significance level α=0.001, we could detect a locus on AHs⩾*C* with a difference of 30% between cases and controls using *C*=10 kb, or only of 7% using *C*=1 Mb.

On the basis of above significance level α and a moderate *C* value, typically thousands of loci could be selected with a large false positive rate from data of unrelated subjects. A key step is to efficiently remove these falsely associated loci from the candidate list. If we knew the minimum size of risk loci, then we would set it as *C* and consider only AH⩾*C*, leading to a lower false positive rate. However, such a *C* value is unknown. One approach is to use multiple values of *C* as discussed below. In convention, define *C*=1 for considering AHs with size ⩾1.

### Algorithm for screening risk loci

We propose to use multiple *C* values for screening risk loci. Suppose we choose *C*_{1} and *C*_{2}, with *C*_{1}<*C*_{2}, for selecting candidate loci with ∣*z*∣⩾*z*_{1−α/2}. It should be noted that the distance between *C*_{1} and *C*_{2} must be larger than the minimum distance between loci of the platform and may be chosen by referring to some public genotyping parameters (for example, the average distance between loci is ∼9 kb in Affymetrix 500K GeneChip, and a median distance is ∼3 kb in Illumina HumanHap550 BeadChip according to Gunderson *et al*, 2005; Steemers and Gunderson, 2007). Let *S*_{1} be the set containing the loci selected with *C*_{1} and *S*_{2} with *C*_{2}, respectively. As the true AHs with size ⩾*C*_{2}>*C*_{1} will remain using either *C*_{1} or *C*_{2}, the loci, not in *S*_{1}⋂*S*_{2}, should be more likely false positives and thus be removed. For example, in the AD data using a significance level α=0.001, among the 25 086 loci on chromosome 1, there were 18 loci selected using *C*=10 kb and 12 loci using *C*=30 kb, respectively, with only three being common loci in both sets. In general, we set *C*={*C*_{i}*, i*=1, 2,…, *L*} with *C*_{1}<*C*_{2}<…<*C*_{L} to cover a wide range of AHs and let *S* be the set containing all loci common in adjacent sets *S*={*S*_{1}⋂*S*_{2}, *S*_{2}⋂*S*_{3},…, *S*_{L−1}⋂*S*_{L}}. This loci‐selecting procedure is called ‘procedure of adjacent‐C‐selection’ (PACS).

The PACS can efficiently remove false positive loci, however, for a real data set in unrelated individuals with large genetic variation, the selected loci usually still contain some false positives, many of which could be removed through further ‘purification’. To achieve this, ideally we should repeat the above steps using an independent data set from the same population to get another candidate set. Then identify the common loci from both sets. This new candidate set contains fewer false positive loci, which could be further removed by repeating above steps iteratively until the number of candidate loci converges. Although it is generally not realistic to do so, we could do the ‘purification’ using random subsets from the full data set as described below.

Let *n*_{k}^{*}=[*f* × *n*_{k}]>30 be the size of a random subset from the full data set of size *n*_{k}, where *k*=1 for cases and *k*=0 for controls, and *f* be a constant with 0<*f*<*f*_{max}, *f*_{max}= (min_{k} (*n*_{k}) − 1)/min_{k} (*n*_{k}. The randomly and independently chosen *n*_{1}^{*} cases and *n*_{0}^{*} controls form a random case–control sub‐data set for further removing the false positive loci from the candidate set using the same set of *C* values as applied to the full data set.

Let *S* be the set containing the selected loci from the full data set and *S*^{*} be that from the first random sub‐data set. Let *S*_{1}^{*}=*S*^{*}⋂*S* containing the common loci in both sets and *N*_{1}=∣*S*_{1}^{*}∣ be the number of loci in *S*_{1}^{*}. Next we generate a new *S*^{*} from the second random sub‐data set and let *S*_{2}^{*}=*S*_{1}^{*}⋂*S*^{*} with *N*_{2}=∣*S*_{2}^{*}∣. Repeating these steps to update the candidate loci set until the number of *N*_{t}, *t*=1,2,………, converges to a constant integer *N*_{c} with *N*_{c} =0 if the null hypothesis of no difference between *p*_{1} and *p*_{0} is true and *N*_{c}>0 if the alternative hypothesis *p*_{1}≠*p*_{0} is true. For a given *f*, there are

possible ways for selecting case–control subset, which should be much larger than the number required for reaching convergence at an appropriate level of significance. The above GAHA algorithm is summarized in Box 1.

### Box 1 Outline of the GAHA algorithm

(1) For case–control SNP data with *n*_{1} cases and *n*_{0} controls, choose a level of significance α, set AH thresholds *C*={*C*_{i}, *i*=1, 2,…, *L*} with *C*_{1}<*C*_{2}<…<*C*_{L}, and then find AHs with size *C*_{i}, *i*=1, 2,…, *L*, for each subject

(2) Compute *z* at each locus and select it if ∣*z*∣⩾*z*_{1‐α/2}. Perform the PACS and let S_{old} be the set of selected loci and *N*_{old}=∣*S*_{old}∣. Chose 0 < *f* < min_{k}{*n*_{k}}−1/min_{k}{*n*_{k}}, and ℓ=0

(3) Randomly select a case–control sub‐dataset from (1) with *n*_{1}^{*} = [*f* × *n*_{1}] >30 cases and *n*_{0}^{*} = [*f* × *n*_{0}] >30 controls. Find AHs for each subject at given *C*, then compute *z* at each locus and select it if ∣*z*∣⩾ *z*_{1‐α/2}

(4) Carry out the PACS and let S^{*} be the set containing all the loci selected from the sub‐dataset. Find *S*_{new}=*S*_{old}⋂*S*^{*} *N*_{new}=∣*S*_{new}∣

(5)

The false positive rate of a locus in the final set should be ⩽α. The false negative rates of loci selection in a random subset were estimated under the same settings for the full data set (Supplementary Table 2).

### Application to AD data set

Set *C*={1, 10 kb, 30 kb, 50 kb, 100 kb, 140 kb, 250 kb, 500 kb, 1 Mb} and α=0.001. We identified 607 loci from 4054 loci whose ∣*z*∣⩾*z*_{1−α/2} (Figure 2A) from the 22 autosomes in the AD data set (Coon *et al*, 2007).

The most significant AH region was on 19q13.2 (see Figure 2B) with positive *z* values suggesting significantly more AHs in controls than in cases. This region, covering the whole *apolipoprotein E* (*APOE*) gene, contains four loci including rs4420638 (Figure 2C), which is in linkage disequilibrium with APOE (Coon *et al*, 2007). However, there were no genotypes within APOE in the AD data. We added available genotyping information (Coon *et al*, 2007) of two loci on APOE, rs429358 and rs7412, to the AD data. The two APOE loci define the ε2/ε3/ε4 genotypes. Figure 2D shows the APOE loci indeed on the AH region where the majority controls have the ε3 genotype, supporting the observation that APOE ε3 is protective against the disease when compared with ε4 (Farrer *et al*, 1997).

To further reduce the false positive rate within this list, we chose *f*=0.9 for generating random subsets, each with 773 cases and 497 controls. The use of *f*=0.9 may not be the statistically optimal choice; it is, however, the best we tried. The convergence of the loci number is shown in Figure 3. There were 26 loci in the final list (Figure 3B) (Table I). Based on a logistic regression model fit, the percent variation of the genetic risk explained by these 26 loci was 75.3%. Model selection removed 10 confounder loci and retained 16 loci (each with *P*‐value<0.05), including rs4420638, in the reduced model with 74.8% of the genetic risk variation explained (Supplementary Table 3, 4).

The APOE ε4 was carried by ∼40% of the later‐onset AD cases (Poirier *et al*, 1993; Laws *et al*, 2003). Recall that rs4420638 is in linkage disequilibrium with APOE, we found that the percent genetic risk variation explained by this locus alone was 34.2%. However, when rs4420638 was excluded from the reduced model, the percentage genetic risk variation explained by the remaining 15 loci was decreased only by 2.9% (from 74.8% to 71.9%). This suggests these loci explain the genetic risk variation of AD as a group. Several of the 26 loci identified in this screening were also found in homozygous regions identified in an early onset AD study of a consanguineous family (Clarimón *et al*, 2008), suggesting that one of these regions harbors a recessive genetic lesion causing AD.

The 26 loci are on 20 genes of which 13 are in known functional pathways or networks as revealed from an Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com) (Supplementary Pathway/Network analysis). On the basis of the correlations among the 20 genes and AD status of subjects, we construct an AD genetic network (Supplementary Figure 2).

## Summary

We propose a statistical method for GAHA of SNP case–control data in unrelated subjects to identify risk loci that are most likely associated with a disease or abnormality due to recessive mutation or deletion. The main novelty of this method over other approaches is to minimize the false positive rate of the risk candidates. We remove the false positive loci by selecting the common loci with different size thresholds of homozygous segments and repeating these steps iteratively using random sub‐data sets until the number of selected loci converges. Furthermore, this method allows selects risk loci from a wider AH size range. By demonstrating of the method using a publicly available AD SNP assay data set, we identified 26 candidate risk loci from the 22 autosomes.

## Materials and methods

### Notes

Suppose there are *n* SNP loci genotyped on a given chromosome (an autosome). We view the sequences of SNP loci on a chromosome as linked regions either being heterozygous or AHs. Let *H* be a set such that *H*={*h*_{1}*, h*_{2}*,…, h*_{m}} where *h*_{i} denotes the number of AHs containing *i* consecutive SNP loci genotyped, and *m* is the maximum number of consecutive SNP loci. The probability of a randomly selected SNP locus on AHs with SNP number being equal to or larger than a predetermined integer *k* is .

### Data

A SNP genotype data set of late‐onset AD(500K Affymetrix) was downloaded from a publicly available website, http://www.neuron.org, to demonstrate our method. This data set consists of 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls.

### Proportion test

We are interested in identifying loci at which the proportion of a SNP locus, on AHs with size equal to or larger than a given threshold *C*, is significantly different between controls and cases. Our null hypothesis is that the SNP at a given locus has the same probability of being on AHs with size ⩾*C* in the control and case groups. The test statistic in a standard proportion test is

and follows a Gaussian distribution under the null hypothesis, where the *p*_{0} is the proportion of the locus on AHs for the *n*_{0} control subjects and the *p*_{1} is that for the *n*_{1} cases. We define *z*=0 when both *p*_{0}=0 and *p*_{1}=0. For a given level of significance α, a locus is selected if ∣*z*∣⩾*z*_{1−α/2}. This test requires large sample size (*n*_{0}, *n*_{1}>30).

### Logistic regression

In logistic regression using the selected loci as predictor variables, let *x*_{ij}=1 if the *i*th locus of the *j*th subject is on an AH with size being equal to or larger than *C*=10 kb and *x*_{ij}=0 otherwise. Logistic regression is carried out using SAS 9.0.

### Declaration

The views expressed in this article do not represent those of the US Food and Drug Administration.

## Acknowledgements

This study was supported by the Intramural Program of the National Institute on Aging, National Institutes of Health and Department of Health and Human Services, project number AG000950‐07. This study used high‐performance computational capabilities of the Biowulf Systems at the National Institutes of Health, Bethesda, MD (http://helix.nih.gov).

## Conflict of Interest

The authors declare that they have no conflict of interest.

## Supplementary Information

Supplementary information

Supplementary figures S1–2, Supplementary tables S1–4, Pathway/Network Analysis [msb200953-sup-0001.doc]

## References

This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.

- Copyright © 2009 EMBO and Nature Publishing Group