Rebecca Rose, Susanna L. Lamers, James J. Dollar et al.
AIDS research and human retroviruses, vol. 33, 211--218, 2017
10.1089/AID.2016.0205
Abstract
We compared the behavior of two approaches (Cluster Picker and HIV-TRACE) at varying genetic distances to identify transmission clusters. We used three HIV gp41 sequence datasets originating from the Rakai Community Cohort Study: (1) next-generation sequence (NGS) data from nine linked couples; (2) NGS data from longitudinal sampling of 14 individuals; and (3) Sanger consensus sequences from a cross-sectional dataset (n = 1,022) containing 91 epidemiologically linked heterosexual couples. We calculated the optimal genetic distance threshold to separate linked versus unlinked NGS datasets using a receiver operating curve analysis. We evaluated the number, size, and composition of clusters detected by Cluster Picker and HIV-TRACE at six genetic distance thresholds (1\%-5.3\%) on all three datasets. We further tested the effect of using all NGS, versus only a single variant for each patient/time point, for datasets (1) and (2). The optimal gp41 genetic distance threshold to distinguish linked and unlinked couples and individuals was 5.3\% and 4\%, respectively. HIV-TRACE tended to detect larger and fewer clusters, whereas Cluster Picker detected more clusters containing only two sequences. For NGS datasets (1) and (2), HIV-TRACE and Cluster Picker detected all linked pairs at 3\% and 4\% genetic distances, respectively. However, at 5.3\% genetic distance, 20\% of couples in dataset (3) did not cluster using either program, and for {\textgreater}1/3 of couples cluster assignment were discordant. We suggest caution in choosing thresholds for clustering analyses in a generalized epidemic.