Integration of new DNA into cellular genomes mediates replication of retroviruses and transposons; integration reactions have also been adapted for use in human gene therapy. Tracking the distributions of integration sites is important to characterize populations of transduced cells and to monitor potential outgrow of pathogenic cell clones. Here, we describe a pipeline for quantitative analysis of integration site distributions named INSPIIRED (integration site pipeline for paired-end reads). We describe optimized biochemical steps for site isolation using Illumina paired-end sequencing, including new technology for suppressing recovery of unwanted contaminants, then software for alignment, quality control, and management of integration site sequences. During library preparation, DNAs are broken by sonication, so that after ligation-mediated PCR the number of ligation junction sites can be used to infer abundance of gene-modified cells. We generated integration sites of known positions in silico, and we describe optimization of sample processing parameters refined by comparison to truth. We also present a novel graph-theory-based method for quantifying integration sites in repeated sequences, and we characterize the consequences using synthetic and experimental data. In an accompanying paper, we describe an additional set of statistical tools for data analysis and visualization. Software is available at https://github.com/BushmanLab/INSPIIRED.
Keywords: SCID-X1; gammaretrovirus; gene therapy; insertional mutagenesis; lentivirus; mutagenesis; recombination; retrovirus; vector; vector driving.