KMerNIPT - a kmer based NIPT analysis pipeline 1. Quick usage for the impatient Basic usage (if lists and k-mer table are already built): ZandMah.py -i KMER_TABLE -o OUTPUT_TABLE 2. Detailed usage The pipeline has three basic steps: A. Create reference kmer lists (to be run once before analysis) B. Create control dataset C. Analyze samples KMerNIPT uses GenomeTester4 programs to generate and modify k-mer lists. Kaplinski L, Lepamets M, Remm M. (2015). GenomeTester4: a toolkit for performing basic set operations – union, intersection and complement on k-mer lists. GigaScience; 4:58 It has to be downloaded and compiled before executing pipeline commands. By default all pipeline scripts expect to find the GenomeTester4 binaries (glistmaker, glistcompare and glistquery) in the current directory, but the location can be edited in script headers. A. Creating reference kmer lists A.1. Creating chromosome (or sub-chromosomal region) specific lists generate_reference_lists.pl WORDLENGTH FASTA_FILES... Creates a set of chromosome whitelists, i.e. GenomeTester4 lists of k-mers of specified lenght that occur once and only once in any given FastA file. Usually the FastA files are reference chromosomes. Internally it executes the following steps: 1.1. Creates composite list of all k-mers that occur more than once in the set of all FastA files, using glistmaker Then, for each FastA file: 1.2. Creates list of all kmers in given file with glistmaker 1.3. Subtracts the list created in 1.1. from chromosome-specific list file, using glistcompare diff function. Thus a list of unique kmers of given FastA sequence is obtained The resulting lists are named BASENAME_WORDLENGHT.list Where BASENAME is the name of FastA file, excluding directory path and extension (everything after '.') A.2. Creating blacklist of known polymorphisms generate_polymorphism_lists.pl WORDLENGTH MIN_MAF PATH_TO_CHROMOSOME_FILE VCF_FILES... Creates list named plymorphic_blacklist.list that contains allele-specific k-mers that may potentially occur at the polymorphic sites with MAF >= cutoff. Internally it executes the following steps: For all VCF files: 2.1. Creates FastA file of pseudohaplotypes (i.e. all potential haplotypes ignoring linkage) that overlap with known polymorphisms. To conserve space, SNP-s are marked with UIPAC codes. hapl_generator.pl WORDLENGTH MIN_MAF VCF_FILE CHROMOSOME_FILE 2.2. Generates FastA file of all kmers occuring in given pseudohaplotype file. For each IUPAC code all potential permutations are created kmer_creator.pl WORDLENGTH HAPLOTYPE_FILE Then: 2.3. Create list of all k-mers in all k-mer FastA files using glistmaker 2.4. Cut all k-mer frequencies to 1 using glistcompare It uses two sub-scripts: hapl_generator.pl and kmer_creator.pl The resulting list is named: polymorphic_blacklist_WORDLENGTH.list A.3. Creating backlist from BED files Known problematic regions of genome, such as low-complexity repeats and pseudoautosmal parts of sex chromosomes can be compiled into separate blacklist from corresponding BED files. 3.1. Create FastA file of problematic sequences bed-2-fasta.pl GENOME BED > FASTA GENOME is FastA file of the whole genome, BED lists regions to be included in blacklist FastA file (excluded from k-mer sets). 3.2. Create blacklist Blacklist is compiled with glistmaker glistmaker FASTA -w WORDLENGTH -o lc_blacklist The resulting blacklist is named lc_blacklist_WORDLENGTH.list A.4. Creating whitelists using sequencing data Individual BAM files or GenomeTester4 lists of full-genome sequencing data can be compiled into individual whitelists with script: poisson_cutoff.pl LIST|BAM COVERAGE CUTOFF [haploid] Creates a list of all k-mers in source list/bam file that have the probability greater or equal to cutoff value of coming from Poisson distribution with mean coverage. If 'median' is used in place of coverage, the coverage is automatically calculated. If the argument "haploid" is present, the haploid coverage value is used to calculate cumulative Poisson probabilities of each k-mer count. This should be used when creating whitelist for sex chromosomes from male samples. Internally it executes the following steps: 4.1. If the file extension is .bam, a kmer list is generated from BAM file using script bam2list.pl. 4.2. If coverage value is 'median', the coverage is estimated by finding the first peak in k-mer frequencies above 1. 4.3. Cumulative poisson cutoff values MIN and MAX are calculated at CUTOFF and 1-CUTOFF 4.4. Temporary list A of k-mers with frequency less than MIN is created from full list with glistcompare intersection function 4.5. Temporary list B with k-mer frequencies above MAX is created from list A with glistcompare intersection function 4.6. List B is subtracted from list A with glistcompare diff function The resulting list is named BASENAME_whitelist_WORDLENGTH.list Where BASENAME is the name of list/BAM file, excluding directory path and extension (everything after '.') 4.7. Multiple whitelists can be combined into single whitelist with glistcompare intersection method: glistcompare LISTS... -i -o whitelist The resulting whitelist is named whitelist_WORDLENGTH_intrsec.list NB! As the autosomes and sex chromosomes have different copy numbers, whitelists should be compiled differently for autosomes and sex chromosomes. From female samples single autosome and X whitelists can be prepared. From male samples autosome and sex chromosome whitelists should be compiled separately. NB! Male and female sex-chromosome whitelists should never be intersected (as described in 4.7) because as the female samples do not contain Y chromosome, all Y-specific k-mers will be missing in the result. A.5. Creating final k-mer lists Blacklists should be subtracted and whitlists intersected with chromosome-specific k-mer lists created in step 1. This can be done with GenomeTester4 program glistcompare: 5.1. Subtracting blacklist: glistcompare LIST BLACKLIST -d -o NAME The resulting list is named NAME_WORDLENGTK_0_diff1.list 5.2. Intersecting with whitelist: glistcompare LIST WHITELIST -i -o NAME The resulting list is named NAME_WORDLENGTH_intrsec.list B. Creating control dataset B.1. Creating kmer lists from sequenced control samples From each sequenced sample (fastq files) a GenomeTester4 k-mer list can be created with command: glistmaker FASTQ_FILES... -w WORDLENGTH -o NAME The resulting list in named NAME_WORDLENGTH.list B.2. Calculate k-mer counts for each sequenced sample make_table.pl CHR_SUBDIR LISTS... CHR_SUBDIR is directory, where chromosome-specific k-mer lists are stored. By default it looks it for all files named CHRID_cleaned_25.list (CHRID is one of 1, 2, 3...23, X, Y). Both the chromsome names and list suffix can be changed in script header. This script a table with chromosomes as columns and samples as rows showing how many k-mers from each chromosome-specific list was present in sequenced data. The first row after header are the total k-mer counts in chromosome-specific lists. Internally the script uses glistcompare to find the intersection size of two lists - sample list and chromosome list. The table has the following rows: CHR - chromsome names TOTAL - total number of k-mers in chromosome-specific list GC - average GC content of all chromsome-specific k-mers Then one row for each sequenced sample The table has following columns: CHR (first column) - sample name TOTAL - total number of k-mers in sample UNIQUE - number of unique k-mers in sample GC - average GC content of sample Then one column for each chromsome Each entry of table (aside headers, GC content etc.) is the number of k-mers specific to certain chromosome (column) in certain sample (row) If only some samples are to be added to already existing dataset, the script add_rows.pl can be used: add_rows.pl CHR_SUBDIR LISTS... Its parameters and output are identical to make_table.pl, but it does not print out header row. Thus it can be used to append rows to already existing table. C. Analyzing samples Both control and patient k-mer counts should be listed in single table (preapred in step B). Then the program ZandMah.py is used to calculate z-scores: ZandMah.py -i KMER_TABLE -r CONTROLS -m CHRS -n CHRS -o OUTPUTFILE KMER_TABLE is the full table of sample k-mer counts prepared in step B. CONTROLS is text file listing all control samples (i.e. euploid samples) for model calculation. If no controls file is specified, all samples are used for model creation, resulting in slightly less predicitve power (but it can be useful if caryotypes are not known for any sample. CHRS is comma-separated list of chromosome names that will be used to normalize k-mer counts. Normally single large chromsome (i.e. 1 is sufficient). As these chromosome will be excluded from both regression model and Mahalanobis distance calculation, no more than necessary should be listed. If omitted, no normalisation is performed. Whether or not normalize counts depends on the variance in coverage. If sample coverages vary a lot, normalisation helps to increase model power. The output is table, where rows are samples and columns chromosomes. Each entry corresponds to z-score of given chromosome k-mer count. I.e. values above certain treshold mark possible aneuploidicity. In addition the Mahalanobis distance of given sample and the result of four sex chromosome models (X/female, X/male, Y/female, Y/male) and predicted sex is printed as last columns. To build sex chromosome models, ZandMah first performs "naive" sex prediction by the k-mers of Y-chromosome. Relative coverages 0.01 and 0.02 are used as cutoffs to call prospective females (below 0.01) and males (above 0.02). These prospective sets are then used to build sex-specific models of X and Y chromosome coverage. By applying the female Y model to the whole dataset, the refined sex prediction is calculated.