KMerNIPT - a kmer based NIPT analysis pipeline

1. Quick usage for the impatient
Basic usage (if lists and k-mer table are already built):
  ZandMah.py -i KMER_TABLE -o OUTPUT_TABLE


2. Detailed usage

The pipeline has three basic steps:
A. Create reference kmer lists (to be run once before analysis)
B. Create control dataset
C. Analyze samples

KMerNIPT uses GenomeTester4 programs to generate and modify k-mer lists.
  Kaplinski L, Lepamets M, Remm M. (2015). GenomeTester4: a toolkit for
  performing basic set operations – union, intersection and complement on
  k-mer lists. GigaScience; 4:58
It has to be downloaded and compiled before executing pipeline commands. By
default all pipeline scripts expect to find the GenomeTester4 binaries
(glistmaker, glistcompare and glistquery) in the current directory, but the
location can be edited in script headers.


A. Creating reference kmer lists

A.1. Creating chromosome (or sub-chromosomal region) specific lists

generate_reference_lists.pl WORDLENGTH FASTA_FILES...

Creates a set of chromosome whitelists, i.e. GenomeTester4 lists of k-mers of specified
lenght that occur once and only once in any given FastA file. Usually the
FastA files are reference chromosomes.

Internally it executes the following steps:
  1.1. Creates composite list of all k-mers that occur more than once in the
       set of all FastA files, using glistmaker
Then, for each FastA file:
  1.2. Creates list of all kmers in given file with glistmaker
  1.3. Subtracts the list created in 1.1. from chromosome-specific list
       file, using glistcompare diff function. Thus a list of unique kmers
       of given FastA sequence is obtained

The resulting lists are named
BASENAME_WORDLENGHT.list
Where BASENAME is the name of FastA file, excluding directory path and
extension (everything after '.')


A.2. Creating blacklist of known polymorphisms

generate_polymorphism_lists.pl WORDLENGTH MIN_MAF PATH_TO_CHROMOSOME_FILE VCF_FILES...

Creates list named plymorphic_blacklist.list that contains allele-specific k-mers that
may potentially occur at the polymorphic sites with MAF >= cutoff.

Internally it executes the following steps:
  For all VCF files:
  2.1. Creates FastA file of pseudohaplotypes (i.e. all potential haplotypes
       ignoring linkage) that overlap with known polymorphisms. To conserve
       space, SNP-s are marked with UIPAC codes.
       hapl_generator.pl WORDLENGTH MIN_MAF VCF_FILE CHROMOSOME_FILE
  2.2. Generates FastA file of all kmers occuring in given pseudohaplotype
       file. For each IUPAC code all potential permutations are created
	kmer_creator.pl WORDLENGTH HAPLOTYPE_FILE
  Then:
  2.3. Create list of all k-mers in all k-mer FastA files using glistmaker
  2.4. Cut all k-mer frequencies to 1 using glistcompare

  It uses two sub-scripts: hapl_generator.pl and kmer_creator.pl

The resulting list is named:
polymorphic_blacklist_WORDLENGTH.list


A.3. Creating backlist from BED files

Known problematic regions of genome, such as low-complexity repeats and
pseudoautosmal parts of sex chromosomes can be compiled into separate
blacklist from corresponding BED files.

3.1. Create FastA file of problematic sequences

bed-2-fasta.pl GENOME BED > FASTA

GENOME is FastA file of the whole genome, BED lists regions to be included
in blacklist FastA file (excluded from k-mer sets).

3.2. Create blacklist

Blacklist is compiled with glistmaker

glistmaker FASTA -w WORDLENGTH -o lc_blacklist

The resulting blacklist is named lc_blacklist_WORDLENGTH.list


A.4. Creating whitelists using sequencing data

Individual BAM files or GenomeTester4 lists of full-genome sequencing data
can be compiled into individual whitelists with script:

poisson_cutoff.pl LIST|BAM COVERAGE CUTOFF [haploid]

Creates a list of all k-mers in source list/bam file that have the probability
greater or equal to cutoff value of coming from Poisson distribution with
mean coverage.
If 'median' is used in place of coverage, the coverage is automatically
calculated.
If the argument "haploid" is present, the haploid coverage value is used
to calculate cumulative Poisson probabilities of each k-mer count. This
should be used when creating whitelist for sex chromosomes from male
samples.

Internally it executes the following steps:
  4.1. If the file extension is .bam, a kmer list is generated from BAM file
       using script bam2list.pl.
  4.2. If coverage value is 'median', the coverage is estimated by finding
       the first peak in k-mer frequencies above 1.
  4.3. Cumulative poisson cutoff values MIN and MAX are calculated at CUTOFF and
       1-CUTOFF
  4.4. Temporary list A of k-mers with frequency less than MIN is created from
       full list with glistcompare intersection function
  4.5. Temporary list B with k-mer frequencies above MAX is created from list
       A with glistcompare intersection function
  4.6. List B is subtracted from list A with glistcompare diff function

The resulting list is named
BASENAME_whitelist_WORDLENGTH.list
Where BASENAME is the name of list/BAM file, excluding directory path and
extension (everything after '.')

4.7. Multiple whitelists can be combined into single whitelist with glistcompare
intersection method:

glistcompare LISTS... -i -o whitelist

The resulting whitelist is named whitelist_WORDLENGTH_intrsec.list

NB! As the autosomes and sex chromosomes have different copy numbers,
whitelists should be compiled differently for autosomes and sex chromosomes.
From female samples single autosome and X whitelists can be prepared.
From male samples autosome and sex chromosome whitelists should be compiled
separately.
NB! Male and female sex-chromosome whitelists should never be intersected
(as described in 4.7) because as the female samples do not contain Y
chromosome, all Y-specific k-mers will be missing in the result.


A.5. Creating final k-mer lists

Blacklists should be subtracted and whitlists intersected with
chromosome-specific k-mer lists created in step 1. This can be done with
GenomeTester4 program glistcompare:

5.1. Subtracting blacklist:

glistcompare LIST BLACKLIST -d -o NAME

The resulting list is named NAME_WORDLENGTK_0_diff1.list

5.2. Intersecting with whitelist:

glistcompare LIST WHITELIST -i -o NAME

The resulting list is named NAME_WORDLENGTH_intrsec.list


B. Creating control dataset

B.1. Creating kmer lists from sequenced control samples

From each sequenced sample (fastq files) a GenomeTester4 k-mer list
can be created with command:

glistmaker FASTQ_FILES... -w WORDLENGTH -o NAME

The resulting list in named NAME_WORDLENGTH.list


B.2. Calculate k-mer counts for each sequenced sample

make_table.pl CHR_SUBDIR LISTS...

CHR_SUBDIR is directory, where chromosome-specific k-mer lists are stored.
By default it looks it for all files named CHRID_cleaned_25.list (CHRID is
one of 1, 2, 3...23, X, Y). Both the chromsome names and list suffix can
be changed in script header.

This script a table with chromosomes as columns and samples as rows showing how
many k-mers from each chromosome-specific list was present in sequenced
data. The first row after header are the total k-mer counts in
chromosome-specific lists.

Internally the script uses glistcompare to find the intersection size of two
lists - sample list and chromosome list.

The table has the following rows:
CHR - chromsome names
TOTAL - total number of k-mers in chromosome-specific list
GC - average GC content of all chromsome-specific k-mers
Then one row for each sequenced sample

The table has following columns:
CHR (first column) - sample name
TOTAL - total number of k-mers in sample
UNIQUE - number of unique k-mers in sample
GC - average GC content of sample
Then one column for each chromsome

Each entry of table (aside headers, GC content etc.) is the number of
k-mers specific to certain chromosome (column) in certain sample (row)

If only some samples are to be added to already existing dataset, the script
add_rows.pl can be used:

add_rows.pl CHR_SUBDIR LISTS...

Its parameters and output are identical to make_table.pl, but it does not
print out header row. Thus it can be used to append rows to already existing
table.


C. Analyzing samples

Both control and patient k-mer counts should be listed in single table
(preapred in step B).
Then the program ZandMah.py is used to calculate z-scores:

ZandMah.py -i KMER_TABLE -r CONTROLS -m CHRS -n CHRS -o OUTPUTFILE

KMER_TABLE is the full table of sample k-mer counts prepared in step B.

CONTROLS is text file listing all control samples (i.e. euploid samples) for
model calculation. If no controls file is specified, all samples are used
for model creation, resulting in slightly less predicitve power (but it can
be useful if caryotypes are not known for any sample.

CHRS is comma-separated list of chromosome names that will be used to
normalize k-mer counts. Normally single large chromsome (i.e. 1 is
sufficient). As these chromosome will be excluded from both regression model
and Mahalanobis distance calculation, no more than necessary should be
listed. If omitted, no normalisation is performed. Whether or not normalize
counts depends on the variance in coverage. If sample coverages vary a lot,
normalisation helps to increase model power.

The output is table, where rows are samples and columns chromosomes. Each
entry corresponds to z-score of given chromosome k-mer count. I.e. values
above certain treshold mark possible aneuploidicity.
In addition the Mahalanobis distance of given sample and the result of
four sex chromosome models (X/female, X/male, Y/female, Y/male) and predicted sex
is printed as last columns.

To build sex chromosome models, ZandMah first performs "naive" sex
prediction by the k-mers of Y-chromosome. Relative coverages
0.01 and 0.02 are used as cutoffs to call prospective females (below 0.01) and
males (above 0.02). These prospective sets are then used to build
sex-specific models of X and Y chromosome coverage. By applying the female
Y model to the whole dataset, the refined sex prediction is calculated.