/*********************************************************************
#   GenomeMasker software package                                    #
#   Copyright (C) 2006  University of Tartu                          #
#   All rights reserved.                                             #
#                                                                    #
#   The software and databases should not be redistributed or used   #
#   for any commercial purpose, without written permission from      #
#   Department of Bioinformatics, University of Tartu, ESTONIA       #
#   Mailing address: Riia str.23, Tartu 51010, ESTONIA               #
#   e-mail: maido.remm@ut.ee                                         #
/*********************************************************************
#   REFERENCE:                                                       #
#   Andreson R., Reppo E., Kaplinski L., Remm M.                     #
#   GENOMEMASKER package for designing unique genomic PCR primers.   #
#   BMC Bioinformatics 2006, 7:172.                                  #
/*********************************************************************


This software package contains 5 programs that support genomic-scale 
PCR primer design.

- gmasker
- gtester
- glistmaker
- gindexer
- gt2multiplex

For complete functionality you will need 2 additional PERL scripts and 
a modified version of PRIMER3: 

- fasta_to_mp3.pl
- mp3_to_table.pl
- mprimer3_core
- repeat.lib
======================================================================

GENOME MASKER

GenomeMasker (gmasker) masks over-represented words in the fasta file, 
preventing design of primers in repeated regions. Input is taken from 
STDIN, output goes to STDOUT

The masking process is similar to well-known RepeatMasker program. 
There are several differences between RepeatMasker and GenomeMasker:
1) GenomeMasker does not mask whole length of over-represented word,
it masks only one nucleotide at the 3' end of by replacing uppercase 
nucleotide with a lowercase nucleotide. Modified Primer3 is necessary 
to take full advantage of this type of masking. See copyright notice
for MPrimer3 in file MPRIMER3_README.
Using other masking symbols (N, X) is also possible, but this will 
give lower number of candidate primers (rejected primers containing N 
will be rejected by default).
2) Because masking by the GenomeMasker is assymetric, upper and lower
(sense and antisense, forward and reverse) strands can be masked 
separately by changing command line parameters. See the end of this file
for complete listing of 'gmasker' options. Furthermore, we allow 
a switch between sense/antisense masking in the middle of input DNA 
sequence. This will be useful if you want to mask only this part of 
the template DNA where primers are designed, leaving amplified region 
unmasked.
3) GenomeMasker is extremely fast because it does not use sequence 
alignment algorithms. Instead, it uses a blacklist of overrepresented 
words created with the program 'glistmaker'. The word length can 
vary between 8 and 16. We recommend size 16, which is on by default. 
NB!!! Creating black list with 'glistmaker' requires a bit more than 
1 GB of RAM using human genome.

Blacklists for different genomes can be created locally using the 
program 'glistmaker'.

======================================================================
GENOME TESTER

GenomeTester (gtester) is the program that tests 
1) whether PCR primers have excessive number of binding sites on 
   template sequence and 
2) how many PCR products would be amplified from the template DNA and
   where are they located. 

Having too many binding sites will typically result in failed PCR.
Amplifying more than one product is undesireable because alternative 
PCR products could cause false positive signals in genotyping.

The principle of GenomeTester is similar to SSAHA (Ning et al. 2001. 
Genome Res. 11:1725). However, the GenomeTester is specifically 
designed for fixed word length (default 16) and therefore it is 
an order of magnitude faster.
With large datasets (more than 10000 primer pais), 1Gb or more 
RAM would reduce the computation time significantly.
The input for the 'gtester' is list of primer pairs and location of 
index files. See the end of this file for complete listing of 'gtester' 
options.

The GenomeTester relies on indexes that are created with the program 
'gindexer'. The input for this program is genome sequence in FASTA 
format. Indexes for human genome require ca 20 Gb of disk space.

The GenomeTester can be used without the GenomeMasker, however, the 
use of GenomeMasker significantly improves the quality of designed PCR
primers and therefore, more primers will pass the GenomeTester. The 
primers that do not pass the GenomeTester, should be redesigned. Using
the repeat library option of the PRIMER3 would help to avoid many low
quality primers but about 15% of primers will still have too many 
binding sites in the human genome (Andreson et al., unpublished).


======================================================================
ADDITIONAL PROGRAMS

To design primers with GenomeMasker one needs few helper programs.
Two PERL scripts are available which help to automate the primer 
design with the MPrimer3:

'fasta_2_mp3.pl' converts FASTA format sequences to MPrimer3 input 
format. Remember to change primer design parameters in that script, 
particularly the location of repeat library (PRIMER_MISPRIMING_LIBRARY=)
and target region (TARGET=). 

'mp3_to_table.pl' converts MPrimer3 output to tab delimited table 
format with following columns: 
name, sense_primer, antisense_primer, product.
This table format can be used as input for the MultiPLX program 
(http://bioinfo.ebc.ee/multiplx/).

'MPrimer3' is a modified version of the Primer3 version 1.0 
(Whitehead Institute for Biomedical Research). See the copyright 
notice in file MPRIMER3_README.
The modification allows to design primers overlapping with masked 
nucleotides. The only change in the code is that MPrimer3 rejects 
candidate primers which have lowercase letter at 3' end. This makes 
it compatible with the GenomeMasker and allows to design more primers
with high quality.

======================================================================
EXAMPLES

If you included example files in your download, you should have the 
following:

- human12.list  A sample blacklist file with over-represented words.
                This is list with word size 12, to keep file size 
		within reasonable limits. The blacklist with word 
		length 16 would be 127 MB. We have found that word
		size 16 gives better separation between high-quality
		and low-quality PCR primers, so we suggest to create 
		your own blacklist locally with word length 16. 

- snp.fas       10 DNA sequences in FASTA format. These are SNPs from
                chromosome 22 with 500 bp flanking sequence from both
		sides.

- snp.12masked  The same sequence masked with GenomeMasker, word 
                length 12, masking type 'target'. This masking type 
		masks upper strand words before the SNP and lower 
		strand words after SNP (target) location. The start 
		and end coordinates of the target region should be 
		defined on command line. Please note that if you use 
		multiple sequences in your FASTA files, all sequences 
		are expected to have the same target coordinates.
		See also example masked sequences snp.16masked (target), 
		snp.unmasked (target), snp.16masked_both (both strands 
		masked), snp.16masked_forward (upper strand masked),
		snp.16masked_backward (lower strand masked).
		This file is input for the MPrimer3 (through the script 
		that converts fasta format to MPrimer3 format). 

- snp.primers.12masked 
                10 pairs of primers, designed from the sequence file
		snp.12masked. Input for the 'gtester'.
                The primer file should contain at least 3 TAB-delimited
		columns:
		NAME OF THE PRIMER PAIR
		SEQUENCE OF PRIMER A (forward/sense/upper primer)
		SEQUENCE OF PRIMER B (backward/antisense/lower primer)
		Other columns are optional and can contain comments 
		or other information.
		The program will not work properly if there is less
		than 3 TAB-delimited columns on each line.
		
- chrX_1MB.fa       Fasta file with 1 Mb of Chr22 sequence. Used as an
                input for the 'gindexer'.

- locfiles.txt  Text file with list of location indexes. Example files
                contain indexes for 1 Mb of Chr22 sequence. Full set 
		of indexes for the human genome take 20 GB of disk 
		space.
		
- chrXa.location   Four index files that keep word locations 
- chrXc.location   in the genome. Currently the 'gindexer' is able
- chrXg.location   to handle one sequence per file, so you have to 
- chrXt.location   make a separate index file for each chromosome 
		   (or separate index for each contig). This index 
		   files contain locations of 16-words from the 1Mb
		   of Chr22 sequence.


Three output files of the Genome Test:
- snp.primers.12masked.gt1     
                      File with number of primer binding sites and
                      with number of products.
		      The columns are:
		      NAME
		      NUMBER OF BINDING SITES FOR PRIMER A
		      NUMBER OF BINDING SITES FOR PRIMER B
                      NUMBER OF PRODUCTS
		      
- snp.primers.12masked.gt2     
                      File with description of all PCR products.
                      The columns are:
		      NAME
		      PRODUCT_NUMBER
		      CHR
		      LOCATION (start nucleotide)
		      LENGTH (bp)
		      TYPE OF PRODUCT
		           1: PrimerA-PrimerB (sense strand product)
                          -1: PrimerB-PrimerA (antisense strand product)
			   2: PrimerA-PrimerA 
			  -2: PrimerB-PrimerB
			  
- snp.primers.12masked.gt3     
                      File with description of all primer binding sites.
                      The columns are:
		      NAME OF THE PRIMER PAIR
		      PRIMER (A or B)
		      STRAND (1=sense, -1=antisense)
		      CHR
		      LOCATION OF THE 5' END OF THE PRIMER

The GenomeTester program does not exclude or reject any primers based on
the results. It is YOUR responsibility to check the results of the 
Genome Test and exclude primer pairs that are inappropriate. 
We suggest to exclude primers with 10 or more binding sites and 
primer pairs with more than one product.
There are also other example files for primers produced from differently 
masked sequences: 
snp.primers.16masked
snp.primers.16masked.gt1
snp.primers.16masked.gt2
snp.primers.16masked.gt3
snp.primers.unmasked
snp.primers.unmasked.gt1
snp.primers.unmasked.gt2
snp.primers.unmasked.gt3

Please note that unmasked sequences give 4 praimer pairs that are 
likely to fail (snp.primers.unmasked.gt1), because they have >10 
binding sites in the genome. Primers in file snp.primers.12masked
contain 2 such low-quality primers and primers in file 
snp.primers.16masked are all high-quality primers.

=====================================================================
RUNNING GENOMEMASKER/GENOMETESTER WITH EXAMPLE FILES

# A separate test file test.sh is included in the package. 
# The test file executes all the commands below automatically.
# If you don't see any error messages after running test.sh, 
# then the programs should work fine on your computer.
# If you are unable to resolve problems, please send us an 
# e-mail and include the output messages from test.sh 
################# GenomeMasker ################
# Copy input files to current directory
cp examples/snp.fas .
cp examples/human12.list .

# Masking with GenomeMasker works through STDIN
cat snp.fas | ./gmasker human12.list l target 500 502 > snp.masked

# If you have enough time and RAM, try to make your own blacklist:
# ./glistmaker files.txt human.blacklist 16
# files.txt should contain full path to your chromosome or 
# contig files, one file name per line.
# See example file files.txt in examples/ directory

################# Primer design ################
# If you downloaded programs for primer design, try making primers:
# Primer design with modified primer3:
cp mprimer3/* .
cat snp.masked | ./fasta_to_mp3.pl | ./mprimer3_core | ./mp3_to_table.pl > snp.primers

################# GenomeTester #################
# Genome Test
cp examples/locfiles.txt .
./gtester snp.primers locfiles.txt

# Three output files with suffixes .gt1, .gt2 and .gt3 are created.
# Description of contents for these files is given above.

# If you want to test gindexer, use the following syntax:
cp examples/chrX_1MB.fa .
./gindexer chrX_1MB.fa chrX
# This creates 4 index files with locations of words


============================================================
COMMAND LINE OPTIONS AND EXAMPLES FOR DIFFERENT PROGRAMS 
IN THE GENOMEMASKER PACKAGE
============================================================
glistmaker is a program that creates the blacklist file
for the gmasker program.

glistmaker -v    version number (for the package)
glistmaker -h    help
glistmaker -d    turn on debugging output

usage:
glistmaker OPTIONS inputfilelist outputfile wordsize

 inputfilelist       Text file with list of file names, 
                     containing FASTA format sequences 
		     (nonredundant genome sequence).
		     Files listed here can contain one or
		     more sequences (multiple FASTA files).
		     
 outputfile          Name for the blacklist file created by 
                     the program.
		     
 wordsize            Length of word (k-tuple) that is used
                     to count over-represented words.
		     Word size 16 gives best primers,
		     sizes 8-15 are also valid and give 
		     smaller blacklist files.
 overreplimit NUMBER Specifies the overrepresentation cutoff 
		     (default 10).


glistmaker examples:
   glistmaker -overreplimit 30 chr_files.txt human12.list 12
   glistmaker contig_files.txt human16.list 16
   
============================================================
gmasker is a program that masks 3'end of each over-represented 
word. The words are taken from the blacklist file.
Input file should be in (multiple) FASTA format and it is 
read from the standard input.

gmasker -v    version number (for the package)
gmasker -h    help
gmasker -d    turn on debugging output
gmasker -u    convert sequence to uppercase before processing

usage:
gmasker OPTIONS blacklistfile maskingletter maskingtype [start end]
   
   blacklistfile     Name of the blacklist file, created 
                     with the glistmaker
		     
   maskingletter     This can be almost any letter, typical 
                     examples are 'N' or 'X'.
		     The only exception is 'l' (or 'L'), which 
		     triggers lowercase masking (3' ends of
		     over-represented words are in lower 
		     case, 3' ends of words with acceptable
		     frequency are in upper case)
		     
   maskingtype       Valid options are:
                     both, forwasrd, backward, target
		     As the masking with GenomeMasker is
		     assymmetric, one has to specify which
		     strand should be masked.
		     Most useful option for primer design
		     is 'target'. This type masks upper 
		     strand in front of target region and
		     lower strand behind the target region.
		     Target in this case is the region 
		     which should be amplified with MPrimer3

   start end         These numbers define start and end 
                     nucleotides of the target region if 
		     type of masking is 'target'
		     
   nbases NUMBER     Defines the number of bases from 3' end 
                     to mask (default 1)
   
gmasker examples:
   cat inputfile.fa | gmasker human16.list N both
   cat inputfile.fa | gmasker human16.list l forward
   cat inputfile.fa | gmasker human16.list # backward
   cat inputfile.fa | gmasker human16.list l target 500 502   
   cat inputfile.fa | gmasker human16.list l target 400 800
   dust inputfile.fa | gmasker human16.list l target 500 502

============================================================
gindexer is a program that creates index files for gtester,
using word length 16.

gindexer -v    version number (for the package)
gindexer -h    help
gindexer -d    turn on debugging output

usage:
gindexer OPTIONS inputfile outputfile

   inputfile         This is a FASTA file with one single
                     sequence (e.g. the assembled chromosome
		     sequence). Multiple FASTA files (e.g. 
		     multiple contig sequences) are not 
		     supported at the moment. One possible 
		     workaround is that you save each 
		     sequence to a separate file and create
		     indexes for all these files.
		     In any case, if you have to index more 
		     than one file, it would be practical to
		     write a shell script that does it for 
		     each file
		     
   outputfile        Prefix for the output file. You can add
                     directory names in front of the output
		     file name. For each input file 4 output
		     files are created
		     
   wordsize LENGTH   Default wordsize for location indexes 
                     is 16, but it can be changed with this 
		     argument
		     
gindexer examples:
   gindexer chr22.seq chr22
   gindexer /db/ensembl_29/chrX.fa indexes/X 
   gindexer -wordsize 12 chr22.seq chr22
============================================================
gtester is a program that counts how many times your primers
bind to the template and how many products are generated. 
Location files must be created with the gindexer.
Maximum product length by default is 1000 bp, but user can
choose a desired length with special option. Maximum number 
of binding sites is 1000 (by default), if >1000 is found, 
they are marked with '+'.

gtester -v    version number (for the package)
gtester -h    help
gtester -d    turn on debugging output

usage:
gtester OPTIONS primerfile locationsfile [blacklistfile]

   primerfile        This is input file with at least 3 
                     TAB-delimited columns: ID, forward 
		     primer, backward primer. Comments can 
		     be added to 4th column.
		     
   locationsfile     Text file with file names, that were 
                     indexed with gtindex and which you 
		     want to use for finding primer 
		     locations. Each file name should be
		     on separate line.
   
   blacklistfile     This is blacklist file name - an 
                     optional parameter that might speed 
		     up the genome test. If this file name 
		     is given, then gtester will not test 
		     and record locations of the words that
		     are already listed in the blacklist.
		     Primers that end with blacklist word
		     are marked with '+' in the .gt1 output
		     file.
		     Please note that this blacklist can 
		     only be used if the word length in 
		     location indexes is the same as in this 
		     blacklist
   maxprodlen LENGTH Specifies the maximum product length 
		     (default 1000 bp).
		     
   limit NUMBER      Defines maximum number of binding sites 
                     to track (default 1000)
		     
   output CODE       Prints only defined output files: 1 = gt1, 2 = gt2, 3 = gt3 
                     (default 123)		     
   
gtester examples:
   gtester my_primers.txt index_files.txt
   
Three output files of the Genome Test:
- my_primers.txt.gt1     
   File with number of primer binding sites and with the 
   number of products.
   
   The columns are:
      NAME
      NUMBER OF BINDING SITES FOR PRIMER A
      NUMBER OF BINDING SITES FOR PRIMER B
      NUMBER OF PRODUCTS
   
- my_primers.txt.gt2     
   File with description of all PCR products.
   
   The columns are:
      NAME
      PRODUCT_NUMBER
      CHR
      LOCATION (start nucleotide)
      LENGTH (bp)
      TYPE OF PRODUCT
         1: PrimerA-PrimerB (sense strand product)
	-1: PrimerB-PrimerA (antisense strand product)
	 2: PrimerA-PrimerA 
	-2: PrimerB-PrimerB
   
- my_primers.txt.gt3
   File with description of all primer binding sites.
   
   The columns are:
      NAME OF THE PRIMER PAIR
      PRIMER (A or B)
      STRAND (1=sense, -1=antisense)
      CHR
      LOCATION OF THE 5' END OF THE PRIMER
      
=====================================================================
CONTACT ADDRESS

If you have any questions or comments, please let us know by sending
an e-mail to Maido Remm at maido.remm@ut.ee
