Supplemental Data. Simpson et al. (2010). Plant Cell 10.1105/tpc.110.077990 NonCanon - extraction of non-canonical start sites. Note: This is research software, it was written to extract and quantify relevant CDS starts from TAIR Genbank-style files. It may not be especially easy to use, but it should work as described in the publication: Simpson. G.G, Laurie, R.E., Dijkwel, P.P., Quesada, V., Stockwell, P.A., Dean, C. and Macknight, R.C., Non-canonical translation initiation of the flowering time and alternative polyadenylation regulator, FCA, Plant Cell, In press. The programs are distributed as a self-extracting shell archive. To unpack execute: sh noncanon.shar which you may already have done, to generate a directory containing noncanon containing the source code, make files and this SW_README.txt file. The sources require that the GNU package gdbm is installed: available via http://www.gnu.org or other Free Software Foundation download sites. The code in this distribution comprises C source code written in a non-platform-specific way, so that compilation and running should be possible on a wide range of Unix or Linux systems. I have tested this distribution on MacOS X (10.6, Snow Leopard and earlier) Ubuntu and Debian Linuxes. To build: cd src make cd ../br_prog make which will generate src/noncanon, br_prog/br_prog and br_prog/fst_prog. Various warnings may be issued during compilation, but can be ignored. The executables could be copied to a system run directory (e.g. /usr/local/bin) or run from where they have been built. src/noncanon does the initial scanning for non-canonical starts, using any file in GenBank format. Terse help is available via -h. A typical run might use: noncanon -R -r -m -L -c 1 -b 0 -P 3 -f NC_003070.gbk > chr1_ctgg.out where: -R => restrict non-canon starts to CTG/CTGG (not ATGg/GTGg) -r => require a leading Pu-X-X before init -m => relate CDS contexts to mRNAs, if available -L => enforce 4 base noncanonical initiators -c 1 => this is for chromosome 1 -b 0 => take 5' region right to start of CDS -P 3 => put out 3 bases prior to init, using upper case to show codon NC_003070.gbk is a TAIR file for Chromosome 1 in Genbank format. This produces an output file similar to: 1 NC_003070:AT1G01590:FRO1 gccATGg:214229 214124 105 atgCTGg 1 NC_003070:AT1G01630: gtaATGg:229206 229179 27 gatCTGg 1 NC_003070:AT1G01650: aagATGg:236537 237506 285 gttCTGg 1 NC_003070:AT1G02180: gatATGa:414505 414532 27 acgCTGg 1 NC_003070:AT1G02800:ATCEL2 gaaATGg:616103 616166 63 ataCTGg 1 NC_003070:AT1G02880:TPK1 ttcATGa:643972 644050 78 gagCTGg 1 NC_003070:AT1G03260: tatATGg:798102 798240 138 aatCTGg 1 NC_003070:AT1G03905: aaaATGg:993478 993469 9 agaCTGg 1 NC_003070:AT1G04890: cttATGa:1382546 1382663 117 aagCTGg 1 NC_003070:AT1G04950:TAF6 aagATGa:1407184 1407193 9 agcCTGg [...] in which the tab-delimited fields are: Chromosome, GeneId:Locus_tag:Name, CDScanoninit:position, noncanon_position, distance_fromCDScanon, Noncanonical_Init. The -x and -X options respectively produce the 5' sequence regions or their translation as a fasta format file: noncanon -R -r -m -L -c 1 -b 0 -X -f NC_003070.gbk > chr1_ctgg.fst giving: >AT1G01590 - 1 NC_003070::FRO1 atgg:214229 ctgg:214124 LETNIAHIYGFCKIHCKLHFALFFLLISWKFISGA >AT1G01630 - 1 NC_003070:: atgg:229206 ctgg:229179 LVVAHRRNV >AT1G01650 - 1 NC_003070:: atgg:236537 ctgg:237506 LVKVPTRVNGSEYTEYVGVGARFGPTLESKEKHATLIKLAIADPPDCCSTPKNKLTGEVI LVHRGKCSFTTKTKVAEAAGASAILIINNSTDLFK >AT1G02180 - 1 NC_003070:: atga:414505 ctgg:414532 LGIQRIKHD where the header lines are: >Locus_tag, Chromosome_No, GenBankID::GeneID CDS_start, noncanon_start These files are input to br_prog/br_prog or br_prog/fst_prog which will respectively run blast or fasta on each sequence fragment against the defined library. These programs work for each sequence by generating an appropriate unique command file, single sequence file and AWK script (hitpicker.awk for br_prog, pikfst.awk for fst_prog). The working files can optionally be saved. The command files will run blast/fasta with appropriate options, then process the output with the awk script to return condensed details to the output file. For instance: br_prog/br_prog -S /bin/sh -B 'blastall -p tblastn' c4.fst \ plnests_no_at will run tblastn with sequences from c4.fst against the prebuilt blast library plnests_no_at, using /bin/sh as the command shell. Some sequences will generate blast error messages (usually because they are too short) - these don't affect the analysis and can be ignored. Working files will not be preserved, but an output file c4.fst.out will contain a line for each sequence of c4.fst giving figures and a description for the highest-scoring blast match or "**No Hit**". The shell script files are like: #!/bin/sh # blastall -p tblastn -m 0 -e 0.100000 -d plnests_no_at -i ./AT4G10010.seq \ -o ./AT4G10010.lis awk -f ./hitpicker.awk ./AT4G10010.lis >> ./c4.fst.out The above br_prog run produced the following output (noting that long lines are wrapped here with a '\'): AT4G00730 - - 4 NC_003075::ANL2 atga:304103 ctgg:304133 **No Hit** AT4G00740 - - 4 NC_003075:: atgg:310298 ctgg:310316 **No Hit** AT4G00840 - - 4 NC_003075:: atga:357105 ctgg:357111 **No Hit** AT4G01240 - - 4 NC_003075:: atgg:521972 ctgg:521984 **No Hit** AT4G01550 - - 4 NC_003075::anac069 atgc:675758 ctgg:676213 \ >TC65360 similar to UniRef100_A8MQY1 Cluster: \ Uncharacterized protein At4g01540.1; n=1; Arabidopsis thaliana|Rep: \ Uncharacterized protein At4g01540.1 - Arabidopsis thaliana \ (Mouse-ear cress), partial (25%) 181 4.00e-4 \ 840 89 752 466 [...] AT4G08967 - - 4 NC_003075:: atgg:5752128 ctgg:5752134 **No Hit** AT4G10010 - - 4 NC_003075:: atgg:6265720 ctgg:6265870 \ >BQ867234 similar to UniRef100_A7QC91 Cluster: \ Chromosome undetermined scaffold_77, whole genome shotgun sequence; \ n=1; Vitis vinifera|Rep: Chromosome undetermined scaffold_77, \ whole genome shotgun sequence - Vitis vinifera (Grape), \ partial (26%) 50 9.00e-06 655 258 398 326 [...] Where lines have the following tab-delimited fields: QuerySeq headerline (from noncanon output file), Match header text for top hit, Bit Score, Poisson probability for best match (Expectation), Length of best hit sequence, Position of match start in Subject Sequence, Remaining length of Subject Sequence after match start, End position in Subject of match, br_prog/fst_prog performs a similar operation but uses fasta for the homology searches with a ktuple of 1 in order to give improved sensitivity. The noncanon code draws heavily on that used in building the TransTerm translation terminator database: http://uther.otago.ac.nz/Transterm.html, (Grant H. Jacobs, Augustine Chen, Stewart G. Stevens, Peter A. Stockwell, Michael A. Black, Warren P. Tate and Chris M. Brown, (2009) Transterm: a database to aid the analysis of regulatory sequences in mRNAs, NAR, 37 (suppl1), D72-D76). Peter Stockwell 28-Oct-2010 peter.stockwell@otago.ac.nz