User Tools

Site Tools


nexus

NEXUS

NEXUS: Description in Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.


NEXUS is a file format designed to contain systematic data for use by computer programs. The goals of the format are to allow future expansion, to include diverse kinds of information, to be independent of particular computer operating systems, and to be easily processed by a program.

format information

standard text file

file format:

  • NEXUS files are free-format, which means that the entire file could conceivably consist of a single, long line of text. It does not matter to Hickory where you break lines (as long as you don’t split up a keyword or the name of a locus, allele or population), nor does it matter if you use one space or a dozen spaces to separate the individual words (tokens) in the file. Tokens may be casually defined as sequences of characters separated by whitespace (e.g., spaces, carriage returns, line feeds, tabs, etc.)
  • NEXUS files are for the most part not case-sensitive by default. A big exception is in the matrix command, where (by default) an allele named A is treated as being distinct from a


  • Comments can be added by enclosing text with brackets: [comment]
  • first line must be: #NEXUS
  • The tokens in a NEXUS file are organized into commands, which are in turn organized into blocks.
    • Commands: the first token in the command is the command name, which is followed by a series of tokens and whitespace; the command is terminated by a semicolon: command-name token token . . . ;
    • Blocks: series of commands, beginning with a BEGIN command and ending with an END command:
      BEGIN block-name;
       command-name token . . . ;
       command-name token . . . ;
       ...
      END;

The primary public blocks are (commonly defined, []=optional, {||}=mutually exclusive options):

  • TAXA: TAXA block defines taxa and gives them names. The block also establishes the order (numbering) of the taxa. Taxa consist of the entities (biological species, haplotypes, manuscripts, etc.) whose attributes might be recorded in characters and whose relationships are described in trees
    BEGIN TAXA;
     DIMENSIONS NTAX=number-of-taxa;
     TAXLABELS taxon-name [taxon-name ...] ;
    END;
  • CHARACTERS: contains information about discrete and continuous data, including that for morphological structure and molecular sequences. Polymorphism and frequency data can be accommodated. Names can be given to the characters and their states.
    BEGIN CHARACTERS;
     DIMENSIONS [NEWTAXA NTAX=number-of-taxa] NCHAR=number-of-characters;
     [FORMAT
      [DATATYPE={STANDARD|DNA|RNA|NUCLEOTIDE|PROTEIN|CONTINUOUS}]      default: STANDARD
      [RESPECTCASE]                                                    default: A and a is the same
      [MISSING=symbol]                                                 default: ?
      [GAP=symbol]
      [SYMBOLS=”symbol [symbol...]”]
      [EQUATE=”symbol=entry [symbol=entry]”]
      [MATCHCHAR=symbol]
      [[No]LABELS]
      [TRANSPOSE]
      [INTERLEAVE]
      [ITEMS=([MIN][MAX][MEDIAN][AVERAGE][VARIANCE][STCERROR][SAMPLESIZE][STATES])]  default: STATES
      [STATESFORMAT={STATESPRESENT|INDIVIDUALS|COUNT|FREQUENCY}]                     default: STATESPERSENT
      [[No]TOKENS]
     ;]
     [ELIMINATE character-set;]
     [TAXLABELS taxon-name [taxon-name...];]
     [CARSTATELABELS character-number [character-name] [/state-name [state-name...]]
      [, character-number [character-name] [/state-name [state-name...]]
       ...]
     ;]
     [CHARLABELS character-name [character-name...];]
     [STATELABELS character-number [character-name] [/state-name [state-name...]]
      [, character-number [character-name] [/state-name [state-name...]]
       ...]
     ;]
     MATRIX data-matrix;
    END;

    example:

    BEGIN CHARACTERS;
     DIMENSION NCHAR=3;
     CHARSTATELABELS 1 hair/absent present, 2 color/red blue, 3 size/small big;
     FORMAT TOKENS;
     MATRIX
      taxon_1 absent red big
      taxon_2 absent blue small
      taxon_3 present blue small;
    END;
  • UNALIGNED: similar to a CHARACTRS block, but it contains unaligned molecular sequence data.
    BEGIN UNALIGNED;
     [DIMENSIONS NEWTAXA NTAX=number-of-taxa;]
     [FORMAT
      [DATATYPE={STANDARD|DNA|RNA|NUCLEOTIDE|PROTEIN}]
      [RESPECTCASE]
      [MISSING=symbol]
      [SYMPOLS=”symbol [symbol...]”]
      [EQUATE=”symbol=entry [symbol=entry...]”]
      [[No]LABELS]
     ;]
     [TAXLABELS taxon-name [taxon-name...];]
     MATRIX data-matrix;
    END;

    example:

    BEGIN UNALIGNED;
     FORMAT DATATYPE= DNA;
     MATRIX
      taxon-1 ACTAGGACTAGATCAAGTT,
      taxon-2 ACCAGGACTAGCGGATCAAG,
      taxon-3 ACCAGGACTAGATCAAG;
    END;
  • DISTANCES: contains distance matrices
    BEGIN DISTANCES;
     [DIMENSIONS [NEWTAXA] NTAX=number-of-taxa NCHAR=number-of-characters;]
     [FORMAT
      [TRIANGLE={LOWER|UPPER|BOTH}]
      [[NO]DIAGONAL]
      [[NO]LABELS]
      [MISSING=symbol]
      [INTERLEAVE]
     ;]
     [TAXLABELS taxon-name [taxon-name...];]
     [MATRIX distance-matirx;
    END;

    example:

    BEGIN DISTANCES;
     FORMAT TRIANGLE=UPPER;
     MATRIX 
      taxon_1 0.0 1.0 2.0
      taxon_2     0.0 3.0
      taxon_3         0.0;
    END; 
  • DATA: is a CHARACTERS block that includes not only the definition of characters but also the definition of taxa (this block is not recommended)
    BEGIN DATA;
     DIMENSIONS NTAX=5 NCHAR=20;
     FORMAT DATATYPE=DNA GAP=-;
     MATRIX
      taxon-1 A-CTAGGACTA---GATCAA
      taxon-2 A-CCAGGACTAGCGGATCAA
      taxon-3 A-CCAGGACTA---GATCAA
      taxon-4 AGCCAGGACTA---GTTCAA
      taxon-5 ATC-AGGACTA---GATCAA;
    END;
  • SETS: descriptions of collections of objects. These objects include characters, taxa, trees, states, and kinds of changes. In addition, partitions of characters, taxa, and trees can be formed.
    BEGIN SETS;
     [CHARSET charstet_name [({STANDARD|VECTOR})]=character-set;]
     [STATESET stateset-name [({STANDARD|VECTOR})]=state-set;]
     [CHANGESET changeset-name=state-set<->state-set [state-set<->state-set...];]
     [TAXSET taxset-name [({STANDARD|VECTOR})]=taxon-set;]
     [TREESET treeset-name [({STANDARD|VECTOR})]=tree-set;]
     [CHARPARTITION partition-name [([{[NO]TOKENS}] [{STANDARD|VECTOR}])]
      =subset-name:character-set [, subset-name:character-set...];]
     [TAXPARTITION partition-name [([{[NO]TOKENS}] [{STANDARD|VECTOR}])]
      =subset-name:taxon-set [, subset-name:taxon-set...];]
     [TREEPARTITION partition-name [([{[NO]TOKENS}] [{STANDARD|VECTOR}])]
      =subset-name:tree-set [, subset-name:tree-set...];]
    END;

    example:

    BEGIN SETS;
     CHARSET larval=1-3 5-8;
     STATESET eyeless=0;
     STATESET eyed=1 2 3;
     CHANGESET eyeloss=eyed -> eyeless;
     TAXSET outgroup=1-4;
     TREESET AfrNZVicariance=3 5 9-12;
     CHARPARTITION bodyparts=head:1-4 7, body:5 6, legs:8-10;
    END;
  • ASSUMPTIONS: assumptions about the data. These can include assignment of weights to various characters, specification of the nature of character changes, exclusion of particular characters, and designation of ancestral states.
    BEGIN ASSUMPTIONS;
     [OPTIONS [DEFTYPE=type-name]
      [POLYTCOUNT={MINSTEPS|MAXSTEPS}]
      [GAPMODE={MISSING|NEWSTATE}];]
     [USERTYPE type-name[({STEPMATRIX|CSTREE})]=USERTYPE-description;]
     [TYPESET [*] typeset-name [({STANDARD|VECTOR})]=TYPESET-definition;]
     [WTSET [*] stset-name [({STANDARD|VECTOR} {TOKENS|NOTOKENS})]=WTSET-definition;]
     [EXSET [*] exset-name [({STANDARD|VECTOR})]=character-set;]
     [ANCSTATES [*] ancstates-name [({STANDARD|VECTOR} {TOKENS|NOTOKENS})]=ANCSTATES-definition;]
    END;

    example:

    BEGIN ASSUMPTIONS;
     OPTIONS DEFTYPE=ORD;
     USERTYPE myOrd=4
      0 1 2 3
      . 1 2 3
      1 . 1 2
      2 1 . 1
      3 2 1 .;
     USERTYPE myTree (CSTREE)=((0,1)a, (2,3)b)c;
     TYPESET * mixed=IRREV: 1 3 10, UNORD 5-7;
     WTSET * one=2: 1-3 6 11-15, 3: 7 8;
     WTSET two=2:4 9, 3:1-3 5;
     EXSET nolarval=1-9;
     ANCSTATES mixed=0: 1 3 5-8 11; 1: 2 4 9-15;
    END;
  • CODONS: contains information about the genetic code, the regions of DNA and RNA sequences that are protein coding, and the location of triplets coding for amino acids in nucleotide sequences.
    BEGIN CODONS;
     [CODONPOSSET [*] name [({STANDARD|VECTOR})]=
      N: character-set,
      1: character-set,
      2: character-set,
      3: character-set;]
     [GENETICCODE code-name
      [([CODEORDER=123|other] [NUCORDER=TCAG|other] [[NO]TOKENS] [EXTENSIONS=“symbols-list“])]
       =genetic code description];]
     [CODESET [*] codeset-name {(CHARACTERS|UNALIGNED|TAXA)}
       =code-name:character-set or taxon-set [,code-name:character-set or taxon-set...];]
    END;
  • TREES: stores information about trees
    BEGIN TREES;
     [TRANSLATE arbitrary-token-used-in-tree-description valid-taxon-name 
      [, arbitrary-token-used-in-tree-description valid-taxon-name. . . ] ;]
     [TREE [*] tree-name= tree-specification;]
    END;

    example:

    BEGIN TAXA;
     TAXLABELS Scarabaeus Drosophila Aranaeus;
    END;
    
    BEGIN TREES;
     TRANSLATE beetle Scarabaeus, fly Drosophila, spider Aranaeus;
     TREEtreel=( (1,2), 3 ) ;
     TREEtree2= ( (beetle, fly), spider);
     TREEtree3= ( (Scarabaeus, Drosophila), Aranaeus);
    END;
  • NOTES: allows attachment of additional information (text, pictures, etc.) to various objects (trees, taxa, characters, etc.) in the file.
    BEGIN NOTES;
     [TEXT [TAXON=taxon-set] [CARACTER=character-set] [STATE=state-set] [TREE=tree-set]
      SOURCE={INLINE|FILE|RESOURCE}TEXT=text-or-source-description:]
     [PICTURE [TAXON=taxon-set] [CARACTER=character-set] [STATE=state-set] [TREE=tree-set]
      [FORMAT=[PICT|TIFF|EPS|JPEG|GIF}] [ENCODE={NONE|UUENCODE|BINHEX}]
      [SOURCE={INLINE|FILE|RESOURCE}PICTURE=picture-or-source-descriptior;]
    END;


  • The order of blocks is predetermined for some pairs of blocks but not others (most programs will require a CHARACTERS or DATA block to precede the ASSUMPTIONS block so that the characters will be defined)
  • Many of the commands in a NEXUS file define objects or specify characteristics about them. All objects can be labeled (given names):
    • Duplicate names should be avoided
    • Names are single words (no spaces)
    • they cannot consist entirely of digits
    • several of the object definitions follow a particular format: GENETICCODE, CODONPOSSET, CODESET, CHARSET, STATESET, CHANGESET, TAXSET, TREESET, CHARPARTITION, TAXPARTITION, TREEPARTITION, USERTYPE, WTSET, TYPESET, EXSET, ANCSTATES and TREE. Example: CHARSET larval =1-15


example:

#NEXUS
BEGIN TAXA;
      Dimensions NTax=4;
      TaxLabels fish frog snake mouse;
END;

BEGIN CHARACTERS;
      Dimensions NChar=20;
      Format DataType=DNA;
      Matrix
        fish   ACATA GAGGG TACCT CTAAG
        frog   ACATA GAGGG TACCT CTAAG
        snake  ACATA GAGGG TACCT CTAAG
        mouse  ACATA GAGGG TACCT CTAAG
END;

BEGIN TREES;
      Tree best=(fish, (frog, (snake, mouse)));
END;

How to cite

Description in Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.

NCL

NCL

NEXUS Class Library (NCL) is an integrated collection of C++ classes designed to allow the user to quickly write a program that reads NEXUS-formatted data files. It also allows easy extension of the NEXUS format to include new blocks of your own design

nexus.txt · Last modified: 2008/07/22 13:31 by 127.0.0.1