Ranking of structures

How structures are ranked

DOOSS supports ranking of structures by their one-dimensional or two-dimensional data values to assist the investigator in mining out structures which may be of the interest.

A structure is ranked by comparing the distribution of data values corresponding to the structure for a particular a data source against all data values for the same data source using a Mann-Whitney U test.

The Mann-Whitney U test generates a z-score which gives an indication of whether a particular structure lies at an extreme of the distribution of all data values in a data source. For example, a large negative z-score for a particular structure means that the median value for a structure is significantly less than the median data value for all data values in the data source, whereas a z-score close to zero indicates that the structure does not contain data values that are significantly different from the rest.

Structures that lie at an extreme (high or low z-score) are typically the most interesting.

How to rank structures

Select Rank structures from the Analysis menu and then select the desired data source to rank against from the Data source menu.

Ranking options

  1. All nucleotides - all non-empty values within a structure are compared against all non-empty values in the data source. Note for two-dimensional data sources the number of data values may be may be very large (sequence_length2), given that the ranking test will consider every non-empty pair of nucleotides in the sequence and it may take a long time to run.
  2. Paired only - all non-empty values corresponding to paired nucleotides within a structure are compared against all non-empty values corresponding to paired nucleotides in the data source. In the case of a two-dimensional overlay the ranking test only considers pairs of nucleotides which are paired in the structure, thus greatly reducing the number of data points it needs to consider.
  3. Unpaired only - all non-empty values corresponding to unpaired nucleotides within a structure are compared against all non-empty values corresponding to unpaired nucleotides in the data source. Note: as with the "All nucleotides" option, this may take a long time to run for two-dimensional data sources.