Tutorial: using the dataset creator

Introduction

This tutorial will guide you through the process of creating a StructureVis dataset using the HIV-1 SHAPE data. The bolded text are the minimum instructions necessary to complete the tutorial.

All files used in this tutorial are located in the 'shape_tutorial' directory of your StructureVis directory (i.e. /your_structure_vis_directory/Manual/shape_tutorial).


1. Project name and location

Welcome panel

Enter a project name. The project name must be unique in the project location, i.e. no file or folder in the location may have the same name. Generally you should leave the project location unchanged, as it is the default StructureVis workspace on your drive, i.e. StructureVis will automatically open to this location when you load a new project. Click next.

2. Specify the structure parameters

2.1 Select a structure file

Specify the consensus structure

Click on the 'Browse' button located next to the 'Structure' field, navigate to the 'shape_tutorial' directory and select the file named 'hiv-1-shape-helix-file.txt'. This file represents the secondary structure of an entire HIV-1 genome, as determined by a SHAPE experiment.

Press auto-detect

Press 'Auto-detect'. Once you have specified the file, you must specify the file type, in the case of the HIV-1 secondary structure data the file type is tab-delimitted helix. You can specify the file type in two ways, by selecting the type from the drop-down menu or by clicking 'Auto-detect'.


2.2 Specify a reference nucleotide alignment

A reference nucleotide alignment is an alignment file that is used to map data to a the secondary structure that you have specified. It is very important that the reference nucleotide alignment is specified correctly, otherwise all subsequent data will be mapped incorrectly to the structure.

The reference nucleotide consists of a sequence (or sequences) where every column of the alignment corresponds exactly to every nucleotide of the secondary structure. Usually the secondary structure of a particular genome will have a corresponding sequence or nucleotide alignment, so most of the time you can just use this.

Specify the reference structure

Select the reference alignment by browsing to the NL4-3 sequence file in the tutorial directory (../shape_tutorial/NL4-3-shape-alignment.fas). In the case of the HIV-1 SHAPE analysis used here, the NL4-3 HIV-1 RNA genome sequence was used, we have manually edited the provided alignment for you by truncating the ends so that it corresponds precisely to the coordinates specified in the structure file. Unfortunately, this is not immediately obvious by looking at the structure file that you specified in step 2.1, however, if you inspect the SHAPE reactivities file (../shape_tutorial/hiv-1-shape-reactivities.txt) which corresponds to the structure you will see that the sequence in the second column is precisely the same as the sequence in the alignment file.

2.3 Set the substructure parameters

Specify substructure parameters

Substructures are individual regions of the structure that are viewable in the 'Substructure' view. Instead of having the user manually generate a list of substructures to be viewed, StructureVis can automatically generate a list of substructures with a user-specified minimum and maximum size. Using the default sizes is generally okay. Click next.


3. Sequence annotations

Add annotations fron a Genbank file and map

Sequence annotations provide a way of visualising the genome organisation of a particular organism or the annotations of a sequence and an additional way to navigate through the structures contained within a sequence. Click 'Add annotations fron a Genbank file and map' and select the file named 'pNL4-3.gb' (This may take a few seconds). Now select a few annotations (for this example choose whichever ones you like) listed in the table in the upper panel by ticking in the 'Use' column, these will be displayed in the sequence annotations panel below. StructureVis will use the sequence contained within the Genbank file to map these annotations against the reference alignment (notice how the 'Start' and 'End' coordinates from the Genbank file differ dramatically from the 'Mapped start' and 'Mapped end' coordinates). You can change the colours of the annotations by clicking the rectangles in the upper panel and the height at the which they appear by changing the 'Level' column.


4. Data overlays

Add 1D data

Data overlays are pieces of data that can be overlayed on the genomic secondary structure. In this tutorial we will add a 1D data overlay, this type of overlay is a list of nucleotide positions and a corresponding data values. Adding a 1D overlay colours the nucleotides of the secondary structure according to their correponding data values.

  1. Click 'Add 1D data', this will open a dialog where you can add a corresponding data source. To learn how to create your own custom 1D data source, see Creating a 1D data source.
  2. Click 'Browse' next to the 'Data file' field and select the 'hiv-1-shape-reactivities.csv' file. This file contains SHAPE reactivity and pairing probabilities for nucleotides in the HIV-1 NL4-3 genome.
  3. Next add the 'NL4-3-shape-alignment.fas' mapping alignment by selecting 'Browse' next to 'Mapping alignment'. This mapping alignment allows the 1D data to be correctly mapped to the reference structure.
  4. Type 'SHAPE reactivity' in the 'Field name' field, this will give your data overlay a name which will displayed in the data legend and other places where this data overlay appears.
  5. Click the 'Position column' drop-down list, this will display a list of columns in the CSV file. Select the '1. Position' item, this specifies that the first column of the alignment will be used as the position co-ordinates of the corresponding data values.
  6. Click the 'Data column' drop-down list, once again this will display a list of columns in the CSV file. Select the '3. SHAPE reactivity' item, this column of the CSV file contains the SHAPE reactivites of individual nucleotides within the HIV-1 NL4-3 genome, high values indicate nucleotides which are reactive to the SHAPE chemistry, these are nucleotides which tend to be unpaired within the genome - it is this information that allow the authors of the paper to generate of model of the secondary structure of HIV-1 based on experimental data.
  7. From the 'Data maximum' drop-down list select 'Custom' and enter '1.0' into the adjacent text field. This will filter out any data values greater than 1.0, this is necessary in the case of the SHAPE data because the majority of data values tend to fall in the range 0.0 - 1.0, displaying values > 1.0 will cause most of the colours corresponding to data values (depicted in the data legend on the right) to be indistinguishable from one another because there are very few data values > 1.0.
  8. Manipulate the data legend on the right to match the colour gradient depicted in the image above. Double-click on the triangles to change their colours. Double-click on an empty region to add an intervening colour. The colours can be moved by dragging them up or down and removed by dragging them off the right of the legend.

5. Finish

Once you have completed the data overlays section click next, this will take you to the last panel which will finalize the creation of your dataset, any errors that occur during this process will be displayed on this panel and you will be able to go back in the wizard in order to rectify them.