GPRO offers you three distinct tools for performing data mining and managing files, folders and contents within folders. If you make click on the menu tab "Management" a drop-down list will show you appears three tool options; “File and folders”; “Find sequences”; and “Alignments”.
This tool is organized in three sub-sections called "Join Folders", "Join Files" and "Split Files" and that you can access just by clicking on their respective tabs at the top of the GUI.
“Join folder” is a script that allows you to reorganize folders using distinct key terms such as enzyme or annotation names, etc (even if folders are in other folders within the Directory). Figure 10.1 shows an example consisting on the following steps:
Figure 10.1. Join folders using distinct folders with contents concerning LTR retroelement protein domains as an example : 1) drag folder in analysis to the "input folder" area; 2) type “filter words” and options; 3) use the mouse to select an output folder and drop it to the output box; 4) run the script; 5) the script will export all folders joined in other folders named according to the terms used as as key words. |
The script “Join files” allows you to select and to group files either in a single file and/or folder. The tool accepts only FASTA and XML files. Using “Join files”, you can collect, for example, the distinct XML files of a BLAST search in a single XML file or to retrieve common gene features of an annotation database divided into distinct files. The possibilities offered by this tool are the following:
In all cases, it is possible to filter the results using various criteria such as file name, extension and type of sequence (nucleotide or protein). The tool works even if the files are in distinct sub-folders of the selected folder from which you want to retrieve the files. The procedure is similar to that of "Join folders", but instead of working with folders, in this case the tool manages files permitting the use of key terms as including or excluding filter options to export all files fitting to this term within a single folder or to create a single database file encompassing the contents of all files fitting the filter option. Figure 10.2 provides a screenshot of the Join files process.
Figure 10.2. Join Files; 1) Drag your input folder from which you want to retrieve and reorganize contents from the Directory to the "Input folder" box; 2) Check the option, in this case "Select files and place them in a single folder"; 3) type filter; 4) Drop the output folder; and finally run the script. Alternatively, should you select the option "Join files in a single file" the tool will collect the contents of all files and will group them in a single file. In other words, this second option is an easy way to create DNA and protein databases. |
The script “Split files” is the third tab available at the top of the "Files and folders" GUI and allows you to split a fasta file into different files using the labels of sequences within the original file as classificatory criterion. The options offered by this tool are the following:
The selection of the sequences is performed over the fasta header according to a key term in the name of the sequences (one or more) or alternatively by blocks of sequences (1000 sequences, 10000 sequences, etc). As shown in Figure 10.3, to use “Split files” drag your input into the corresponding text box (as in the other scripts). Then select one of the proposed split options. For instance, if you want to split your file according to distinct search terms (for instance enzyme acronyms) type each term in the box “Split by sequence names” and then add it to the list at left (you can add as many terms as desired). In doing so, the script will parse the file and will divide it in as many files as search terms used, each one containing the sequences whose fasta labels matched with the term giving the new file its name. The script will also deliver an additional file (no match) containing the remaining sequences of the processed database.
Figure 10.3. Split files screnshot. |
This is a data mining tool for searching and extracting sequences from FASTA file databases using the labels or names of the sequences in the FASTA header as a search criterion. With this utility you can search a fasta file using one or more term options to export sequences and create new databases as described in Figure 10.4.
In this task:
Figure 10.4. Find sequences script |
Frequent jobs in multiple alignment methodology is changing the format of an alignment or trying to identify the set of motifs common to all aligned sequences. GPRO implements two scripts for doing these tasks using DNA/RNA and protein multiple alignment inputs.
GPRO includes a Join Alignments script allowing you to automatically join different files containing multiple distinct alignments (one per each domain) into a single alignment within a single file and arrange them in a user-defined order. The tool has two requisites. The number and name of the sequences must be identical in each file to join them, and the alignments must be provided in FASTA format. Figure 10.5 illustrates the join alignment process. As usual the process is the followings:
Figure 3.24. Join alignments. As an example, we used different files, each containing a multiple alignment based on the GAG (red), Protease (AP, green), Reverse transcriptase (RT, blue), Ribonuclease H (RNaseH, yellow), Integrase (INT, violet) and Envelope (ENV, orange) proteins encoded by distinct Retroviridae retroviruses. We will join these six alignments in a single gag-pol alignment, organized as described in the figure.. |
This is an alignment format converter a tool that allows users to upload a protein or nucleotide multiple alignment file in one format and convert it into other formats in one step. The utility accepts and converts the following formats: FASTA, Clustal, Pir, MSF, Phylip and Stockholm. Any of the formats can be used for input and output, and can be selected in one, several or all formats, simultaneously. Figure 10.6 shows a scheme of the procedure.
Figure 10.6. Alignment format converter. |
Llorens, C., Futami, R., Covelli, L., Dominguez-Escriba, L., Viu, J.M., Tamarit, D., Aguilar-Rodriguez, J. Vicente-Ripolles, M., Fuster, G., Bernet, G.P., Maumus, F., Munoz-Pomer, A., Sempere, J.M., LaTorre, A., Moya, A. (2011) The Gypsy Database (GyDB) of Mobile Genetic Elements: Release 2.0 Nucleic Acids Research (NARESE) 39 (suppl 1): D70-D74 doi: 10.1093/nar/gkq1061