In Next generation Sequencing (NGS) sequence preprocessing is the process of transforming raw reads to assembly-ready sequence generating in parallel associated informative reports. Raw data preprocessing includes tasks such as converting the raw trace file from proprietary to standard form, deriving template information, base-calling, vector screening, quality evaluation and control, disk management, associated tracking and reporting operations, demultiplex, sequence trimming/clipping and elimination of artifacts etc.
Preprocessing of raw data is thus a necessity has given rise to a number of Unix-based free-source software tools using a wide range of paradigms. Management and use of these tools requires some informatic skills about linux commands. Taking this into primary consideration we implemented GPRO with a multi-funtional friendly-to-use interface (Figure 7.1) in order to let the users to deal with the most representative preprocessing tools installed in the remote server just having skills at the user level (click-and-go actions) although we assume . You can access the preprocessing GPRO interface just clicking in the tab “Data Preprocessing” of the main menu highlighted in the Figure 7.1 below.
Figure 7.1. Interface for Data preprocessing |
Figure 7.2 schematizes the menu within the preprocessing interface showing the organization of the different solutions installed in our server to which GPRO currently support the accession. Almost all these pre-processing tools are free source tools designed by third parties so you must cite them if you obtain some interesting publishable results.
Figure 7.2. Preprocessing menu. |
Following is a brief description of each interface tab.
This tab facilitates accession to some scripts for format conversion (as shown in Figure 7.2). You have two tabs "Color space" and "Nucleotide space". The first links to a script that converts Solid-based (color-space) fasta files coupled with quality files either into color space fastq (csFastq) or the conventional nucleotide based fastq.
If you have your own server coupled with GPRO you also have a tab you can use for running other proprietary source code tools our your personal scripts (if you need more details about how proceed please contact us
This tab provides accessing to three distinct software packages for preprocessing and cleaning via the preprocessing interface. These are;
You can use this tab for performing quality analyses using FASTQC
The way to manage any of the aforesaid tools via the interface is quite intuitive and friendly. As shown in Figure 7.3, when launching the preprocessing interface you also activate a FTP protocol between your PC and your user account in the server pipeline. The first step is you to drag the files you want to process from your PC to your user account.
Then create (by right-clicking) an output folder that you can name as your wish, select the file format (fastq, fasta or fasta + qual) in the interface box named format,
Subsequently, select in the menu the tool you are going to use (the interface will automatically the applications and requirements of the selected tool). Then, use the mouse to drag both the input files you want to preprocess and the output folder wherein you want to get the resulting files to the top box (small red line in figure) and the output fold box (larger red line), respectively.
Figure 7.3. Managing the preprocessing interface. |
Finally, at the bottom of the interface you have an interactive form listing all command options and parameters (Figure 7.4) provided by this tool, select the option or fill the box data analysis parameters where required and then you are ready to launch the preprocessing analysis. In this task, you have two options. You can click the tab "Run program" (in Figure 7.3) to launch the analysis as such you configured or you can click on the Tab "Append and command" then your command string will appear in the queue box below allowing you to prepare other analyses. In this way you can simultaneuosly launch the same command on multiple files where you will only need to drag a new input file to the input box, or yet more interesting, if you keep the option "Use output file created by previous command as the next input file" selected you can design an "ad hoc" preprocessing pipeline for a particular data file. This is, you design a command for demultiplexing your file and then another command for trimming the first 10 nucleotides at 5´in the output of the last command and then eliminate all those sequences having not enough quality (according to a threshold) from the output of the former output and so on.
Figure 7.4. Form for selecting commands and parameters. |
Llorens, C., Futami, R., Covelli, L., Dominguez-Escriba, L., Viu, J.M., Tamarit, D., Aguilar-Rodriguez, J. Vicente-Ripolles, M., Fuster, G., Bernet, G.P., Maumus, F., Munoz-Pomer, A., Sempere, J.M., LaTorre, A., Moya, A. (2011) The Gypsy Database (GyDB) of Mobile Genetic Elements: Release 2.0 Nucleic Acids Research (NARESE) 39 (suppl 1): D70-D74 doi: 10.1093/nar/gkq1061