Menu Tools: Data preprocessing

Return to Index


Menu Tools: Data preprocessing

Data preprocessing

In Next generation Sequencing (NGS) sequence preprocessing is the process of transforming raw reads to assembly-ready sequence generating in parallel associated informative reports. Raw data preprocessing includes tasks such as converting the raw trace file from proprietary to standard form, deriving template information, base-calling, vector screening, quality evaluation and control, disk management, associated tracking and reporting operations, demultiplex, sequence trimming/clipping and elimination of artifacts etc.

Preprocessing of raw data is thus a necessity has given rise to a number of Unix-based free-source software tools using a wide range of paradigms. Management and use of these tools requires some informatic skills about linux commands. Taking this into primary consideration we implemented GPRO with a multi-funtional friendly-to-use interface (Figure 7.1) in order to let the users to deal with the most representative preprocessing tools installed in the remote server just having skills at the user level (click-and-go actions) although we assume . You can access the preprocessing GPRO interface just clicking in the tab “Data Preprocessing” of the main menu highlighted in the Figure 7.1 below.


Error creating thumbnail: Unable to save thumbnail to destination

Figure 7.1. Interface for Data preprocessing


Menu

Figure 7.2 schematizes the menu within the preprocessing interface showing the organization of the different solutions installed in our server to which GPRO currently support the accession. Almost all these pre-processing tools are free source tools designed by third parties so you must cite them if you obtain some interesting publishable results.


Error creating thumbnail: Unable to save thumbnail to destination

Figure 7.2. Preprocessing menu.


Following is a brief description of each interface tab.

Converters

This tab facilitates accession to some scripts for format conversion (as shown in Figure 7.2). You have two tabs "Color space" and "Nucleotide space". The first links to a script that converts Solid-based (color-space) fasta files coupled with quality files either into color space fastq (csFastq) or the conventional nucleotide based fastq.

Private user tools

If you have your own server coupled with GPRO you also have a tab you can use for running other proprietary source code tools our your personal scripts (if you need more details about how proceed please contact us

Processing and cleaning

This tab provides accessing to three distinct software packages for preprocessing and cleaning via the preprocessing interface. These are;

  1. Cutadapt (Martin 2011), for removing primers and adapters from the sequences any many more actions. For more details please visit the web site
  2. Fastxtool kit a collection of tools summarized in Figure 2 for fasta and fastq preprocessing FASTX-TOOL-KIT,
  3. Prinseq (Schmieder and edwards 2011), which is a tool for filtering, reformat, and/or trimming sequence data, for more info visit the web site

Quality analyses

You can use this tab for performing quality analyses using FASTQC

How to proceed with the preprocessing interface

The way to manage any of the aforesaid tools via the interface is quite intuitive and friendly. As shown in Figure 7.3, when launching the preprocessing interface you also activate a FTP protocol between your PC and your user account in the server pipeline. The first step is you to drag the files you want to process from your PC to your user account.

Then create (by right-clicking) an output folder that you can name as your wish, select the file format (fastq, fasta or fasta + qual) in the interface box named format,

Subsequently, select in the menu the tool you are going to use (the interface will automatically the applications and requirements of the selected tool). Then, use the mouse to drag both the input files you want to preprocess and the output folder wherein you want to get the resulting files to the top box (small red line in figure) and the output fold box (larger red line), respectively.


Error creating thumbnail: Unable to save thumbnail to destination

Figure 7.3. Managing the preprocessing interface.


Finally, at the bottom of the interface you have an interactive form listing all command options and parameters (Figure 7.4) provided by this tool, select the option or fill the box data analysis parameters where required and then you are ready to launch the preprocessing analysis. In this task, you have two options. You can click the tab "Run program" (in Figure 7.3) to launch the analysis as such you configured or you can click on the Tab "Append and command" then your command string will appear in the queue box below allowing you to prepare other analyses. In this way you can simultaneuosly launch the same command on multiple files where you will only need to drag a new input file to the input box, or yet more interesting, if you keep the option "Use output file created by previous command as the next input file" selected you can design an "ad hoc" preprocessing pipeline for a particular data file. This is, you design a command for demultiplexing your file and then another command for trimming the first 10 nucleotides at 5´in the output of the last command and then eliminate all those sequences having not enough quality (according to a threshold) from the output of the former output and so on.


Error creating thumbnail: Unable to save thumbnail to destination

Figure 7.4. Form for selecting commands and parameters.


Return to Index




Welcome to the Gypsy Database (GyDB) an open editable database about the evolutionary relationship of viruses, mobile genetic elements (MGEs) and the genomic repeats where we invite all authors to contribute with their knowledge to improve and expand the topics.
Cite this project:

Llorens, C., Futami, R., Covelli, L., Dominguez-Escriba, L., Viu, J.M., Tamarit, D., Aguilar-Rodriguez, J. Vicente-Ripolles, M., Fuster, G., Bernet, G.P., Maumus, F., Munoz-Pomer, A., Sempere, J.M., LaTorre, A., Moya, A. (2011) The Gypsy Database (GyDB) of Mobile Genetic Elements: Release 2.0 Nucleic Acids Research (NARESE) 39 (suppl 1): D70-D74 doi: 10.1093/nar/gkq1061

Contact - Announcements - Acknowledgments - Terms of use and policy - Help - Donate
Donating legal disclaimer - Terms and conditions of the donation