Schumer lab: Upload data from Sherlock to NCBI's SRA

From OpenWetWare
Jump to navigationJump to search

Uploading Data from Sherlock to NCBI's SRA

In general, all raw sequencing data should be uploaded to the NCBI Sequence Read Archive (SRA) by the time of publication. Note that you only have to submit sequencing reads that have not been uploaded as part of previous projects, so make sure to coordinate with other lab members and search the SRA database to understand which sequences have or have not already been added.

These are the steps provided by the NCBI as of July 2021, with specific details edited to match our workflow on the Sherlock cluster. Most information here is also available at https://submit.ncbi.nlm.nih.gov/subs/sra/

  1. Log in to the SRA Submission Portal Wizard.
  2. Create new SRA submission (click on the button New submission).
  3. Register your project (Bioproject) and biological samples (Biosamples) if you did not register them before at BioProject and BioSample databases, respectively, or in other Submission Portal Wizards. Please refer to the SRA-specific guidelines here and to an example BioSample attributes file from the lab here.
  4. Submit SRA metadata - information that will link your project, samples/experiments and file names. Please refer to the SRA Metadata Overview and to an example metadata file from the lab here.
  5. Once you have uploaded metadata, you will receive options for uploading sequence data files.
    1. Choose the File Transfer Protocol (FTP) option, at which point you'll receive credentials that you'll need for the transfer: an address, a username, and a password, as well as an account folder of the form uploads/<NCBI_given_folder_name>
    2. Log on to Sherlock and navigate to the source folder where all files for submission are located
    3. Set up an FTP using rclone:
      1. Type rclone config
      2. Enter your SUNet password (this will then list out your current remotes and their associated names)
      3. Type "n" to set up new remote
      4. Input a name for the remote (e.g., "NCBI_SRA")
      5. Type 13 to choose FTP as the type of storage to configure
      6. Input the address provided by NCBI as the FTP host to connect to
      7. Input the username provided by NCBI as the FTP username
      8. Leave FTP port blank and press enter to use default FTP port
      9. Type "y" to select your own password
      10. Input the password provided by NCBI as the FTP password
      11. Leave the following two security questions blank and press enter
    4. Make a folder where the files will be uploaded using a command of the following form: rclone mkdir <FTP name>:/uploads/<NCBI_given_folder_name>/<desired_upload_folder_name>
      1. When asked to "enter configuration password" during these steps, enter your SUNet password (not the password from NCBI)
    5. Copy the files using rclone copy -L -P <address of folder with files to be transferred> <FTP name>:/uploads/<NCBI_given_folder_name>/<desired_upload_folder_name> --include "*.fastq.gz"

This is assuming that the reads to be uploaded are in fastq.gz format, but you can adjust the --include option to send other types of data.


View the files using rclone ls <FTP name>:/uploads/<NCBI_given_folder_name>/<desired_upload_folder_name>


You will be able to see when the files have been processed and approved on the SRA Submission portal

Notes:

- when you run the rclone copy command, if a file says it's 100% uploaded and suddenly it drops back down to 0% uploaded, it's probably because the timeout limit was reached before the file could fully upload. Rclone will keep trying and failing to upload this unless you break it. If that's the case, change --timeout to something other than the default of 5m. For reference, I used --timeout 10m for a Quail prep fastq file that was 16Gb and that worked, but 5m didn't.

- to debug during the rclone copy command, use the -vv flag to get error messages

- in case you want to delete a file that you uploaded, use rclone deletefile