Talk:PGP and Tranche
It sounds like the Tranche user accounts are not required for downloading public data sets? What about automated uploads? --AWZ 11:23, 10 December 2008 (EST)
Correct. Public data sets can be downloaded without signing up for a Tranche account. Private data sets can also be downloaded without a Tranche user account as long as the person provides the password that was used to encrypt the data. Automated uploads are possible, and have been set up for some groups already, with the command-line tools (http://tranche.proteomecommons.org/users/command-line-clients.html). --James A. Hill 00:04, 17 December 2008 (EST)
The user accounts are used to sign data at the time it is uploaded. When the data is uploaded to the servers, they verify the signed data to ensure the user is permitted to write. So generally, user only needs to sign in (or use a user file) to sign the data for an upload, not for download. We have other services (and more on the horizon) that will require signing in, but we do not plan to require signing in for downloading. If you try out the GUI, some of the additional functionality will become clearer (hopefully). --Bryan E. Smith
Great - so, maybe, one "canonical" example of a PGP data set available via Tranche would be helpful. We should probably put data behind passwords and then remove passwords once the initial batch of publications come out. We never thought this would be an issue but, happily, our "volunteer" effort has moved ahead very quickly! For example, we want to get http://genomerator.freelogy.org/~awz/pgp2-FC_00037_L002/ into the system (about 100GB) and put it behind a password. Later we want to remove this password. Is that an option (without re-uploading). Also, can we retrieve just a few files from the upload? Finally, if data is uploaded with different user credentials, how do we get statistics on what data exactly is being stored by the PGP? It's late sorry to ask so many questions that might already be answered in FAQs. Probably we should put these FAQs on the main-page here. (With lots of links into your existing documentation.) --AWZ 01:17, 17 December 2008 (EST)
1. You can upload a password-protected (encrypted) project, and later "publish" the password. The project will automatically be unencrypted at the point of download, creating no barrier for end user. There is a little overhead, since the project must be decrypted by the download client.
2. You can retrieve a few files from the download. There are many ways to do this. For starters, running the GUI allows you to browse the uploaded project with a file system explorer view. We can work with you for more automated solutions, particularly if you want to get files ending with a specific file extension or using a regular expression.
3. We can set up a way to tag the PGP uploads so that you shouldn't have to worry who uploads them. The easiest way for you is for us to create a custom tool that adds these tags automatically. Another way is to simply agree on a consistent title convention, and let us know. Lastly, you can create your own static HTML page with links that will directly download the projects from a user's browser, launch the web start tool with each project ready for download. We have a rich set of tools that give us a lot of flexibility, so we should discuss what sounds best.
4. I'm going to review why Andrea's upload took so long, and I'll get back to you. You might want to wait while I perform the diagnosis so that your future uploads are more smooth.
--Bryan E. Smith
I just uploaded a test file to Tranche with no error messages, 1.9MB in 42 seconds. Looks like we're all set to go.
--Andrea Loehr 09:00, 17 December 2008
We should expect must faster uploads for larger projects, especially if many files. There's a sunk startup cost for every upload, regardless of size. Will the uploads be archived (e.g., .tar or .tar.gz/.tgz)? --Bryan E. Smith
Bryan, I am not sure we will use uploads at all.
Andrew - can you confirm that shipping 1 TB drives is the preferred method of data transfer.
Can we completely erase the test data I have uploaded?
Who is going to sort through the public and private PGP data and prepare them for transfer to Tranche?
We're not sure what method you will use. If we have an idea of the size and schedule for the uploads, we can quickly help you decide how we should go about this. Are there going to be many separate projects? What will be their average size (ballpark)? All at once, or staggered? Also, what file formats? (Mostly like the *.pf you already uploaded)? Lastly, will you be gzipping the files, or will they be unzipped? --Bryan E. Smith
So far, the raw data consists of 8 data sets of 36 x (4 x 100) = 14400 .tiff images at 7MB each. Basically, 101GB of .tiff images for 8 PGP volunteers. Once these are processed, the resulting .pf and some other txt files are uploaded. However, this is only 10% of a volunteer's genome. We will need to determine, what will be uploaded and when, e.g. when will the full data sets be available. The data should probably be gzipped or bzipped.
Okay, I understand. GZIP compression is the default for Tranche, so if the files are not compressed, you may chose to not do so. (If already compressed, no problems.) This is a complex choice, as I pointed out in a recent email. There are several tradeoffs -- disk space, computation and transfer -- and we can continue to explore this. --Bryan E. Smith
Moved from the main page:
Transferring Data onto Tranche
For initial data transfer, could ship (two?) USB drives to BPF:
Attn: Andrew Gagne Biopolymers Facility 77 Ave. Louis Pasteur Room 0088 Boston, MA 02115
- PGP2 - FC37_2 - http://genomerator.freelogy.org/~awz/pgp2-FC_00037_L002/ Note: Other data sets could appear on the hard-disk(s) with this directory structure. On arrival, data could be loaded into Tranche as 100 data "bundles" per data-set (i.e per Illumina lane).
- PGP1 - FC37_3
- PGP3 - FC35_3
- PGP5 - FC44_2
- PGP7 - FC44_4
- PGP8 - FC37_1,FC51_2,FC51_6
- PGP9 - FC43_3,FC51_3,FC51_7
- PGP10 - FC41_3
Also, could use:
- CONTROL - FC35, FC37, FC41, FC43, FC44, FC51.
For all the above there is a top level directory (eg. pgp2-FC_00037_L002) and exactly 36 directories below that. Within each of those directories there are 4x100 files. For this release, it would be ideal if the data was organized in tranche as 18x100 "randomly addressable" data sets that a volunteer computer could ask for as desired. Each addressable "bundle" of data would then be 4x36 files.
Example: Upload project, download a portion using command-line tools
- Get directory to test.
besmit@besmit-kubuntu:~/PGP-Test$ wget -r -l 1 http://genomerator.freelogy.org/~awz/pgp2-FC_00037_L002/C36.1/
- Moved downloaded directory contents to C36.1/. Upload this directory to Tranche. Requires login to upload. See -h or --help for information about parameters. The very last argument is the directory to upload.
besmit@besmit-kubuntu:~/Desktop/TrancheLabs/Upload$ java -Xmx512m -jar Tranche-Uploader.jar -U bryan -P ********** -d "This is my description. Passphrase required for download." -t "This is my title: C35.1 encrypted" -e pgptest4 -c true C36.1/
- This is the stderr for the project. Intended for debugging, etc.
Using batch chunk upload?: yes Started total of 10 file encoding threads.
- This is the stdout for the project - the hash used to identify the project. This should be saved.
- Download files tifs with filenames referring to G or C nucleotides
besmit@besmit-kubuntu:~/Desktop/TrancheLabs/Download$ java -Xmx512m -jar Tranche-Downloader.jar -e pgptest4 -r _[gc].tif.gz$ uiRL5wtqG5FyzE9PnJG47dbxuU3PqpX3aE2Gq9SNJa5vRvlgn14hwUEBW8UZyXIeQWLP9B49sb6/W8dBOz1+QfRC5UkAAAAAAAEnnA==
- The only output is the path to download directory, shown when download complete