Next-generation sequencing datasets are extremely large and complex sets of data that can often be difficult to process due to prior knowledge of data processing pipelines and workflows.
The Genomics Repository supports our research teams in the analysis of large next-generation sequencing datasets. The Repository enables our researchers to safely store, analyse and share large genomic datasets, thus enhancing the reproducibility of complex datasets within the University.
How does it work?
The repository enables researchers to add large datasets to their personal workspace and use standardised workflows to run on the University's Phoenix High Performance Computing (HPC) system. Once Phoenix has analysed the data, output file/s will be returned to the researcher's workspace for downstream analyses.
Incremental backups of the Genomics Repository are performed daily with full backups being performed monthly.
What is high-throughput sequencing?
A genome is an organism's complete set of DNA stored in long molecules of DNA called chromosomes. In order to study the genome of an organism, high-throughput genome sequencing technologies have been developed that produce a large amount of data. Other sequencing approaches, such as the sequencing of genes from the genome that are expressed (RNA-seq or transcriptome sequencing), can also be carried out using these sequencing approaches allowing researchers to study many elements of the organism's genetic system at low cost.
How can the Genomics Repository help researchers?
The Genomics Repository will enable the university to centralise and simplify genomic data storage, analysis and collaboration. The goal is to provide bioinformaticians and non-computational researchers with the tools they need to perform their research without specialist data processing knowledge as there are pre-prepared processing commands available in the repository. This will result in a more efficient preparation and execution of data analysis, allowing for completely reproducible data processing workflow that can be included in high-quality research outputs. Additionally, by enabling a system where genomic data is centralised in a backed up storage repository, researchers will now be able to re-use genomic data with University collaborators enabling additional outcomes to be found from a single dataset.
As one of Australia's most research-intensive institutions, we need to be innovative in the way we collect, analyse and collaborate with our research data. The processing of massive datasets through
the Genomics Repository and the High Performance Computer (HPC) is key to achieving this goal and
will deliver dividends in research outputs and impact.
The Genomics Repository has extensive data storage capabilities that will enable our researchers to store and process data in a timely manner with the assurance of regular data backup. Datasets can be shared within the repository, which will enhance the capability of collaboration across research groups. Bioinformaticians and non-computational researchers will benefit from this critical platform capability that can be utilized without specialist data processing knowledge.
The Faculty, Institute and University community look forward to the results of this enhanced research capability and collaborative opportunity in genomic data.
Sarah Robertson and Jimmy Breen
Robinson Research Institute
The following resources are available if you'd like to learn more:
- User guide (subject to change with system improvements)
- Presentation about the Genomics Repository
- Component diagram
- Workspace lifecycle diagram
- Cleaning your workspace (data lifecycle)
- Simple data movements diagram
Workflow Description Language (WDL)