22 March 2012

Internship proposals from Illumina (Master, P.H. Thesis)

Below you will find 7 internship proposals from Illumina Cambridge UK. To apply send your CV and a cover letter to the relevant contact. Applications will be reviewed until the end of March, after which candidates with an interesting profile will go through a phone interview.

Internships can last 3 to 6 months and start according to the availability of the candidate.

Title: Algorithms for structural variant detection in next-generation sequencing data
Supervisor: Ole Schulz-Trieglaff and Richard Shaw
Contact email: oschulz-trieglaff@illumina.com, rshaw@illumina.com

This internship project focuses on variant detection in resequencing experiments: sequencing reads from a sample are aligned to its reference and genetic variants (insertions, deletions, inversions, tandem duplications and inter-chromosomal translocations) in the sample are detected by searching for reads that align in an unexpected fashion to the reference.

GROUPER ("Guided Reassembly Of Unaligned Paired-End Reads") is Illumina's variant calling algorithm and part of the CASAVA software suite. It is a modular workflow consisting of anomalous read clustering, de-novo assembly of clusters and contig alignment and interpretation.

The aim of this project is to implement and test one of several potential improvements to the GROUPER workflow. Among them are:
• Read clustering: We currently use a straightforward clustering algorithm based on read alignment positions and overlaps. Clustering is a popular topic in the data mining and statistical community so there is a plethora of tools available that could be tried.
• Enhanced event type interpretation: Currently initial read clustering is segregated by a few simple variant types and one such type is chosen in an ad hoc fashion if there is ambiguity after subsequent consolidation. Instead, the evidence for genuine compound variants could be considered, with consolidation of adjacent but separate variants being avoided.
• Local de-novo assembly: GROUPER assembles clusters of reads using a de Bruijn graph approach. There are several possible improvements such as a string-graph assembly [1] or a re-assembly guided by alignment positions.

Depending on the interests and previous experience of the candidate, we will agree on one (!) of these improvements as a topic for the internship. We have a large number of real and simulated data sets on which the implementation can be tested.

The intern will learn to work in a collaborative Bioinformatics research environment and gain experience in programming and algorithm development. We also expect a written report summarizing the results of the project and a short oral presentation.

This project is suitable for a student with good programming skills, ideally in C/C++, and an interest in biological applications. Previous knowledge of Makefiles, code version management (cvs), a scripting language and Linux is advantageous but can be acquired during the internship.

[1] Eugene W Myers (2005) "The fragment assembly string graph" Bioinformatics

Title: Estimating tumour heterogeneity in cancer sequencing samples
Supervisor: Sergii Ivakhno, Jennifer Becq
Contact email: sivakhno@illumina.com, jbecq@illumina.com

Copy number aberrations (CNAs) represent an important type of genomic alterations in cancer that can be uncovered with high-throughput genome sequencing (HTG). Interactions between CNAs and other mutation types such as point mutations can enhance the understanding of evolutionary history of individual tumours. The assignment of discrete copy number states at a particular genome location is however complicated by varied ploidy and purity of tumour samples and also internal heterogeneity that many cancers exhibit [1]. We have developed a method called CNAseg, based on Hidden Markov Models (HMM), to disentangle contributions of a non-diploid genome and normal sample admixture towards the final CNA calls [2].

The aim of the internship is to explore possibilities of predicting tumour heterogeneity in addition to ploidy/purity values. The candidate will start by exploring correlations between different germline variants (i.e. based on B-allele ratios, etc) in a well-characterized cancer samples with the aim of finding heterogeneity signature [3]. Various regression-based models will be used for this task. The set of found heterogeneity correlates will be incorporated into the CNAseg HMM model and tested on the real and simulated datasets. There is also scope to explore feasibility of classifying tumor samples based on heterogeneity signature.

This internship will be suitable for a person with a statistical background and with knowledge of genetics/cancer biology. Programming and modeling skills in R are essential, intermediate proficiency in C++ is a plus.

1. Ivakhno S, Royce T, Cox AJ, Evers D, Cheetham KR, Tavar´e S. (2010) HMMseg a novel framework for identification of copy number changes in cancer from second-generation sequencing data, Bioinformatics, 26, 3051-3058.
2. Pleasance, E., Cheetham KR, et al. (2010) A comprehensive catalogue of somatic mutations from a human cancer genome, Nature, 463, 191–196.
3. Yau, C., et al. (2010) A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data, Genome Biology, 11, R92

Title: Using hashing algorithms to cluster DNA sequences
Supervisor: Tony Cox and Ole Schulz-Trieglaff
Contact email: acox@illumina.com, oschulz-trieglaff@illumina.com

The analysis of large read sets is a computationally demanding task, especially de-novo assemblies of large genomes require a prohibitive amount of RAM.

This project will explore the use of hashing algorithms to cluster sequence reads with similar content, the aim of which is to serve as a preprocessing step for de novo assembly. The idea is to cluster reads with overlapping sequences using a hashing function. Once the sequences in each cluster have been assembled into a contig (sequence), the subsequent assembly of these contigs is less complex than the assembly of the entire genome directly from the individual reads.

This hashing approach has several applications, also for metagenomic analyses and sequence database queries.

We have several whole-genome de-novo assembly data sets on which this approach can be tested.

We have a large number of real and simulated data sets on which the implementation can be tested. We also expect a written report summarizing the results of the project and a short oral presentation.

The intern will learn to work in a collaborative Bioinformatics research environment and gain experience in programming, algorithm development and the application of de-novo assemblers.

This project would suit a candidate with a strong background in computer science or mathematics. Previous use of C or C++ in some form of project work is essential. An understanding of hash functions and other computer science techniques and a familiarity with algorithms for sequence analysis are both highly desirable.

Title: Combined analysis of methylation array and RNA-Seq data in colorectal cancer samples
Supervisor: Jennifer Becq
Contact emails: jbecq@illumina.com

The methylation state of genomic loci is thought to be correlated with their level of transcription. The global methylation status of a cell can be assessed using illumina HumanMethylation450 array, which measures the level of methylation of ~500,000 sites, most of which fall within RefSeq genes. Sequencing of mRNA(RNA-Seq) allows an exhaustive measure of the global expression of a cell. For this internship project, HumanMethylation450 data and RNA-Seq data is available for several tumour/normal pairs of colorectal cancer samples.The aim of this internship is to combine the analysis of those two datasets in order to discover potential biomarkers or therapeutic targets of colorectal cancer.

The intern will try available methodologies on this particular dataset, assess the quality of the results and propose the most suitable pipeline. The intern will learn to work in a collaborative bioinformatics research environment within the industry and gain experience in biological data analysis. We also expect a written report summarizing the results of the project and a short oral presentation.

This project would suit a student whose undergraduate degree or previous experience includes a significant amount of bioinformatics and biological data analysis. In particular the student should be familiar with a scripting language (eg. Python or Perl), with a statistical tool such as R and have a fairly good understanding of methylation and transcription processes in human/eukaryotic cells. Experience in array data analysis and/or next-gen sequencing data would be ideal.

Title: Applications of the Burrows-Wheeler transform to DNA sequence data
Supervisors: Dr. Markus Bauer, Dr. Tony Cox
Contact emails: mbauer@illumina.com, acox@illumina.com

The Burrows-Wheeler transform (BWT) [1] is a permutation of a string with special properties that have given rise to many applications for the BWT within pattern matching and data compression. Based on the BWT, in 2000 Ferragina and Manzini presented a compressed text index, called the FM-index [2], that allows for counting and locating patterns in time linear to the size of the search pattern.

Recent work [3] has demonstrated it is feasible to compute FM-index style data structures for sets of reads on the scales seen in human genome sequencing experiments. This project will explore practical applications of this concept to areas such as transcriptomics and metagenomics.

This project would suit a student whose undergraduate degree or previous experience includes a significant amount of computing. Some previous familiarity with algorithms and data structures for text indexing and string matching would be a distinct advantage. The student would need to be able to program confidently in C or (ideally) C++.

[1] Donald Adjeroh, Timothy Bell and Amar Mukherjee. The Burrows-Wheeler transform: data compression, suffix arrays and pattern matching. Springer 2008, ISBN 978-0-387-78908-8
[2] Paolo Ferragina and Giovanni Manzini. Opportunistic Data Structures with Applications. FOCS 2000.
[3] Markus J. Bauer, Anthony J. Cox, Giovanna Rosone: Lightweight BWT Construction for Very Large String Collections. Proc. CPM 2011: 219-23

Title: Learning models for predicting the quality of variant calls from sequencing data
Supervisors: Epameinondas Fritzilas and Adrian Alexa
Contact email: efritzilas@illumina.com, aalexa@illumina.com

An important application of high-throughput sequencing is the detection of the differences between a sequenced sample genome and a pre-assembled reference genome. In a nutshell, this involves the alignment of the sequenced reads against the reference genome and the detection of groups of reads that align in “unexpected” ways. The accuracy of the results depends on several factors: some are related to static properties of the reference genome, such as repeat content, some are related to the chemistry used for the sample preparation and sequencing, such as uniformity of coverage and single-base accuracy, and some are related to the robustness of the alignment and variant calling algorithms.

Understanding how exactly the above factors interact with each other and what is their relative impact on the accuracy of the reported genomic variants is a challenging problem in the bioinformatics community.Having a large number of sequenced genomes available gives us the opportunity to approach this problem with systematic unsupervised and supervised learning methods. In particular, we want to discover the most informative features, quantify how they correlate with each other and combine them in predictive models.

This project is an opportunity to work in a cutting-edge R&D environment and gain significant intuition about the mechanics of analysing high-throughput sequencing data with the aim of understanding the quality of the reported genetic variants. At the methodological level, the intern will sharpen their machine-learning skills in a data-rich environment, will get familiar with the interplay between chemistry and algorithms and understand the inherent difficulties of the process. Datasets from Illumina’s latest sequencing platforms will be made available for the evaluation of the investigated methods.

This project requires a candidate with a good background in statistical learning methods for the analysis of multidimensional data. In terms of technical skills, the candidate is expected to implement the investigated methods and, therefore, it is essential to be familiar with the R statistical environment and related machine learning packages or with another open-source framework that offers equivalent functionality. Familiarity with C/C++ and a scripting language is an advantage. Previous experience with high-throughput sequencing data would be ideal, but not a strict requirement.

Title: Methods for annotating non-coding variants
Supervisor: Stewart MacArthur
Contact email: smacarthur@illumina.com

A lot of effort has been put into understanding the effect of variants in coding sequence. It is possible to predict with some accuracy the effect a simple variant, such as a SNP or insertion/deletion, has on a coding sequence, the resulting protein and eventually phenotype. The same is not true of non-coding variants. The absence of a genetic code or an understanding of the grammar of regulation makes it much more complicated to predict how any given non-coding variant will affect phenotype.

There has been some effort to predict the effect of non-coding variants by the Variant Effect Predictor (VEP), using information from publicly available datasets, such as from chIP-seq and DNase hypersensitivity assays and from the calculation of predicted transcription factor binding motifs. However, these are largely overlap based annotations, with no prediction regarding the severity of the variant.

The intern would develop methodologies to assess the likely severity of non-coding variants. The approach is flexible to the interests of the intern and could include predicating new transcription factoring binding motifs, using multi-species genome alignments to assess the evolutionary conservation of affected positions or other methods such as using ENCODE and other publicly available data to infer variant effects.

This project would suit a student with previous experience of bioinformatics and managing large data sets. The student should be familiar with a scripting language (e.g. perl) and experience with R would be beneficial. A good knowledge of the biology of gene regulation and mechanisms of transcriptional control would we useful.

No comments: