Methylation Preprocessor
Overview
The methylation preprocessor filters methylation data for use in downstream pipelines. This filtering process happens in two stages: the first stage is common to all pipelines, while the second stage is unique to each pipeline. The first stage removes problematic rows, removes redundant columns, reorders the columns, and sorts the data by gene name. The second stage is described below for each destination pipeline.  Note that /wiki/spaces/GDAC/pages/844333174 was recently shared by SKCM awg, which we will try to implement in time for March/April analysis run (this page is restricted to GDAC members, and requires login).
Control Flow
- Standardize
- standardize methylation data
- standardize expression data (if expression data is available)
- Preprocess
- generate data for clustering pipeline
- generate data for clinical correlation pipeline
- generate data for expression correlation pipeline (if expression data is available)
- Cleanup
- delete all temporary files
Stage 1: Standardization
The data files are stripped of any information that is not relevant to their downstream pipeline, and then sorted by gene name.
(It is worth noting that sorting is done on-disk. Methylation files can exceed the amount of memory on a machine, so sorting in-memory is not always possible. Writing to disk allows us to leverage the highly-optimized unix sort command, which performs an on-disk mergesort in a specified amount of memory.)
Methylation Data
Input Format
- Each row corresponds to a single methylation probe.
- The first column lists the name of the methylation probe.
- Each subsequent block of four columns corresponds to a single TCGA barcode.
- Beta_Value
- Gene_Symbol
- Chromosome
- Genomic_Coordinate
Filtering
- Remove rows where:
- Chromosome is X or Y
- Gene_Symbol is NA
- more than 5% of the Beta_Values are NA
- Split rows corresponding to probes targeting multiple genes.
- Sometimes a probe overlaps or targets multiple genes. This results in a row of probe values where the Gene_Symbol is two gene names separated by a semicolon. When this happens, we duplicate that row once for each gene in the Gene_Symbol, and remove the original row.
- Remove redundant columns.
- For every column of Beta_Values, there are three columns of other metadata. This metadata is constant across an entire row. So, we keep the first three metadata columns, all Beta_Value columns, and discard the rest.
- Sort by Gene_Symbol.
Expression Data
This file is optional. If one is provided, we will generate methylation data for expression correlation.
Input Format
- Each row corresponds to a single gene.
- The first column is the Gene_Symbol, and some other data.
- All subsequent columns are Beta_Values.
Filtering
- Remove the extra data from the Gene_Symbol column.
- The gene name is followed by a pipe, and a genomic coordinate. We don't care about the genomic coordinate, so we drop it.
- Sort by Gene_Symbol.
Stage 2: Preprocessing
Clustering
For a given gene, we select the probe with the maximum standard deviation across all beta values. Then we discard any probes with a standard deviation below a specified cutoff. The default cutoff is .2, but it can be tuned based on the desired output file size.
Correlation with expression data
We start by filtering the methylation and expression data even further. We make a list of genes and individuals appearing in each file, and keep only the beta values for genes and individuals that appear in both. This leaves us with multiple methylation probes per gene, and a single measured expression level per gene.
For a given gene, we select the probe that is most anticorrelated with expression data. (ie, we find the Spearman rho between a row of expression data and each row of methylation data, and we keep the row of methylation data that produces the most-negative rho value.)