Theme Settings
Color Scheme
Light
Dark
Layout Mode
Fluid
Boxed
Topbar Color
Light
Dark
Brand

Help

Introduction to GEPIA

The GEPIA series has provided robust and widely used tools for analyzing gene expression data derived from TCGA, GTEx and ICGC, as well as cell line drug screen data. GEPIA enabled comprehensive studies of differential expression, survival analysis, and gene correlation.

Quick Start

Enter a gene/isoform symbol or gene/isoform ID (Ensembl ID) in the "Enter gene name" field, and click the "GoPIA!" button to search for the gene of interest.

Images

Expression Calculation in GEPIA3

All functions were based on human reference genome hg38. GEPIA3 used RNA-seq gene or transcript expression in TCGA tumor, TCGA peritumor and GTEx normal tissues. The expression values were TPM (Transcripts Per Million) and expected counts calculated by RSEM from UCSC Toil RNA-seq Recompute. Cell line expression were calculated from CCLE RNA-seq samples. We downloaded RSEM calculated TPM (log2 transformed) from DepMap 22Q1. The gene and isoform biotype annotation were based on GENCODEv46. Our analysis framework permits optional user-selected normalization (log2 transformation) as input, except: (a) DESeq2 only accepts expected counts as input; (b) statistical methods requiring normality assumptions for expression analysis. We highly recommended using log2 transformed TPM as input in most common cases. Boxplots and scatter plots with expression values used log2 transformed TPM as marked in axis labels.
Signature: all expression-based functions in GEPIA3 support the use of multiple genes as input, and then compute the signature score using the first principal component value derived from Principal Component Analysis (PCA) of the expression levels of the input genes.

The update from GEPIA2 to GEPIA3

GEPIA3 introduces substantial updates, including modules for drug sensitivity analysis that link gene expression profiles with therapeutic outcomes. Additionally, it incorporates advanced network analysis frameworks, leveraging eQTL, co-mutation, and protein interaction data to map regulatory landscapes. Comprehensive tools for studying multiple types of RNA alterations have also been added, enabling the investigation of diverse transcriptomic modifications. These innovations establish GEPIA3 as an indispensable resource for integrative and translational cancer genomics research.

Drug Analysis

TCGA Drug Response Analysis

    This function enables users to compare gene/signature expression level between patients with different drug responses annotated in TCGA.

Parameters

  • Gene: Input one or a group of genes of interest.
  • Methods: Select ANOVA or Wilcoxon for differential analysis.
  • log2 transformed: Select whether use log2(TPM+1) as expression abundance.
  • Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
  • Groups: Choose which group to compare

Results

    The table shows the results of differential expression analysis for gene expression levels between patients with different drug responses.
Images Images

TCGA Drug Treatment vs Survival

    This function provides users to compare gene expression-survival relationship between patients with and without specific drug treatment.

Parameters

  • Gene: Input a gene/isoform or gene signature A of interest.
  • Methods: Select the OS or PFS survival response.
  • Axis Units: Select Month or Day unit for plotting.
  • Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
  • Colors: Choose color to plot KM curves
  • Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression cohorts.
  • Cutoff-High(%): Samples with expression level higher than this threshold are considered as the high-expression cohort.
  • Cutoff-Low(%): Cutoff-Low(%): Samples with expression level lower than this threshold are considered the low-expression cohort.

Results

    The result table shows the results of Cox regression analysis for gene expression levels and patient survival between the drug and non-drug groups. click the button beside drug names to calculate the survival differences between the four patient groups: high expression with the drug, low expression with drug, high expression without drug, and low expression without drug. Kaplan-Meier (KM) survival curves will be generated for each pairwise comparison.
Images

Cell Line Drug Screen

    This function provides users to calculate correction between drug response and gene expression/copy number in CTRP, GDSC and CREAMMIST databases.

Parameters

  • Gene: Input a gene/isoform or gene signature A of interest.
  • Drug Screen Dataset Select the dataset to use.
  • Drug Sensitivity Metric Select the score for drug sensitivity measurement.
  • Cell Line Types: Select one or multiple cell types and click "+" to build the dataset.

Results

    The result table shows the results of correlation coefficient between query gene expression level or drug responses in selected cell lines. Z score is the Fisher's Z of correlation coefficient r. The plot shows the top 10 sensitive drugs with the gene expression (the top 10 genes with highest r and lowest r respectively).
Images

Cell Crispr Screens

    This function allows users to compare the cell responses to CRISPR gene perturbation between drug-treated and untreated conditions.

Parameters

  • Gene: Input a gene for analysis. Gene ID is not supported, please use gene official name. A total of 17,382 genes were included in the screen, therefore, not all genes are available.
  • Treatment Datasets: Click "+" to select datasets for drug treatment. Data with more than five entries and data for nonexistent selected genes will be filtered out.

Results

    The result shows the gene's NormZ value (normalized sgRNA reads count Z-score between DNA damage-treated and untreated conditions cells). Negative NormZ values represent genes whose mutation leads to their depletion from the cell population after genotoxin exposure, whereas positive NormZ scores represent genes whose mutation leads to a selective growth advantage in the presence of the drug. See the original paper for details.
Images

Network Analysis

Comprehensive Analysis

    This feature allows users to explore multiple interaction relationships of genes based on selected datasets and interaction types. STRING database annotations can also be optionally included.

Parameters

  • Gene Input: Input a list of gene symbols or Ensembl IDs of interest.
  • Interaction Types: Choose interaction types, including:
    • Positive Correlation
    • Anti-correlation
    • Co-alternation
    • Mutually Exclusive Alternation
    • Synthetic Lethality (SL) / Synthetic Viability (SV)
    • OncoPPI: Oncogenic Protein-Protein Interactions
  • Nodes Displayed: Choose the number of nodes you want to include in the network graph.
  • Datasets: Select cancer datasets of interest (e.g., TCGA Tumor). Multiple datasets can be chosen, or manual input separated by commas is supported.
  • STRING Annotation: Enable this option to include gene relationships from the STRING database.

Notes

  • Only gene pairs with an absolute Pearson Correlation Coefficient (PCC) greater than 0.5 are considered in the analysis of expression patterns.
  • When searching , the displayed network highlights the top k (Nodes Displayed input) nodes with the highest degrees connected to the target gene within the selected datasets and interaction patterns.

Results

  • Edges: A table summarizing gene pairs, interaction types, confidence measures (such as score for STRING and SL/SV, PCC for co-expression, and p-value/q-value for co/me-alteration and oncoPPI), and relevant cancer datasets.
  • Network Visualization: An interactive graph displaying gene relationships. Nodes represent genes, and edges represent interactions. Different colors are used to indicate interaction types.
Images Images

Alteration Network

    This feature allows users to identify and visualize gene alteration relationships, including co-alteration and mutually exclusive alteration patterns, within selected datasets and genes.

Parameters

  • Gene: Input a gene/isoform or multiple genes of interest.
  • Alteration Pattern: Select alteration pattern of interest.
  • ICGC Tumor/Selected Datasets: Select cancer types of interest in the "ICGC Tumor" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of cancer types split by comma (e.g. Kidney-RCC,Liver-HCC) is also acceptable. The interaction analysis is based on the datasets list.

Notes

  • We considered six types of alterations: alternative promoter usage, expression outliers, genetic variants, alternative splicing events, allele-specific expression, and copy number variations.
  • When searching , the displayed network highlights the top 10 nodes with the highest degrees connected to the target gene within the selected datasets and interaction patterns.

Results

  • Table:Genes that are co-altered or me-altered with the target gene, along with selected features of the former.
  • Heatmap Visualization: Visualizes the alteration status of selected genes across samples from the user-selected cohort. Each bar represents an individual sample, with red indicating alteration and gray indicating no alteration. The percentage on the right reflects the proportion of altered samples for each gene.
  • Network Visualization: An interactive graph displaying gene relationships. Nodes represent genes, and edges represent interactions. Different colors are used to indicate interaction types.
Images Images Images

Expression Network

    This feature allows users to search for genes that are most correlated with a target gene or gene signature within selected datasets.

Parameters

  • Gene: Input a gene/isoform or gene signature of interest.
  • Top # Genes: Set the number of top correlated genes to display. (Maximum: 1000)
  • Interaction Pattern: Select interaction pattern of interest.
  • Log Transformation: Apply log2 transformation to normalize the gene expression data.
  • TCGA Tumor/TCGA Normal/GTEx/Used Expression Datasets: Select cancer types of interest in the "TCGA Tumor", "TCGA Normal" or "GTEx" field and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.

Results

  • Table: Displays the top correlated genes, their correlation coefficients, and the number of tumor, normal, and GTEx samples analyzed.
  • Network Visualization: For single gene input, the top correlated genes are displayed in a network, with annotation fron STRING database.
Images Images

Protein Interaction

    The OncoPPI module identifies oncogenic protein-protein interactions (oncoPPIs) for input genes across selected cancer types, leveraging interaction data from large-scale studies and structural databases. This module is built upon the findings from the study "A structurally informed human protein–protein interactome reveals proteome-wide perturbations caused by disease mutations" published in Nature Biotechnology on October 24, 2024. Specifically, the module uses data from Supplementary Table 3, which includes significance tests of somatic mutation enrichment in protein-protein interaction (PPI) interfaces across 33 TCGA cancer types. The interaction data integrates predictions from the PIONEER deep-learning framework and experimentally resolved structures. In addition to the PIONEER-predicted interactions, structural data were curated from:
  • The Protein Data Bank (PDB): Co-crystal structures of PPIs were downloaded from the PDB FTP site (PDB FTP).
  • Interactome3D: Homologous structures for PPIs without co-crystal data were collected from Interactome3D.
  • For citation, please refer to: A structurally informed human protein–protein interactome reveals proteome-wide perturbations caused by disease mutations

Parameters

  • Gene List: Input one or multiple genes to query their oncogenic protein interactions.
  • Datasets: Select TCGA cancer types (e.g., "CHOL Tumor", "BRCA Tumor") to filter relevant oncogenic interactions.

Results

  • Table: Summarizes oncogenic protein interaction details:
    • Gene A: Input gene involved in the interaction.
    • Gene B: Interacting partner gene.
    • P-value A / P-value B: Statistical significance of the somatic mutation enrichment for each gene.
    • FDR: Adjusted False Discovery Rate for multiple testing correction.
    • Cancer Type: Specific cancer type where the interaction is identified.
    • Source: Indicates the method used to define protein-protein interface regions.
  • Network Visualization: Visualizes oncogenic protein interactions, annotated with STRING database.
Images Images

SLSV Analysis
    This module allows users to explore genetic interactions (GIs) across selected cancer types, focusing on synthetic lethality (SL) and synthetic viability (SV) relationships between gene pairs. For citation, please refer to: "Harnessing genetic interactions for prediction of immune checkpoint inhibitors response signatures in cancer cells."
Parameters
  • Gene: Input a gene/isoform or multiple genes of interest.
  • Interaction Pattern: Select interaction pattern of interest.
  • Tissue/Selected Datasets: Select tissue of interest in the "Tissue" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of tissue split by comma (e.g. pancancer,blood) is also acceptable.
Results
  • Table: Displays SL and SV interaction results:
    • Gene A: The queried gene symbol.
    • Gene B: The interacting partner gene.
    • Interaction Type: Indicates whether the interaction is SL or SV.
    • Tissue: Specifies the tissue or cancer type where the interaction occurs.
  • Network Visualization: Visualizes genetic interactions as an interactive network, allowing for exploration of SL and SV relationships.
Images Images

eQTL

    This module allows users to identify both cis-eQTLs and trans-eQTLs across 33 cancer types derived from The Cancer Genome Atlas (TCGA). For citation, please refer to: PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types.

Parameters

  • Gene: Input a gene/isoform or multiple genes of interest.
  • eQTL Type: Choose between "Cis-eQTLs" for local gene expression regulation or "Trans-eQTLs" for distant gene expression regulation.
  • TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The interaction analysis is based on the datasets list.

Results

  • Table: Displays eqtls results:
    • Gene: The queried gene symbol.
    • SNP ID: The identifier of the single nucleotide polymorphism (SNP) associated with gene expression (eQTL).
    • Alleles: The observed allelic variants at the SNP locus (e.g., A/G, T/C), representing the different nucleotides in the population.
    • Cancer Type: The specific cancer type in which the eQTL association was identified.
    • Beta: The effect size indicating how much the SNP alters gene expression; a positive value suggests upregulation, while a negative value suggests downregulation.
    • P-value: The statistical significance of the eQTL association; lower values indicate stronger evidence against the null hypothesis of no association.
    • FDR: The false discovery rate–adjusted p-value accounting for multiple testing; controls the expected proportion of false positives.
  • Images

    RNA Alterations

    Allele-Specific Expression

      This function enables users to identify allelic expression imbalances in one or more genes, along with the factors driving these imbalances.

    Parameters

    • Gene: Input one or a group of genes of interest.

    Results

    • Table: Displays the allele-specific expression levels of the gene in cancer, as well as whether the gene is an imprinted gene, a cancer gene, or a cancer-testis gene.
    • Plot: Heatmaps that show the effect sizes of influencing factors and the input somatic burdens.

      Effect size here refers to the magnitude of the relationship between the genomic factors and allelic expression imbalance, as quantified by generalized linear model. Zoom-in plots for somatic mutation factors are provided to ensure that finer details are not masked by factors with larger effect sizes.

    Images

    Alternative Promoter

      This function provides users with information on relative promoter activity and enables comparisons between tumor samples and their corresponding normal tissues.

    Parameters

    • Gene: Input one or a group of genes of interest.
    • ICGC Tumor: Select cancer types of interest.

    Results

    • Table: Shows the basic information of promoters and their average relative activities.
    • Barplot: Displays average activities in tumors and their corresponding normal tissues. Note that only the p-values < 0.10 will be annotated on the plot.
    Images Images

    Gene Fusion

      This function allows users to search for gene fusion events in specific cancer types.

    Parameters

    • Gene: Input one or a group of genes of interest. Or enter "all" to search for all fusion events in the given cancer types.
    • ICGC Tumor: Select cancer types of interest.
    • show top n frequent genes: Set the number of genes to be displayed in the bar plot for gene frequency.

    Results

    • Table: Displays the basic information of fusion events and the frequency of these events in the given cancer types.
    • Circos Plot: Visualization for the gene fusion events. Note that when searching for all events in a given cancer type, only the top n (set by users) most frequent genes will be annotated on the plot.
    • Barplot: Shows the top n (set by users) most frequent genes in the search result.
    Images Images

    Other Updates

    Differential Expression with Hotspot Mutation

      This function enables users to compare gene/signature expression level between patients with and without key hotspot mutations.

    Parameters

    • Gene: Input a gene/isoform or gene signature A of interest.
    • Methods: Select ANOVA or Wilcoxon for differential analysis.
    • log2 transformed: Select whether use log2(TPM+1) as expression abundance.
    • Hotspot Mutation Type: Select the type of mutation. Gene level means any nonsilent mutation in the gene. Amino acid level means changes at specific amino acid site (HGSV protein-level).
    • Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
    • Plot Color: Choose group color to plot differential expression boxplot and correlation scatter plot.

    Results

      The table shows the significant results of differential expression between hotspot mutant and wildtype cases. Click the button beside mutation to plot gene/signature expression distribution of mutant and wildtype groups, as well as the expression correlation with the hotspot gene.
    Images Images

    Functions

    Gene Card

    Enter a gene symbol or gene ID (Ensembl ID), and click the "GoPIA" button to quick search your gene of interest.

    GEPIA provides an interactive bodymap to visualize the expression in both tumor and normal samples, as well as an expression profile that displays the gene's expression level across all tumor samples and normal tissues.

    Images Images

    Profile Summary

    Differential Genes

    This function allows users to list the tumor/normal tissue differentially expressed genes or isoforms in a cancer type, and plot the chromosomal distribution of these genes.

    Users can apply custom statistical methods and thresholds on a given dataset to dynamically obtain differentially expressed genes. For the DESeq2 and limma options, genes with higher |log2FC| values and lower q values than pre-set thresholds are considered differentially expressed genes.

    Parameters

    • Dataset Selection: Select a cancer type of interest.
    • BioType: Type of genes or isoforms.
    • Normal samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data.
    • |log2FC| Cutoff: Set custom fold-change threshold.
    • q-value: Set custom q-value threshold.
    • Gene/Isoform: Query differentially expressed genes or isoforms.
    • Differential Methods: Select a method for differential analysis.

    Results

      Click the "List" button: GEPIA will generate a list of differentially expressed genes.
    Images

    Prognostic Genes

    This function enables users to search for genes most associated with patient survival based on a cancer type.

    GEPIA uses the Log-rank test for hypothesis testing across all genes, with q-values calculated for multiple testing corrections.

    Parameters

    • Datasets Selection: Select a cancer dataset of interest.
    • Biotype: Type of genes or isoforms.
    • Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression groups.
    • TimeResponse: Select the OS or PFI survival method.
    • Gene/Isoform: Query most prognosis-associated genes or isoforms.

    Results

      Click the "List" button: GEPIA will generate a list of prognostic genes.
    Images

    Find Drugs

    This function enables users to perform survival analysis based on TCGA patients using specific drug treatment.

    GEPIA performs differential analysis of Overall Survival (OS) or Progression-free Interval (PFI) between patients with or without using specific drugs. GEPIA uses Log-rank test, a.k.a the Mantel-Cox test, for hypothesis test.

    Parameters

    • Drugs: Select a drug of interest. Both generic name and trade name are provided.
    • TCGA Caner Types/Selected Datasets: Select one or multiple cancer types and click "+" to build the dataset.
    • TimeResponse: Select the OS or PFS survival response.
    • TimeUnit: Select Month or Day unit for plotting.
    • Colors: Choose color to plot.

    Results

      Click the "Run" button: GEPIA will generate survival curve comparing two groups (with or without drug treatment).
    Images

    Analysis

    Expression Analysis

    Expression DIY

      GEPIA plots expression profiles of a given gene according to selected datasets and statistical methods by cancer types or pathological stages. The details are shown below:

    Profile DIY

      GEPIA generates dot plots profiling gene expression across multiple cancer types and paired normal samples, with each dot representing a distinct tumor or normal sample.

    Parameters

    • Gene: Input a gene/isoform of interest.
    • TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "+" to build a dataset list in the "Selected Datasets" field. Manual input of cancer types split by comma (e.g. ACC,BRCA,BLCA) is also acceptable. The x-axis of the plot will follow the order of datasets.
    • Normal Samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data in plotting. [The matched normal samples for differential analysis are "TCGA normal + GTEx normal"]
    • Differential Methods: Statistical methods used for differential gene expression analysis.
    • |log2FC| Cutoff: Set custom fold-change threshold.
    • q-value: Set custom fold-change threshold.

    Results

      Click the “Run” button: GEPIA will generate a gene expression profile based on users' custom input parameters.
    Images

    Boxplot

      GEPIA generates box plots with jitter for comparing expression in several cancer types (For best visual quality, we recommend 1-4 cancer types).

    Parameters

    • Gene: Input a gene or multiple genes of interest.
    • TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "+" to build a dataset list in the "Selected Datasets" field. Manual input of cancer types split by comma (e.g. ACC,BRCA,BLCA) is also acceptable. The x-axis of the plot will follow the order of datasets.
    • Normal Samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data in plotting. [The matched normal samples for differential analysis are "TCGA normal + GTEx normal"]
    • |log2FC| Cutoff: Set custom fold-change threshold.
    • q-value: Set custom fold-change threshold.
    • High Group: Set the box color of tumor dataset and the font color for cancer types with elevated expression.
    • Low Group: Set the box color of normal dataset and the font color for cancer types with reduced expression.

    Results

      Click the “Run” button: GEPIA will present a gene expression box plot based on users' custom input parameters.
    Images Images

    Correlation Analysis

      This function performs pair-wise gene expression correlation analysis for given sets of TCGA and/or GTEx expression data, using methods including Pearson, Spearman and Kendall. One gene can be normalized by other gene.
      GEPIA3 uses the log-scale for calculation and visualization.

    Parameters

    • Gene A: Input a gene or multiple genes of interest. [For x-axis]
    • Gene B: Input a gene or multiple genes of interest. [For y-axis]
    • TCGA Tumor/TCGA Normal/GTEx/Selected Expression Datasets: Select cancer types of interest in the “TCGA Tumor”, “TCGA Normal” or “GTEx" field and click “add” to build dataset list in the “Used Expression Datasets” field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
    • Correlation Coefficient: The method for calculating the correlation coefficient.
    • use log2 (TPM+1) scale: Prior to computation, data are subjected to log2(TPM+1) scale to enhance robustness

    Results

      Click the “Run” button: GEPIA will generate a scatter plot of correlation analysis result.
    Images

    Dimensionality Reduction

      GEPIA performs Principal Component Analysis (PCA) for a given gene list using custom TCGA and/or GTEx expression data. First, GEPIA presents a 3D plot of top three principal components (PC) and generates a bar plot for variances interpreted by each PC. Second, GEPIA presents 2D plot or 3D plot based on user-specified PCs.

    Parameters

    • Gene Set: Input a gene list of interest. Manual input of genes(e.g. ERBB2,EGFR)
    • TCGA Tumor/TCGA Normal/GTEx/Selected Expression Datasets: Select cancer types of interest in the “TCGA Tumor”, “TCGA Normal” or “GTEx" field and click “add” to build dataset list in the “Used Expression Datasets” field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
    • use log2 (TPM+1) scale: Prior to computation, data are subjected to log2(TPM+1) scale to enhance robustness

    Results

      Click the “Run” button: GEPIA will generate a scatter dimensionality reduction result.
    Images

    Hotspot Mutation

    Survival Analysis

    This function enable users to perform survival analysis based on the expression status of one gene or a multi-genes' signature and plot a Kaplan-Meier curve. Some gene signature lists are provided.

    GEPIA performs Overall Survival (OS) or Progression-free Interval (PFI) analysis based on gene expression. GEPIA uses Log-rank test, a.k.a the Mantel-Cox test, for hypothesis test. Cohorts thresholds can be adjusted. The cox proportional hazard ratio and the 95% confidence interval information can also be included in the survival plot.

    Parameters

    • Gene: Input a gene/isoform or gene signature A of interest.
    • TimeResponse: Select the OS or PFS survival response.
    • TimeUnit: Select Month or Day unit for plotting.
    • TCGA Tumor/Selected Datasets: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
    • Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression cohorts.
    • Cutoff-High(%): Samples with expression level higher than this threshold are considered as the high-expression cohort.
    • Cutoff-Low(%): Cutoff-Low(%): Samples with expression level lower than this threshold are considered the low-expression cohort.
    • Colors: Choose color to plot.

    Results

    • Univariable
    • Images
    • Multivariable
    • Images

      Drug Analysis

      See GEPIA Update - Drug Analysis

      Network Analysis

      See GEPIA Update - Network Analysis

      RNA Alterations

      See GEPIA Update - RNA Alterations

      Differential Analysis

      We use three methods for differential expression analysis using TCGA tumor samples with paired adjacent TCGA normal samples and GTEx normal samples.

      Why do we use both TCGA normal and GTEx normal samples for differential analysis? TCGA produced 11,257 tumor samples across 33 cancer types, while this project only provides 1,475 normal samples. The imbalance between the tumor and normal data can cause inefficiency in various differential analyses. Fortunately, GTEx project produced RNA sequencing data for ~8,000 normal samples. Meanwhile, UCSC Xena project recomputed the TCGA and GTEx raw RNA-Seq data using a standard pipeline, which makes two datasets compatible. As a result, the TCGA and GTEx data could be integrated for very comprehensive expression analysis.

      we consulted with multiple medical experts to determine the most appropriate tumor-normal comparisons. The comparisons and data we used are presented in the Dataset Sources - Tissues.

      Differential methods

      ANOVA

      Considering the different stratifications of sex, age, ethnicity in tumor and normal samples, we applied four-way analysis of variance (ANOVA), using sex, age, ethnicity and disease state (Tumor or Normal) as variables for calculating differential expression:

      Gene expression ~ sex + age + ethnicity + disease state

      The expression data are first log2(TPM+1) transformed and the log2FC is defined as median(Tumor) - median(Normal).

      The Benjamini and Hochberg false discovery rate (FDR) method was used to adjust the p-value in each factor to obtain the multiple testing adjusted q-value.

      LIMMA

      For an alternative method, we use the linear model and the empirical Bayes method implemented by the R package limma, with adjusted p-value (Benjamini and Hochberg FDR). The limma method leverages the highly parallel nature of genomic data, borrowing information between the gene-wise models.

      Similarly, the expression data are first log2(TPM+1) transformed and the log2FC is defined as median(Tumor) - median(Normal).

      Top 10

      Cancer drug targets such as ERBB2, VEGF, are over-expressed in a subset (10-20%) of tumors and lowly expressed in all normal tissues. When discovering cancer drug targets, genes like ERBB2 and VEGF are ideal candidates as therapeutic targets. For this purpose, we implemented a method for detecting the genes that are over-expressed in only a subset of tumors for a given cancer type.

      For each cancer type, we choose tumor samples that have the top 10% expression level for a given gene. For comparison, we choose the same number of normal samples that have the highest expression level for the same gene. We rank the tumor and normal samples by expression level and calculate the percentage of tumor samples in top 50% ranked list as the percentage value. The expression data are first log2(TPM+1) scaled and the log2FC is defined as median(Tumor) - median(Normal).

      By default, we report the over-expressed genes as those passing following thresholds:

      log2FC > 1, percentage > 0.9.

      For example, CLEC3A is over expressed in breast cancer as the profile below:

      Images

      Definition of differentially expressed genes

      For the ANOVA and LIMMA methods, genes with higher |log2FC| values and lower q values than pre-set thresholds are considered differentially expressed genes.

      For the Top 10 option, genes with higher log2FC values and higher percentage value than the thresholds are considered over-expressed genes.

      Data Pre-processing

      Data Collection

      For resources selection, GEPIA3 collected datasets with the largest data size or the most comprehensive compiling of existing resources.

      Functionalities Resources Selection criteria
      Expression analysis TCGA Integrative pan-cancer multi-omic resources
      GTEx Largest public expression datasource for normal tissues
      Drug cell line responses GDSC Well-established systematic screening datasets for drug response in over 900 cancer cell lines
      CTRP
      CREAMMIST Standardized and aggregated drug response in different projects, providing an integrative dose-response curve across datasets
      Drug CRISPR screen Olivieri et al. Cell. 2020 Collected from BioGRID ORCs database with following criteria:
      a) published,
      b) high throughput screens with gene size above 10000,
      c) human cells exposure to cancer drugs,
      d) systematic screens for more than 3 different drugs.
      Lau et al. Genome Biol. 2020
      Wang et al. Nucleic Acids Res. 2021
      Ramaker et al. BMC Cancer. 2021
      Alteration network PCAWG Most comprehensive publicly available resource for gene alterations (both DNA and RNA) beyond expression abundance, uniformly processed across cancer types
      PCAWG Transcriptome Core Group et al. Nature. 2020
      Protein Interaction Xiong et al. Nat Biotechnol. 2024 Integrates TCGA-wide mutation data with multi-source structure-defined interfaces (PDB, Interactome3D, PIONEER) to reveal enriched, functionally disrupted PPIs across cancers
      eQTL PancanQTL The only dataset comprehensively providing both cis- and trans-eQTLs in multiple cancer types using TCGA data
      SL/SV CGIdb2.0 Systematically integrated database of SL data from 33 published studies and SV data from 11 studies, currently one of the most comprehensive SL/SV datasets available
      Allele-specific Expression PCAWG Transcriptome Core Group et al. Nature. 2020 Most comprehensive publicly available resource for RNA alterations beyond expression abundance, uniformly processed across cancer types
      Alternative Promoter PCAWG
      Gene Fusion PCAWG

      RNA Alteration Data Quality Control

      For RNA Alterations, to ensure consistency and reliability, GEPIA3 uses data from the PCAWG project, which provides high-quality and uniformly processed data across a wide range of cancer types.

      For the ASE and Gene Fusion modules, GEPIA3 directly adopts the processed results from PCAWG without further modification. The ASE data quantify allelic expression imbalances and summarize the effect sizes of influential factors as well as input mutational burden categories. The gene fusion data contains curated fusion events across different cancer types.

      The Alternative Promoter module is based on promoter activity matrices from PCAWG. To enable fast queries and visualization, we applied several preprocessing steps. Sample metadata from UCSC Xena are used to classify samples by cancer type and tissue origin (tumor vs. normal). For each promoter, we compute the mean and standard deviation of relative activity within each group, count the number of non-missing observations, and perform two-sample t-tests to assess differential activity between tumor and normal tissues.

      Network Data Quality Control

      We used the gene centric binary table from ICGC PCAWG3 group as the input alterations. Firstly, we filtered out the alterations occurred in less than 5 samples. We also ruled out the co-occurrent candidate gene pairs located in the same chromosome to eliminate the bypass co-occurrence. Secondly, we used the DISCOVER tool to test the significance of co-occurrence or mutual exclusivity. Thirdly, we used the BH method for multiple test correction. In the downstream analysis if not specially mentioned, we only include the 8 cancer types with more than 50 tumor samples: Kidney-RCC, Lymph-BNHL, Liver-HCC, Ovary-AdenoCA, Breast-AdenoCA, Panc-AdenoCA, Lymph-CLL and ColoRect-AdenoCA and only include 6 types of alterations (alternative promoter , expression outlier, variants, alternative splicing, allele specific expression and copy number variations) because fusion and RNA editing only involve small number of alterations.

      Results Download

      The PDF and the SVG download is available by clicking the button nearby the results.

    License Statement: All content on this website is freely available to all users, including for commercial use.