Help
Introduction to GEPIA
The GEPIA series has provided robust and widely used tools for analyzing gene expression data derived from TCGA, GTEx and ICGC, as well as cell line drug screen data. GEPIA enabled comprehensive studies of differential expression, survival analysis, and gene correlation.
Quick Start
Enter a gene/isoform symbol or gene/isoform ID (Ensembl ID) in the "Enter gene name" field, and click the "GoPIA!" button to search for the gene of interest.

Expression Calculation in GEPIA3
All functions were based on human reference genome hg38. GEPIA3 used RNA-seq gene or transcript expression in TCGA tumor, TCGA peritumor and GTEx normal tissues.
The expression values were TPM (Transcripts Per Million) and expected counts calculated by RSEM from UCSC Toil RNA-seq Recompute. Cell line expression were calculated from CCLE RNA-seq samples.
We downloaded RSEM calculated TPM (log2 transformed) from DepMap 22Q1. The gene and isoform biotype annotation were based on GENCODEv46. Our analysis framework permits optional user-selected normalization (log2 transformation) as input, except:
(a) DESeq2 only accepts expected counts as input; (b) statistical methods requiring normality assumptions for expression analysis.
We highly recommended using log2 transformed TPM as input in most common cases. Boxplots and scatter plots with expression values used log2 transformed TPM as marked in axis labels.
Signature: all expression-based functions in GEPIA3 support the use of multiple genes as input, and then compute the signature score using the first principal component value derived from Principal Component Analysis (PCA) of the expression levels of the input genes.
The update from GEPIA2 to GEPIA3
GEPIA3 introduces substantial updates, including modules for drug sensitivity analysis that link gene expression profiles with therapeutic outcomes. Additionally, it incorporates advanced network analysis frameworks, leveraging eQTL, co-mutation, and protein interaction data to map regulatory landscapes. Comprehensive tools for studying multiple types of RNA alterations have also been added, enabling the investigation of diverse transcriptomic modifications. These innovations establish GEPIA3 as an indispensable resource for integrative and translational cancer genomics research.
Drug Analysis
TCGA Drug Response Analysis
-
This function enables users to compare gene/signature expression level between patients with different drug responses annotated in TCGA.
Parameters
- Gene: Input one or a group of genes of interest.
- Methods: Select ANOVA or Wilcoxon for differential analysis.
- log2 transformed: Select whether use log2(TPM+1) as expression abundance.
- Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
- Groups: Choose which group to compare
Results
-
The table shows the results of differential expression analysis for gene expression levels between patients with different drug responses.


TCGA Drug Treatment vs Survival
-
This function provides users to compare gene expression-survival relationship between patients with and without specific drug treatment.
Parameters
- Gene: Input a gene/isoform or gene signature A of interest.
- Methods: Select the OS or PFS survival response.
- Axis Units: Select Month or Day unit for plotting.
- Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
- Colors: Choose color to plot KM curves
- Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression cohorts.
- Cutoff-High(%): Samples with expression level higher than this threshold are considered as the high-expression cohort.
- Cutoff-Low(%): Cutoff-Low(%): Samples with expression level lower than this threshold are considered the low-expression cohort.
Results
-
The result table shows the results of Cox regression analysis for gene expression levels and patient survival between the drug and non-drug groups.
click the button beside drug names to calculate the survival differences between the four patient groups: high expression with the drug,
low expression with drug, high expression without drug, and low expression without drug.
Kaplan-Meier (KM) survival curves will be generated for each pairwise comparison.

Cell Line Drug Screen
-
This function provides users to calculate correction between drug response and gene expression/copy number in CTRP, GDSC and CREAMMIST databases.
Parameters
- Gene: Input a gene/isoform or gene signature A of interest.
- Drug Screen Dataset Select the dataset to use.
- Drug Sensitivity Metric Select the score for drug sensitivity measurement.
- Cell Line Types: Select one or multiple cell types and click "+" to build the dataset.
Results
-
The result table shows the results of correlation coefficient between query gene expression level or drug responses in selected cell lines.
Z score is the Fisher's Z of correlation coefficient r.
The plot shows the top 10 sensitive drugs with the gene expression (the top 10 genes with highest r and lowest r respectively).

Cell Crispr Screens
-
This function allows users to compare the cell responses to CRISPR gene perturbation between drug-treated and untreated conditions.
Parameters
- Gene: Input a gene for analysis. Gene ID is not supported, please use gene official name. A total of 17,382 genes were included in the screen, therefore, not all genes are available.
- Treatment Datasets: Click "+" to select datasets for drug treatment. Data with more than five entries and data for nonexistent selected genes will be filtered out.
Results
-
The result shows the gene's NormZ value (normalized sgRNA reads count Z-score between DNA damage-treated and untreated conditions cells).
Negative NormZ values represent genes whose mutation leads to their depletion from the cell population after genotoxin exposure,
whereas positive NormZ scores represent genes whose mutation leads to a selective growth advantage in the presence of the drug.
See the original paper for details.

Network Analysis
Comprehensive Analysis
-
This feature allows users to explore multiple interaction relationships of genes based on selected datasets and interaction types. STRING database annotations can also be optionally included.
Parameters
- Gene Input: Input a list of gene symbols or Ensembl IDs of interest.
- Interaction Types: Choose interaction types, including:
- Positive Correlation
- Anti-correlation
- Co-alternation
- Mutually Exclusive Alternation
- Synthetic Lethality (SL) / Synthetic Viability (SV)
- OncoPPI: Oncogenic Protein-Protein Interactions
- Nodes Displayed: Choose the number of nodes you want to include in the network graph.
- Datasets: Select cancer datasets of interest (e.g., TCGA Tumor). Multiple datasets can be chosen, or manual input separated by commas is supported.
- STRING Annotation: Enable this option to include gene relationships from the STRING database.
Notes
- Only gene pairs with an absolute Pearson Correlation Coefficient (PCC) greater than 0.5 are considered in the analysis of expression patterns.
- When searching , the displayed network highlights the top k (Nodes Displayed input) nodes with the highest degrees connected to the target gene within the selected datasets and interaction patterns.
Results
- Edges: A table summarizing gene pairs, interaction types, confidence measures (such as score for STRING and SL/SV, PCC for co-expression, and p-value/q-value for co/me-alteration and oncoPPI), and relevant cancer datasets.
- Network Visualization: An interactive graph displaying gene relationships. Nodes represent genes, and edges represent interactions. Different colors are used to indicate interaction types.


Alteration Network
-
This feature allows users to identify and visualize gene alteration relationships, including co-alteration and mutually exclusive alteration patterns, within selected datasets and genes.
Parameters
- Gene: Input a gene/isoform or multiple genes of interest.
- Alteration Pattern: Select alteration pattern of interest.
- ICGC Tumor/Selected Datasets: Select cancer types of interest in the "ICGC Tumor" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of cancer types split by comma (e.g. Kidney-RCC,Liver-HCC) is also acceptable. The interaction analysis is based on the datasets list.
Notes
- We considered six types of alterations: alternative promoter usage, expression outliers, genetic variants, alternative splicing events, allele-specific expression, and copy number variations.
- When searching , the displayed network highlights the top 10 nodes with the highest degrees connected to the target gene within the selected datasets and interaction patterns.
Results
- Table:Genes that are co-altered or me-altered with the target gene, along with selected features of the former.
- Heatmap Visualization: Visualizes the alteration status of selected genes across samples from the user-selected cohort. Each bar represents an individual sample, with red indicating alteration and gray indicating no alteration. The percentage on the right reflects the proportion of altered samples for each gene.
- Network Visualization: An interactive graph displaying gene relationships. Nodes represent genes, and edges represent interactions. Different colors are used to indicate interaction types.



Expression Network
-
This feature allows users to search for genes that are most correlated with a target gene or gene signature within selected datasets.
Parameters
- Gene: Input a gene/isoform or gene signature of interest.
- Top # Genes: Set the number of top correlated genes to display. (Maximum: 1000)
- Interaction Pattern: Select interaction pattern of interest.
- Log Transformation: Apply log2 transformation to normalize the gene expression data.
- TCGA Tumor/TCGA Normal/GTEx/Used Expression Datasets: Select cancer types of interest in the "TCGA Tumor", "TCGA Normal" or "GTEx" field and click "add" to build dataset list in the "Used Expression Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
Results
- Table: Displays the top correlated genes, their correlation coefficients, and the number of tumor, normal, and GTEx samples analyzed.
- Network Visualization: For single gene input, the top correlated genes are displayed in a network, with annotation fron STRING database.


Protein Interaction
-
The OncoPPI module identifies oncogenic protein-protein interactions (oncoPPIs) for input genes across selected cancer types, leveraging interaction data from large-scale studies and structural databases.
This module is built upon the findings from the study "A structurally informed human protein–protein interactome reveals proteome-wide perturbations caused by disease mutations" published in Nature Biotechnology on October 24, 2024. Specifically, the module uses data from Supplementary Table 3, which includes significance tests of somatic mutation enrichment in protein-protein interaction (PPI) interfaces across 33 TCGA cancer types. The interaction data integrates predictions from the PIONEER deep-learning framework and experimentally resolved structures.
In addition to the PIONEER-predicted interactions, structural data were curated from:
- The Protein Data Bank (PDB): Co-crystal structures of PPIs were downloaded from the PDB FTP site (PDB FTP).
- Interactome3D: Homologous structures for PPIs without co-crystal data were collected from Interactome3D. For citation, please refer to: A structurally informed human protein–protein interactome reveals proteome-wide perturbations caused by disease mutations
Parameters
- Gene List: Input one or multiple genes to query their oncogenic protein interactions.
- Datasets: Select TCGA cancer types (e.g., "CHOL Tumor", "BRCA Tumor") to filter relevant oncogenic interactions.
Results
- Table: Summarizes oncogenic protein interaction details:
- Gene A: Input gene involved in the interaction.
- Gene B: Interacting partner gene.
- P-value A / P-value B: Statistical significance of the somatic mutation enrichment for each gene.
- FDR: Adjusted False Discovery Rate for multiple testing correction.
- Cancer Type: Specific cancer type where the interaction is identified.
- Source: Indicates the method used to define protein-protein interface regions.
- Network Visualization: Visualizes oncogenic protein interactions, annotated with STRING database.


SLSV Analysis
-
This module allows users to explore genetic interactions (GIs) across selected cancer types, focusing on synthetic lethality (SL) and synthetic viability (SV) relationships between gene pairs. For citation, please refer to:
"Harnessing genetic interactions for prediction of immune checkpoint inhibitors response signatures in cancer cells."
Parameters
- Gene: Input a gene/isoform or multiple genes of interest.
- Interaction Pattern: Select interaction pattern of interest.
- Tissue/Selected Datasets: Select tissue of interest in the "Tissue" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of tissue split by comma (e.g. pancancer,blood) is also acceptable.
Results
- Table: Displays SL and SV interaction results:
- Gene A: The queried gene symbol.
- Gene B: The interacting partner gene.
- Interaction Type: Indicates whether the interaction is SL or SV.
- Tissue: Specifies the tissue or cancer type where the interaction occurs.
- Network Visualization: Visualizes genetic interactions as an interactive network, allowing for exploration of SL and SV relationships.


eQTL
-
This module allows users to identify both cis-eQTLs and trans-eQTLs across 33 cancer types derived from The Cancer Genome Atlas (TCGA). For citation, please refer to: PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types.
Parameters
- Gene: Input a gene/isoform or multiple genes of interest.
- eQTL Type: Choose between "Cis-eQTLs" for local gene expression regulation or "Trans-eQTLs" for distant gene expression regulation.
- TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "add" to build dataset list in the "Selected Datasets" field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The interaction analysis is based on the datasets list.
Results
- Gene: The queried gene symbol.
- SNP ID: The identifier of the single nucleotide polymorphism (SNP) associated with gene expression (eQTL).
- Alleles: The observed allelic variants at the SNP locus (e.g., A/G, T/C), representing the different nucleotides in the population.
- Cancer Type: The specific cancer type in which the eQTL association was identified.
- Beta: The effect size indicating how much the SNP alters gene expression; a positive value suggests upregulation, while a negative value suggests downregulation.
- P-value: The statistical significance of the eQTL association; lower values indicate stronger evidence against the null hypothesis of no association.
- FDR: The false discovery rate–adjusted p-value accounting for multiple testing; controls the expected proportion of false positives.

RNA Alterations
Allele-Specific Expression
-
This function enables users to identify allelic expression imbalances in one or more genes, along with the factors driving these imbalances.
Parameters
- Gene: Input one or a group of genes of interest.
Results
- Table: Displays the allele-specific expression levels of the gene in cancer, as well as whether the gene is an imprinted gene, a cancer gene, or a cancer-testis gene.
- Plot: Heatmaps that show the effect sizes of influencing factors and the input somatic burdens.
Effect size here refers to the magnitude of the relationship between the genomic factors and allelic expression imbalance, as quantified by generalized linear model. Zoom-in plots for somatic mutation factors are provided to ensure that finer details are not masked by factors with larger effect sizes.

Alternative Promoter
-
This function provides users with information on relative promoter activity and enables comparisons between tumor samples and their corresponding normal tissues.
Parameters
- Gene: Input one or a group of genes of interest.
- ICGC Tumor: Select cancer types of interest.
Results
- Table: Shows the basic information of promoters and their average relative activities.
- Barplot: Displays average activities in tumors and their corresponding normal tissues. Note that only the p-values < 0.10 will be annotated on the plot.


Gene Fusion
-
This function allows users to search for gene fusion events in specific cancer types.
Parameters
- Gene: Input one or a group of genes of interest. Or enter "all" to search for all fusion events in the given cancer types.
- ICGC Tumor: Select cancer types of interest.
- show top n frequent genes: Set the number of genes to be displayed in the bar plot for gene frequency.
Results
- Table: Displays the basic information of fusion events and the frequency of these events in the given cancer types.
- Circos Plot: Visualization for the gene fusion events. Note that when searching for all events in a given cancer type, only the top n (set by users) most frequent genes will be annotated on the plot.
- Barplot: Shows the top n (set by users) most frequent genes in the search result.


Other Updates
Differential Expression with Hotspot Mutation
-
This function enables users to compare gene/signature expression level between patients with and without key hotspot mutations.
Parameters
- Gene: Input a gene/isoform or gene signature A of interest.
- Methods: Select ANOVA or Wilcoxon for differential analysis.
- log2 transformed: Select whether use log2(TPM+1) as expression abundance.
- Hotspot Mutation Type: Select the type of mutation. Gene level means any nonsilent mutation in the gene. Amino acid level means changes at specific amino acid site (HGSV protein-level).
- Datasets Selection: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
- Plot Color: Choose group color to plot differential expression boxplot and correlation scatter plot.
Results
-
The table shows the significant results of differential expression between hotspot mutant and wildtype cases.
Click the button beside mutation to plot gene/signature expression distribution of mutant and wildtype groups,
as well as the expression correlation with the hotspot gene.


Functions
Gene Card
Enter a gene symbol or gene ID (Ensembl ID), and click the "GoPIA" button to quick search your gene of interest.
GEPIA provides an interactive bodymap to visualize the expression in both tumor and normal samples, as well as an expression profile that displays the gene's expression level across all tumor samples and normal tissues.


Profile Summary
Differential Genes
This function allows users to list the tumor/normal tissue differentially expressed genes or isoforms in a cancer type, and plot the chromosomal distribution of these genes.
Users can apply custom statistical methods and thresholds on a given dataset to dynamically obtain differentially expressed genes. For the DESeq2 and limma options, genes with higher |log2FC| values and lower q values than pre-set thresholds are considered differentially expressed genes.
Parameters
- Dataset Selection: Select a cancer type of interest.
- BioType: Type of genes or isoforms.
- Normal samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data.
- |log2FC| Cutoff: Set custom fold-change threshold.
- q-value: Set custom q-value threshold.
- Gene/Isoform: Query differentially expressed genes or isoforms.
- Differential Methods: Select a method for differential analysis.
Results
-
Click the "List" button: GEPIA will generate a list of differentially expressed genes.

Prognostic Genes
This function enables users to search for genes most associated with patient survival based on a cancer type.
GEPIA uses the Log-rank test for hypothesis testing across all genes, with q-values calculated for multiple testing corrections.
Parameters
- Datasets Selection: Select a cancer dataset of interest.
- Biotype: Type of genes or isoforms.
- Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression groups.
- TimeResponse: Select the OS or PFI survival method.
- Gene/Isoform: Query most prognosis-associated genes or isoforms.
Results
-
Click the "List" button: GEPIA will generate a list of prognostic genes.

Find Drugs
This function enables users to perform survival analysis based on TCGA patients using specific drug treatment.
GEPIA performs differential analysis of Overall Survival (OS) or Progression-free Interval (PFI) between patients with or without using specific drugs. GEPIA uses Log-rank test, a.k.a the Mantel-Cox test, for hypothesis test.
Parameters
- Drugs: Select a drug of interest. Both generic name and trade name are provided.
- TCGA Caner Types/Selected Datasets: Select one or multiple cancer types and click "+" to build the dataset.
- TimeResponse: Select the OS or PFS survival response.
- TimeUnit: Select Month or Day unit for plotting.
- Colors: Choose color to plot.
Results
-
Click the "Run" button: GEPIA will generate survival curve comparing two groups (with or without drug treatment).

Analysis
Expression Analysis
Expression DIY
-
GEPIA plots expression profiles of a given gene according to selected datasets and statistical methods by cancer types or pathological stages. The details are shown below:
Profile DIY
-
GEPIA generates dot plots profiling gene expression across multiple cancer types and paired normal samples, with each dot representing a distinct tumor or normal sample.
Parameters
- Gene: Input a gene/isoform of interest.
- TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "+" to build a dataset list in the "Selected Datasets" field. Manual input of cancer types split by comma (e.g. ACC,BRCA,BLCA) is also acceptable. The x-axis of the plot will follow the order of datasets.
- Normal Samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data in plotting. [The matched normal samples for differential analysis are "TCGA normal + GTEx normal"]
- Differential Methods: Statistical methods used for differential gene expression analysis.
- |log2FC| Cutoff: Set custom fold-change threshold.
- q-value: Set custom fold-change threshold.
Results
-
Click the “Run” button: GEPIA will generate a gene expression profile based on users' custom input parameters.

Boxplot
-
GEPIA generates box plots with jitter for comparing expression in several cancer types (For best visual quality, we recommend 1-4 cancer types).
Parameters
- Gene: Input a gene or multiple genes of interest.
- TCGA Tumor/Selected Datasets: Select cancer types of interest in the "TCGA Tumor" field and click "+" to build a dataset list in the "Selected Datasets" field. Manual input of cancer types split by comma (e.g. ACC,BRCA,BLCA) is also acceptable. The x-axis of the plot will follow the order of datasets.
- Normal Samples: Select "TCGA normal + GTEx normal" or "TCGA normal" for matched normal data in plotting. [The matched normal samples for differential analysis are "TCGA normal + GTEx normal"]
- |log2FC| Cutoff: Set custom fold-change threshold.
- q-value: Set custom fold-change threshold.
- High Group: Set the box color of tumor dataset and the font color for cancer types with elevated expression.
- Low Group: Set the box color of normal dataset and the font color for cancer types with reduced expression.
Results
-
Click the “Run” button: GEPIA will present a gene expression box plot based on users' custom input parameters.


Correlation Analysis
-
This function performs pair-wise gene expression correlation analysis for given sets of TCGA and/or GTEx expression data, using methods including Pearson, Spearman and Kendall. One gene can be normalized by other gene.
- GEPIA3 uses the log-scale for calculation and visualization.
Parameters
- Gene A: Input a gene or multiple genes of interest. [For x-axis]
- Gene B: Input a gene or multiple genes of interest. [For y-axis]
- TCGA Tumor/TCGA Normal/GTEx/Selected Expression Datasets: Select cancer types of interest in the “TCGA Tumor”, “TCGA Normal” or “GTEx" field and click “add” to build dataset list in the “Used Expression Datasets” field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
- Correlation Coefficient: The method for calculating the correlation coefficient.
- use log2 (TPM+1) scale: Prior to computation, data are subjected to log2(TPM+1) scale to enhance robustness
Results
-
Click the “Run” button: GEPIA will generate a scatter plot of correlation analysis result.

Dimensionality Reduction
-
GEPIA performs Principal Component Analysis (PCA) for a given gene list using custom TCGA and/or GTEx expression data. First, GEPIA presents a 3D plot of top three principal components (PC) and generates a bar plot for variances interpreted by each PC. Second, GEPIA presents 2D plot or 3D plot based on user-specified PCs.
Parameters
- Gene Set: Input a gene list of interest. Manual input of genes(e.g. ERBB2,EGFR)
- TCGA Tumor/TCGA Normal/GTEx/Selected Expression Datasets: Select cancer types of interest in the “TCGA Tumor”, “TCGA Normal” or “GTEx" field and click “add” to build dataset list in the “Used Expression Datasets” field. Also, manual input of cancer types split by comma (e.g. COAD Tumor,READ Tumor) is also acceptable. The correlation analysis is based on the datasets list.
- use log2 (TPM+1) scale: Prior to computation, data are subjected to log2(TPM+1) scale to enhance robustness
Results
-
Click the “Run” button: GEPIA will generate a scatter dimensionality reduction result.

Hotspot Mutation
Survival Analysis
This function enable users to perform survival analysis based on the expression status of one gene or a multi-genes' signature and plot a Kaplan-Meier curve. Some gene signature lists are provided.
GEPIA performs Overall Survival (OS) or Progression-free Interval (PFI) analysis based on gene expression. GEPIA uses Log-rank test, a.k.a the Mantel-Cox test, for hypothesis test. Cohorts thresholds can be adjusted. The cox proportional hazard ratio and the 95% confidence interval information can also be included in the survival plot.
Parameters
- Gene: Input a gene/isoform or gene signature A of interest.
- TimeResponse: Select the OS or PFS survival response.
- TimeUnit: Select Month or Day unit for plotting.
- TCGA Tumor/Selected Datasets: Select one or multiple cancer types of interest in the "Dataset Selection" field and click "add" to build dataset list in the "Datasets" field.
- Group Cutoff: Select a suitable expression threshold for splitting the high-expression and low-expression cohorts.
- Cutoff-High(%): Samples with expression level higher than this threshold are considered as the high-expression cohort.
- Cutoff-Low(%): Cutoff-Low(%): Samples with expression level lower than this threshold are considered the low-expression cohort.
- Colors: Choose color to plot.
Results
- Univariable
- Multivariable


Drug Analysis
See GEPIA Update - Drug Analysis
Network Analysis
See GEPIA Update - Network Analysis
RNA Alterations
See GEPIA Update - RNA Alterations
Differential Analysis
We use three methods for differential expression analysis using TCGA tumor samples with paired adjacent TCGA normal samples and GTEx normal samples.
Why do we use both TCGA normal and GTEx normal samples for differential analysis? TCGA produced 11,257 tumor samples across 33 cancer types, while this project only provides 1,475 normal samples. The imbalance between the tumor and normal data can cause inefficiency in various differential analyses. Fortunately, GTEx project produced RNA sequencing data for ~8,000 normal samples. Meanwhile, UCSC Xena project recomputed the TCGA and GTEx raw RNA-Seq data using a standard pipeline, which makes two datasets compatible. As a result, the TCGA and GTEx data could be integrated for very comprehensive expression analysis.
we consulted with multiple medical experts to determine the most appropriate tumor-normal comparisons. The comparisons and data we used are presented in the Dataset Sources - Tissues.
Differential methods
ANOVA
Considering the different stratifications of sex, age, ethnicity in tumor and normal samples, we applied four-way analysis of variance (ANOVA), using sex, age, ethnicity and disease state (Tumor or Normal) as variables for calculating differential expression:
Gene expression ~ sex + age + ethnicity + disease state
The expression data are first log2(TPM+1) transformed and the log2FC is defined as median(Tumor) - median(Normal).
The Benjamini and Hochberg false discovery rate (FDR) method was used to adjust the p-value in each factor to obtain the multiple testing adjusted q-value.
LIMMA
For an alternative method, we use the linear model and the empirical Bayes method implemented by the R package limma, with adjusted p-value (Benjamini and Hochberg FDR). The limma method leverages the highly parallel nature of genomic data, borrowing information between the gene-wise models.
Similarly, the expression data are first log2(TPM+1) transformed and the log2FC is defined as median(Tumor) - median(Normal).
Top 10
Cancer drug targets such as ERBB2, VEGF, are over-expressed in a subset (10-20%) of tumors and lowly expressed in all normal tissues. When discovering cancer drug targets, genes like ERBB2 and VEGF are ideal candidates as therapeutic targets. For this purpose, we implemented a method for detecting the genes that are over-expressed in only a subset of tumors for a given cancer type.
For each cancer type, we choose tumor samples that have the top 10% expression level for a given gene. For comparison, we choose the same number of normal samples that have the highest expression level for the same gene. We rank the tumor and normal samples by expression level and calculate the percentage of tumor samples in top 50% ranked list as the percentage value. The expression data are first log2(TPM+1) scaled and the log2FC is defined as median(Tumor) - median(Normal).
By default, we report the over-expressed genes as those passing following thresholds:
log2FC > 1, percentage > 0.9.
For example, CLEC3A is over expressed in breast cancer as the profile below:

Definition of differentially expressed genes
For the ANOVA and LIMMA methods, genes with higher |log2FC| values and lower q values than pre-set thresholds are considered differentially expressed genes.
For the Top 10 option, genes with higher log2FC values and higher percentage value than the thresholds are considered over-expressed genes.
Data Pre-processing
Data Collection
For resources selection, GEPIA3 collected datasets with the largest data size or the most comprehensive compiling of existing resources.
Functionalities | Resources | Selection criteria |
---|---|---|
Expression analysis | TCGA | Integrative pan-cancer multi-omic resources |
GTEx | Largest public expression datasource for normal tissues | |
Drug cell line responses | GDSC | Well-established systematic screening datasets for drug response in over 900 cancer cell lines |
CTRP | ||
CREAMMIST | Standardized and aggregated drug response in different projects, providing an integrative dose-response curve across datasets | |
Drug CRISPR screen | Olivieri et al. Cell. 2020 |
Collected from BioGRID ORCs database with following criteria: a) published, b) high throughput screens with gene size above 10000, c) human cells exposure to cancer drugs, d) systematic screens for more than 3 different drugs. |
Lau et al. Genome Biol. 2020 | ||
Wang et al. Nucleic Acids Res. 2021 | ||
Ramaker et al. BMC Cancer. 2021 | ||
Alteration network | PCAWG | Most comprehensive publicly available resource for gene alterations (both DNA and RNA) beyond expression abundance, uniformly processed across cancer types |
PCAWG Transcriptome Core Group et al. Nature. 2020 | ||
Protein Interaction | Xiong et al. Nat Biotechnol. 2024 | Integrates TCGA-wide mutation data with multi-source structure-defined interfaces (PDB, Interactome3D, PIONEER) to reveal enriched, functionally disrupted PPIs across cancers |
eQTL | PancanQTL | The only dataset comprehensively providing both cis- and trans-eQTLs in multiple cancer types using TCGA data |
SL/SV | CGIdb2.0 | Systematically integrated database of SL data from 33 published studies and SV data from 11 studies, currently one of the most comprehensive SL/SV datasets available |
Allele-specific Expression | PCAWG Transcriptome Core Group et al. Nature. 2020 | Most comprehensive publicly available resource for RNA alterations beyond expression abundance, uniformly processed across cancer types |
Alternative Promoter | PCAWG | |
Gene Fusion | PCAWG |
RNA Alteration Data Quality Control
For RNA Alterations, to ensure consistency and reliability, GEPIA3 uses data from the PCAWG project, which provides high-quality and uniformly processed data across a wide range of cancer types.
For the ASE and Gene Fusion modules, GEPIA3 directly adopts the processed results from PCAWG without further modification. The ASE data quantify allelic expression imbalances and summarize the effect sizes of influential factors as well as input mutational burden categories. The gene fusion data contains curated fusion events across different cancer types.
The Alternative Promoter module is based on promoter activity matrices from PCAWG. To enable fast queries and visualization, we applied several preprocessing steps. Sample metadata from UCSC Xena are used to classify samples by cancer type and tissue origin (tumor vs. normal). For each promoter, we compute the mean and standard deviation of relative activity within each group, count the number of non-missing observations, and perform two-sample t-tests to assess differential activity between tumor and normal tissues.
Network Data Quality Control
We used the gene centric binary table from ICGC PCAWG3 group as the input alterations. Firstly, we filtered out the alterations occurred in less than 5 samples. We also ruled out the co-occurrent candidate gene pairs located in the same chromosome to eliminate the bypass co-occurrence. Secondly, we used the DISCOVER tool to test the significance of co-occurrence or mutual exclusivity. Thirdly, we used the BH method for multiple test correction. In the downstream analysis if not specially mentioned, we only include the 8 cancer types with more than 50 tumor samples: Kidney-RCC, Lymph-BNHL, Liver-HCC, Ovary-AdenoCA, Breast-AdenoCA, Panc-AdenoCA, Lymph-CLL and ColoRect-AdenoCA and only include 6 types of alterations (alternative promoter , expression outlier, variants, alternative splicing, allele specific expression and copy number variations) because fusion and RNA editing only involve small number of alterations.
Results Download
The PDF and the SVG download is available by clicking the button nearby the results.
License Statement: All content on this website is freely available to all users, including for commercial use.