Introduction - The Human Protein Atlas

INTRODUCTION

The Human Protein Atlas portal is a publicly available database with millions of high-resolution images showing the spatial distribution of proteins in 44 different normal human tissues and 20 different cancer types, as well as 46 different human cell lines. The data is released together with application-specific validation performed for each antibody, including immunohistochemistry, Western blot analysis and, for a large fraction, a protein array assay and immunofluorescent-based confocal microscopy. The database has been developed in a gene-centric manner with the inclusion of all human genes predicted from genome efforts. Search functionalities allow for complex queries regarding protein expression profiles, protein classes and chromosome location.

Uhlén et al (2015). Tissue-based map of the human proteome. Science
PubMed: 25613900 DOI: 10.1126/science.1260419

Uhlen et al (2010). Towards a knowledge-based Human Protein Atlas. Nat Biotechnol.
PubMed: 21139605 DOI: 10.1038/nbt1210-1248

Berglund et al (2008). A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics.
PubMed: 18669619 DOI: 10.1074/mcp.R800013-MCP200

Uhlén et al (2005). A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics.
PubMed: 16127175 DOI: 10.1074/mcp.M500279-MCP200

Pontén et al (2008). The Human Protein Atlas - a tool for pathology. J Pathol.
PubMed: 18853439 DOI: 10.1002/path.2440

The Human Protein Atlas

The Human Protein Atlas contains information for a large majority of all human protein-coding genes regarding the expression and localization of the corresponding proteins based on both RNA and protein data. The atlas consists of four subparts; normal tissue, cancer, subcellular and cell lines with each subpart containing images and data based on antibody-based proteomics and transcriptomics. The tissue atlas contains information of 44 different human tissues and organs with annotation data for altogether 83 different cell types. The transcriptomics data provide quantitative data on gene expression levels across the tissues and organs, while the antibody-based protein profiles show the spatial distribution at a single cell level for the corresponding protein in the various substructures and cell types of the tissues. Version 14 of the Human Protein Atlas contains RNA data for 99.9% and protein data for 86% of the predictive human genes and includes more than 11 million images with primary data from immunohistochemistry and immunofluorescence.

The normal tissue atlas

The normal tissue atlas contains information and images regarding the expression profiles of human genes both at the mRNA and protein level. The protein expression data is derived from annotation of immunohistochemical staining of cell populations in all major human tissues and organs, including the brain, liver, kidney, lymphoid tissues, heart, lung, skin, gastrointestinal tract, pancreas, endocrine tissues and the reproductive organs. In total, 44 different human tissues are included and contain annotation data for altogether 83 different cell types. The antibody-based protein profiles are qualitative and describe the spatial distribution, cell type specificity and the rough relative abundance of proteins in these tissues, whereas the mRNA data provide quantitative data on gene expression levels. For each gene, the immunohistochemical staining profile is taken into account together with mRNA data and gene/protein characterization data to yield an “annotated protein expression” profile. This procedure is performed for single antibodies as well as paired antibodies (two or more independent antibodies directed against different, non-overlapping epitopes on the same protein) to derive a best estimate of the correct protein expression based on available data.

Example:

MYL7
Myosin, light chain 7, regulatory.

Selective cytoplasmic expression in cardiomyocytes at the protein level, highly tissue enriched in heart muscle at the mRNA level.

The mouse brain atlas

The mouse brain atlas is a complement to the normal tissue atlas, providing a more extended overview of the brain proteome. In the normal tissue atlas three forebrain regions (cerebral cortex, hippocampus, and lateral ventricle) and one hindbrain (cerebellum) region is included, by adding full mouse brain sections the possibility of detecting protein expression currently not detected in the human samples is increased. A selected set of genes are profiled in the mouse brain providing detailed information on the regional and cellular location of proteins in the brain.

Example:

NECAB1
N-terminal EF-hand calcium binding protein 1.

Subsets of neurons showed distinct positivity in cell bodies and dendrites. Main location of the positive neurons is layer 4 of the cerebral cortex.

The cancer atlas

The cancer tissue atlas contains a multitude of human cancer specimens representing the 20 most common forms of cancer, including breast-, colon-, prostate-, lung-, urothelial-, skin-, endometrial- and cervical cancer. Altogether 216 different cancer samples are used to generate protein expression profiles for all proteins using immunohistochemistry. The data is presented as pathology-based annotation of protein expression levels in tumor cells, along with the images underlying the annotation. This enables the identification of a potential protein signature for each given type of cancer and provides a starting point for further analyses of cancer type-specific proteins. Because the cancer atlas contains a large number of cancer samples the available protein profiles provide an excellent starting point for identifying new potential cancer biomarkers.

Example:

KLK3
Kallikrein-related peptidase 3.

Selective cytoplasmic expression in prostate cancers. All other malignant tissues were negative.

The cell line atlas

The cell line atlas contains expression profiles from a diverse panel of human-derived cell lines at both the mRNA (n=45) and protein level (n=46). In addition, protein data is also displayed for patient blood samples representing normal peripheral blood mononuclear cells (PBMC) and different types of leukemia and lymphoma; each antibody is tested on two samples of PBMC and ten samples representing AML, ALL, CML, and CLL. Protein expression has been assessed using immunohistochemistry (IHC), and profiles are currently available for 28% of all protein-coding genes based on 6868 well-validated antibodies. IHC staining positivity has been quantified using the automated image analysis software TMAx (Beecher instruments, Sun Praire, WI) (Strömberg et al. 2007). All underlying images of IHC stained cells and cell lines are displayed, along with transcript levels and relative IHC scores.

Example:

Emerin
EMD, LEMD5, STA.

Transcript and protein detected at same medium/high levels in almost all cell lines.

The subcellular atlas

Alongside the immunohistochemical pipeline generating the three sub-atlases above, antibodies are also used for confocal immunofluorescence analyses to generate a subcellular protein atlas. This sub-atlas contains high resolution, multicolor images of immunofluorescently stained cells that reveal spatial expression patterns at the subcellular level. For each antibody, two suitable human cell lines are selected for the immunofluorescence analysis on the basis of RNA expression. The third human cell line is always U-2 OS. The cells are stained in a standardized way where the antibody of interest is labeled green, the cytoskeleton is labeled red, the endoplasmic reticulum is labeled yellow, and nuclei are stained blue by DAPI. The images are manually annotated in terms of localization at the organelle level, staining intensity and staining characteristics.

Example:

Ezrin
EZR, VIL2.

Protein localized to the plasma membrane in both human and mouse cells.

Background and history

The Human Protein Atlas project was initiated in 2003 by funding from the Knut and Alice Wallenberg foundation. Primarily based in Sweden, the HPA-project involves the joint efforts of the Royal Institute of Technology in Stockholm, Uppsala University, Uppsala Akademiska University Hospital and more recently also Science for Life Laboratory based in both Uppsala and Stockholm. Formal collaborations are with groups in India, South Korea, Japan, China, Germany, France, Switzerland, USA, Canada, Denmark, Finland, The Netherlands, Spain and Italy.

The first version of the HPA-website was launched in 2005 and contained protein expression data based on approximately 700 antibodies. Since then, each new release has added more and more data and also added new functionalities and new features to the website. Some important changes were the inclusion of cell-line data in version 2, and the inclusion of confocal images showing subcellular localizations in version 3. Version 3 also included a new search function that allowed for building queries. In version 4, the overall database structure was shifted from a previously antibody-centric structure, to a gene-centric structure in order to include information on all genes predicted by Ensembl. The next major restructuring came in 2010 with the version 7 when the concept of annotated protein expression for paired antibodies (two independent antibodies directed against different, non-overlapping epitopes on the same protein) was introduced. In 2013, the version 12 of the protein atlas database was complemented with transcriptomics profiles from 27 normal tissues, and the format with four sub-atlases was introduced. Version 13 was released at end of 2014 and included an analysis of all major organ and tissues in the human body using transcriptomics and antibody-based profiling. The results were summarized on interactive knowledge-pages divided into 7 human proteomes and 27 tissues and organs.

Strategy for high-throughput proteomics

The high-throughput approach to human proteomics rests on two main pillars; the streamlined production of antibodies and the use of tissue microarray (TMA) technology for immunohistochemistry. The antibody production process begins with a bioinformatics analysis of the protein-coding part of the genome. For every protein, the amino acid sequence is compared to all other putative protein-coding genes to identify a stretch of 50-150 amino acids that has as low homology as possible with respect to all other proteins. Transmembrane regions including hydrophobic and less immunogenic regions are avoided. These sequences are then cloned from cDNA libraries using specifically designed primers and transformed into E. coli bacteria that produce the corresponding peptide chain, here called a PrEST (Protein Epitope Signature Tag). The PrEST is used for various applications including immunization to produce antibodies, and for affinity purification of the polyclonal antisera. Numerous quality assurance and validation steps are included throughout this production chain and all generated antibodies undergo a validation regime and basic characterization before being approved for profiling on tissue microarrays.

The tissue microarray technology enables high-throughput immunohistochemistry on multiple samples within a single experiment. The tissue microarrays used in the Human Protein Atlas project typically consists of 72 different tissue samples, each one punched out as 1 mm diameter cores from formalin-fixed paraffin-embedded tissue blocks. These sampled cores are then arranged in a matrix on a single receiver paraffin block. The resulting receiver block (or TMA) is subsequently sectioned into 200-250 sections that are used for separate immunohistochemical staining experiments. Using this approach, many data points are generated under similar conditions, reducing intra-experimental variation, and saving both time and cost as compared to staining all tissues separately. The Human Protein Atlas project routinely generates protein expression profiles for each antibody by staining 9 different standardized TMA-sets containing samples from 44 different normal human tissues, 20 different cancer types, 46 different human cell lines and 6 hematopoietic cell types from patients.

The stained TMA-slides are then scanned using a digital slide scanner and all tissue cores are separated into individual image files that are uploaded to an internal annotation software and annotated by pathologists. The stainings are scored with respect to the intensity of immunoreactivity, the fraction of immunostained cells and cellular localization of immunoreactivity. The annotation output is then reviewed and compared with available mRNA data, gene/protein characterization data, and if applicable, results from paired antibodies (antibodies directed towards non-overlapping epitiopes on the same protein), before it is finally approved for publication on the Human Protein Atlas website.

The Human Protein Atlas project is funded by the Knut & Alice Wallenberg foundation.