29.08.2019

Kf V1020 To V1021 Update Java

60
  1. Kf V1020 To V1021 Update Javascript
  2. I Want To Update Java
  3. Update Java To Latest Version
Bit

Kf V1020 To V1021 Update Javascript

Clinical Proteomics
M E T H O D S
I N
M O L E C U L A R
B I O L O G YTM
John M. Walker, SERIES EDITOR 447. Alcohol: Methods and Protocols, edited by Laura E. Nagy, 2008 446. Post-translational Modification of Proteins: Tools for Functional Proteomics, Second Edition, edited by Christoph Kannicht, 2008 443. Molecular Modeling of Proteins, edited by Andreas Kukol, 2008 439. Genomics Protocols: Second Edition, edited by Mike Starkey and Ramnanth Elaswarapu, 2008 438. Neural Stem Cells: Methods and Protocols, Second Edition, edited by Leslie P. Weiner, 2008 437. Drug Delivery Systems, edited by Kewal K. Jain, 2008 436. Avian Influenza Virus, edited by Erica Spackman, 2008 435. Chromosomal Mutagenesis, edited by Greg Davis and Kevin J. Kayser, 2008 434. Gene Therapy Protocols: Volume 2: Design and Characterization of Gene Transfer Vectors edited by Joseph M. LeDoux, 2008 433. Gene Therapy Protocols: Volume 1: Production and In Vivo Applications of Gene Transfer Vectors, edited by Joseph M. LeDoux, 2007 432. Organelle Proteomics, edited by Delphine Pflieger and Jean Rossier, 2008 431. Bacterial Pathogenesis: Methods and Protocols, edited by Frank DeLeo and Michael Otto, 2008 430. Hematopoietic Stem Cell Protocols, edited by Kevin D. Bunting, 2008 429. Molecular Beacons: Signalling Nucleic Acid Probes, Methods and Protocols, edited by Andreas Marx and Oliver Seitz, 2008 428. Clinical Proteomics: Methods and Protocols, edited by Antonia Vlahou, 2008 427. Plant Embryogenesis, edited by Maria Fernanda Suarez and Peter Bozhkov, 2008 426. Structural Proteomics: High-Throughput Methods, edited by Bostjan Kobe, Mitchell Guss, and Huber Thomas, 2008 425. 2D PAGE: Volume 2: Applications and Protocols, edited by Anton Posch, 2008 424. 2D PAGE: Volume 1:, Sample Preparation and Pre-Fractionation, edited by Anton Posch, 2008 423. Electroporation Protocols, edited by Shulin Li, 2008 422. Phylogenomics, edited by William J. Murphy, 2008 421. Affinity Chromatography: Methods and Protocols, Second Edition, edited by Michael Zachariou, 2008 420. Drosophila: Methods and Protocols, edited by Christian Dahmann, 2008 419. Post-Transcriptional Gene Regulation, edited by Jeffrey Wilusz, 2008 418. Avidin-Biotin Interactions: Methods and Applications, edited by Robert J. McMahon, 2008 417. Tissue Engineering, Second Edition, edited by Hannsjörg Hauser and Martin Fussenegger, 2007 416. Gene Essentiality: Protocols and Bioinformatics, edited by Svetlana Gerdes and Andrei L. Osterman, 2008 415. Innate Immunity, edited by Jonathan Ewbank and Eric Vivier, 2007
414. Apoptosis in Cancer: Methods and Protocols, edited by Gil Mor and Ayesha Alvero, 2008 413. Protein Structure Prediction, Second Edition, edited by Mohammed Zaki and Chris Bystroff, 2008 412. Neutrophil Methods and Protocols, edited by Mark T. Quinn, Frank R. DeLeo, and Gary M. Bokoch, 2007 411. Reporter Genes for Mammalian Systems, edited by Don Anson, 2007 410. Environmental Genomics, edited by Cristofre C. Martin, 2007 409. Immunoinformatics: Predicting Immunogenicity In Silico, edited by Darren R. Flower, 2007 408. Gene Function Analysis, edited by Michael Ochs, 2007 407. Stem Cell Assays, edited by Vemuri C. Mohan, 2007 406. Plant Bioinformatics: Methods and Protocols, edited by David Edwards, 2007 405. Telomerase Inhibition: Strategies and Protocols, edited by Lucy Andrews and Trygve O. Tollefsbol, 2007 404. Topics in Biostatistics, edited by Walter T. Ambrosius, 2007 403. Patch-Clamp Methods and Protocols, edited by Peter Molnar and James J. Hickman 2007 402. PCR Primer Design, edited by Anton Yuryev, 2007 401. Neuroinformatics, edited by Chiquito J. Crasto, 2007 400. Methods in Membrane Lipids, edited by Alex Dopico, 2007 399. Neuroprotection Methods and Protocols, edited by Tiziana Borsello, 2007 398. Lipid Rafts, edited by Thomas J. McIntosh, 2007 397. Hedgehog Signaling Protocols, edited by Jamila I. Horabin, 2007 396. Comparative Genomics, Volume 2, edited by Nicholas H. Bergman, 2007 395. Comparative Genomics, Volume 1, edited by Nicholas H. Bergman, 2007 394. Salmonella: Methods and Protocols, edited by Heide Schatten and Abraham Eisenstark, 2007 393. Plant Secondary Metabolites, edited by Harinder P. S. Makkar, P. Siddhuraju, and Klaus Becker, 2007 392. Molecular Motors: Methods and Protocols, edited by Ann O. Sperry, 2007 391. MRSA Protocols, edited by Yinduo Ji, 2007 390. Protein Targeting Protocols Second Edition, edited by Mark van der Giezen, 2007 389. Pichia Protocols, Second Edition, edited by James M. Cregg, 2007 388. Baculovirus and Insect Cell Expression Protocols, Second Edition, edited by David W. Murhammer, 2007 387. Serial Analysis of Gene Expression (SAGE): Digital Gene Expression Profiling, edited by Kare Lehmann Nielsen, 2007 386. Peptide Characterization and Application Protocols, edited by Gregg B. Fields, 2007 385. Microchip-Based Assay Systems: Methods and Applications, edited by Pierre N. Floriano, 2007
M E T H O D S I N M O L E C U L A R B I O L O G YT M
Clinical Proteomics Methods and Protocols
Edited by
Antonia Vlahou Biomedical Research Foundation, Academy of Athens, Athens, Greece
Editor Antonia Vlahou Academy of Athens Biomedical Research Foundation Athens, Greece Athens 115 27 e-mail: [email protected]
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Herts., AL10 9AB UK
ISBN: 978-1-58829-837-9
e-ISBN: 978-1-59745-117-8
Library of Congress Control Number: 2007939413 ©2008 Humana Press, a part of Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, 999 Riverview Drive, Suite 208, Totowa, NJ 07512 USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper 987654321 springer.com
Preface
Clinical proteomics has rapidly evolved over the past few years and is continuously growing as new methodologies and technologies emerge. In this volume, leading researchers in the field have contributed their stateof-the-art methodologies on protein profiling and identification of disease biomarkers in tissues, microdissected cells, and body fluids. Experimental approaches involving application of two-dimensional electrophoresis, multidimensional liquid chromatography, SELDI/MALDI mass spectrometry and protein arrays, as well as the bioinformatics and statistical tools pertinent to the analysis of proteomics data are described. As stated in the introductory chapter by Prof. Paik, the Vice President of the Human Proteome Organization, “clinical proteomics needs the integration of biochemistry, pathology, analytical technology, bioinformatics, and proteome informatics to develop highly sensitive diagnostic tools for routine clinical care in the future.” The multi-disciplinary character of clinical proteomics approaches is evident in the detailed step-by-step protocols described in this volume, which makes them of potential use to a wide range of researchers, including clinicians, molecular biologists, chemists, bioinformaticians, and computational biologists. Antonia Vlahou
v
Acknowledgments
The editor gratefully acknowledges all contributing authors for their collaboration, which made this project possible and brought it into fruition; the series editor, Prof. John Walker, whose help and guidance have been instrumental; Mr. Patrick Marton, Mr. David Casey, and the whole production team at Humana headed by the late Mr. Tom Laningan for making an excellent production of this book.
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1.
Overview and Introduction to Clinical Proteomics . . . . . . . . . . . . . . . . . Young-Ki Paik, Hoguen Kim, Eun-Young Lee, Min-Seok Kwon, and Sang Yun Cho
Part I:
1
Specimen Collection for Clinical Proteomics
2.
Specimen Collection and Handling: Standardization of Blood Sample Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Harald Tammen 3. Tissue Sample Collection for Proteomics Analysis. . . . . . . . . . . . . . . . . . 43 Jose I. Diaz, Lisa H. Cazares, and O. John Semmes
Part II: Clinical Proteomics by 2DE and Direct MALDI/SELDI MS Profiling 4.
5.
6.
7.
8.
Protein Profiling of Human Plasma Samples by Two-Dimensional Electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Sang Yun Cho, Eun-Young Lee, Hye-Young Kim, Min-Jung Kang, Hyoung-Joo Lee, Hoguen Kim, and Young-Ki Paik Analysis of Laser Capture Microdissected Cells by 2-Dimensional Gel Electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Daohai Zhang and Evelyn Siew-Chuan Koay Optimizing the Difference Gel Electrophoresis (DIGE) Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 David B. Friedman and Kathryn S. Lilley MALDI/SELDI Protein Profiling of Serum for the Identification of Cancer Biomarkers . . . . . . . . . . . . . . . . . . . . . . 125 Lisa H. Cazares, Jose I. Diaz, Rick R. Drake, and O. John Semmes Urine Sample Preparation and Protein Profiling by Two-Dimensional Electrophoresis and Matrix-Assisted Laser Desorption Ionization Time of Flight Mass Spectroscopy . . . . . . . . 141 Panagiotis G. Zerefos and Antonia Vlahou
ix
x
Contents 9.
Combining Laser Capture Microdissection and Proteomics Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Dana Mustafa, Johan M. Kros, and Theo Luider
Part III: 10.
Clinical Proteomics by LC-MS Approaches
Comparison of Protein Expression by Isotope-Coded Affinity Tag Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Zhen Xiao and Timothy D. Veenstra
11.
Analysis of Microdissected Cells by Two-Dimensional LC-MS Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Chen Li, Yi-Hong, Ye-Xiong Tan, Jian-Hua Ai, Hu Zhou, Su-Jun Li, Lei Zhang, Qi-Chang Xia, Jia-Rui Wu, Hong-Yang Wang, and Rong Zeng 12. Label-Free LC-MS Method for the Identification of Biomarkers . . . . . 209 Richard E. Higgs, Michael D. Knierman, Valentina Gelfanova, Jon P. Butler, and John E. Hale 13.
Analysis of the Extracellular Matrix and Secreted Vesicle Proteomes by Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Zhen Xiao, Thomas P. Conrads, George R. Beck, Jr., and Timothy D. Veenstra
Part IV:
Clinical Proteomics and Antibody Arrays
14.
Miniaturized Parallelized Sandwich Immunoassays . . . . . . . . . . . . . . . . 247 Hsin-Yun Hsu, Silke Wittemann, and Thomas O. Joos
15.
Dissecting Cancer Serum Protein Profiles Using Antibody Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Marta Sanchez-Carbayo
Part V: Statistics and Bioinformatics in Clinical Proteomics Data Analysis 16.
2D-PAGE Maps Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Emilio Marengo, Elisa Robotti, and Marco Bobba 17. Finding the Significant Markers: Statistical Analysis of Proteomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Sebastien Christian Carpentier, Bart Panis, Rony Swennen, and Jeroen Lammertyn 18. Web-Based Tools for Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . 349 Costas D. Paliakasis, Ioannis Michalopoulos, and Sophia Kossida
Contents 19.
20.
xi
Open-Source Platform for the Analysis of Liquid Chromatography-Mass Spectrometry (LC-MS) Data . . . . . . . . . . . . . . 369 Matthew Fitzgibbon, Wendy Law, Damon May, Andrea Detter, and Martin McIntosh
Pattern Recognition Approaches for Classifying Proteomic Mass Spectra of Biofluids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Ray L. Somorjai Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Contributors Jian-Hua Ai • Eastern Hepatobiliary Surgery Hospital, Shanghai, China George R. Beck, Jr • Division of Endocrinology, Metabolism and Lipids Emory University, School of Medicine, Atlanta, GA Marco Bobba • University of Eastern Piedmont, Department of Environmental and Life Sciences, Alessandria, Italy Jon P. Butler • Lilly Corporate Center, Indianapolis, IN Sebastien Christian Carpentier • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Lisa H. Cazares • The George L. Wright Jr. Center for Biomedical Proteomics Eastern Virginia Medical School, Norfolk, VA Sang Yun Cho • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Thomas P. Conrads • Laboratory of Proteomics and Analytical Technologies SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Andrea Detter • Fred Hutchinson Cancer Research Center, Seattle, WA Jose I. Diaz • Cancer Therapy Research Center’s Institute for Drug Development, University of Texas, Health Science Center, San Antonio, TX Rick R. Drake • Eastern Virginia Medical School, Norfolk, VA Matthew Fitzgibbon • Fred Hutchinson Cancer Research Center, Seattle, WA David B. Friedman • Proteomics Laboratory, Mass Spectrometry Research Center, Department of Biochemistry, Vanderbilt University School of Medicine, Nashville, TN Valentina Gelfanova • Lilly Corporate Center, Indianapolis, IN John E. Hale • Lilly Corporate Center, Indianapolis, IN Richard E. Higgs • Lilly Corporate Center, Indianapolis, IN Yi-Hong • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Hsin-Yun Hsu • Biochemistry Department NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Thomas O. Joos • Biochemistry Department, NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Min-Jung Kang • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea xiii
xiv
Contributors
Hoguen Kim • Department of Pathology, College of Medicine, Yonsei University, Seoul, Korea Hye-Young Kim • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Michael D. Knierman • Lilly Corporate Center, Indianapolis, IN Evelyn Siew-Chuan Koay • Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, and Molecular Diagnosis Center, Department of Laboratory Medicine. National University Hospital, Singapore Sophia Kossida • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece Johan M. Kros • Department of Pathology, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands Min-Seok Kwon • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Jeroen Lammertyn • Faculty of Bioscience Engineering, Division of Mechatronics, Biostatistics and Sensors, K.U. Leuven, Leuven, Belgium Wendy Law • Fred Hutchinson Cancer Research Center, Seattle, WA Eun-Young Lee • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Hyoung-Joo Lee • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Chen Li • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Su-Jun Li • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Kathryn S. Lilley • Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, United Kingdom Theo Luider • Laboratories of Neuro-Oncology/Clinical and Cancer Proteomics, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands Emilio Marengo • Department of Environmental and Life Sciences, University of Eastern Piedmont, Alessandria, Italy Damon May • Fred Hutchinson Cancer Research Center, Seattle, WA Martin McIntosh • Fred Hutchinson Cancer Research Center, Seattle, WA Ioannis Michalopoulos • Biomedical Research Foundation, Academy of Athens, Athens, Greece Dana Mustafa • Department of Pathology, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands
Contributors
xv
Young-Ki Paik • Department of Biochemistry, Yonsei Proteome Research Center & Biomedical Proteome Research Center, Seoul, Korea Costas D. Paliakasis • Biomedical Research Foundation, Academy of Athens, Athens, Greece Bart Panis • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Elisa Robotti • Department of Environmental and Life Sciences, University of Eastern Piedmont, Alessandria, Italy Marta S.anchez-Carbayo • Tumor Markers Group, Spanish National Cancer Center (CNI0), Madrid, Spain O. John Semmes • The George L. Wright Jr. Center for Biomedical Proteomics, Eastern Virginia Medical School, Norfolk, VA Ray L. Somorjai • Biomedical Informatics Institute for Biodiagnostics, National Research Council, Winnipeg, Manitoba, Canada Rony Swennen • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Harald Tammen • Digilab BioVisioN GmbH, Hannover, Germany Ye-Xiong Tan • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Timothy D. Veenstra • Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Antonia Vlahou • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece Hong-Yang Wang • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Silke Wittemann • Biochemistry Department, NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Jia-Rui Wu • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Qi-Chang Xia • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Zhen Xiao • Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Rong Zeng • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Panagiotis G. Zerefos • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece
xvi
Contributors
Daohai Zhang • Molecular Diagnosis Center Department of Laboratory Medicine, National University Hospital, Singapore and Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Lei Zhang • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Hu Zhou • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
1 Overview and Introduction to Clinical Proteomics Young-Ki Paik, Hoguen Kim, Eun-Young Lee, Min-Seok Kwon, and Sang Yun Cho
Summary As the field of clinical proteomics progresses, discovery of disease biomarkers becomes paramount. However, the immediate challenges are to establish standard operating procedures for both clinical specimen handling and reduction of sample complexity and to increase the ability to detect proteins and peptides present in low amounts. The traditional concept of a disease biomarker is shifting toward a new paradigm, namely, that an ensemble of proteins or peptides would be more efficient than a single protein/peptide in the diagnosis of disease. Because clinical proteomics usually requires easy access to well-defined fresh clinical specimens (including morphologically consistent tissue and properly pretreated body fluids of sufficient quantity), biorepository systems need to be established. Here, we address these questions and emphasize the necessity of developing various microdissection techniques for tissue specimens, multidimensional fractionation for body fluids, and other related techniques (including bioinformatics), tools which could become integral parts of clinical proteomics for disease biomarker discovery.
Key Words: biomarker; body fluids; clinical proteomics; translational proteomics; depletion; biorepository; multidimensional fractionation; specimen bank; biomarker panel. Abbreviations: CSF: Cerebrospinal Fluid, SILAC: Stable Isotope Labeling with Amino acids in Cell culture, FFE: Free Flow Electrophoresis, IMAC: Immobilized Metal Affinity Chromatography, 2DE: 2-dimensional Gel electrophoresis, CBB: Coomassie Brilliant Blue, SELDI: Surface-Enhanced Laser Desorption/Ionization, MALDI: MatrixAssisted laser desorption/ionization, MDLC: Multi-dimensional Liquid Chromatography, LC: Liquid Chromatography, TOF: Time-of-Flight, CID: Collision-induced dissociation, ETD: Electron Transfer Dissociation, LIT: Linear Ion-Trap, FT: Fourier-Transform, Q: Quadrupole, ELISA; Enzyme-Linked Immunosorbent Assay, SISCAPA: Stable Isotope Standards with Capture by Anti-Peptide Antibody, AQUA: Absolute Quantitative From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
1
2
Paik et al. Analysis. Commercial brands are also shown: MARS; Multiple Affinity Removal System, (Agilent, Palo Alto, CA, USA), EnchantTM : EnchantTM Multi-protein Affinity Separation Kit (Pall Life Sciences, Ann Arbor, MI, USA), GradiflowTM : GradiflowTM Separation (Life Bioprocess, Frenchs Forest, Australia), FFETM : BD Free Flow Electrophoresis System (BD Diagnostics, Martinsried/Planegg, Germany), Zoom® : Zoom® Benchtop Proteomics System (Invitrogen Corporation, Carlsbad, CA, USA), Rotofor: Bio-Rad Rotofor® Prep IEF Ccll (Bio-Rad, Hercules, CA, USA), PF2D: ProteomeLabTM PF2D Protein Fractionation System (Beckman Coulter, Inc., Fullerton, CA, USA), DIGE: EttanTM DIGE System (GE Healthcare Bio-Sciences AB, Uppsala, Sweden), Deep PurpleTM : Deep PurpleTM Total Pprotein Stain (GE Healthcare Bio-Sciences AB, Uppsala, Sweden), ICATTM : Isotopecoded affinity tags (Applied Biosystems, Foster City, CA, USA), iTRAQTM : iTRAQTM Reagents (Applied Biosystems, Foster City, CA, USA), Q-TRAPTM : (Applied Biosystems, Foster City, CA, USA).
1. Overview and Scope of Clinical Proteomics Clinical proteomics is defined as comprehensive studies of qualitative and quantitative profiling of proteins (and peptides) present in clinical specimens such as body fluids and tissues. The comparison of specimens from healthy and diseased individuals may lead to the discovery of a disease biomarker (1). The biomarker serves as a molecular signature reflecting stages of disease before or after treatment and can also be used for prognostic purposes in monitoring the response to treatment (2). Clinical proteomics consists of a variety of experimental processes, which include the collection of well-phenotyped clinical specimens, analysis of proteins or peptides of interest, data interpretation, and validation of proteomics data in a clinical context (Fig. 1). After successful identification of a few disease biomarker candidates through extensive profiling,
Fig. 1. Clinical and translational proteomics. The key components of experimental methods are included in each box.
Overview and Introduction to Clinical Proteomics
3
translational proteomics involving validation with a cohort study follows. Even after proper identification and verification of a disease biomarker, it takes quite a long time to prove that this biomarker is applicable to clinical diagnosis or prognosis (3,4). There has been a remarkable increase in publication of clinical proteomics papers within a short period of time [more than 800 papers in 2006 (Fig. 2)], coinciding with the rapid growth of proteomics. Reflecting this trend in clinical proteomics, this chapter aims to present a review of core technologies that are used in the field of clinical proteomics with respect to sample specimen processing, protein separation platforms (e.g., gel-based system or liquid-based methods), quantitative labeling, mass spectrometry (MS), and proteome informatics tools. It is noteworthy that despite the advent of new technologies, there remain several bottlenecks in the proteomics field such as lack of dataset standardization, quantification of the proteins of interest, verification of protein or peptides identified, and an overall strategy for tackling biomarker postidentification. Thus, the pace of biomarker discovery, one of the key agendas of clinical proteomics, will depend on how well these obstacles or bottlenecks are resolved by technical advancement (4). The following sections address these issues in the context of clinical proteomics.
Fig. 2. Recent trends in clinical proteomics publications. The distribution of the articles related to clinical proteomics listed in PubMed is shown here. The key words used for searching articles are as follows: query (clinical[All Fields] OR ((“biological markers”[TIAB] NOT Medline[SB]) OR “biological markers”[MeSH Terms] OR biomarker[Text Word])) AND (“proteomics”[MeSH Terms] OR proteomics[Text Word] OR proteomic[All Fields] OR “proteome”[MeSH Terms] OR proteome[Text Word]).
4
Paik et al.
2. Sample Specimens and Processing Techniques Used for Clinical Proteomics 2.1. General Considerations Because clinical proteomics rely heavily on the patient specimens, three important factors need to be considered before the selection and preparation of clinical specimens: (1) selection of the correct clinical samples according to the type of research, (2) isolation of the appropriate component from the clinical samples, and (3) establishment of optimal experimental conditions for each sample (5,6,7,8). For the selection of correct clinical samples, the relationship between clinical samples and the specific disease should also be considered. For example, although cancer tissue represents a specific cancer, several types of body fluids from patients may also have a relationship to the cancer. If the selected clinical samples specifically represent the disease, the next step is to evaluate what components are related to the specific disease. That is, tumor cells in cancerous tissues are surrounded by many types of stromal cells, inflammatory cells, and connective tissues that are directly related to changes in protein expression in the cancer. If the purpose of proteomic analysis is to identify characteristic changes of specific proteins in tumor cells, then the precise identification of tumor cell percentage that can be increased by tissue microdissection would appear to be necessary (5,6,7). As sample specimen conditions directly impact the results of biomarker discovery, well-defined clinical specimens should be used since the discovery of disease biomarkers is much easier when the samples have clear anatomical and pathophysiological definitions. Because clinical specimens are heterogeneous, sophisticated pathological discrimination is required for the isolation of specific diseased tissue or body fluids. Without the expertise of a pathologist at the earliest stage, it may be difficult to isolate a specifically defined specimen for clinical proteomics. Generally, clinical samples contain variable factors and components originating from the microenvironment of specific tissues. For instance, liver tissues usually contain a large amount of blood in the sinusoid and this amount is increased in tissues with dilated sinusoids (9). Lung tissues usually contain deposited exogenous materials and this amount is increased in heavy smokers (10). Note that the amount of blood present in isolated tissues may directly influence the relative proportion of proteins found in clinical specimens. Deposited materials and the other chemicals such as stain dye and fixatives used in the microdissection may also influence the experimental conditions (11). In the analysis of clinical samples, suitable buffer conditions, minimal lysis time, and high-yield protein precipitation are highly recommended. To avoid substantial variations between experiments using clinical specimens, a large set of specimens are also necessary because, unlike cultured cell lines, clinical specimens have high
Overview and Introduction to Clinical Proteomics
5
component variability (12). More details on specific disease types are also described throughout this volume. 2.2. Body Fluids Surveying the literature, there appears to be five to six different types of clinical specimens. Body fluids [e.g., plasma, urine, tear, cerebrospinal fluid, lymph, and ascites], tissues (e.g., liver, heart, muscle, brain, and lung), cells, bone, and hair have all been used for clinical proteomics (Table 1) (13,14,15,16, 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33). Each has its own merits and limitations for biomarker discovery via proteomic analysis. Among those sample specimens, the number of publications using body fluids has increased recently, perhaps because of their convenience and ease of use for noninvasive diagnosis. Since those proteins secreted in the body fluids during or after disease may reflect a broad range of pathophysiological conditions, much emphasis has been given to identification of prominent protein/peptide biomarkers that exhibit differential expression at different stages. In the literature, the terms “body fluids” and “biofluids” are being used interchangeably, although the former indicates a greater likelihood of being obtained directly from the patients, while the latter is applied more broadly, referring to liquid or liquid-like samples obtained from living organisms including model animals and plants. Throughout this chapter we will use “body fluids” for clarity. Given the large dynamic range of protein and peptide sources, plasma (a complex liquid interface between tissues) and extra cellular fluids may be the best body fluid to use for clinical proteomics and biomarker discovery (34,35, 36,37,38). In addition to plasma, more than a dozen additional body fluids are currently used for biomarker discovery, ranging from urine to peritoneal fluids (Table 1). However, the biggest challenge in body fluids proteomics may be the multiple pretreatment processes including depletion of high-abundance proteins (in the case of plasma) (34,35,36) and/or their enrichment (in the case of urine) (15,39) prior to analysis (Table 1). Thus, the outcome of clinical proteomics may depend on proper sample processing since the quality of selection and handling of the most specific type of specimen will affect the overall pattern of profiling. Because the details of body fluid proteomics have been well described by Shen Hu et al. (38), we would like to focus on only a few essential points. First, standard measures need to be introduced to protect specimens from nonspecific proteolysis, lysis, and modification during collection and preparation (11). For the standardization of blood sample collection, Tammen emphasizes many useful considerations of preanalytical variables in plasma proteomics, which can be applied to processes involved with blood specimens [(40) and see Chapter 2]. The more specific problems involved in sample
6
Fluid
Synovial fluid Ascites Bronchial lavage fluid
Pleural fluid Peritoneal fluid
Body cavity fluid
Seminal fluid Nipple aspirate fluid Cerebrospinal fluid
Follicular fluid
Lung cancer Ovarian cancer
Rheumatoid arthritis Ovarian cancer Chronic obstructive pulmonary disease, asthmatics and lung disease (29) (14)
(26) (13) (27,28)
(23) (24) (25)
(22)
• Can reflect disease perturbations in the organs or tissues from which they are secreted • Procedure of synovial biopsy is not very difficult
(15) (16) (17,18) (19) (20,21)
Urine Nasal discharge Tears Saliva Amniotic-/cervical fluid
Prostate cancer Seasonal allergic rhinitis Blepharitis and dry eye Oral and breast cancer Fetal aneuploidy and intra-amniotic inflammation Recurrent spontaneous abortion Male infertility Breast cancer Brain tumor
• Routinely accessible body fluids • Very important in the discovery of biomarkers of diseases (systemic vs. organ specific/local) • Important for early detection, disease severity, prognosis, monitoring of response to therapy
(13,14)
Plasma/serum
Proximal fluid
Secretions
Characteristics of the samples
Disease
Reference
Type
Table 1 Types of Biological Specimens Used in Clinical Proteomics
• Mucosa and salt have to be removed necessarily
• Considerations for sample adequacy – Storage – Hemolysis – Influence of anticoagulants –Consistent results • Consider whether to pool samples or analyze individual samples • Depletion of high-abundance proteins (Albumin consist of 50% of plasma proteins)
Pretreatment required for proteomics
7
Hair
Cartilage
Cell lines or primary tissue culture
Cell
Bone
LCM or LMPC isolated Formalin fixed Paraffin embedded
Tissue
Rheumatoid arthritis
Any type of disease
Any type of disease
• Very important for the development of novel in situ biomarkers • Immunofluorescence, immunocytochemistry, imaging mass spectrometry • Very important in the discovery of biomarker candidates • Validation should be performed using primary tumor samples (e.g., immunohistologic methods, imaging MS) • Cartilage consists mainly of extracellular matrix, mostly made of collagens and proteoglycans • Over 300 proteins were found to constitute the insoluble complex formed by transglutaminase crosslinking
(30)
(31)
(32)
(33)
• Need to sufficient extraction of protein from insoluble complex
• Cetylpyridinium chloride effectively aggregate with proteoglycan
• Desalting and removal of media component
• Considerations for sample adequacy • Integrity, degradation of protein • Contamination (microorganisms, extraneous material)
8
Paik et al.
handling are also addressed by Rai et al. (41). Second, to increase the dynamic range of detection and reduce sample heterogeneity, pretreatments such as depletion of high-abundance proteins appear to be required (34,35,36). In addition, many pretreatment steps to remove high-abundance proteins may be required during initial sample processing. Multiple fractionations of clinical samples prior to major separation work would reduce the sample complexity. Note that coremoval of low-abundance proteins during this type of multiple depletion (36,42) and modification of proteins of interest during or after isolation (43) should be considered as well. For several problems encountered with specimen collection, Xiao et al. (Chapter 13) in this volume also describe different methods to isolate extra cellular matrix (ECM) and analyze the proteome of secreted vesicles. These methods will be useful for studying ECM and secreted vesicles in various samples ranging from the primary cultured cells to tissue specimens. Therefore, one must consider the best options for this process before doing the main experiment. 2.3. Tissues and Other Samples Usually tissues are used as primary screening samples to find direct causes of disease from the lesion present in tissues of the corresponding organ, for example, liver tissue in hepatocellular carcinoma (HCC) (44,45). Tissues are widely used for clinical proteomics, although there are no standing operation procedures in specimen fractionation and the detection limit of current instrumentation remains borderline. As listed in Table 1, many cancer tissues can be prepared in different ways such as laser capture microdissection (LCM) (5,6), pressures catapulting techniques [laser microdissection and pressure catapulting (LMPC)] (30,46), and formalin-fixed paraffin-embedded sample preparation (11). Theses techniques are well described in Chapters 3, 5, 9, and 11 in this volume. It is desirable, however, that proteomics studies of disease tissues should also be coupled with parallel analysis of the corresponding body fluids. For example, for the study of cancer biomarkers, paired cancer tissue sets (tumor vs. nontumor) and the same patient’s plasma were used, which led to a more comprehensive analysis (47,48). Experiments on tissue samples may mostly be suitable for pathophysiological studies rather than biomarker discovery due to the complexity of the sample. In specimen processing for proteomics studies, there are usually several unwanted problems such as artifacts created during sample collection, processing, and storage. Other matters arise in the handling of patient information regarding sex, age, and race (49). To minimize those problems associated with systematic sample handling, it is plausible to establish a specimen bank (50,51,52). In fact, the collection of many clinical samples in a biorepository would have enormous
Overview and Introduction to Clinical Proteomics
9
benefits for proteomic research. This enables the selection of homogeneous clinical samples according to the research purposes and isolation of specific components from clinical samples. Additionally, large scale collection of clinical specimens in a biorepository is essential for the validation of specific markers after biomarker candidate discovery. Ideally, the clinical samples stored in the biorepository should be (1) collected and stored immediately because dead cells and altered proteins affect proteomic analysis, (2) subjected to accurate quality control, and (3) catalogued by reliable and secure clinical data. The quality control of clinical samples includes trimming of specimens and confirmation of diagnosis by pathologists; information gained (such as the confirmation of tumor cell and stromal cell ratio, percentage of necrosis, percentage of fibrosis, proportion of infiltrated inflammatory cells, etc.) should be stored in a database of clinical samples. It is also essential to store clinical and follow-up data for each sample and each patient’s written informed consent form in the biorepository network. This clinical specimen banking network provides convenience, reduced budget, and reliability for researchers involved in clinical proteomic research (50,51,52). For representative tissue sample collection for proteomics studies, Diaz et al. (Chapter 3) address a practical experimental strategy for storage and handling of sample specimens that are used in surface-enhanced laser desorption/ionization (SELDI), 2D gel, and liquid chromatography (LC)-based proteomics. Emphasis should be given to the primary responsibility of pathologists in the whole process of tissue proteomics in addition to morphological analysis at the molecular level.
3. Biomarker Discovery and Clinical Proteomics Given that one of the central issues of clinical proteomics is biomarker discovery and its application, a brief account of this subject is appropriate here. An excellent review of the whole arena of biomarker development can be found elsewhere (53,54,55). Until now, it has been generally accepted that a conventional concept of a disease biomarker would be a single protein/peptide with high specificity, which is usually present in low abundance, expressed in a disease in a stage-specific manner, and serve as a major fingerprint of the body’s response to drugs or other treatments. Although many examples of broad biomarkers for various diseases are known (56,57,58,59,60), identification of more specific and selective biomarkers is urgently needed. Accordingly, we may also need to change the current biomarker concept and eliminate the inherent bias toward individual disease biomarkers. Recently, a new idea has been introduced that an ensemble of different proteins would be more efficient than a single protein/peptide in the diagnosis of disease (61,62,63). To solve
10
Paik et al.
this problem we propose a general strategy of clinical proteomics leading to disease biomarker discovery as outlined in Fig. 3. Since biomarker candidate proteins could come from many different cellular processes, they could be either in low abundance or high abundance, which would directly or indirectly reflect the physiological condition of the body. Perhaps they are present in different concentrations depending on the disease stage or tissue type. For example, common proteins such as Hsp 27 (64, 65), 14-3-3 proteins (66,67), apoA-I (68,69), and serum amyloid precursor A (70) appear in most of disease samples from lung cancer, gastric cancer, pancreatic cancer, prostate cancer, neuroblastoma and, inflammation. A number of questions then arise: should they be treated as disease-specific or disease nonspecific proteins? What would be the criterion to make this decision? Is this due to the fact that the number and type of proteins secreted from a specific
Fig. 3. The concept of the creation of a protein biomarker panel for a specific disease. Each white, gray, dark-gray, and black circle represents a putative protein biomarker of a specific disease at that clinical stage. A group of slash-lined circles symbolizes the biomarker panel of liver disease as an example.
Overview and Introduction to Clinical Proteomics
11
physiological condition of many different types of diseases might be similar? How one can distinguish one type of disease from another simply by looking at their protein profiles? As outlined in Fig. 3, at the beginning of certain disease, signals at earlier stages may be limited to only a few easily counted molecules. As the disease progresses, more signal molecules might have been produced, resulting in mixed types of biomarkers representing multiple disease phenomena. Although this assumption seems to be oversimplified, more noise is created at a certain stage where it becomes more difficult to identify those molecules at the molecular level because of two reasons: (1) they are in amounts too small to be detected using the current technology and (2) it may be too premature for the molecules to be specific for a particular disease. Presumably, proteins appearing in stage 3 or 4 may have higher specificity of a particular disease but the sensitivity might be low. It may be likely that this noise interferes with the signaling pathway of a certain disease, and we may end up having no decisive marker. To circumvent this problem, it may be desirable to identify a set of biomarker candidate proteins, termed a “biomarker panel,” which ideally contains potential candidate proteins or peptides that represent specific stages of the disease as a group. Given this panel, extensive validation processes may be sought using large group cohort. Analogous to this strategy, many biomarker candidates at stage 1 can be included in the panel, which can have more specificity and sensitivity as compared to a single molecule biomarker. Using this kind of biomarker panel, one can use not only this molecule as diagnostic marker but also as a prognostic indicator in monitoring treatment effectiveness. For example, Linkov et al. (61) reported that both the sensitivity and specificity were improved up to 84.5 and 98%, respectively, when they used a panel containing 25 multimarkers in early diagnosis of head and neck cancer (squamous cell cancer of the head and neck) (61). In the diagnosis of prostate cancer, specificity was increased from 5–15 to 84–95% when they used a biomarker panel containing six marker proteins as compared to a single marker. In HCC, studies have been carried out on a biomarker panel consisting of a protein array that can be used as a diagnostic kit (62,63). A general strategy for biomarker discovery is outlined in Fig. 4. In typical clinical proteomics, work sample collection is the first step, followed by pretreatment of the sample in order to reduce sample complexity to enable searching for low-abundance proteins (e.g., disease biomarkers) using various fractionation tools. This multidimensional fractionation is well-described elsewhere (34,35,36), and depends on the properties and concentration of the sample. Typically the prefractionated samples go either to a two-dimensional electrophoresis (2DE) or LC-based proteomics separation system, followed by single or multiple steps of mass spectrometric analysis depending on the sample
12 Fig. 4.
Overview and Introduction to Clinical Proteomics
13
quantity and experimental goal. The data obtained from this series of analyses will be integrated into the proteome informatics system where protein/peptide identification, quantification, modification, and verification of peak list are carried out [(71) and also Chapter 19]. Usually this step becomes rate limiting since major profiling data are constructed and analyzed at this point. The clinical relevance of those proteins (and changes in their expression level) in a specific disease state is mostly determined, which eventually leads to identification of biomarker candidates. In addition, SELDI, molecular imaging and protein microarrays can also be applied before or after this step. Once major biomarker candidates are identified, those proteins are subjected to further verification via sophisticated analytical arrays and translational proteomics, which involves cohort studies, pre-evaluation, and a robust analytical system (4,72). Throughout the process of translational proteomics, one may be able to judge whether the identified panel or single proteins are suitable for biomarkers of a specific disease. A recent comprehensive review by Zolg (73) addressed several considerations in the biomarker development pipeline from discovery to validation. Three critical challenges within the pipeline are reduction of clinical sample complexity, the proof of principle of biomarker function, and the detection limit of unique proteins present in the samples. In the search for biomarker panels, reliable statistical tools and bioinformatics resources are needed, which are now available on the web (Table 2; see also Chapters 16 and 17). As the number of biomarker panel candidates increases, more cases are being examined, which require statistical learning methods. These methods include neural networks, genetic algorithms, k-means 0002 Fig. 4. A typical experimental strategy for clinical proteomics and translational proteomics. In clinical proteomics research, various experimental techniques are included: specimen collection, prefractionation, 2DE, Non2DE (liquid-based separation), mass spectrometry, informatics, and others. The course of each section as marked (square, circle in different color) is determined by the investigators, depending on the experimental goal. At the bottom, experimental procedures for the verification and validation of biomarker candidates are schematically outlined leading to clinical screening and applications. The squares indicate the separation system based on the specific characteristics of proteins and general prefractionation system. The open circles and open triangle represent analytical modules at the protein and peptide level, respectively. The arrow and junction points indicate an option of each selection. Bottom parts indicate verification procedure employing multiple reaction monitoring and quantitative mass analysis. Those biomarker candidates identified from typical clinical proteomics would be subject to translational proteomics for validation where a large scale cohort study and evaluation would then proceed.
14
Paik et al.
nearest-neighbor analysis, euclidean distance-based nonlinear methods, fuzzy pattern matching, selforganizing mapping, and support vector machines (74,75,76,77,78). They are very useful for classification of proteins according to the specific disease state (see also Chapters 16 and 20). Once biomarker candidates are identified, it is necessary to predict in silico the function of these proteins and validate them in the context of clinical application. Table 3 provides web resources, which can be used for clinical data management, in silico functional annotation (see Chapter 18), prediction, and identification of modified forms of proteins. Thus, by combining experimental methods (Fig. 4) and informatics tools (Tables 2 and 3), one is able to obtain a set of biomarker candidate proteins (panel) that would be further used for validation through translational proteomics (Fig. 1).
4. Introduction of the Experimental Strategy Described in This Volume For protein profiling and identification, proteomics platform technologies are moving forward in many areas not only in clinical proteomics but also in the general biological field. In this section, the leading scientists in the field of proteomics outline core techniques and their application to the studies of clinical proteomics. For example, in plasma proteome analysis, it is necessary to deplete high-abundance proteins using various techniques such as multidimensional fractionation by immunoaffinity column, gel permeation, and beads (Fig. 4). Cho et al. (Chapter 4) addresses this in relation to 2D gel analysis of plasma wherein the technical details of sample preparation, gel electrophoresis, and quantification of proteins on the gel are described. Zhang and Koay (Chapter 5) describe the methods of 2D gel analysis for cells prepared by LCM. They describe the application of LCM in dissecting tumor cells in breast cancer for macromolecular extraction and 2D gels. This can be used for preparation of samples from paraffin-embedded tissue blocks in microdissecting the cells of interest. Further to this procedure, Mustafa et al. (Chapter 9) review the application of LCM for proteomics analysis and demonstrate that combining LCM and MS would facilitate identification of specific proteins for each sample type. For urine sample analysis, Zerefos et al. (Chapter 8) provide simple protocols for protein analysis by 2D gel or direct matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry. These techniques include protein enrichment through protein precipitation and ultrafiltration means. Combining these methods with the above profiling technologies allows reproducible and sensitive analysis of one of the most significant and complex biological samples (77).
Overview and Introduction to Clinical Proteomics
15
Table 2 Clinical Proteomics Initiatives and Resources
Institute CPTI
ABRF
PPI
EDRN
Web resources ExPASy
NCBI
CPRMap
Database MedGene
Details
Websites
National Cancer Institute’s Clinical Proteomics Technologies, initiative for cancer The Association of Biomolecular Resource Facilities, an international society dedicated to advancing core and research biotechnology laboratories through research, communication, and education Plasma Proteome Institute, the PPI is working to facilitate clinical adoption of advanced diagnostic tests using proteins in plasma and serum The Early Detection Research Network, the EDRN provide up-to-date information on biomarker research through this website and scientific publications
http://proteomics.cancer. gov
Expert Protein Analysis System, proteomics related information and database National Center for Biotechnology Information, the protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq Clinical Proteomics Research Map, updated research article for disease and clinical proteomics
http://www.expasy.org/
MedGene can make a list of human genes associated with a particular human disease in ranking order
http://hipseq.med.harv ard.edu/MEDGENE
http://www.abrf.org/
http://www.plasmaprote ome.org/plasmaframes. htm http://edrn.nci.nih.gov
http://www.ncbi.nlm. nih.gov/entrez/query. fcgi?db = Protein& itool = toolbar
http://www.cprmap.com/
16
Paik et al.
Table 3 Available Bioinformatic Resources for the Analysis of Proteomics Data Name
Description
Clinical proteome data management system Proteus LIMS for proteomics pipeline CPAS LIMS for identification and quantification using by LC-MS/MS data Systems biology A management system for experiment analysis collecting, storing, management and accessing data system produced by microarray, proteomics, and immunohistochemistry GPM database Open source system for analyzing, validating, and storing protein identification data SpectrumMill MS/MS data analysis and management system Phosphorylation Group-based phosphorylation scoring method KinasePhos
NetPhos
NetPhosK
Prediction of kinase-specific phosphorylation sites A web tool for identifying protein kinase-specific phosphorylation sites using by hidden Markov model Sequence and structure-based prediction of eukaryotic protein phosphorylation sites Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence
Website URL
PMID
http://www. genologics.com 16396501
http://www. sbeams.org/
16756676
http://www. thegpm.org/
15595733
http://www.chem. agilent.com/
http://973proteinweb.ustc. edu.cn/gps/ gps_web/ http://kinasePhos. mbc.nctu.edu.tw
15980451
http://www.cbs. dtu.dk/services/ NetPhos/
10600390
http://www.cbs.dtu. dk/services/ NetPhosK/
15174133
15980458
Overview and Introduction to Clinical Proteomics PredPhospho
PREDIKIN
Prosite
Scansite
Phospho.ELM
Human protein reference database (HPRD)
PhosphoSite
Glycosylation NetOGlyc 2.0
17
Prediction of phosphorylation sites using support vector machine A prediction of substrates for serine/threonine protein kinases based on the primary sequence of a protein kinase catalytic domain A prediction of substrates for protein kinases-based conserved motif search Prediction of PK-specific phosphorylation site with Bayesian decision theory A database of experimentally verified phosphorylation sites in eukaryotic proteins A database of known kinase/phosphatase substrate as well as binding motifs that are curated from the published literature A bioinformatics resource dedicated to physiological protein phosphorylation
http://pred.ngri. re.kr/Pred Phospho.htm http://florey.biosci. uq.edu.au/kinsub/ home.htm
15231530
http://kr.expasy. org/prosite
17237102
http://scansite. mit.edu
16549034
http://phospho.elm. eu.org/
15212693
http://www. phosphosite.org/ Login.jsp
15174125
Predicts O-glycosylation sites in mucin-type proteins
http://www.cbs. dtu.dk/services/ NetOGlyc/ http://www.cbs. dtu.dk/services/ DictyOGlyc/ http://www.cbs. dtu.dk/services/ YinOYang/ http://www.cbs.dtu. dk/services/ NetNGlyc/ http://www.expasy. ch/tools/glycomod/
9557871
DictyOGlyc 1.1
Predicts O-GlcNAc sites in eukaryotic proteins
YinOYang 1.2
Predicts O-GlcNAc sites in eukaryotic proteins
NetNGlyc 1.0
Predicting N-glycosylation sites
GlycoMod
Web software for prediction of the possible oligosaccharide structures in glycoproteins from their experimentally determined masses
16445868
http://www.hprd. org/PhosphoMotif_ finder
10521537
16316981
11680880
(Continued)
18
Paik et al.
Table 3 (Continued) Name
Description
Website URL
PMID
Glyco-fragment
A web tool to support the interpretation of mass spectra of complex carbohydrates Compares each peak of a measured mass spectrum with the calculated fragments of all structures contained in the SweetDB Based on the matching of experimental MS2 data with the theoretical fragmentation of glycan structures in GlycoSuiteDB A web-based computational program that can quickly extract sequence information from a set of MSn spectra for an oligosaccharide of up to 10 residues To determine simultaneously the glycosylation sites and oligosaccharide heterogeneity of glycoproteins using MATLAB A web server for identifying multiple post-translational peptide modifications from tandem mass spectra An attempt to create annotated data collections for carbohydrates
http://www.dkfz. de/spec/projekte/ fragments/
14625865
GlycoSearchMS
GlycosidIQ
Saccharide topology analysis tool
GlycoX
MODi
SWEET-DB
Protein–protein interaction Munich The database of mammalian information protein–protein interactions center for protein sequence’s MPPI
http://www.dkfz. 15215392 de/spec/glycosciences. de/sweetdb/ms/
https://tmat. 15174134 proteomesystems. com/glyco/glycosuite/ glycodb 10857602
17022651
http://www. unimod.org
16845006
http://www.dkfz.de/ spec2/sweetdb/
11752350
http://mips.gsf.de
16381839
Overview and Introduction to Clinical Proteomics Database of interacting proteins Molecular interaction network database
Protein–protein interactions of cancer proteins
IntAct
Biomolecular interaction network database Metabolic and signal pathway BioCarta KEGG
Cancer cell map
HPRD
19
A database that documents experimentally determined protein–protein interactions A database of storing, in a structured format, information about molecular interactions by extracting experimental details from work published in peer-reviewed journals Predicts interactions, which are derived from homology with experimentally known protein–protein interactions from various species IntAct provides a freely available, open source database system and analysis tools for protein interaction data A database designed to store full descriptions of interactions, molecular complexes and pathways
http://dip.doembi.ecla.edu/
11752321
http://mint.bio. uniroma2.it/mint
17135203
http://bmm. cancerresearchuk. org/˜pip
16398927
http://www.ebi. ac.uk/intact/
17145710
http://www.bind.ca
12519993
A pathway database
http://www. biocarta.com http://www. genome.jp/kegg
A pathway database with genomical, chemical, and biological network information The cancer cell map is a selected set of human cancer focused pathways A database with data pertaining to post-translational modifications, protein–protein interactions, tissue expression,
16381885
http://cancer. cellmap.org/cellmap/ http://www. hprd.org/
(Continued)
20
Paik et al.
Table 3 (Continued) Name
Description
Website URL
PMID
subcellular localization, and enzyme–substrate relationships Proteomic data resource The cancer cell A database of clinical data map from SELDI-TOF
Proteomics identifications database PeptideAtlas
Disease resource Online mendelian inheritance in man GeneCards
Cancer gene census
A database of protein and peptide identifications that have been described in the scientific literature A multiorganism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments
http://home.ccr. cancer.gov/ncifda proteomics/ ppatterns.asp http://www.ebi. ac.uk/pride/
16381953
http://www. peptideatlas.org
16381952
A database of human genes and genetic disorders
http://www.ncbi.nlm. nih.gov/entrez/query. fcgi?db = OMIM
17170002
An integrated database of human genes that includes automatically mined genomic, proteomic, and transcriptomic information A catalogue those genes for which mutations have been causally implicated in cancer
http://www.genecards. 15608261 org/index.shtml
http://www.sanger. ac.uk/genetics/CGP/ Census/
14993899
Two-dimensional electrophoresis is perhaps the most popular start-up tool for proteome analysis. For clinical proteomics, 2DE has been the traditional workhorse of proteomics used for the analysis of different clinical specimens ranging from plasma to urine (Table 1). Quantification problems in 2DE are now solved by employing fluorescent dyes (cy3 and cy5), which allow normalization
Overview and Introduction to Clinical Proteomics
21
of data obtained from two different clinical specimens (79). Freedman and Lilley (Chapter 6) present general optimization conditions for differential in gel electrophoresis (DIGE) in the quantitative analysis of clinical samples. They address the usefulness of differentially labeling dyes (Cy2, Cy3, and Cy5). The essence of any DIGE system is to minimize any potential human errors in the process of identification and quantification of proteins spotted in a 2D gel (79). The difficulties in 2D map analysis are introduced by Marengo et al. (Chapter 16). They describe methods for comparing protein spots using image analysis technology and related informatics tools to minimize variations between measurements of spot volume, a key to successful 2D map construction. There are many variations of LC in protein profiling, including mass detection methods, column types, data mining through search engines, mass accuracy, and running conditions (80,81,82). These are all related to quantification of proteins or peptides in the sample, one of the major bottlenecks in proteomics (83,84,85,86,87). Among the several techniques are isotope-coded affinity tags (ICAT), mass-coded affinity tagging, and nonisotope labeled methods. Xiao and Veenstra (Chapter 10) present the application of ICAT in the course of COX-2 inhibitor regulated proteins in a colon cancer cell line. With emphasis on sample preparation, they provide details on ICAT procedures for quantitative proteomics (88). In addition to this approach, Li et al. (Chapter 11) employ a strategy, which combines LCM techniques for sample preparation of HCC and cleavable isotope-coded affinity tags in order to identify those markers quantitatively. However, it should be mentioned here that some other measures are needed to increase the efficiency of ICAT since it has drawbacks in the efficiency of sample recovery during or after labeling steps (87). A label-free serum quantification method has been recently introduced (48) (See Chapter 12 by Higgs et al.). The use of antibody arrays in clinical proteomics has increased recently in the context of high-throughput detection of cancer specimens where the identities of the proteins of interest are known (89,90). The evaluation of antibody crossreactivity and specificity is very crucial in these assays. This matter is addressed by Sanchez-Carbayo (Chapter 15), where technical aspects and application of planar antibody arrays in the quantification of serum proteins is described as well as by Hsu et al. (Chapter 14) where the development and use of beadbased miniaturized multiplexed sandwich immunoassays for focused protein profiling in various body fluids is provided. The latter method using beadbased protein arrays or suspension microarray allows the simultaneous analysis of a variety of parameters within a single experiment. With the versatility of suspension microarray in the analysis of proteins of interest present in different types of body fluids ranging from serum to synovial fluids, this multiplexed protein profiling technology described by Hsu et al. (Chapter 14) seems to hold a great promise in clinical proteomics. Similarly, in combination with
22
Paik et al.
tissue microarrays technology (91) it would also be possible to perform parallel molecular profiling of clinical samples together with immunohistochemistry, fluorescence in situ hybridization, or RNA in situ hybridization. SELDI is another arena of high-throughput profiling of clinical samples in the course of disease marker discovery [(92,93), Chapter 7]. It is expected that profiling approaches in proteomics, such as SELDI-MS, will be frequently used in disease marker discovery, but only if the proper identification technologies coupled with SELDI are improved. During the course of biomarker discovery, large data sets are usually generated and deposited in a coordinated fashion (Tables 2 and 3) (94,95). Indeed, statistical analysis of 2DE proteomics, which produce several hundred protein spots, is complex. To circumvent some inconsistency in 2D gel proteomics data, Friedman and Lilley (Chapter 6) and Carpentier et al. (Chapter 17) point out available statistical tools and suggest case-specific guidelines for 2D gel spot analysis. Fitzgibbon et al. (Chapter 19) describe an open source platform for LC-MS spectra where the msInspector program is used to lower false positives and guide normalization of the dataset. It is also demonstrated that msInspect can analyze data from quantitative studies with and without isotopic labels. Paliakasis et al. (Chapter 18) introduce web-based tools for protein classification, which lead to prediction of potential protein function and family clustering of related proteins. They provide some guidelines to classification of protein data into more meaningful families. Finally, Somorjai (Chapter 20) addresses important filtering criteria for the application of protein pattern recognition to biomarker discovery using statistical tools. 5. Concluding Remarks Although there are several bottlenecks in clinical proteomics (such as lack of standardization of sample specimen process, quantification, and overall strategy for tackling post-identification of biomarkers), we believe that the field holds great promise in biomarker discovery. The success of clinical proteomics depends on the availability and selection of well-phenotyped specimens, reduction of sample complexity, development of good informatics tools, and efficient data management. Therefore, sample handling techniques including microdissection for tissue sample, multidimensional fractionation for body fluids, and pretreatment of other clinical specimens (e.g., urine, tears, and cells) should be developed in this context. Since there is no gold standard for sample collection and handling, one needs to find the best options available for sample processing without damage. In addition, establishment of a biorepository system would systematically minimize some artifacts and variation between samples during or after identification of biomarkers.
Overview and Introduction to Clinical Proteomics
23
It is now generally accepted that an ensemble (or panel) of different proteins would be more efficient than a single protein/peptide in the diagnosis of disease, an idea which is poised to replace the conventional concept of a biomarker. As a high-throughput way of protein profiling, the use of antibody arrays in clinical proteomics has recently increased in regard to detection of cancer specimens. However, in the use of antibody arrays to profile serum autoantibodies, issues of cross-reactivity and specificity have to be resolved. Although not covered here due to space limitations, with the advent of proteomics techniques one can further analyze a network of protein–protein interaction as well as post-translational modifications of those proteins involved in a specific disease (Table 3). It is now highly recommended that common reagents such as antibodies and standard proteins, which are very useful for spiking purposes, quantification work, and sensitivity normalization of one machine to another be used in worldwide efforts like human proteome organization plasma proteome project (96,97). Finally, clinical proteomics needs the integration of biochemistry, pathology, analytical technology, bioinformatics, and proteome informatics to develop highly sensitive diagnostic tools for routine clinical care in the future (71,98). Acknowledgments This study was supported by a grant from the Korea Health 21 R&D project, Ministry of Health & Welfare, Republic of Korea (A030003 to YKP). References 1. Etzioni, R., Urban, N., Ramsey, S., McIntosh, M., Schwartz, S., Reid, B., Radich, J., Anderson, G., and Hartwell, L. (2003) The case for early detection. Nat. Rev. Cancer 3, 1–10. 2. Ludwig, J. A. and Weinstein, J. N. (2005) Biomarkers in cancer staging, prognosis and treatment selection. Nat. Rev. Cancer 5, 845–856. 3. Xiao, Z., Prieto, D., Conrads, T. P., Veenstra, T. D., and Issaq, H. J. (2005) Proteomic patterns: their potential for disease diagnosis. Mol. Cell Endocrinol. 230, 95–106. 4. Rifai, N., Gillette, M. A., and Carr, S. A. (2006) Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 24, 97–983. 5. Emmert-Buck, M. R., Bonner, R. F., Smith, P. D., Chuaqui, R. F., Zhuang, Z., Goldstein, S. R., Weiss, R. A., and Liotta, L. A. (1996) Laser capture microdissection. Science 274, 998–1001. 6. Gillespie, J. W., Ahram, M., Best, C. J., Swalwell, J. I., Krizman, D. B., Petricoin, E. F., Liotta, L. A., and Emmert-Buck, M. R. (2001) The role of tissue microdissection in cancer research. Cancer J. 7, 32–39.
24
Paik et al.
7. Craven, R. A. and Banks, R. E. (2002) Use of laser capture microdissection to selectively obtain distinct populations of cells for proteomic analysis. Methods Enzymol. 356, 33–49. 8. Vincourt, J. B., Lionneton, F., Kratassiouk, G., Guillemin, F., Netter, P., Mainard, D., and Magdalou, J. (2006) Establishment of a reliable method for direct proteome characterization of human articular cartilage. Mol. Cell Proteomics 5, 1984–1995. 9. Platt, M. S., Agamanolis, D. P., Krill, C. E. Jr., Boeckman, C., Potter, J. L., Robinson, H., and Lloyd, J. (1983) Occult hepatic sinusoid tumor of infancy simulating neuroblastoma. Cancer 52, 1183–1189. 10. Mahadevia, P. J., Fleisher, L. A., Frick, K. D., Eng, J., Goodman, S. N., and Powe, N. R. (2003) Lung cancer screening with helical computed tomography in older adult smokers: a decision and cost-effectiveness analysis. JAMA 289, 313–322. 11. Hood, B. L., Darfler, M. M., Guiel, T. G., Furusato, B., Lucas, D. A., Ringeisen, B. R., Sesterhenn, I. A., Conrads, T. P., Veenstra, T. D., and Krizman, D. B. (2005) Proteomic analysis of formalin-fixed prostate cancer tissue. Mol. Cell Proteomics 4, 1741–1753. 12. Alaiya, A., Al-Mohanna, M., and Linder, S. (2005) Clinical cancer proteomics: promises and pitfalls. J. Proteome Res. 4, 1213–1222. 13. Gericke, B., Raila, J., Sehouli, J., Haebel, S., Konsgen, D., Mustea, A., and Schweigert, F. J. (2005) Microheterogeneity of transthyretin in serum and ascitic fluid of ovarian cancer patients. BMC Cancer 17, 133–141. 14. Swisher, E. M., Wollan, M., Mahtani, S. M., Willner, J. B., Garcia, R., Goff, B. A., and King, M. C. (2005) Tumor-specific p53 sequences in blood and peritoneal fluid of women with epithelial ovarian cancer. Am. J. Obstet. Gynecol. 193, 662–667. 15. Pisitkun, T., Johnstone, R., and Knepper, M. A. (2006) Discovery of urinary biomarkers. Mol. Cell Proteomics 5, 1760–1771. 16. Ghafouri, B., Irander, K., Lindbom, J., Tagesson, C., and Lindahl, M. (2006) Comparative proteomics of nasal fluid in seasonal allergic rhinitis. J. Proteome Res. 5, 330–338. 17. Koo, B. S., Lee, D. Y., Ha, H. S., Kim, J. C., and Kim, C. W. (2005) Comparative analysis of the tear protein expression in blepharitis patients using two-dimensional electrophoresis. J. Proteome Res. 4, 719–724. 18. Grus, F. H., Podust, V. N., Bruns, K., Lackner, K., Fu, S., Dalmasso, E. A., Wirthlin, A., and Pfeiffer, N. (2005) SELDI-TOF-MS ProteinChip array profiling of tears from patients with dry eye. Invest. Ophthalmol. Vis. Sci. 46, 863–876. 19. Amado, F. M., Vitorino, R. M., Domingues, P. M., Lobo, M. J., and Duarte, J. A. (2005) Analysis of the human saliva proteome. Expert Rev. Proteomics 2, 521–539. 20. Wang, T. H., Chang, Y. L., Peng, H. H., Wang, S. T., Lu, H. W., Teng, S. H., Chang, S. D., and Wang, H. S. (2005) Rapid detection of fetal aneuploidy using proteomics approaches on amniotic fluid supernatant. Prenat. Diagn. 25, 559–566. 21. Ruetschi, U., Rosen, A., Karlsson, G., Zetterberg, H., Rymo, L., Hagberg, H., and Jacobsson, B. (2005) Proteomic analysis using protein chips to detect
Overview and Introduction to Clinical Proteomics
22.
23. 24.
25.
26.
27.
28.
29.
30. 31.
32.
33.
34.
25
biomarkers in cervical and amniotic fluid in women with intra-amniotic inflammation. J. Proteome Res. 4, 2236–2242. Kim, Y. S., Kim, M. S., Lee, S. H., Choi, B. C., Lim, J. M., Cha, K. Y., and Baek, K. H. (2006) Proteomic analysis of recurrent spontaneous abortion: identification of an inadequately expressed set of proteins in human follicular fluid. Proteomics 6, 3445–3454. Pilch, B. and Mann, M. (2006) Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol. 7, R40 Varnum, S. M., Covington, C. C., Woodbury, R. L., Petritis, K., Kangas, L. J., Abdullah, M. S., Pounds, J. G., Smith, R. D., and Zangar, R. C. (2003) Proteomic characterization of nipple aspirate fluid: identification of potential biomarkers of breast cancer. Breast Cancer Res. Treat. 80, 87–97. Zheng, P. P., Luider, T. M., Pieters, R., Avezaat, C. J., van den Bent, M. J., Sillevis Smitt, P. A., and Kros, J. M. (2003) Identification of tumor-related proteins by proteomic analysis of cerebrospinal fluid from patients with primary brain tumors. J. Neuropathol. Exp. Neurol. 62, 855–862. Gibson, D. S., Blelock, S., Brockbank, S., Curry, J., Healy, A., McAllister, C., and Rooney, M. E. (2006) Proteomic analysis of recurrent joint inflammation in juvenile idiopathic arthritis. J. Proteome Res. 5, 1988–1995. Merkel, D., Rist, W., Seither, P., Weith, A., and Lenter, M. C. (2005) Proteomic study of human bronchoalveolar lavage fluids from smokers with chronic obstructive pulmonary disease by combining surface-enhanced laser desorption/ionization-mass spectrometry profiling with mass spectrometric protein identification. Proteomics 5, 2972–2980. Wu, J., Kobayashi, M., Sousa, E. A., Liu, W., Cai, J., Goldman, S. J., Dorner, A. J., Projan, S. J., Kavuru, M. S., Qiu, Y., and Thomassen, M. J. (2005) Differential proteomic analysis of bronchoalveolar lavage fluid in asthmatics following segmental antigen challenge. Mol. Cell Proteomics 4, 1251–1264. Tyan, Y. C., Wu, H. Y., Lai, W. W., Su, W. C., and Liao, P. C. (2005) Proteomic profiling of human pleural effusion using two-dimensional nano liquid chromatography tandem mass spectrometry. J. Proteome Res. 4, 1274–1286. Khalil, A. A. and James, P. (2007) Biomarker discovery: a proteomic approach for brain cancer profiling. Cancer Sci. 98, 201–213. Khodavirdi, A. C., Song, Z., Yang, S., Zhong, C., Wang, S., Wu, H., Pritchard, C., Nelson, P. S., and Roy-Burman, P. (2006) Increased expression of osteopontin contributes to the progression of prostate cancer. Cancer Res. 66, 883–888. Vincourt, J. B., Lionneton, F., Kratassiouk, G., Guillemin, F., Netter, P., Mainard, D., and Magdalou, J. (2006) Establishment of a reliable method for direct proteome characterization of human articular cartilage. Mol. Cell Proteomics 5, 1984–1995. Lee, Y. J., Rice, R. H., and Lee, Y. M. (2006) Proteome analysis of human hair shaft: from protein identification to post-translational modification. Mol. Cell Proteomics 5, 789–800. Cho, S. Y., Lee, E. Y., Lee, J. S., Kim, H. Y., Park, J. M., Kwon, M. S., Park, Y. K., Lee, H. J., Kang, M. J., Kim, J. Y., Yoo, J. S., Park, S. J., Cho, J. W., Kim, H. S., and
26
35.
36.
37. 38. 39.
40.
41.
42.
43.
44.
45.
46.
47.
Paik et al. Paik, Y. K. (2005) Efficient prefractionation of low-abundance proteins in human plasma and construction of a two-dimensional map. Proteomics 5, 3386–3396. Lathrop, J. T., Hayes, T. K., Carrick, K., and Hammond, D. J. (2005) Rarity gives a charm: evaluation of trace proteins in plasma and serum. Expert Rev. Proteomics 2, 393–406. Lee, H. J., Lee, E. Y., Kwon, M. S., and Paik, Y. K. (2006) Biomarker discovery from the plasma proteome using multidimensional fractionation proteomics. Curr. Opin. Chem. Biol. 10, 42–49. Anderson, N. L. and Anderson, N. G. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell Proteomics 1, 845–867. Hu, S., Loo, J. A., and Wong, D. T. (2006) Human body fluid proteome analysis. Proteomics 6, 6326–6353. Park, M. R., Wang, E. H., Jin, D. C., Cha, J. H., Lee, K. H., Yang, C. W., Kang, C. S., and Choi, Y. J. (2006) Establishment of a 2-D human urinary proteomic map in IgA nephropathy. Proteomics 6, 1066–1076. Tammen, H., Schutle, I., Hess, R., Menzel, C., Kellmann, M., and SchulzKnappe, P. (2005) Prerequisites for peptidomic analysis of blood samples: I. Evaluation of blood specimen qualities and determination of technical performance characteristics. Comb. Chem. High Trhoughput Screen 8, 725–733. Rai, A. J., Gelfand, C. A., Haywood, B. C., Warunek, D. J., Yi, J., Schuchard, M. D., Mehigh, R. J., Cockrill, S. L., Scott, G. B., Tammen, H., Schulz-Knappe, P., Speicher, D. W., Vitzthum, F., Haab, B. B., Siest, G., and Chan, D. W. (2005) HUPO plasma proteome project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–3277. Zhou, M., Lucas, D. A., Chan, K. C., Issaq, H. J., Petricoin, E. F. 3rd, Liotta, L. A., Veenstra, T. D., and Conrads, T. P. (2004) An investigation into the human serum “interactome”. Electrophoresis 25, 1289–1298. Findeisen, P., Sismanidis, D., Riedl, M., Costina, V., and Neumaier, M. (2005) Preanalytical impact of sample handling on proteome profiling experiments with matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Clin. Chem. 51, 2409–2411. Park, K. S., Kim, H., Kim, N. G., Cho, S. Y., Choi, K. H., Seong, J. K., and Paik, Y. K. (2002) Proteomic analysis and molecular characterization of tissue ferritin light chain in hepatocellular carcinoma. Hepatology 35, 1459–1466. Park, K. S., Cho, S. Y., Kim, H., and Paik, Y. K. (2002) Proteomic alterations of the variants of human aldehyde dehydrogenase isozymes correlate with hepatocellular carcinoma. Int. J. Cancer 97, 261–265. Marko-Varga, G., Berglund, M., Malmstrom, J., Lindberg, H., and Fehniger, T. E. (2003) Targeting hepatocytes from liver tissue by laser capture microdissection and proteomics expression profiling. Electrophoresis 24, 3800–3805. Paradis, V., Degos, F., Dargere, D., Pham, N., Belghiti, J., Degott, C., Janeau, J. L., Bezeaud, A., Delforge, D., Cubizolles, M., Laurendeau, I., and Bedossa, P. (2005) Identification of a new biomarker of hepatocellular carcinoma by serum protein profiling of patients with chronic liver diseases. Hepatology 41, 40–47.
Overview and Introduction to Clinical Proteomics
27
48. Ru, Q. C., Zhu, L. A., Silberman, J., and Shriver, C. D. (2006) Label-free semiquantitative peptide feature profiling of human breast cancer and breast disease sera via two-dimensional liquid chromatography–mass spectrometry. Mol. Cell Proteomics 5, 1095–1104. 49. Azad, N. S., Rasool, N., Annuziata, C. M., Minasian, L., Whiteley, G., and Kohn, E. C. (2006) Proteomics in clinical trials and practice: present uses and future promise. Mol. Cell Proteomics 5, 1819–1829. 50. Gunter, E. W. (1997) Biological and environmental specimen banking at the Centers for Disease Control and Prevention. Chemosphere 34, 1945–1953. 51. Strauss, G. H. and Kelly, S. J. (1990) The development of the U.S. EPA health effects research laboratory frozen blood cell repository program. Mutat. Res. 234, 349–354. 52. Romeo, M. J., Espina, V., Lowenthal, M., Espina, B. H., Petricoin, E. F. 3rd, and Liotta, L. A. (2005) CSF proteome: a protein repository for potential biomarker identification. Expert Rev. Proteomics 2, 57–70. 53. Conrads, T. P., Hood, B. L., Petricoin, E. F. 3rd, Liotta, L. A., and Veenstra, T. D. (2005) Cancer proteomics: many technologies, one goal. Expert Rev. Proteomics 2, 693–703. 54. Schrader, M. and Selle, H. (2006) The process chain for peptidomic biomarker discovery. Dis. Markers 22, 27–37. 55. Danna, E. A. and Nolan, G. P. (2006) Transcending the biomarker mindset: deciphering disease mechanisms at the single cell level. Curr. Opin. Chem. Biol. 10, 20–27. 56. De Masi, S., Tosti, M. E., and Mele, A. (2005) Screening for hepatocellular carcinoma. Dig. Liver Dis. 37, 260–268. 57. Yamaguchi, K., Nagano, M., Torada, N. Hamasaki, N., Kawakita, M., and Tanaka, M. (2004) Urine diacetylspermine as a novel tumor marker for pancreatobiliary carcinomas. Rinsho. Byori. 52, 336–339 58. Dabrowska, M., Grubek-Jaworska, H., Domagala-Kulawik, J., Bartoszewicz, Z., Kondracka, A., Krenke, R., Nejman, P., and Chazan, R. (2004) Diagnostic usefulness of selected tumor markers (CA125, CEA, CYFRA 21–1) in bronchoalveolar lavage fluid in patients with non-small cell lung cancer. Pol. Arch. Med. Wewn 111, 659–665. 59. Gann, P. H., Hennekens, C. H., and Stampfer, M. J. (1995) A prospective evaluation of plasma prostate-specific antigen for detection of prostatic cancer. JAMA 273, 289–294 60. Ciambellotti, E., Coda, C., and Lanza, E. (1993) Determination of CA 15–3 in the control of primary and metastatic breast carcinoma. Minerva Med. 84, 107–112. 61. Linkov, F., Lisovich, A., Yurkovetsky, Z., Marrangoni, A., Velikokhatnaya, L., Nolen, B., Winans, M., Bigbee, W., Siegfried, J., Lokshin, A., and Ferris, R. L. (2007) Early detection of head and neck cancer: development of a novel screening tool using multiplexed immunobead-based biomarker profiling. Cancer Epidemiol. Biomarkers Prev. 16, 102–107. 62. Casiano, C. A., Mediavilla-Varela, M., and Tan, E. M. (2006) Tumor-associated antigen arrays for the serological diagnosis of cancer. Mol. Cell Proteomics 5, 1745–1759.
28
Paik et al.
63. Nissom, P. M., Lo, S. L., Lo, J. C., Ong, P. F., Lim, J. W., Ou, K., Liang, R. C., Seow, T. K., and Chung, M. C. (2006) Hcc-2, a novel mammalian ER thioredoxin that is differentially expressed in hepatocellular carcinoma. FEBS Lett. 580, 2216– 2226. 64. Feng, J. T., Liu, Y. K., Song, H. Y., Dai, Z., Qin, L. X., Almofti, M. R., Fang, C. Y., Lu, H. J., Yang, P. Y., and Tang, Z. Y. (2005) Heat-shock protein 27: a potential biomarker for hepatocellular carcinoma identified by serum proteome analysis. Proteomics 5, 4581–1588. 65. Li, D. Q., Wang, L., Fei, F., Hou, Y. F., Luo, J. M., Wei-Chen, Zeng, R., Wu, J., Lu, J. S., Di, G. H., Ou, Z. L., Xia, Q. C., Shen, Z. Z., and Shao, Z. M. (2006) Identification of breast cancer metastasis-associated proteins in an isogenic tumor metastasis model using two-dimensional gel electrophoresis and liquid chromatography-ion trap-mass spectrometry. Proteomics 6, 3352–3368. 66. Lee, I. N., Chen, C. H., Sheu, J. C., Lee, H. S., Huang, G. T., Yu, C. Y., Lu, F. J., and Chow, L. P. (2005) Identification of human hepatocellular carcinomarelated biomarkers by two-dimensional difference gel electrophoresis and mass spectrometry. J. Proteome Res. 4, 2062–2069. 67. Righetti, P. G., Castagna, A., Antonucci, F., Piubelli, C., Cecconi, D., Campostrini, N., Rustichelli, C., Antonioli, P., Zanusso, G., Monaco, S., Lomas, L., and Boschetti, E. (2005) Proteome analysis in the clinical chemistry laboratory: myth or reality? Clin. Chim. Acta 357, 123–139. 68. Jang, J. S., Cho, H. Y., Lee, Y. J., Ha, W. S., and Kim, H. W. (2004) The differential proteome profile of stomach cancer: identification of the biomarker candidates. Oncol. Res. 14, 491–499. 69. Steel, L. F., Shumpert, D., Trotter, M., Seeholzer, S. H., Evans, A. A., London, W. T., Dwek, R., and Block, T. M. (2003) A strategy for the comparative analysis of serum proteomes for the discovery of biomarkers for hepatocellular carcinoma. Proteomics 3, 601–609. 70. Yip, T. T., Chan, J. W., Cho, W. C., Yip, T. T., Wang, Z., Kwan, T. L., Law, S. C., Tsang, D. N., Chan, J. K., Lee, K. C., Cheng, W. W., Ma, V. W., Yip, C., Lim, C. K., Ngan, R. K., Au, J. S., Chan, A., Lim, W. W., and Ciphergen SARS Proteomics Study Group (2005) Protein chip array profiling analysis in patients with severe acute respiratory syndrome identified serum amyloid a protein as a biomarker potentially useful in monitoring the extent of pneumonia. Clin. Chem. 51, 47–55. 71. Anderson, L. and Hunter, C. L. (2005) Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell Proteomics 5, 573–588. 72. Lee, J. W., Figeys, D., and Vasilescu, J. (2007) Biomarker assay translation from discovery to clinical studies in cancer drug development: quantification of emerging protein biomarkers. Adv. Cancer Res. 96, 269–298. 73. Zolg, W. (2006) The proteomic search for diagnostic biomarkers: lost in translation? Mol. Cell Proteomics 5, 1720–1726.
Overview and Introduction to Clinical Proteomics
29
74. Bensmail, H., Golek, J., Moody, M. M., Semmes, J. O., and Haoudi, A. (2005) A novel approach for clustering proteomics data using Bayesian fast Fourier transform. Bioinformatics 21, 2210–2224. 75. Ward, D. G., Cheng, Y., N’Kontchou, G., Thar, T. T., Barget, N., Wei, W., Billingham, L. J., Martin, A., Beaugrand, M., and Johnson, P. J. (2006) Changes in the serum proteome associated with the development of hepatocellular carcinoma in hepatitis C-related cirrhosis. Br. J. Cancer 94, 287–292. 76. Lin, N. and Zhao, H. (2005) Are scale-free networks robust to measurement errors? BMC Bioinformatics 6, 119. 77. Castagna, A., Cecconi, D., Sennels, L., Rappsilber, J., Guerrier, L., Fortis, F., Boschetti, E., Lomas, L., and Righetti, P. G. (2005) Exploring the hidden human urinary proteome via ligand library beads. J. Proteome Res. 4, 1917–1930. 78. Rauch, A., Bellew, M., Eng, J., Fitzgibbon, M., Holzman, T., Hussey, P., Igra, M., Maclean, B., Lin, C. W., Detter, A., Fang, R., Faca, V., Gafken, P., Zhang, H., Whiteaker, J., States, D., Hanash, S., Paulovich, A., and McIntosh, M. W. (2006) Computational proteomics analysis system (CPAS): an extensible open source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. J. Proteome Res. 5, 112–121. 79. Lilley, K. S. and Friedman, D. B. (2004) All about DIGE: quantification technology for differential-display 2D-gel proteomics. Expert Rev. Proteomics 1, 401–409. 80. Qian, W. J., Jacobs, J. M., Liu, T., Camp, D. G. 2nd, and Smith, R. D. (2006) Advances and challenges in liquid chromatography-mass spectrometrybased proteomics profiling for clinical applications. Mol. Cell Proteomics 5, 1727–1744. 81. Powell, D. W., Merchant, M. L., and Link, A. J. (2006) Discovery of regulatory molecular events and biomarkers using 2D capillary chromatography and mass spectrometry. Expert Rev. Proteomics 3, 63–74. 82. Andre, M., Le Caer, J. P., Greco, C., Planchon, S., El Nemer, W., Boucheix, C., Rubinstein, E., Chamot-Rooke, J., and Le Naour, F. (2006) Proteomic analysis of the tetraspanin web using LC-ESI-MS/MS and MALDI-FTICR-MS. Proteomics 6, 1437–1449. 83. Greengauz-Roberts, O., Stoppler, H., Nomura, S., Yamaguchi, H., Goldenring, J. R., Podolsky, R. H., Lee, J. R., and Dynan, W. S. (2005) Saturation labeling with cysteine-reactive cyanine fluorescent dyes provides increased sensitivity for protein expression profiling of laser-microdissected clinical specimens. Proteomics 5, 1746–1757. 84. Heck, A. J. and Krijgsveld, J. (2004) Mass spectrometry-based quantitative proteomics. Expert Rev. Proteomics 1, 317–326. 85. Schneider, L. V. and Hall, M. P. (2005) Stable isotope methods for high-precision proteomics. Drug Discov. Today 10, 353–363. 86. Zhang, J., Goodlett, D. R., Peskind, E. R., Quinn, J. F., Zhou, Y., Wang, Q., Pan, C., Yi, E., Eng, J., Aebersold, R. H., and Montine, T. J. (2005) Quantitative proteomic analysis of age-related changes in human cerebrospinal fluid. Neurobiol Aging 26, 207–227.
30
Paik et al.
87. Liu, T., Qian, W. J., Strittmatter, E. F., Camp, D. G. 2nd, Anderson, G. A., Thrall. B. D., and Smith, R. D. (2004) High-throughput comparative proteome analysis using a quantitative cysteinyl-peptide enrichment technology. Anal. Chem. 76, 5345–5353. 88. Li, C., Hong, Y., Tan, Y. X., Zhou, H., Ai, J. H., Li, S. J., Zhang, L., Xia, Q. C., Wu, J. R., Wang, H. Y., and Zeng, R. (2004) Accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotope-coded affinity tag and two-dimensional liquid chromatography mass spectrometry. Mol. Cell Proteomics 3, 399–409. 89. Sheehan, K. M., Calvert, V. S., Kay, E. W., Lu, Y., Fishman, D., Espina, V., Aquino. J., Speer, R., Araujo, R., Mills, G. B., Liotta, L. A., Petricoin, E. F. 3rd, and Wulfkuhle, J. D. (2005) Use of reverse phase protein microarrays and reference standard development for molecular network analysis of metastatic ovarian carcinoma. Mol. Cell Proteomics 4, 346–355. 90. Knezevic, V., Leethanakul, C., Bichsel, V. E., Worth, J. M., Prabhu, V. V., Gutkind, J. S., Liotta, L. A., Munson, P. J., Petricoin, E. F. 3rd, and Krizman, D. B. (2001) Proteomic profiling of the cancer microenvironment by antibody arrays. Proteomics 1, 1271–1278. 91. Sharma-Oates, A., Quirke, P., Westhead, D. R. (2005) TmaDB: a repository for tissue microarray data. BMC Bioinformatics 6, 218. 92. Rai, A. J., Stemmer, P. M., Zhang, Z., Adam, B. L., Morgan, W. T., Caffrey, R. E., Podust, V. N., Patel, M., Lim, L. Y., Shipulina, N. V., Chan, D. W., Semmes, O. J., and Leung, H. C. (2005) Analysis of human proteome organization plasma proteome project (HUPO PPP) reference specimens using surface enhanced laser desorption/ionization-time of flight (SELDI-TOF) mass spectrometry: multiinstitution correlation of spectra and identification of biomarkers. Proteomics 5, 3467–3474. 93. Engwegen, J. Y., Gast, M. C., Schellens, J. H., and Beijnen, J. H. (2006) Clinical proteomics: searching for better tumour markers with SELDI-TOF mass spectrometry. Trends Pharmacol. Sci. 27, 251–259. 94. Domon, B. and Aebersold, R. (2006) Mass spectrometry and protein analysis. Science 312, 212–217. 95. Domon, B. and Aebersold, R. (2006) Challenges and opportunities in proteomics data analysis. Mol. Cell Proteomics 5, 1921–1926. 96. Uhlen, M. and Ponten, F. (2005) Antibody-based proteomics for human tissue profiling. Mol. Cell Proteomics 4, 384–393. 97. Taussig, M. J., Stoevesandt, O., Borrebaeck, C. A., Bradbury, A. R., Cahill, D., Cambillau, C., de Daruvar, A., Dubel, S., Eichler, J., Frank, R., Gibson, T. J., Gloriam, D., Gold, L., Herberg, F. W., Hermjakob, H., Hoheisel, J. D., Joos, T. O., Kallioniemi, O., Koegll, M., Konthur, Z., Korn, B., Kremmer, E., Krobitsch, S., Landegren, U., van der Maarel, S., McCafferty, J., Muyldermans, S., Nygren, P. A., Palcy, S., Pluckthun, A., Polic, B., Przybylski, M., Saviranta, P., Sawyer, A., Sherman, D. J., Skerra, A., Templin, M., Ueffing, M., and Uhlen, M. (2007)
Overview and Introduction to Clinical Proteomics
31
ProteomeBinders: planning a European resource of affinity reagents for analysis of the human proteome. Nat. Methods 4, 13–17. 98. Ilyin, S. E., Belkowski, S. M., and Plata-Salaman, C. R. (2004) Biomarker discovery and validation: technologies and integrative approaches. Trends Biotechnol. 22, 411–416.
I Specimen Collection for Clinical Proteomics
2 Specimen Collection and Handling Standardization of Blood Sample Collection Harald Tammen
Summary Preanalytical variables can alter the analysis of blood-derived samples. Prior to the analysis of a blood sample, multiple steps are necessary to generate the desired specimen. The choice of blood specimens, its collection, handling, processing, and storage are important aspects since these characteristics can have a tremendous impact on the results of the analysis. The awareness of clinical practices in medical laboratories and the current knowledge allow for identification of specific variables that affect the results of a proteomic study. The knowledge of preanalytical variables is a prerequisite to understand and control their impact.
Key Words: blood; plasma; serum; proteomics; specimen; preanalytical variables.
1. Introduction Proteomic analysis of blood specimens by semi-quantitative multiplex techniques offers a valuable approach for discovery of disease or therapyrelated biomarkers (1,2). Based on reproducible separation of proteins by their physical–chemical properties in combination with semi-quantitative detection methods and bioinformatic data analysis, proteomics allows for sensitive measurement of proteins in blood specimens (3). Blood can be regarded as a complex liquid tissue that comprises cells and extracellular fluid (4). The choice of a suitable specimen-collection protocol is crucial to minimize artificial processes (e.g., cell lysis, proteolysis) occurring during specimen collection and preparation (5). Preanalytic procedures can alter the analysis of blood-derived From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
35
36
Tammen
samples. These procedures comprise the processes prior to actual analysis of the sample and include steps needed to obtain the primary sample (e.g., blood) and the analytical specimen (e.g., plasma, serum, cells). Legal or ethical issues (e.g., importance of informed consents) or potential risks of phlebotomy (e.g., bleeding) are not covered in this article. 1.1. Collection of Blood Samples It has been reported that the most frequent faults in the preanalytical phase are the result of erroneous procedures of sample collection (e.g., drawing blood from an infusive line resulting in sample dilution) (6). The design of blood collection devices may aid in correct sampling: evacuated containers sustain the draw of accurate quantity of blood to ensure the correct concentration of additives or the correct dilution of the blood, such as in the case of citrated plasma. The speed of blood draw is also controlled and restricts the mechanical stress. The favored site of collection is the median cubital vein, which is generally easily found and accessed. As such, it will be most comfortable to the patient, and should not evoke additional stress. Preparation of the collection site includes proper cleaning of the skin with alcohol (2-propanol). The alcohol must be allowed to evaporate, since commingling of the remaining alcohol with blood sample may result in hemolysis, raise the levels of distinct analytes, and cause interferences. The position of the patient (standing, lying, sitting) can affect the hematocrit (7), and hence may change the concentration of the analytes. Tourniquet should be applied 3–4 inches above the site of venipuncture and should be released as soon as blood begins flowing into the collection device. The duration of venous occlusion (>1 min) can affect the sample composition. Prolonged occlusion may result in hemoconcentration and subsequently increase the miscellaneous analytes, e.g., total protein levels. Blood should be collected from fasting patients in the morning between 7 and 9 a.m., because ingestion or circadian rhythms can alter the concentration of analytes considerably (e.g., total protein, hemoglobin, myoglobin). 1.2. Characteristics of Serum and Plasma Specimens Serum is one of the most frequently analyzed blood specimens. The generation of serum is time consuming and associated with the activation of coagulation cascade and complement system. These processes influence the composition of the samples, because they result in cell lysis (e.g., thrombocytes, erythrocytes). As a consequence, the concentration of components in the extracellular fluid, such as aspartate-aminotransferase, serotonin, neuronspecific enolase, and lactate-dehydrogenase, are increased (8). On the other hand, degradation of the analytes (e.g., hormones) may occur faster (9). On the
Specimen Collection and Handling
37
proteomic level, more peptides and less proteins are observed in serum when compared to plasma (10,11). Consequently, the activation of clotting cascades necessary to generate serum can lead to artefacts. A reason to use serum as a specimen is based on the notion that the proteome or peptidome of serum may reflect biological events (12). Post-sampling proteolytic cleavage products have been proposed as biomarkers, and it has been further suggested that serum peptidome is of particular diagnostic value for the detection of cancer (13). However, it has been reported that more protein changes occur in serum than in plasma (14). Thus, it can be expected that the reproducibility of such ex vivo proteolytic events is comparatively low. In contrast to serum, citrate and EDTA inhibit coagulation and other enzymatic processes by chelate formation with ions, thereby inhibiting iondependent enzymes. This is in contrast to heparin, which acts through the activation of antithrombin III. The main concern associated with heparinized plasma for proteomic studies is that it is a poly-disperse charged molecule that binds many proteins non-specifically (15,16), and may also influence separation procedures and mass spectrometric detection of peptides and small proteins due to its similar molecular weight (17). The sampling of plasma is less time consuming than the acquisition of serum. Separation of the cells and the liquid phase can be performed subsequently to sample collection since no clotting time is required (30–60 min). In comparison to serum, the amount of plasma generated from blood is approximately 10 to 20% higher. Additionally, the protein content of plasma is also higher than in serum, because of the presence of clotting factors and associated components. Furthermore, proteins may be bound to the clot, resulting in a decrease of protein concentration. 1.3. Processing of Blood Samples A quick separation of cells from the plasma is favorable, since cellular constituents may liberate substances that alter the composition of the sample. Generally, it is recommended that plasma and serum be centrifuged with 1300–2000×g for 10 min within 30 min from the collection of the sample. The temperature should generally be 15–24°C (18), unless recommended differently for distinct analytes like gastrin or A-type natriuretic peptide. Processing at 4°C appears to be attractive, because enzymatic degradation processes are reduced at low temperatures. However, platelets become activated at low temperatures (19) and release intracellular proteins and enzymes, which affect the sample composition. Thus, processing at low temperatures is safe only after thrombocytes have been removed. Since one centrifugation step may be insufficient for
38
Tammen
depletion of platelets below 10 cells/nL, a second centrifugation step (2500×g for 15 min at room temperature) or filtration step may be required to obtain platelet-poor plasma. This procedure is applicable only to plasma since the platelets in serum are already activated. 1.4. Protease Inhibitors Protease inhibitors would be attractive, but commonly used protease cocktails may introduce difficulties due to interference with mass spectrometry and formation of covalent bonds with proteins, which would result in shifting the isoform pattern (20). Protease inhibitors have been considered and investigated as additives in proteome research to prevent or slow down proteolytic processes and thereby provide a means of more sensitive detection of markers in blood (21). Even though protein integrity has been shown to be maintained by the addition of 15 commercially available protease inhibitors, the usefulness of protease inhibitors in overall protein stabilization of blood samples remains to be investigated in more detail (22). The presence of certain protease inhibitors in whole blood is toxic to live cells. Stressed, apoptotic, or necrotic cells release substances, and it may be argued that this affects the composition of serum or plasma until the cellular and soluble factions of blood are separated. However, careful selection of an appropriate protease inhibitor may solve this problem. 2. Materials 1. Twenty gauge needles and an appropriate adapter (e.g., Sarstedt, Nümbrecht, Germany) or a Vacutainer system (BD Bioscience, Franklin Lakes, USA). 2. Alcohol (2-propanol) in spray flask. 3. Swabs. 4. Examination gloves. 5. Tourniquet or sphygmomanometer. 6. Blood collection tubes (e.g., Sarstedt). 7. Centrifuge with a swinging bucket rotor (e.g., Sigma 4K15, Sigma Laborzentrifugen, Osterode, Harz). 8. A 10-mL syringe equipped with a cellulose acetate filter unit with 0.2 μm pore size and 5 cm2 filtration area (e.g., Sartorius Minisart, Sarstedt). 9. 2 mL cryo-vials. 10. Pipette and tips.
3. Methods 1. Venipuncture of a cubital vein is performed using a 20-gauge needle (diameter: 0.9 mm, e.g., butterfly system max. tubing length: 6 cm). If tourniquet is applied, it should not remain in place for longer than 1 min (risk of falsifying results due to
Specimen Collection and Handling
39
hemoconcentration). As soon as the blood flows into the container, the tourniquet has to be released at least partially. If more time is required, the tourniquet has to be released so that circulation resumes and normal skin color returns to extremity. • Prior to blood collection for proteomic analysis, blood is aspirated into the first container (e.g., 2.7 mL S-Monovette, Sarstedt, Nümbrecht, Germany). This is done to flush the surface and remove initial traces of contact-induced coagulation. This sample is not useful for analysis. • Afterward, blood is drawn into a standard EDTA or citrate-containing syringe (e.g. 9 mL EDTA-Monovette, Sarstedt, Nümbrecht, Germany). Depending on ease of blood flow, several samples can be collected. Free flow with mild aspiration should be assured to avoid haemolysis. 2. After venipuncture, plasma is obtained by centrifugation for 10 min at 2000×g at room temperature. Centrifugation should start within 30 min after blood collection. The resulting plasma sample may now be separated from red and white blood cells in an efficient and gentle way. Nevertheless, a significant number of platelets (∼25%) are still present in the sample. This requires an additional preparation step. 3. For platelet depletion, one of the following procedures has to be undertaken directly after step 2: • Platelet removal by centrifugation: The plasma sample is transferred into a second vial for another centrifugation for 15 min at 2500×g at room temperature. After centrifugation, the supernatant is transferred in aliquots of 1.5 mL into cryo vials. • Platelet removal by filtration: Plasma aliquots of 1.5 mL resulting from step 2 are transferred into 2-mL cryo vials using a 10-mL syringe equipped with a cellulose acetate filter unit with 0.2 μm pore size and 5 cm2 filtration area (e.g., Sartorius Minisart® , Sartorius, Göttingen, Germany). Filtration requires only gentle pressure. 4. Samples are transferred to an –80°C freezer within 30 min. Storage is at –80°C. Transport of samples is done on dry ice.
4. Notes 4.1. Frequently Made Mistakes 4.1.1. Blood Withdrawal • • • •
The The The The
patient was not fasting (i.e., had taken food prior to sampling). blood was drawn from an infusive line. blood was drawn in a wrong position (e.g., supine, upright). consumables used were different than those recommended.
40
Tammen
• The expiry date of consumables was already reached. • The tubes were not properly filled. • The tubes were agitated vigorously (instead of gentle shaking to dissolve the anticoagulant). • The blood sample tubes were not consistently kept at room temperature. • The sample tubes were put on ice or in a refrigerator.
. 4.1.2. Lab Handling • Centrifugation was delayed more than 30 min after blood withdrawal. • A cooling centrifuge was adjusted below room temperature. • The centrifugation speed was wrong (e.g., rounds per minute were set instead of g-force). • The centrifugation time was wrong. • The removal of blood plasma by pipetting was done without proper caution. Consequently, the buffy coat or the red blood cells were churned up. • The second centrifugation of recovered plasma samples was delayed after first centrifugation.
4.1.3. Storage of Samples • • • •
The storage of samples was delayed. The storage temperatures were above –80°C. The labeling of sample containers was unreadable or confusable. The attachment of labels to the sample containers was not proper during storage or handling resulted in loss of labels.
4.1.4. General Recommendations • A proper first centrifugation should produce a visible white blood cell layer (buffy coat) between red blood cells and plasma. If not, centrifugation speed or time may be wrong. • One should discard plasma that is icteric or exhibits signs of haemolysis. One should check with an expert if this was due to that particular disease.
References 1. Vitzthum F, Behrens F, Anderson NL, Shaw JH. (2005) Proteomics: from basic research to diagnostic application. A review of requirements and needs. J. Proteome Res. 4, 1086–97. 2. Lathrop JT, Anderson NL, Anderson NG, Hammond DJ. (2003) Therapeutic potential of the plasma proteome. Curr. Opin. Mol. Ther. 5, 250–7.
Specimen Collection and Handling
41
3. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR et al. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75, 4818–26. 4. Anderson NL, Anderson NG. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–67. 5. Omenn GS. (2004) The Human Proteome Organization Plasma Proteome Project pilot phase: reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics 4, 1235–40. 6. Plebani M, Carraro P. (1997) Mistakes in a stat laboratory: types and frequency. Clin. Chem. 43, 1348–51. 7. Burtis CA, Ashwood E. (eds) (2001) Fundamentals of Clinical Chemistry. Saunders, Philadelphia. 8. Guder WG, Narayanan S, Wisser H, Zawata B. (2003) Samples: From the Patient to the Laboratory. The Impact of Preanalytical Variables on the Quality of Laboratory Results. GIT Verlag, Darmstadt, Germany. 9. Evans MJ, Livesey JH, Ellis MJ, Yandle TG. (2001) Effect of anticoagulants and storage temperatures on stability of plasma and serum hormones. Clin. Biochem 34, 107–12. 10. Omenn GS, States DJ, Adamski M, Blackwell TW, Menon R, Hermjakob H et al. (2005) Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 5, 3226–45. 11. Rai AJ, Gelfand CA, Haywood BC, Warunek DJ, Yi J, Schuchard MD et al. (2005) HUPO Plasma Proteome Project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–77. 12. Villanueva J, Shaffer DR, Philip J, Chaparro CA, Erdjument-Bromage H, Olshen AB et al. (2006) Differential exoprotease activities confer tumor-specific serum peptidome patterns. J. Clin. Invest. 116, 271–84. 13. Liotta LA, Petricoin EF. (2006) Serum peptidome for cancer detection: spinning biologic trash into diagnostic gold. J. Clin. Invest. 116, 26–30. 14. Tammen H, Schulte I, Hess R, Menzel C, Kellmann M, Schulz-Knappe P. (2005) Prerequisites for peptidomic analysis of blood samples: I. Evaluation of blood specimen qualities and determination of technical performance characteristics. Comb. Chem. High Throughput Screen. 8, 725–33. 15. Holland NT, Smith MT, Eskenazi B, Bastaki M. (2003) Biological sample collection and processing for molecular epidemiological studies. Mutat. Res. 543, 217–34. 16. Landi MT, Caporaso N. (1997) Sample collection, processing and storage. IARC Sci. Publ. 223–36. 17. Tammen H, Schulte I, Hess R, Menzel C, Kellmann M, Mohring T, Schulz-Knappe P. (2005) Peptidomic analysis of human blood specimens: comparison between plasma specimens and serum by differential peptide display. Proteomics 13, 3414–22.
42
Tammen
18. Favaloro EJ, Soltani S, McDonald J. (2004) Potential laboratory misdiagnosis of hemophilia and von Willebrand disorder owing to cold activation of blood samples for testing. Am. J. Clin. Pathol. 122, 686–92. 19. Mustard JF, Kinlough-Rathbone RL, Packham MA. (1989) Isolation of human platelets from plasma by centrifugation and washing. Methods Enzymol. 169, 3–11. 20. Schuchard MD, Mehigh RJ, Cockrill SL, Lipscomb GT, Stephan JD, Wildsmith J et al. (2005) Artifactual isoform profile modification following treatment of human plasma or serum with protease inhibitor, monitored by 2-dimensional electrophoresis and mass spectrometry. Biotechniques 39, 239–47. 21. Jeffrey DH, Deidra B, Keith H, Shu-Pang H, Deborah LR, Gregory JO, Stanley AH. (2004) An Investigation of Plasma Collection, Stabilization, and Storage Procedures for Proteomic Analysis of Clinical Samples. Humana, Totowa, NJ. 22. Rai AJ, Vitzthum F. (2006) Effects of preanalytical variables on peptide and protein measurements in human serum and plasma: implications for clinical proteomics. Expert Rev. Proteomics 3, 409–26.
3 Tissue Sample Collection for Proteomics Analysis Jose I. Diaz, Lisa H. Cazares, and O. John Semmes
Summary Successful collection of tissue samples for molecular analysis requires critical considerations. We describe here our procedure for tissue specimen collection for proteomic purposes with emphasis on the most important steps, including timing issues and the procedures for immediate freezing, storage, and microdissection of the cells of interest or “tissue targets” and the lysates for protein isolation for SELDI, MALDI, and 2DGE applications. The pathologist is at the cornerstone of this process and is an invaluable collaborator. In most institutions, pathologists are responsible for “tissue custody,” and they closely supervise the tissue bank. In addition, they are optimally trained in histopathology in order to they assist investigators to correlate tissue morphology with molecular findings. In recent years, the advent of the laser capture microscope, a tool ideally designed for pathologists, has tremendously facilitated the efficiency of collecting tissue targets for molecular analysis.
Key Words: tissue bank; frozen section; immunofluorescence; laser capture microscope; proteomics.
1. Introduction From the completion of surgery and the acquisition of tissue sample to protein isolation and performing the various proteomic techniques, a number of challenges must be overcome. The first challenge is time. Surgery is associated with loss of vascular supply, resulting in progressive increase of endogenous protease activity, protein degradation, and tissue autolysis. For this reason, specimens submitted for tissue procurement must be processed without delay. Formalin fixation, a standard processing procedure in pathology, From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
43
44
Diaz et al.
stops protease activity. However, formalin is a cross-linking fixative that irreversibly alters protein, thus compromising the quality of the extracts for most proteomic techniques. Recent technical developments appear promising and may ultimately enable peptide analysis and protein identification (bottom up proteomics) in formalin-fixed paraffin embedded tissue (1). At present, however, it is imperative to take a representative “fresh” tissue sample immediately after surgery when collecting tissue for proteomic studies, including MALDI TOF MS and 2DGE. The surgical specimen should be transported quickly to pathology, and a representative tissue sample should be obtained under the supervision of a pathologist. The sample should be embedded in OCT and frozen without delay. Ideally, a frozen section should be performed for quality assurance before archiving the sample. Once the pathologist confirms that the expected targets are present in the collected tissue (for instance, tumor and non-tumor tissue), the frozen specimen can be stored in a –80°C freezer for subsequent use. Overcoming time constraints requires appropriate institutional policies and dedicated personnel. From our experience, it is better to delegate the responsibility of transporting the surgical specimen from the operating room to pathology to dedicated tissue procurement personnel, instead of expecting the surgical team to deliver the specimens. When collecting and archiving tissue samples, our policy is to bisect the sample into two halves, one embedded in OCT and stored permanently at –80°C for future molecular studies, and one submitted as a “mirror image” processed in formalin after performing a frozen section for morphologic comparison and cell type mapping after basic hematoxylin and eosin (H&E) staining. This formalin-processed mirror image tissue provides optimal morphological detail, which might be necessary in the future. For instance, it is very difficult to identify prostatic intraepithelial neoplasia (PIN) on frozen section slides; however, the formalin fixed section, which closely mimics the frozen section, can be used for guidance. After archiving the tissue sample, the next challenge is to ensure that the proteomic findings are representative of the tissue targets under investigation, given the cellular heterogeneity present in most tissues. For instance, if one would like to determine the differential protein expression in tumor versus non-tumor, one must ensure that proteins are separately and reliably extracted from normal and tumor cells. Certainly, many solid tumors are visible to the naked eye, and both tumor and non-tumor tissues can be collected by gross inspection. However, under a microscope, the tumor bed contains not only tumor cells but many other tumor–associated, non-tumoral elements, such as supporting stromal cells, blood vessels, infiltrating lymphocytes, etc. Moreover, microscopic foci of tumor may infiltrate grossly normal tissue. In the past, various approaches were followed to collect cells from tissue sections, including manual microdissection with a syringe. In the recent years, the procedure
Tissue Sample Collection for Proteomics Analysis
45
of laser-capture microdissection (2) has tremendously increased the quality, specificity, and speed of the process, allowing selective capture of cells and various tissue elements while preserving the molecular integrity (3,4,5). The LCM is a special microscope that isolates cells from frozen or formalinfixed tissues and cytological preparations. Microdissection of single cells or multicellular structures is accomplished by placing a plastic polymer (cap) over the tissue while pulsing an infrared laser for the polymer to melt and adhere to the target cells under the laser ring. When the cap is removed, the cells that adhered to the polymer detach from the surrounding tissue without any molecular damage, becoming suitable for the extraction of high-quality nucleic acids and proteins, and for a wide range of downstream molecular analyses,
A
B
C
D
Fig. 1. Selective immunofluorescent LCM of prostate gland’s basal cells by immunocapture: (A) immunofluorescent staining of basal cells with a mAb against highmolecular-weight keratins, which are highly expressed on basal cells, (B) selection of immunofluorescent-positive basal cells for subsequent LCM, (C) captured immunofluorescent-positive cells after LCM photographed from the plastic cap, (D) remaining of the gland after removing the basal cell layer by LCM.
46
Diaz et al.
such as gene expression microarrays, or proteomics. The use of a microscope can be coupled with special immunostaining procedures if one wishes to capture specific cell types not easily identified by morphology alone, which is the “so called” immunocapture procedure (6,7), which further enhances the specificity of tissue procurement for molecular analysis. For example, in a former study (8), we were able to selectively capture basal cells from benign prostate glands, which are extremely difficult to recognize morphologically but easily identifiable after immunostaining for high-molecular-weight cytokeratin (Fig. 1). We obtained excellent protein quality results and were able to identify several protein peaks preferentially expressed in these cells using SELDI-TOF-MS. When we compared the protein spectra from the same tissue sample sections routinely stained with hematoxilin with those immunostained for high-molecularweight cytokeratins, there was no difference in the spectra, militating against any significant protein deterioration due to the immunostaining procedure. 2. Materials 2.1. Tissue Collection and Storage 1. 2. 3. 4. 5.
Tissue-Tek Cryomold-standard (Sakura, Torrance, CA) Tissue-Tek OCT (Sakura) 20002 methylbutane (Mallinckrodt, St. Louis, MO) Shandon Histobath II (Thermo Electron Corp., Waltham, MA) –80°C freezer
2.2. Frozen Tissue Sectioning and Staining 1. Cryostat 2. HistoGeneTM LCM Frozen Section Staining Kit (Arcturus Biosciences Inc, Mountain View, CA). The kit contains histogene staining solution, ethanol (75, 95, 100%), xylene, distilled water nuclease free, histogene LCM slides, and disposable slide staining jars. 3. 1× PBS made from 10× stock (Fisher Scientific) 4. Acetone (high purity grade) 5. Cy3-Strepavidin (Invitrogen, Carlsbad, CA) 6. Biotinylated mAbs: Any antibody can be biotinylated. We routinely have 1.5 mg of antibody labeled with 0.2 mg biotin (Alpha Diagnostic Intl. Inc. San Antonio, TX).
2.3. LCM 1. 2. 3. 4. 5.
PixCell II LCM System (Arcturus Biosciences Inc) AutoPixTM Automated LCM System (Arcturus Biosciences Inc) CapSure® LCM caps (Arcturus Biosciences Inc) Prep Strip (Arcturus Biosciences Inc) Microcentrifuge tubes (0.5 ml) (Eppendorf North America)
Tissue Sample Collection for Proteomics Analysis
47
2.4. LCM Lysate 1. 2. 3. 4.
Micropipet capable of delivering 1 μl accurately 20 mM HEPES (pH to 8.0 with NaOH) with 1% Triton X-100 Sonicator (optional) 1× PBS
2.5. SELDI Analysis 1. 2. 3. 4. 5. 6. 7.
IMAC3 or WCX2 Protein Array Chips (Ciphergen Biosystems Palo Alto, CA) HPLC grade water (Fisher Scientific) 100 mM sodium acetate pH 4.0 100 mM ammonium acetate pH 4.0 Sinapinic acid (SPA) (Ciphergen Biosystems, Palo Alto, CA) Optima grade Acetonitile (Fisher Scientific) Trifluoroacetic acid, packaged in 1 ml ampules (Pierce Chemical Company, Rockford, IL)
2.6. MALDI Analysis 1. 2. 3. 4. 5.
Target plate Cinaminic acid (CHCA) (Bruker Daltonics, Palo Alto, CA) SPA (Fluka) Optima grade Acetonitile (Fisher Scientific) Trifluoroacetic acid, packaged in 1 ml ampules (Pierce Chemical Company)
3. Method 3.1. Tissue Collection and Storage 1. The tissue sample is embedded in OCT using a cryomold and is frozen in the Shandon Histobath, which contains 20002 methylbutane (see Note 1). 2. Hold the cryomold against the 20002 methylbutane liquid interface and allow the tissue to freeze slowly (3–5 min) (see Note 2). 3. After achieving complete freezing, place the frozen cryomold containing the sample in a plastic bag and transport the sample within a liquid nitrogen container. Store the sample in a –80°C freezer.
3.2. Frozen Tissue Sectioning and Staining 3.2.1. Regular Hematoxylin Staining Prior to LCM, cut 8-μm-thick frozen tissue sections from the cryostat (discard folded or wrinkled sections). Keep slides with sections in cryostat after cutting and stain as follows (see Notes 3 and 9; slides may also be frozen at –80°C until stained.):
48
Diaz et al. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Remove the slides from the freezer or cryostat and place in 70% ethanol (30 s). Place in purified water (5 s). Add the Histogene staining solution (30 s) (see Note 4). Rinse the slides with purified water. Wash with 70% ethanol (60 s). Wash with 95% ethanol twice (60 s each). Wash with 100% ethanol (60 s). Place the slides in xylene to ensure complete dehydration (10 min) (see Note 5). Shake off and drain carefully by touching the corner with a particle-free tissue paper. 10. Air dry the slides to allow xylene to evaporate completely (at least 2 min). 11. The slides are now ready for LCM (they should not be coverslipped) (see Note 12)
3.2.2. Immunofluorescence Staining (see Note 7) 1. 2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13. 14.
Thaw slides (1 min). Place in cold acetone at 4°C (2 min). Air dry (30 s). Wash in filtered pH 7.4 1× PBS. Drain off slides. Add 100 μl of first biotinylated Ab at optimal dilution: recommended concentration 30–100 μg/ml, optimize for best results (3 min). Rinse in PBS. Add 100 μl of Cy3 at dilution 1:100 (user may decide the optimal staining concentration of the Cy3 Streptavidin conjugate by performing a serial dilution staining experiment) (1 min). Rinse in PBS. Place slides in 75% ethanol (30 s). Place slides in 95% ethanol (30 s). Place slides in 100% ethanol (30 s). Place slides in xylene (5 min) (see Note 6). Air dry (5 min).
3.3. LCM The new instruments developed by Arcturus, such as the AutoPixTM and the VeritasTM are enclosed in automated systems entirely operated by a computer. We describe here the LCM procedure using the PixCell II instrument, which is manually operated and the least expensive LCM instrument today and, therefore, more widely used (see Note 8). 1. Turn on the instrument and enter pertinent data such as slide #, case #, cap lot #, thickness (always 8 μm), and place the stained slide on the mechanical stage (see Note 10).
Tissue Sample Collection for Proteomics Analysis
49
2. Turn on the vacuum pump to immobilize the slide (small aperture on the left side of the stage) and push in the filter bottom for optimal image quality. 3. Place the caps in the rail on the right side of the stage. Unlock the mechanical arm, move it toward the tissue, and drop it at the top of the tissue. Align the joystick to move the stage to a centered and perpendicular position before beginning the microdissection process. 4. Turn on the key on the right side of the power supply to enable the infrared laser. Focus the laser before beginning microdissection using the smallest ring diameter and adjust to the desired diameter. 5. Select the appropriate energy (mW) and time of exposure (ms) for the desired laser ring diameter and ensure its effectiveness in an area of the tissue that lacks any interest using a cap to be discarded (see Note 11). 6. Fire the laser each time the ring is over the desired tissue target. Move the stage supporting the glass slide with the aid of the joystick, which allows fine and precise motion. Check if the tissue is appropriately microdissected and capture the tissue images before and after LCM as well as the image of the target tissue that was captured in the cap (see Note 13). 7. When the cap is filled with the desired amount of tissue, remove the cap and use a 0.5-ml microcentrifuge tube to collect the tissue (the cap is designed to perfectly fit to close the tube) (see Note 14). 8. The microcentrifuge tube can be safely stored in a –80°C freezer without adding any buffer and without lysing the cells, which may be done at a convenient time later.
3.4. LCM Lysate 1. Lyse a total of 1500–2000 laser shots (about 3000 to 6000 microdissected cells) in 4 μl of 20 mM Hepes pH 8.0 with 1% Triton X-100. This is sufficient for one SELDI protein array or one MALDI run. For 2D analysis, a minimum of approximately 25,000 cells are necessary. 2. Add the above lysing buffer on the cap and place in the microfuge tube holding the cap. This is usually done with two additions of 2 μl to the LCM cap. Pipet up and down and scrape the surface of the LCM cap to remove all the cells. A gentle scraping motion with the pipet tip may be necessary to remove the cells, but be careful not to rip the polymer film (see Note 15). Transfer the lysate from the surface of the cap to the microfuge tube. Cells from multiple caps may be combined by subsequently using 4 μl of LCM lysate to lyse cells on another cap. In this way the volume will remain small. If 2DGE may be performed, the lysis procedure is different (see below). Make a 1:10 dilution of each lysate in PBS (for IMAC3 SELDI chips) or 100 mM ammonium acetate pH 4.0 (for WCX2 chips) (i.e., 36 μl added to the 4 μl lysate) vortex for at least 1 min (see Note 16). Spin down briefly. 3. Prepare the arrays of the IMAC chip with CuSO4 according to the manufacturer’s specifications: 20 μl, 100 mM CuSO4 for 10 min, wash with HPLC water; 20 μl, 100 mM Na acetate pH 4.0 for 5 min, wash with water. Use the Micromix shaker for all incubations with the following settings: Form-20, Amplitude-5.
50
Diaz et al.
4. Assemble the bioprocessor with the desired number of chips and add 2× 200 μl PBS to each well, incubate on the shaker for 5 min each time. Pretreat the WCX2 chip with 100 mM ammonium acetate pH 4.0. This can be done on the BioMek robot. 5. Add the diluted lysate to the spot on the chip(s) in the bioprocessor. 6. Cover the bioprocessor with a plastic seal and incubate overnight on MicroMix shaker at room temperature, using the same setting as given above. 7. Remove lysates carefully with a pipet; do not touch the surface of the arrays. Save if needed for another experiment. 8. Wash the spots in bioprocessor 2× with 200 μl PBS (for IMAC) or 100 mM ammonium acetate pH 4.0 (for WCX) for 5 min on the shaker. 9. Wash the arrays with HPLC water 2× for 5 min (on shaker). 10. Remove the chip(s) from bioprocessor and give them a final rinse with HPLC water. 11. Let the chip dry completely, usually overnight. 12. Add 2× 0.5 μl saturated SPA dissolved in 50% acetonitrile, 0.5% TFA. 13. Read at instrument settings optimized for resolution and intensity for the m/z range of 1000–20,000. Higher laser energy will be required to see higher molecular weight peaks.
One method of MALDI sample preparation that reduces the complexity of cell lysates while remaining robust and easily amenable to automated highthroughput applications is sample fractionation using magnetic beads (MB) combined with pre-structured MALDI sample supports (AnchorChip Technology). Several magnetic bead types with different surface chemistries can be used to fractionate serum and increase the number of detectable peaks (see the chapter on serum protein profiling for details). For MALDI analysis, dilute the lysate 1:10 with CHCA or SPA matrix (5–10 mg/ml in 50% acetonitrile, 0.1% TFA). Spot on Anchorplate and read in a MALDI instrument. Further dilution and/or fractionation of the lysate may be necessary to achieve optimal spectra. If 2DGE analysis will be performed, the cells should be lysed as follows: Remove the LCM cap from the tube and add a small volume (10 μl) of 1D focusing rehydration buffer to the tube. The preferred number of laser shots is approximately 100 K. Replace the cap and invert the tube to allow the buffer to come in contact with the cells on the cap and lyse them. Incubate 5 min at room temperature. Sonicate the samples to ensure lysis. Continue with the basic protocol for 1D IEF and 2D analysis. 4. Notes 1. In our experience, a time window of 30 min between completion of surgery and tissue freezing yields good protein quality for most proteomic techniques. However, if one is studying protein phosphorylation, this begins to significantly decrease 20 min after completion of surgery (10).
Tissue Sample Collection for Proteomics Analysis
51
2. When freezing the tissue sample in the Histobath, avoid immediate and complete immersion in 20002 methylbutane to preserve optimal tissue morphology. Hold the sample at the liquid interface with minimal immersion and wait until the OCT and the tissue slowly turn white. 3. Use uncoated glass slides for LCM. Coated or electrically-charged glass slides will interfere with the detachment process of the plastic polymer and are not suitable for LCM. 4. Precipitate from Hematoxylin can contaminate the surface of the tissue. Filter these solutions. Add one tablet of protease inhibitor to each staining bath (we use Complete, from BMB). Do not add protease inhibitor to alcohol baths. If using the histogene staining kit (Arcturus) for frozen sections, this is not necessary. 5. Change all the staining and alcohol solutions after staining 20 slides. 6. Poor transfers may result if 100% ethanol has hydrated. Increasing the incubation time in xylene often improves transfer. 7. When specific cells need to be microdissected and these cannot be identified morphologically, the cells of interest can be immunostained with specific mAbs against proteins highly expressed on those cells (immunophenotype). It is critical to expedite the immunostaining procedure because the shorter the immunostaining time, the better the protein quality. One must avoid exceeding 30 min for the total immunostaining and dehydration procedure. In the past, we have used the immunoperoxidase technique with DAB labeling (6), but it was difficult to perform quick enough to preserve optimal protein integrity. Also, manual microdissection of DAB labeled cells with Pixel II is extremely tedious and nonpractical. The immunofluorescence staining method (7) is faster and easier to perform. This method coupled with the Autopix microscope, which has dark field fluorescence and automation capabilities, is the ideal procedure for immunocapture. Since Cy3-strepavidin binds to the antibody labeled with biotin, there is no need for a secondary antibody, thereby decreasing the necessary staining time. It is recommended to run negative control staining; use a biotinylated control antibody from the same animal species and of the same isotype as your primary antibody. Dilute to the same working concentration as the primary antibody. 8. Do not forget to wear gloves every time while performing LCM, including when handling the plastic caps. 9. The thickness of the tissue section is a critical parameter for effective LCM. In our experience (using the Pixel II and the Autopix instruments by Arcturus), 8 μm is the optimal thickness for LCM. 10. Smooth out the surface of the tissue section with a Prep-strip before placing the slide on the LCM instrument, which improves the efficiency and uniformity of the microdissection process. 11. The main factors affecting the efficiency of LCM include the energy, the time of exposure, and the diameter of the laser beam. Regarding the diameter, when using Pixel II, the smallest ring is 7 μm, the medium ring is 15 μm, and the widest ring is 30 μm. Very often, we have used the medium (15 μm, which lifts up about three cells with each shot). When trying to microdissect single cells with
52
Diaz et al. Pixel II, one must use the smallest (7 μm) diameter ring, but our experience was frustrating. With Autopix, we have observed that microdissection of individual cells is better achieved setting the laser ring at 10 μm diameter, below which it becomes very difficult to lift up cells efficiently. A 30-μm diameter laser is very effective for microdissection of whole glands and other large tissue structures.
Regarding the other two parameters, the optimization depends on the tissue type. For instance, for prostate tissue, an energy of 80 mW with a duration of 0.5 ms is usually effective for a medium-size ring (15 μm). The tuning of these parameters is accomplished by a “fail and try” approach, progressively adjusting the energy and the time of exposure for the desired diameter, which obviously depends on the desired microdissection task (single cells vs. mediumor large-size tissue structures). 12. Another factor that affects the effectiveness of LCM is the time the tissue section has been dry after the staining and dehydration procedure. Ideally, the tissue should be stained and microdissected within 1 h if possible. One must avoid having the slide under LCM for more than 4 h. If microdissecting many tissues, stain only four slides at a time. 13. When capturing images before and after microdissection for documentation purposes, make sure the image on the monitor is focused because that is the image that would be captured. Sometimes is focused on the microscope but is unfocused on the monitor. In a typical experiment, you will capture the image before and after firing the laser, which provides records of the effectiveness in removing the cell targets. You can also capture the image of microdissected cells from the polymer cap. 14. Avoid allowing the LCM caps to become excessively crowded. When using the 15-μm laser ring, microdissection is about three cells per shot. One should expect around 3000 cells for each 1000 shots, which is about right per single cap. 15. LCM caps can be viewed under a dissecting microscope to ensure that all cells have been removed from the polymer film after the lysing procedure. 16. Depending on the cell type, vigorous vortexing and sonication may be necessary to completely lyse the cells after they are removed from the cap.
References 1. Prieto, D.A., Hood, B.L., Darfler, M.M., Guiel, T.G., Lucas, D.A., Conrads, T.P., Veenstra, D.T., and Krizman, D.B. (2005) Liquid TissueTM : proteomic profiling of formalin-fixed tissues. Biotechniques 38: 32–5. 2. Emmert-Buck, M.R., Bonner, R.F., Smith, P.D., Chuaqui, R.F., Zhuang, Z., Goldstein, S.R., Weiss, R.A., and Liotta, L.A. (1996) Laser capture microdissection. Science 274: 998–1001. 3. Espina, V., Milia, J., Wu, G., Cowherd, S., Liotta, L.A. (2006) Laser capture microdissection. Methods Mol Biol 319: 213–29.
Tissue Sample Collection for Proteomics Analysis
53
4. Best, C.J., and Emmert-Buck, M.R. (2001) Molecular profiling of tissue samples using laser capture microdissection. Expert Rev Mol Diagn. 1: 53–60. 5. Ornstein, D.K., Gillespie, J.W., Paweletz, C.P., Duray, P.H., Herring, J., Vocke, C.D., Topalian, S.L., Bostwick, D.G., Linehan, W.M., Petricoin, E.F., III, and Emmert-Buck, M.R. (2000) Proteomic analysis of laser capture microdissected human prostate cancer and in vitro prostate cell lines. Electrophoresis 21: 2235–42. 6. Fend, F., Emmert-Buck, M.R., Chuaqui, R., Cole, K., Lee, J., Liotta, L.A., and Raffeld, M. (1999) Immuno-LCM: laser capture microdissection of immunostained frozen sections for mRNA analysis. Am J Pathol 154: 61–6. 7. Murakami, H., Liotta, L., Star, R.A. (2000) IF-LCM: laser capture microdissection of immunofluorescently defined cells for mRNA analysis rapid communication. Kidney Int 58(3): 1346–53. 8. Cazares, L.H., Adam, B.L., Ward, M.D., Nasim, S., Schellhammer, P.F., Semmes, O.J., and Wright, G.L., Jr (2002) Normal, benign, preneoplastic, and malignant prostate cells have distinct protein expression profiles resolved by surface enhanced laser desorption/ionization mass spectrometry. Clin Cancer Res 8: 2541–52. 9. Diaz, J., Cazares, L.H., Corica, A., and Semmes O. (2004) Selective capture of prostatic basal cells and secretory epithelial cells for proteomic and genomic analysis. Urol Oncol 22(4): 329–36. 10. Mora, L., Buettner, R., Seigne, J., Diaz, J., Hamad, N., Garcia, R., Bowman, T., Falcone, R., Faigurth, R., Cantor, A., Muro-Cacho, C., Livistong, S., Levitzki, A., Kraker, A., Karras, J., Pow-Sang, J., and Jove, R. (2002) Constitutive activation of Stat3 in human prostate tumors and cell lines: direct inhibition of stat3 signaling induces apoptosis of prostate cancer cells. Cancer Research 62: 6659–66.
4 Protein Profiling of Human Plasma Samples by Two-Dimensional Electrophoresis Sang Yun Cho, Eun-Young Lee, Hye-Young Kim, Min-Jung Kang, Hyoung-Joo Lee, Hoguen Kim, and Young-Ki Paik
Summary Human plasma is regarded the most complex and well-known clinical specimen that can be easily obtained; alterations in the levels of plasma proteins or their corresponding enzyme activities may reflect either a healthy or a diseased state. Given that there is no defined genomic information as to the intact protein components in plasma, protein profiling could be the first step toward its molecular characterization. Several problems exist in the analysis of plasma proteins, however. For example, the widest dynamic range of protein concentrations, the presence of high-abundance proteins, and post-translational modifications need to be considered before proteomic studies are undertaken. In particular, efficient depletion or pre-fractionation of high-abundance proteins is crucial for the identification of low-abundance proteins that may contain potential biomarkers. After the removal of high-abundance proteins, protein profiling can be initiated using two-dimensional electrophoresis (2DE), which has been widely used for displaying the differential proteome under specific physiological conditions. Here, we describe a typical 2DE procedure for plasma proteome under either a healthy or a diseased state (e.g., liver cancer) in which pre-fractionation and depletion are integral steps in the search for disease biomarkers.
Key Words: 2-dimensional gel electrophoresis; plasma; HPPP; immunoaffinity column.
Abbreviations: IEF: Isoelectric Focusing, IPG; Immobilized pH Gradient, TCA: Trichloroacetic Acid, FFE: Free Flow Electrophoresis, HPMC: Hydroxypropyl Methylcellulose, TBP: Tributylphosphine, 2DE: 2-dimensional Gel Electrophoresis, BPB: Bromophenol Blue, CHCA: 0002-cyano-4-hydroxycinnamic acid, LTQ: Linear Iontrap From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
57
58
Cho et al.
MALDI-TOF: Matrix-assisted Laser Desorption Ionization - Time of Flight Mass Spectrometry, HPPP: Human Plasma Proteome Project.
1. Introduction Human plasma is an intravascular fluid that serves as a liquid medium for blood proteins that are derived from various cells, tissues, and other biofluids (1). In fact, the components of plasma are very heterogeneous, including inorganic ions (e.g., bicarbonate, calcium), metabolic intermediates (e.g., cholesterol, glucose), and plasma proteins (e.g., albumin, globulin), which are important in maintaining body fluid balance, immune response, blood clotting, and other metabolic mechanisms of homeostasis. Plasma contains many different proteins that are primarily synthesized in the liver and are often subjected to post-translational modification (PTM) (2). Since human plasma is the most complex and well-known clinical specimen that can be easily obtained, it has been a central target for many biomedical studies (2). Alterations in the levels of plasma proteins or their corresponding enzyme activities may reflect either a healthy or a diseased state that can be monitored by various analytical tools, including biochemical assays and proteomics. Given that there is no defined genomic information as to the intact protein components in plasma, a proteomic study may be the method of choice (3,4). Recently, plasma protein profiling was conducted as part of the plasma proteome project of HUPO, termed HPPP (5). The pilot phase of HPPP produced 3020 non-redundant proteins that were found to be present in human plasma and serum (5,6). However, several points must be addressed before proteomic studies are undertaken. First, plasma protein is believed to contain the most dynamic concentration range (more than 10 orders of magnitude) of each constituent protein, creating many technical obstacles in proteomic detection by mass spectrometry (MS) (2,3). For example, the removal of high-abundance proteins (e.g., albumin, IgG, transferrin, fibrinogen, IgA, etc.) that occupy more than 90% of all plasma proteins prior to biochemical analysis may be a big challenge and perhaps even problematic in light of plasma-derived biomarker discovery (3,7). Second, since many plasma proteins have many structural isoforms, more efficient analytical system is needed to facilitate the analysis of multiple isoforms of plasma proteins (1). Third, since many plasma proteins are synthesized as pre-proteins that are subjected to various PTMs for cellular function, more efficient methods to analyze modified proteins (e.g., glycosylated proteins) are required. For example, since glycopeptides are not easily ionized completely during MS analysis, which leads to inadequate spectral data and low detection sensitivity due to the attached glycans, a strategy
Protein Profiling by Two-Dimensional Electrophoresis
59
for the removal of glycans must be considered for protein identification. Taken together, all these factors are important for the proteomic study of plasma (8). Of the problems listed above, the first problem that concerns the protein profiling of plasma may be the depletion or pre-fractionation of high-abundance plasma proteins (3,4,7). Without this depletion procedure, the identification of low-abundance proteins (including biomarkers) may not be practical. After the removal of high-abundance proteins, two-dimensional electrophoresis (2DE) may be the first step chosen to analyze plasma proteins because it is easy to perform in the laboratory. Although 2DE has several limitations in terms of reproducibility, separation of membrane or low-molecular-weight proteins, and proteins with extreme pIs (<3 or >10), this technique has been widely used as a first analysis of proteins in a particular physiological state when coupled with MS (9). Recently, quantitative 2DE was performed with a difference in gel electrophoresis (DIGE) system (see Chapter by Friedman and Lilley for detail), where two or three differentially staining dyes can be applied to specific protein populations to determine their quantitative changes in expression levels under a specific physiological condition (10). Thus, this chapter is intended to provide the reader with necessary information on the systematic analysis of the plasma proteome using 2DE in an attempt to search for disease biomarkers from the plasma proteins of patients with hepatocellular carcinoma (HCC) (11,12).
2. Materials 2.1. Preparation of Human Plasma Samples 1. Blood collection tubes: BD Plus Plastic K2 EDTA (BD, 367525; 10 mL), BD Glass Serum with silica clot activator (367820, 10 mL). 2. Protease inhibitor (Complete Protease Inhibitor Cocktail, Roche, 11 697 498 001, 20 tablets): One tablet contains protease inhibitors (antipain, bestatin, chymostatin, leupeptin, pepstatin, aprotinin, phosphoramidon, and EDTA) sufficient for the processing of 100 mL plasma samples. Prepare 25× stock solutions in 2 mL distilled water.
2.2. Depletion of High-Abundance Proteins with an Immunoaffinity Column 1. HPLC system, such as the HP1100 LC system (Agilent). 2. Multiple affinity removal system (MARS): LC column (Agilent, 5185-5984); Buffer A for sample loading, washing, and equilibrating (Agilent, 5185-5987); Buffer B for eluting (Agilent, 5185-5988).
60
Cho et al.
2.3. Isoelectric Focusing (IEF) with Immobilized pH Gradient (IPG) Strip 1. MultiPhorTM (GE Healthcare) or Protean IEF cell (Bio-Rad): Numerous commercially available isoelectric focusing units exist 2. Re-swelling tray 3. Mineral oil: Immobiline Dry Strip Cover Fluid (GE Healthcare) 4. Power supply, such as the EPS 3501 XL power supply (GE Healthcare) 5. Thermostatic circulator: Multitemp III thermostatic circulator (GE Healthcare) 6. IPG strip: Immobiline Dry Strip, pH 3-10 nonlinear (NL), or pH 4.0-5.0, and pH 5.5-6.7, 18 cm long, 0.5 mm thick (GE Healthcare) or with the same pH ranges for ReadyStrip IPG strip (Bio-Rad) 7. Carrier ampholyte mixtures: IPG buffer or Pharmalyte, same range as the selected IPG strip 8. Sample buffer: 7 M urea, 2 M thiourea, 4% (w/v) CHAPS, 0.5% (v/v) ampholyte, 100 mM DTT, 40 mM Tris-HCl, pH 7.5, a trace amount of bromophenol blue (BPB)
2.4. Microscale Solution Isoelectric Focusing: ZOOM® 1. ZOOM® (IEF Fractionator (Invitrogen, ZF10001)). 2. ZOOM® disks: pHs 3.0, 4.6, 5.4, 6.2, 7.0, and 10.0 [Invitrogen, ZD series (e.g., ZD10030 for pH 3.0)] 3. IEF Anode Buffer (50X) (Novex, LC5300, 100 mL) 4. IEF Cathode Buffer (10X) (Novex, LC5310, 125 mL) 5. Anode buffer: 8.4 g urea, 3.0 g thiourea, 3.3 mL Novex® IEF Anode Buffer (50X). Add water to a final volume of 20 mL. 6. Cathode buffer: 8.4 g urea, 3.0 g thiourea, 3.3 mL Novex® IEF Cathode Buffer (50X). Add water to a final volume of 20 mL.
2.5. Fractionation of Plasma Samples by Free Flow Electrophoresis (FFE) 1. ProTeamTM FFE instrument (Tecan) 2. 1% 2-(4-sulfophenylazo)-1,8-dihydroxy-3,6-naphthalenedisulfonic acid (SPADNS) (Tecan, 517074) 3. 0.8% hydroxypropyl methylcellulose (HPMC) (Tecan, 5170709) 4. pI markers: mixture of pI markers that indicate pHs 4.2, 5.1, 6.3, 7.4, 8.7, and 10.1 (Tecan, 5170705) 5. ProlyteTM 1, ProlyteTM 2, and ProlyteTM 3 (Tecan, 0309081, 0309102, and 0309093) 6. Anodic stabilization medium (Inlet I1 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 100 mM H2 SO4 7. Separation medium 1 (Inlet I2 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 1
Protein Profiling by Two-Dimensional Electrophoresis
61
8. Separation medium 2 (Inlet I3−5 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 2 9. Separation medium 3 (Inlet I6 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 3 10. Cathodic stabilization medium (Inlet I7 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 100 mM NaOH 11. Counter flow medium (Inlet I8 ): 14.5% (w/w) glycerol, 8 M urea 12. Anodic circuit electrolyte: 100 mM H2 SO4 13. Cathodic circuit electrolyte: 100 mM NaOH
2.6. Preparation of 2D Gels 1. Gradient former: One of the two Bio-Rad models can be used in this step: Model 385 (30-100 mL capacity) or Model 395 (100-750 mL capacity). 2. Orbital shaker with speed controller. 3. SDS-PAGE: Protean II xi multicell and multicasting chamber (Bio-Rad) or Ettan DALT twelve large vertical system (GE Healthcare). 4. 5× Tris-HCl buffer: Dissolve 227 g Tris into 800 mL distilled water and adjust the buffer to pH 8.8 with HCl (∼30 mL). Add distilled water to a final volume of 1 L. 5. 5× Gel buffer: Dissolve 15 g Tris, 72 g glycine, and 5 g sodium dodesyl sulfate (SDS) into 800 mL distilled water and add distilled water to a final volume of 1 L. 6. SDS Equilibration buffer contains 6 M urea, 2% (w/v) SDS, 5× gel buffer (pH 8.8), 50% (v/v) glycerol, and 2.5% (w/v) acrylamide monomer. 7. Acrylamide stock solution: Acrylamide/Bis-acrylamide 37:5.1, 40% (w/v) solution (Amresco, M157, 500 mL). 8. Fixing solution: 40% (v/v) methanol and 5% (v/v) phosphoric acid in distilled water. 9. Coomassie blue G-250 staining solution: 17% (w/v) ammonium sulfate, 3% (v/v) phosphoric acid, 34% (v/v) methanol, and 0.1% (w/v) Coomassie blue G-250 in distilled water.
2.7. 2D Gel Image Analysis 1. Scanner with transparency unit, such as Bio-Rad GS710 or GS800 2. 2D gel image analysis program: Image Master Platinum 5 (GE Healthcare), PDQuest 7.3.0 (Bio-Rad), or Progenesis Discovery (NonLinear Dynamics, Ltd.)
2.8. Destaining, In-gel Deglycosylation, and In-gel Tryptic Digestion 1. Speed Vac (Heto) 2. PNGase F stock solution for in-gel deglycosylation PNGase F (Glyko, Inc, GKE5010). Dilute 1 μL PNGase F (2 mU) with 2.5 mL 1× N-glycanase incubation buffer (20 mM sodium phosphate, pH 7.5, and 0.02% (w/v) sodium azide)
62
Cho et al.
3. Sequencing-grade modified trypsin (Promega, V5111, 100 μg, 18,100 U/mg) 4. 50 mM ammonium bicarbonate
2.9. Desalting of Peptides and MALDI Plating 1. 2. 3. 4. 5. 6. 7. 8.
GELoader tips (Eppendorf, No. 0030 048.083, 20 μL capacity) Poros 10 R2 resin (PerSeptive Biosystems, 1-1118-02, 0.8 g) Oligo R3 resins (PerSeptive Biosystems, 1-1339-03, 6.3 g) 2% (v/v) formic acid in 70% (v/v) acetonitrile (ACN) 0.1% (v/v) trifluoroacetic acid in 70% (v/v) ACN 1-mL syringe Matrix: 0002-cyano-4-hydroxycinnamic acid (CHCA) Opti-TOFTM 384-well insert (123 × 81 mm, 1016491, Applied Biosystems)
2.10. MALDI-TOF and Peptide Mass Fingerprinting 1. MALDI-TOF and MALDI-TOF/TOF: Voyager DE-Pro and 4800 MALDI TOF/TOFTM Analyzer (Applied Biosystems) equipped with a 355-nm Nd:YAG laser. The pressure in the TOF analyzer is approximately 7.6e-07 Torr.
3. Methods 3.1. Human Plasma Sample Preparation The following protocol is conducted according to the HUPO reference sample collection protocol (13). 1. Each sample pool consisted of 400 mL blood from one healthy, fasting male and one healthy, fasting postmenopausal female, and was collected into 10-mL tubes by two venipunctures, 20 tubes per veni-puncture (see Note 1). 2. Equal numbers of tubes and aliquots were generated with appropriate concentrations of K2 -EDTA, lithium heparin, or sodium citrate for plasma or were permitted to clot at room temperature for 30 min to yield serum (with micronized silica as the clot activator) (see Note 2). 3. The specimens were centrifuged for 10–15 min under refrigerated conditions at 2–6°C. 4. The resultant serum and plasma from 10 spun tubes of the same type from each donor were pooled into one secondary 50-mL conical bottom BDTM Falcon tube for each tube type. 5. The secondary tube was centrifuged at 2400×g for 15 min to remove residual cellular material from serum and to prepare platelet-poor plasma from the EDTA, heparin, and citrate secondary tubes. 6. Equal volumes of either serum or plasma were pooled from each secondary tube into media bottles (see Note 3). 7. Serum/plasma was mixed gently and kept on ice while distributed as 20-μL aliquots into cryovials and was then frozen and stored at –70°C.
Protein Profiling by Two-Dimensional Electrophoresis
63
3.2. Depletion of High-abundance Proteins with an Immunoaffinity Column For efficient depletion of high-abundance proteins prior to their molecular analysis, many reports have indicated that it is convenient to use commercially available immunoaffinity columns, such as the MARS (Agilent) (2,3) or the prepacked 2-mL SepproTM MIXED12 affinity LC column (GenWay Biotech.) (14), coupled with an HPLC system. For depletion of the six most abundant proteins (i.e., albumin, transferrin, IgG, IgA, haptoglobin, and anti-trypsin) in either serum or plasma, we introduced MARS, which has been used successfully with a wide variety of sample types, including cerebrospinal fluid (CSF) and follicular fluid (2,3) (see Fig. 1 ). 1. Dilute human serum or plasma fivefold with Buffer A (for example: 20 μL human plasma with 80 μL Buffer A) containing the protease inhibitor stock solution (40 μL per 1 mL plasma) (see Note 4) (adopted from the manufacturer’s instructions). 2. Remove the particulates with a 0.22-μm spin filter for 1 min at 16,000×g. 3. Inject 75-100 μL of the diluted serum or plasma at a flow rate of 0.5 mL/min.
Fig. 1. The 2DE images of total human plasma proteins that were depleted of the major six abundant proteins through MARS. Proteins were isoelectrically focused with pH 3–10 NL IPG strips in the first dimension and then resolved by 9–16% SDSPAGE in the second dimension. (A) Whole plasma. (B) Flow through from MARS. Approximately 800 protein spots are displayed by 2DE and identified by MALDI-TOF mass spectrometry. The names of the major proteins of each gel are marked on the image (5) (from (4)with permission)
64
Cho et al.
4. Collect the flow-through fractions that appear between 1.5 and 4.5 min and store them at –20°C if they were not to be analyzed immediately. 5. Elute bound proteins from the column with Buffer B (elution buffer) at a flow rate of 1 mL/min for 3.5 min. 6. Regenerate the column by equilibrating with Buffer A for an additional 7.4 min at a flow rate of 1 mL/min.
3.3. TCA/Acetone Precipitation During 2DE, interfering compounds, such as proteolytic enzymes, salts, lipids, nucleic acids, and any residual high-abundance proteins present after depletion, must be removed or inactivated. In the case of plasma samples, the two most important parameters are salt and proteolysis. TCA/acetone precipitation is the most useful method for desalting the whole plasma and the flow-through fractions of MARS. 1. Add 50% (w/v) trichloroacetic acid (TCA, Sigma, T9159) to reach a final TCA concentration of 5-8%. Mix gently by inverting the tube 5 to 6 times and incubate on ice for 2 h. 2. Centrifuge the sample at 14,000×g for 15 min and discard the supernatant. 3. Add 200 μL cold acetone and resuspend the protein pellet with a pipette. 4. Incubate on ice for 15 min and centrifuge the sample at 14,000×g for 20 min, discard the acetone, and dry the pellet in air (see Note 5). 5. Dissolve the pellet in the sample buffer for 2DE and quantify the protein concentration by the Bradford protein assay.
3.4. Rehydration of the IPG Gel Strip For analytical purposes, typically 0.3–1.0 mg protein can be loaded onto an 18-cm-long IPG with a wide pH range (e.g., pH 3-10), or 0.5–2.0 mg on an IPG with a narrow pH range (e.g., pH 5.5–6.7). A narrow-range IPG usually produces a higher resolution when separate proteins are analyzed by sequential IEF systems: first, fractionate the proteins over several pI ranges in solution with ZOOM® disks or FFE (see Subheadings 3.6 and 3.7) and then perform IEF with IPG strips [one pH unit range strips are also available (e.g., pH 3.0– 4.0 or pH 3.5–4.5 up to pH 6.7)]. Certain proteins appear to be trapped in the disk membrane; partitions and sample loss should be considered. 1. Dilute 1.0 mg protein with the sample buffer to a final volume of 400 μL for 18-cm-long IPG strips (see Note 6). 2. Transfer the entire protein-containing sample buffer into the re-swelling tray. 3. Peel off the protective cover from the IPG strip and slowly slide the IPG strip (gel side down) onto the sample solution. Avoid trapping air bubbles and distribute the sample solution evenly under the strips.
Protein Profiling by Two-Dimensional Electrophoresis
65
4. Overlay the strip with mineral oil and leave for 12-16 h at room temperature (see Note 7 for cup loading)
3.5. IEF with IPG Strip 1. Remove the rehydrated IPG strips that are carrying the protein samples and place them (gel side up) on the strip tray. 2. Place the 2.5-cm filter papers, wetted with distilled water, on both sides of the strips at both cathodic and anodic ends. Place the strip tray on the IEF unit. 3. Cover the strips entirely with mineral oil. 4. Program the instrument (e.g., Multiphor II): Increase the voltage from 100 to 3500 V to reach 80,000 total voltage hours (Vh) (e.g., sequentially, 300 Vh at 100 V, 600 Vh at 300 V, 600 Vh at 600 V, 1000 Vh at 1000 V, and 2000 Vh at 2000 V, for a total of 80,000 Vh at 3500 V) (see Notes 8 and 9). 5. During IEF, the temperature is set to 20°C with a water circulator.
3.6. Microscale Solution IEF: ZOOM® To reduce typical artifacts that may occur when using narrow-range IPG strips (e.g., streaking, distortion, and loss of protein spots), one may use MicroSol-IEF (e.g., ZOOM® , Invitrogen) prior to running 2D gels (3) (see Fig. 2). MicroSol-IEF is a preparative solution-phase IEF apparatus that is dissected by a defined pH membrane disc (15,16). Using MicroSol-IEF, 2.5-3.0 mg plasma proteins can be loaded and efficiently fractionated into five separate chambers by their pI values. 1. Add 2 μL of 99% dimethylamine (DMA) to the 400-μL sample (see Subheading 3.4, Step 2) for alkylation and incubate the sample on a rotary shaker for 30 min at room temperature (adopted from the manufacturer’s instructions). 2. Add 4 μL of 2 M DTT to quench any excess DMA. Centrifuge at 16,000×g for 20 min at 4°C. 3. Preparation of protein samples: Dilute 3 mg protein to a 3250-μL volume with sample buffer. The amount of diluted sample per chamber in the ZOOM® IEF Fractionator is 650 μL. 4. Assemble the ZOOM® IEF Fractionator according to the manufacturer’s instructions. Six disks (pHs 3.0, 4.6, 5.4, 6.2, 7.0, and 10.0) are used to create five fractions that have a range of pH 3.0–10.0. 5. Add each buffer (anode or cathode) to the corresponding blank chamber. 6. Remove the sample chamber cap and add 650 μL of protein sample (step 3) to each chamber. 7. Fractionation can be carried out under the following conditions: 100 V for 20 min, 200 V for 80 min, and 600 V for 80 min (see Note 10). The starting current is approximately 0.6 mA, which increases to approximately 1.2 mA at the beginning of the 200-V step, and the ending current is approximately 0.2 mA. 8. Load the electro-focused samples to the narrow pH IPG strips for 2DE.
66
Cho et al.
Fig. 2. Narrow pH range 2DE images of plasma proteins after depletion of the major six abundant proteins through MARS. After microscale solution IEF (ZOOM® ), the pH 5.5–6.2 fraction was separated on pH 5.5–6.7 IPG strips by second isoelectric focusing and then resolved on a 9–16% gel. (A) Whole 2DE image of pH 3–10 NL and pH 5.5–6.7. (B) One spot on the pH 3–10 NL gel can be separated into two or more spots in the narrow pH range 2DE. (C) Many hidden spots on the pH 3–10 NL gel appear in the narrow pH range 2DE of normal and HCC plasma.
Protein Profiling by Two-Dimensional Electrophoresis
67
3.7. Fractionation of the Plasma Samples by Free Flow Electrophoresis To identify and isolate biomarker candidates from the plasma of diseased patients with HCC using 2DE, a higher resolution is critical, and the analysis can be done by performing narrow pH range IEF. However, for narrow pH range IEF, higher amounts of proteins (e.g., 10-fold or higher) should be loaded onto the IPG strip since the proteins present in other pH ranges will be discarded. Nevertheless, prefractionation or depletion is required prior to running both IEF and 2D gel. FFE is useful for prefractionation of plasma samples since it gives rise to a specific fraction of interest (e.g., pI, or density). For example, if one knows the pI of certain proteins, free fractionation by FFE can be useful for prefractionation of complex plasma. We describe here one of the several procedures for prefractionation of plasma samples using FFE. 1. Dissolve the TCA-precipitated, flow-through fractions of MARS (∼2.0 mg) into the 500-μL separation medium 3 (see below) (adopted from the manufacturer’s instructions). 2. Add traces of red acidic dye 2-(4-sulfophenylazo)-1,8-dihydroxy-3,6naphthalenedisulfonic acid (SPADNS, Aldrich) to ease the optical control of the migration of sample within the separation chamber. 3. FFE is carried out at 10°C using the following media (solutions marked at each inlet are applied): Anodic stabilization medium (Inlet I1 ), separation medium 1 (Inlet I2 ), separation medium 2 (Inlet I3−−5 ), separation medium 3 (Inlet I6 ), cathodic stabilization medium (Inlet I7 ), and counter-flow medium (Inlet I8 ). 4. To both the anode and the cathode, anodic circuit electrolyte and cathodic circuit electrolyte are applied, respectively. 5. Assemble the ProTeamTM FFE instrument (Tecan). Use a 0.4-mm spacer for the separation chamber and a flow rate of approximately 60 mL/h (Inlet I1−7 ) and a voltage of 1500 V, which results in a current of 20–24 mA. 6. Perfuse the separation chamber with the sample using the cathodal inlet at approximately 0.7 mL/h (4,17). Residence time in the separation chamber is approximately 33 min. 7. Collect each fraction into polypropylene, 96 deep-well plates, numbered 1 (anode) through 44 (cathode) (4). 8. Remove glycerol and HPMC by TCA/acetone precipitation and dissolve the proteins with sample buffer. 9. Load the electro-focused samples with narrow pH to the IPG strips for 2DE.
3.8. Preparation of 2D Gels 1. Cast the glass plates (separated by two 1.5-mm spacers positioned along the sides) and thin plastic sheets in the multi-casting chamber (20). 2. Prepare gel solution for making 10 gels (20 × 20 cm, 1.5-mm spacer, 9–16% gradient): heavy solution (66.7 mL of 5× Tris-HCl buffer, 75 mL of a 40%
68
Cho et al.
acrylamide stock solution, 0.7 mL of 10% ammonium persulfate (APS), 70 μL TEMED, and 191.7 mL of 50% glycerol), light solution (66.7 mL of 5× Tris-HCl buffer, 141.7 mL of a 40% acrylamide stock solution, 0.7 mL of 10% APS, 70 μL TEMED, and 125 mL distilled water). 3. Assemble the gradient maker and peristaltic pump. Pour the light gel solution into the mixing chamber (close to the casting chamber) and the heavy gel solution into the reservoir chamber of the gradient maker. Operate the magnetic stirrer in the mixing chamber. Turn on the peristaltic pump until the gel solution reaches 0.5-1.0 cm below the end of the glass plates (∼5 min). Check the flow rate, which should be between 100-120 mL/min. 4. After the gel solution is poured, overlay the gel solution with distilled water to exclude air and to ensure a level surface on the top of the gel. 5. Allow polymerization to occur overnight at room temperature.
3.9. Equilibration of the Sample and Running of the Gel To solubilize the electro-focused proteins and to allow SDS to polymerize, it is necessary to soak the IPG strips in SDS equilibration buffer. This step is analogous to boiling the sample in SDS buffer prior to SDS-PAGE. The reducing agents, dithiothreitol (DTT) and tributylphosphine (TBP), reduce disulfide bonds to sulfhydryls (cysteine residues). Alkylating agents and iodoacetamide (IAA) prevent reoxidation of the free sulfhydryl groups (21). 1. Prior to use, add approximately 158 μL TBP in 1 mL isopropanol to 100 mL SDS equilibration buffer and sonicate in a bath-type sonicator until the solution becomes transparent (see Note 11) (termed TBP equilibration buffer). 2. Add 15 mL TBP equilibration buffer to each strip (gel side up) and gently shake for 25 min (TBP equilibration) (see Note 12) on an orbital shaker. 3. Briefly rinse the IPG strip with 1× gel buffer and load the IPG strips onto the top of the gel and pour the agarose embedding solution (molten agarose solution with trace amounts of BPB) (see Note 13). 4. Perform SDS-PAGE (40 mA/gel) until the BPB dye reaches the bottom of the gel. Keep the temperature at 10°C. The total run time for 20 × 20 cm gels is approximately 6 h.
3.10. Coomassie Brilliant Blue G-250 Staining 1. Fix the separated proteins into the gel in a 200-mL fixing solution for 1 h. 2. Decant the fixing solution and stain the gel in Coomassie brilliant blue G-250 overnight. 3. Decant the staining solution. 4. Wash several times (>3 times) in distilled water for more than 4 h. 5. Scan the gel, then wrap the gel in plastic, and store it at 4°C.
Protein Profiling by Two-Dimensional Electrophoresis
69
3.11. 2D Gel Image Analysis 1. Import the gel image (recommended 12–16 bit, tiff format) and convert it into an ImageMaster file (*.mel). 2. Detect the protein spots and determine the volume and percentage volume of each spot. The percentage volume is the normalized value that remains relatively independent of any irrelevant variations between gels, particularly those caused by varying experimental conditions. 3. Select the differentially displayed protein spots (see Fig. 3).
3.12. Destaining, In-gel Deglycosylation, and In-gel Tryptic Digestion Most plasma proteins are glycosylated, including clotting factors, lipoproteins, and antibodies (22,23). These carbohydrate-containing proteins play major roles in the normal biological functions in plasma. Since glycopeptides are not easily completely ionized during MS analysis, which may lead to inadequate spectral data and low detection sensitivity due to the attached glycans, a strategy for the removal of glycans is necessary for protein identification. 1. Pick (or excise) the protein spot with an end-cut yellow tip and transfer the gel piece into a 1.5-mL Eppendorf tube. 2. Wash the gel piece with 100 μL distilled water. 3. Add 50 μL of 50 mM NH4HCO3 (pH 7.8) and ACN (6:4), and shake for 10 min. 4. Repeat step 3 until the Coomassie blue G250 dye disappears (2 to 5 times). 5. Decant the supernatant and dry the gel piece in a Speed Vac for 10 min (see Note 14). 6. Add 5 μL trypsin (12.5 ng/μL in 50 mM NH4 HCO3 ) and leave the gel piece on ice for 45 min. 7. Add 10 μL of 50 mM NH4HCO3 to the gel slice. 8. Incubate the gel piece at 37°C for 12 h.
3.13. Desalting of Peptides and MALDI Plating 1. Resin packing: Twist the column body (GELoader tip, Eppendorf) near the end of the tip and push the resin solution [Poros R2:Oligo R3 (2:1) in 70% (v/v) ACN, occasionally in a more efficient ratio of 1:1] with a 1-mL syringe. A packed resin length of 2-3 mm is suitable (18,19). 2. Equilibration of the column: Add 20 μL of 2% (v/v) formic acid and push the solution through the column with the 1-mL syringe. 3. Peptide binding: Add the peptide solution (supernatant of step 9 in Subheading 3.12, approximately 10-12 μL) and push this solution through the column with the syringe. 4. Washing: Add 20 μL of 2% (v/v) formic acid and push this solution through the column with the syringe.
70
Cho et al.
Fig. 3. Detection of PTMs on the 2DE of plasma proteins. (A) 2DE images of plasma proteins that were depleted of the major six abundant proteins through MARS, untreated (left) and alkaline phosphatase (AP)-treated (AP) (right). (B) One of the differentially displayed proteins after treatment with AP. (C)
i'0006j='10006n
Even if the sum runs on all the cells in the grid, only the neighbor cells are influenced by the presence of a signal, depending on the 000e parameter. The procedure consists of turning each digitalized image into a virtual map containing, in each cell, the sum of the influence of all the spots of the original 2D-PAGE; these virtual maps can be called fuzzy matrices or fuzzy maps. Due to the existence of complex spots of irregular shape in real maps, the Gaussian function is associated to each cell instead of to each spot. Figure 8 represents an example of fuzzyfication of a map at different 000e values; the example shows the digitalized and defuzzyfied maps and the fuzzyfication of the map for five increasing 000e values. 4.1.4. Application of Multivariate Tools to Fuzzy Maps The final fuzzy maps can then be analyzed by several multivariate tools for diagnostic/prognostic purposes. Two approaches will be presented here: (1) the coupling of PCA and classification tools; (2) the use of multi-dimensional scaling (MDS) techniques.
314
Marengo et al. (A)
(B)
Digitalised image
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180 200
200
20 40 60 80 100 120 140 160 180 200
20 40 60 80 100 120 140 160 180 200
(C)
σ = 0.50
σ = 1.00
(D)
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200 20 40 60 80 100 120 140 160 180 200
(E)
De-fuzzyfied image
20 40 60 80 100 120 140 160 180 200
σ = 1.50
σ = 2.00
(F)
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200 20 40 60 80 100 120 140 160 180 200
(G)
20 40 60 80 100 120 140 160 180 200
σ = 2.50
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200
Fig. 8. Sample ILL1 from (61): digitalized image (A); defuzzyfied image (B); fuzzyfication at five 000e values (C–G).
2D-PAGE Maps Analysis
315
4.1.4.1. PCA and Classification Methods (61)
Marengo et al. (61) have reported an application of PCA and LDA to fuzzy maps to a set of eight 2D maps belonging to control and mantle cell lymphoma samples. Principal Component Analysis can be applied to images by the previous unwrapping of each image; each sample (map) is turned into a series of variables describing the signal in each position of the map. In this case, 200 × 200 pixel images were taken into consideration, providing a final set of 40,000 variables for each map. PCA is particularly useful here to detect a small number of components accounting for the differences existing between the groups of samples and operating, in the meantime, a dimensionality reduction. The significant PCs calculated were used to build a LDA model to classify the samples; the selection of the variables for LDA model, which discriminates between the classes present in the dataset, was performed by a stepwise algorithm in forward search (Fto−enter = 4.0). The procedure was repeated for different values of the 000e parameter in order to detect the best 000e value providing correct classification of the samples with the smallest number of components in the final LDA model. The best results (100% of correct assignments) were obtained for 000e values ranging from 1.75 to 2.25, with PC1 and PC4 in the final LDA model. The differences existing between the two groups of samples could then be investigated by the analysis of loadings on the first and the fourth PCs. Figure 9 shows the score plot and the loading plot of PC1 and PC4 for 000e = 2.00. The loadings are represented again on a virtual map on a color scale: white tones correspond to the zones in the map characterized by large positive loadings and the black tones to the zones characterized by large negative loadings on the corresponding PC. 4.1.4.2. Multi-Dimensional Scaling
In other applications of multivariate tools to fuzzy maps, Marengo et al. (62,63) describe the use of MDS procedures. MDS performs a substantial dimensionality reduction and an effective graphical representation of the data on the basis of similarity calculated between couples of objects. MDS searches for the smallest number of dimensions in which the objects can be represented as points, matching, as much as possible, the distances between the objects in the new reference system with those calculated in the original reference system. In these applications, the calculations were performed by the Kruskal iterative method; the search for the coordinates was based on the steepest descent minimization algorithm, where the target function is the so-called stress (S), which is a measure of the ability of the configuration of points to simulate the original distance matrix.
316
Marengo et al. σ = 2.00 10 HEA2
8
6 HEA4
PC4
4
2 ILL2
0 ILL3 ILL4 HEA3
–2 ILL1
–4
HEA1
–6 –12 –10
–8
–6
–4
–2
0
2
4
6
8
10
12
14
16
18
PC1
Loadings PC1
Loadings PC4 0.04 20
20 0.03
40
0.03
40
0.02
60
60 0.02
0.01 80
80 0.01
100
0
100
120
120
–0.01
140
–0.02
0 140 –0.01
160
160 –0.03 180
180 –0.02
–0.04 200
200 20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
Fig. 9. Score plot (A) and loading plots (B) of PC1 and PC4 with 000e = 2.00.
As for the previous applications based on PCA and LDA, several values of 000e parameter have been investigated, and the one providing the best classification was selected. In this case, for each value of the 000e parameter, a similarity matrix has to be built. From the match between the two fuzzy maps k and l, the common signal SCkl (the sum of all signals present in both maps) and the total signal STkl can be computed: SCkl =
0003
0007 min Sik 0006 Sil
i=10006n
STkl =
0003
i=10006n
0007 max Sik 0006 Sil
2D-PAGE Maps Analysis
317
where n is the number of cells in the grid. The similarity index is then computed by: Skl =
SCkl STkl
Skl ranges from 0 (two maps showing no common structure) to 1 (two identical maps). In both the applications, the optimal 000e values that provide the best classification of the samples with only one or two dimensions could be identified. 4.2. Moment Functions Moment functions have been widely used in image analysis, in applications related to invariant pattern recognition, object classification, pose estimation, image coding, and reconstruction (65,66,67,68,69). A set of moments computed from a digital image generally represents global characteristics of the image shape, and provides a lot of information about different types of geometrical features of the image. Geometric moments were the first ones to be applied to images, as they are computationally very simple. With the progress of research in image processing, many new types of moment functions have been introduced recently, such as orthogonal moments, rotational moments, and complex moments, which are useful tools in the field of pattern recognition, and can be used to describe the features of objects such as shape, area, border, location, and orientation; naturally each moment function has its own advantages in specific applications. The most important and most used moments are orthogonal moments (e.g., Legendre (70,71,72) and Zernike moments (73,74,75)), which can attain a zero value of redundancy measure in a set of moment functions, so that these orthogonal moments correspond to the independent characteristics of the image. In other words, moments with orthogonal basis functions can be used to represent the image by a set of mutually independent descriptors, with a minimum amount of information redundancy. So far, orthogonal moments have additional properties of being more robust, with respect to the non-orthogonal ones, in the presence of image noise. Orthogonal moments also permit analytical reconstruction of an image intensity function from a finite set of moments, using the inverse moment transform. Legendre moments are the most used orthogonal moments and can be implemented as feature descriptors for 2D-PAGE maps classification. The main advantages in the use of Legendre moments to clustering the maps derive from the possibility to obtain invariance to translation, scale, and rotation; in other words, the original maps, without any pre-treatment, can be used for classification, and the use of complex commercial software can be totally avoided.
318
Marengo et al.
The number of calculated moments is very large, and many of them do not contain information related to the specific target of correctly classifying the 2D-PAGE maps; for this reason a method for selecting the moments having highest DP must be applied (e.g., LDA). 4.2.1. Legendre Moments The Legendre polynomials form a complete orthogonal set inside the unit circle. Moments with Legendre polynomials as kernel functions were first introduced by Teague (68). The kernel of Legendre moments are products of Legendre polynomials defined along the rectangular image coordinate axes inside a unit circle. The two-dimensional Legendre moments of order0002p + q0003of an image intensity mapf 0002x0006 y0003 are defined as: Lpq =
00022p + 10003 00022q + 10003 1 1 Pp 0002x0003 × Pq 0002y0003f0002x0006 y0003dxdy0005 4 −1 −1 x0006y∈000f−10006100100006
where Legendre polynomial, Pp 0002x0003, of order p is given by: Pp 0002x0003 =
p 0003
0002−10003
p−k 2
k=0
1 0002p − k0003!xk 0007 0007 p+k 2p p−k ! 2 !k! 2
p−k=even
The recurrence relation of Legendre polynomials, Pp 0002x0003, is: Pp 0002x0003
00022p − 10003 xPp−1 0002x0003 − 0002p − 10003 Pp−2 0002x0003 0006 p
where P0 0002x0003-1, P1 0002x0003 = x, and p>1. Since the region of definition of Legendre polynomials is the interior of [–1,1], a square image of N × N pixels with intensity function f0002i0006 j0003, 0≤i, j≤( N–1 ) is scaled in the region –1< x,y<1. Legendre moments can be expressed in discrete form as: L pq = 0011pq
−1 N −1 N 0003 0003
0007 Pp 0002xi 0003Pq yj f 0002i0006 j0003 0006
i=0 j=0
where the normalizing constant is 0011pq =
00022p + 10003 00022q + 10003 0005 and N2
xi and yj denote the normalized pixel coordinates in the range [–1,1] xi =
2i 2j − 1 and yj = −1 N −1 N −1
2D-PAGE Maps Analysis
319
The reconstruction of image function from calculated moments can be performed by the following inverse transformation: f 0002i0006 j0003 =
00030003
pmax qmax
0007 0011pq Pp 0002xi 0003 Pq xj
p=0 q=0
Marengo et al. (76) report an interesting application of Legendre moments to a set of 2D-PAGE maps belonging to two different cell lines of control (untreated) and drug-treated pancreatic ductal carcinoma cells. The aim of the work was to obtain the correct classification of the 18 samples using the Legendre moments as discriminant variables. Each 2D-PAGE, which was automatically digitalized, was described by a 200×200 matrix of pixels; the value of each pixel varies from 0 to 1 to indicate the staining intensity in the given position. The Legendre moments of the 18 digitalized images were calculated. Moments up to a maximum order of 100 were computed from the images. Each matrix held the global information of the corresponding 2D-PAGE map. The final dataset contained 18 samples and 10,201 variables. The number of variables was very large, and many of them were either redundant or did not contain information related to the specific target of correctly classifying the samples; for this reason a method for selecting the variables having the highest power of discrimination was applied (forward stepwise LDA with Fto−enter = 4.0). The results of stepwise LDA procedure showed that only six different Legendre moments were necessary in order to correctly classify the 18 samples. The results demonstrate that the Legendre moments can be successfully applied for fast classification and similarity analysis of 2D-PAGE maps. 4.3. Other Methods Schultz et al. (77), together with the application of PCA and PLS to spot volume data, applied PCA to the analysis of gel images after digitalization and unwrapping. The choice of the alignment procedure for the sets of gels proved to be the determinant of the final result. PCA proved to be effective in the identification of the groups of maps present. Marengo et al. (78) also applied three-way PCA to the identification of the differences among groups of 2D maps. Proteomic datasets are suitable to be treated by three-way method due to their three-way structure: the first dimension being the pH gradient, the second the molecular mass, and the third the samples. In three-way PCA, the observed modes (conventionally called I, J, and K) can be synthesized in more fundamental modes, each element of a reduced mode expressing a particular structure existing between all or a part
320
Marengo et al.
of the elements of the associated observation mode. The final result is given by three sets of loadings together with a core array describing the relationship among them. Each of the three sets of loadings can be displayed and interpreted in the same way as a score plot of standard PCA. Three-way PCA was preceded by data transformation to scale all the samples and make them comparable; to this purpose, maximum scaling was selected and the digitalized 2D PAGE maps were scaled one at a time to the maximum value for each map. This method was successfully applied to datasets of human lymph-nodes and rat sera allowing the identification of the main differences existing among the sets of 2D maps.
References 1. Mahon, P., Dupree, P., (2001) Quantitative and reproducible two-dimensional gel analysis using Phoretix 2D Full, Electrophoresis 22, 2075–2085 2. Rubinfeld, A., Keren-Lehrer, T., Hadas, G., Smilansky, Z., (2003) Hierarchical analysis of large-scale two-dimensional gel-electrophoresis experiments, Proteomics 3, 1930–1935 3. Anderson, N.L., Taylor, J., Scandora, A.E., Coulter, B.P., Anderson, N.G., (1981) The TYCHO system for computer analysis of two-dimensional gel electrophoresis patterns, Clinical Chemistry 27 (11), 1807–1820 4. Rosengren, A.T., Salmi, J.M., Aittokallio, T., Westerholm, J., Lahesmaa, R., Nyman, T.A., Nevalainen, O.S., (2003) Comparison of PDQuest and Progenesis software packages in the analysis of two dimensional electrophoresis gels, Proteomics 3, 1936–1946 5. Raman, B., Cheung, A., Marten, M.R., (2002) Quantitative comparison and evaluation of two commercially available, two-dimensional electrophoresis image analysis software packages, Z3 and Melanie, Electrophoresis 23, 2194–2202 6. Panek, J., Vohradsky, J., (1999) Point pattern matching in the analysis of twodimensional gel electropherograms, Electrophoresis 20, 3483–3491 7. Pleissner, K.P., Hoffman, F., Kriegel, K., Wenk, C., Wegner, S., Sahistrom, A., Oswald, H., Alt, H., Fleck, E., (1999) New algorithmic approaches to protein spot detection and pattern matching in two-dimensional electrophoresis gel databases, Electrophoresis 20, 755–765 8. Voss, T., Haberl, P., (2000) Observations on the reproducibility and matching efficiency of two-dimensional electrophoresis gels: consequences for comprehensive data analysis, Electrophoresis 21, 3345–3350 9. Cutler, P., Heald, G., White, I.R., Ruan, J., (2003) A novel approach to spot detection for two-dimensional gel electrophoresis images using pixel value collection, Proteomics 3, 392–401 10. Molloy, M.P., Brzezinski, E.E., Hang, J., McDowell, M.T., VanBogelen, R.A., (2003) Overcoming technical variation and biological variation in quantitative proteomics, Proteomics 3, 1912–1919
2D-PAGE Maps Analysis
321
11. Moritz, B., Meyer, H.E., (2003) Approaches for the quantification of protein concentration ratios, Proteomics 3, 2208–2220 12. Wheelock, A.M., Buckpitt, A.R., (2005) Software-induced variance in twodimensional gel electrophoresis image analysis, Electrophoresis 26, 4508–4520 13. Almeida, J.S., Stanislaus, R., Krug, E., Arthur, J.M., (2005) Normalisation and analysis of residual variation in two-dimensional gel electrophoresis for quantitative differential proteomics, Proteomics 5, 1242–1249 14. Pietrogrande, M.C., Marchetti, N., Dondi, F., Righetti, P.G., (2003) Spot overlapping in two-dimensional polyacrylamide gel electrophoresis maps: relevance to proteomics, Electrophoresis 24, 217–224 15. Pietrogrande, M.C., Marchetti, N., Dondi, F., Righetti, P.G., (2002) Spot overlapping in two-dimensional polyacrylamide gel electrophoresis separations: a statistical study of complex protein maps, Electrophoresis 23, 283–291 16. Campostrini, N., Areces, L.B., Rappsilber, J., Pietrogrande M.C., Dondi, F., Pastorino, F., Ponzoni, M., Righetti, P.G., (2005) Spot overlapping in twodimensional maps: a serious problem ignored for much too long, Proteomics 2005 (5), 2385–2395 17. Garrels, J.I., (1979) Two dimensional gel electrophoresis and computer analysis of proteins synthesized by clonal cell lines, J. Biol. Chem. 254, 7961–7977 18. Garrels, J.I., Farrar, J.T., Burwell IV, C.B., (1984) In: Celis, J.E., Bravo, R. (Eds.), Two-dimensional Gel Electrophoresis of Proteins, Academic Press, Orlando, FA, USA, pp. 38–91 19. Garrels, J.I., (1989) The QUEST system for quantitative analysis of twodimensional gels, J. Biol. Chem. 264, 5269–5282 20. Massart, D.L., Vandeginste, B.G.M., Deming, S.M., Michotte, Y., Kaufman, L., (1988) Chemometrics: A Textbook. Amsterdam, Elsevier 21. Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De Jong, S., Lewi, P.J., Smeyers-Verbeke, J., (1998) Handbook of Chemometrics and Qualimetrics: Part B. Amsterdam, Elsevier 22. Marengo, E., Robotti, E., Righetti, P.G., Campostrini, N., Pascali, J., Ponzoni, M., (2004) Study of Proteomic changes associated with healthy and tumoral murine samples in Neuroblastoma by Principal Component Analysis and classification methods, Clinica Chimica Acta 345, 55–67 23. Marengo, E., Robotti, E., Bobba, M., Liparota, M.C., Antonucci, F., Rustichelli, C., Zamò, A., Chilosi, M., Hamdan, M., Righetti, P.G., (2006) Characterisation of the proteomic profiles of two human lymphoma cell lines by two-dimensional gel-electrophoresis and multivariate statistical tools, Electrophoresis 27, 484–494 24. Massart, D.L., Kaufman, L., (1983) In: Elving, P.J., Winefordner, J.D. (Eds.), The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, USA 25. Eisenbeis, R.A. (Ed.), (1972) Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, USA
322
Marengo et al.
26. Klecka, W.R. (Ed.), (1980) Discriminant Analysis. Sage Publications, Beverly Hills, USA 27. Wold, S., (1976) Pattern recognition by means of disjoint principal components models, Pattern Recognition 8, 127–139 28. Martens, H., Naes, T., (1989) Multivariate Calibration, Wiley, London 29. Kleinbaum, D., Kupper, L., Muller, K., (1988) Applied Regression Analysis and Other Multivariate Methods, 2nd ed.. Pws-Kent, Boston 30. De Noord, O.E., (1994) Multivariate calibration standardization, Chemometr. Intell. Lab. Syst. 25, 85–97 31. Anderson, N.L., Hofmann, J.P., Gemmell, A., Taylor, J., (1984) Global approaches to quantitative analysis of gene-expression patterns observed by use of twodimensional gel electrophoresis, Clin Chem. 30, 2031–2036 32. Tarroux, P., Vincens, P., Rabilloud, T., (1987) HERMeS: a second generation approach to the automatic analysis of two-dimensional electrophoresis gels. Part V: Data analysis, Electrophoresis 8, 187–199 33. Couto, M.M.B., Vogels, J.T.W.E., Hofstra, H., Husiintveld, J.H.J., Vandervossen, J.M.B.M., (1995) Random amplified polymorphic DNA and restriction enzyme analysis of PCR amplified RDNA in taxonomy, 2 Identification techniques for food-borne yeasts, J. Applied Bacteriology 79 (5), 525–535 34. Johansson, M.L., Quednau, M., Ahrne, S., Molin, G., (1995) Classification of lactobacillus-plantarum by restriction-endonuclease analysis of total chromosomal DNA using conventional agarose-gel electrophoresis, International J. of Systematic Bacteriology 45 (4), 670–675 35. Boon, N., De Windt, W., Verstraete, W., Top, E.M., (2002) Evaluation of nested PCR-DGGE (denaturing gradient gel electrophoresis) with group-specific 16S rRNA primers for the analysis of bacterial communities from different wastewater treatment plants, FEMS Microbiology Ecology 39 (2), 101–112 36. Gadea, I., Ayala, G., Diago, M.T., Cunat, A., Garcia de Lomas J., (2000) Immunological diagnosis of human hydatid cyst relapse: utility of the enzyme-linked immunoelectrotransfer blot and discriminant analysis, Clinical and Diagnostic Laboratory Immunology 7 (4), 549–552 37. Gadea, I., Ayala, G., Diago, M.T., Cunat, A., Garcia de Lomas, J., (1999) Immunological diagnosis of human cystic echinococcosis: utility of discriminant analysis applied to the enzyme-linked mmunoelectrotransfer blot, Clinical and Diagnostic Laboratory Immunology 6 (4), 504–508 38. Kovarova, H., Hajduch, M., Korinkova, G., Halada, P., Krupickova, S., Gouldsworthy, A., Zhelev, N., Strnad, M., (2000) Proteomics approach in classifying the biochemical basis of the anticancer activity of the new olomoucinederived synthetic cyclin-dependent kinase inhibitor, bohemine, Electrophoresis 21, 3757–3764 39. Kovarova, H., Radzioch, D., Hajduch, M., Sirova, M., Blaha, V., Macela, A., Stulik, J., Hernychova, L., (1998) Natural resistance to intracellular parasites: a study by two-dimensional gel electrophoresis coupled with multivariate analysis, Electrophoresis 19 (8–9), 1325–1331
2D-PAGE Maps Analysis
323
40. De Moor, B., Marchal, K., Mathys, J., Moreau, Y., (2003) Bioinformatics: organisms from Venus, technology from Jupiter, algorithms from Mars, European Journal of Control 9 (2–3), 237–278 41. Iwadate, Y., Sakaida, T., Hiwasa, T., Nagai, Y., Ishikura, H., Takiguchi, M., Yamaura, A., (2004) Molecular classification and survival prediction in human gliomas based on proteome analysis, Cancer Research 64 (7), 2496–2501 42. Amin, R.A., Vickers, A.E., Sistare, F., Thompson, K.L., Roman, R.J., Lawton, M., Kramer, J., Hamadeh, H.K., Collins, J., Grissom, S., Bennett, L., Tucker, C.J., Wild, S., Kind, C., Oreffo, V., Davis, J.W., Curtiss, S., Naciff, J.M., Cunningham, M., Tennant, R., Stevens, J., Car, B., Bertram, T.A., Afsharil, C.A., (2004) Identification of putative gene-based markers of renal toxicity, Environmental Health Perspectives 112 (4), 465–479 43. Heijne, W.H.M., Stierum, R.H., Slijper, M., van Bladeren, P.J., van Ommen, B., (2003) Toxicogenomics of bromobenzene hepatotoxicity: a combined transcriptomics and proteomics approach, Biochemical Pharmacology 65 (5), 857–875 44. Anderson, N.L., EsquerBlasco, R., Richardson, F., Foxworthy, P., Eacho, P., (1996) The effects of peroxisome proliferators on protein abundances in mouse liver, Toxicology and Applied Pharmacology 137 (1), 75–89 45. Perrot, F., Hebraud, M., Charlionet, R., Junter, G.A., Jouenne, T., (2001) Cell immobilisation induces changes in the protein response of Escherichia coli K-12 to a cold shock, Electrophoresis 22, 2110–2119 46. Verhoeckx, K.C.M., Bijlsma, S., de Groene, E.M., Witkamp, R.F., van der Greef, J., Rodenburg, R.J.T., (2004) A combination of proteomics, principal component analysis and transcriptomics is a powerful tool for the identification of biomarkers for macrophage maturation in the U937 cell line, Proteomics 4 (4), 1014–1028 47. Verhoeckx, K.C.M., Bijlsma, S., Jespersen, S., Ramaker, R., Verheij, E.R., Witkamp, R.F., van der Greef, J., Rodenburg, R.J.T., (2004) Characterization of anti-inflammatory compounds using transcriptomics, proteomics, and metabolomics in combination with multivariate data analysis, International Immunopharmacology 4 (12), 1499–1514 48. Marengo, E., Robotti, E., Cecconi, D., Scarpa, A., Righetti, P.G., (2004) Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin-A by 2D-PAGE maps and Multivariate Statistical Analysis, Analytical and Bioanalytical Chemistry 379 (7–8), 992–1003 49. Fujii, K., Kondo, T., Yokoo, H., Yamada, T., Matsuno, Y., Iwatsuki, K., Hirohashi, S., (2005) Protein expression pattern distinguishes different lymphoid neoplasms, Proteomics 5, 4274–4286 50. Dewettinck, K., Dierckx, S., Eichwalder, P., Huyghebaert, A., (1997) Comparison of SDS-PAGE profiles of four Belgian cheeses by multivariate statistics, Lait 77 (1), 77–89 51. Alika, J.E., AkenOva, M.E., Fatokun, C.A., (1995) Variation among maize (Zea mays L) accessions of Bendel State, Nigeria – numerical analysis of zein protein band patterns, Genetic Resources and Crop Evolution 42 (4), 393–399
324
Marengo et al.
52. Magdic, D., Horvat, D., Jurkovic, Z., Sudar, R., Kurtanjek, K., (2002) Chemometric analysis of high molecular mass glutenin subunits and image data of bread crumb structure from Croatian wheat cultivars, Food Technology and Biotechnology 40 (4), 331–341 53. Jessen, F., Lametsch, R., Bendixen, E., Kjaersgard, I.V.H., Jorgensen, B.M., (2002) Extracting information from two-dimensional electrophoresis gels by partial least squares regression, Proteomics 2, 32–35 54. Kleno, T.G., Leonardsen, L.R., Kjeldal, H.O., Laursen, S.M., Jensen, O.N., Baunsgaard, D., (2004) Mechanisms of hydrazine toxicity in rat liver investigated by proteomics and multivariate data analysis, Proteomics 4, 868–880 55. Kjaersgard, I.V.H., Norrelykke, M.R., Jessen, F., (2006) Changes in cod muscle proteins during frozen storage revealed by proteome analysis and multivariate data analysis, Proteomics 6, 1606–1618 56. Gottfries, J., Sjogren, M., Holmberg, B., Rosengren, L., Davidsson, P., Blennow, K., (2004) Proteomics for drug target discovery, Chemometrics and Intelligent Laboratory Systems 73, 47–53 57. Karp, N.A., Griffin, J.L., Lilley, K.S., (2005) Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics, Proteomics 5, 81–90 58. Norden, B., Broberg, P., Lindberg, C., Plymoth A., (2005) Analysis and understanding of high-dimensionality data by means of multivariate data analysis, Chemistry and Biodiversity 2 (11), 1487–1494 59. Malone, J., McGarry, K., Bowermann, C., (2006) Automated trend analysis of proteomics data using an intelligent data mining architecture, Expert Systems with Applications 30, 24–33 60. Marengo, E., Robotti, E., Gianotti, V., Righetti P.G., (2003) A new approach to the statistical treatment of 2D-Pages in proteomics using fuzzy logic, Annali di Chimica 93 (1–2), 105–116 61. Marengo, E., Robotti, E., Righetti, P.G., Antonucci, F., (2003) A new approach based on fuzzy logic and principal component analysis for the classification of 2Dmaps in health and disease: application to lymphomas, Journal of Chromatography A 1004, 13–28 62. Marengo, E., Robotti, E., Gianotti, V., Righetti, P.G., Domenici, E., Cecconi, D., (2003) A new integrated statistical approach to the diagnostic use of proteomic two-dimensional maps, Electrophoresis 24, 225–236 63. Marengo, E., Robotti, E., Cecconi, D., Scarpa, A., Righetti, P.G., (2004) Application of fuzzy logic principles to the classification of 2D-PAGE maps belonging to human pancreatic cancers treated with Trichostatin-A, Proceedings of 2004 IEEE International Conference on Fuzzy Systems, Budapest, Hungary, 25–29 July 2004, 1, 359–364 64. Marengo, E., Robotti, E., Antonucci, F., Cecconi, D., Campostrini, N., Righetti, P.G., (2005) Spot matching in two-dimensional gels: a review of commercial software and of “home-made” approaches, Proteomics 5, 654–666
2D-PAGE Maps Analysis
325
65. Zenkouar, H., Nachit, A., (1997) Images compression using moments method of orthogonal polynomials, Materials Science and Engineering B 49, 211–215 66. Yin, J., Rodolfo De Pierro, A., Wei, M., (2002) Analysis for the reconstruction of a noisy signal based on orthogonal moments, Applied Mathematics and Computation 132, 249–263 67. Hu, M.K., (1962) Visual pattern recognition by moment invariants, IRE Transaction on Information Theory 8, 179–187 68. Teague, M.R., (1980) Image analysis via the general theory of moments, Journal of the Optical Society of America 70, 920–930 69. Li, B.C., Shen, J., (1991) Fast computation of moment invariants, Pattern Recognition 24, 807–813 70. Chong, C., Raveebdram, P., Mukundan, R., (2004) Translation and scale invariants of Legendre moments, Pattern Recognition 37, 119–129 71. Mukundan, R., Ramakrishnan, K.R., (1995) Fast computation of Legendre and Zernike moments, Pattern Recognition 28, 1433–1442 72. Zhou, J.D., Shu, H.Z., Luo, L.M., Yu, W.X., (2002) Two new algorithms for efficient computation of Legendre moments, Pattern Recognition 35, 1143–1152 73. Wee, C., Paramesran, R., Takeda, F., (2004) New computational methods for full and subset Zernike moments, Information Sciences 159, 203–220 74. Kan, C., Srinath, M.D., (2002) Invariant character recognition with Zernike and orthogonal Fourier-Mellin moments, Pattern Recognition 35, 143–154 75. Khotanzad, A., Hong, Y.H., (1990) Invariant image recognition by Zernike moments, IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489–497 76. Marengo, E., Bobba, M., Robotti, E., Liparota, M.C., (2005) Use of Legendre moments for the fast comparison of 2D-PAGE maps images, Journal of Chromatography A 1096 (1–2), 86–91 77. Marengo, E., Leardi, R., Robotti, E., Righetti, P.G., Antonucci, F., Cecconi, D., (2003) Application of three-way principal component analysis to the evaluation of two-dimensional maps in proteomics, Journal of Proteome Research 2 (4), 351–360 78. Schultz, J., Gottlieb, D.M., Petersen, M., Nesic, L., Jacobsen, S., Sondergaard, I., (2004) Explorative data analysis of two-dimensional electrophoresis gels, Electrophoresis 25 (3), 502–511
17 Finding the Significant Markers Statistical Analysis of Proteomic Data Sebastien Christian Carpentier, Bart Panis, Rony Swennen, and Jeroen Lammertyn
Summary After separation through two-dimensional gel electrophoresis (2DE), several hundreds of individual protein abundances can be quantified in a cell population or sample tissue. Both a good experimental setup and a valid statistical approach are essential to get insight into the data and to draw correct conclusions. High-throughput 2DE proteomics yield complex and large datasets with a huge disproportion between the hundreds of variables and the restricted number of replicates. However, the most commonly used statistical tests have been designed to cope with a high number of replicates and a restricted number of variables. There is some inconsistency in the proteomics community related to the use of statistics. Two approaches of data analysis can be distinguished: exploratory data analysis and confirmatory data analysis. Currently, most proteomic data are analyzed with the emphasis on confirmatory analysis and do not take into account the exploratory data analysis. This chapter gives an overview of the typical statistical exploratory and confirmatory tools available and suggests case-specific guidelines for a reliable statistical approach that can be used for 2DE analysis. Examples are given for an experimental setup based on classical staining methods as well as for the more advanced difference gel electrophoresis.
Key Words: assumptions; confirmatory data analysis; experimental set-up; exploratory data analysis; missing values; multivariate statistics; non-parametric test; parametric test; principal component analysis; univariate statistics.
From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
327
328
Carpentier et al.
1. Introduction The conventional approach to analyze a biological problem is to collect data in order to test a particular hypothesis. Starting from this hypothesis, the data are collected, which should lead to an objective and reliable decision. As such, the hypothesis can be accepted, revised, or rejected. This confirmatory way of data analysis is accompanied by a number of steps that define the experimental setup. However, our understanding of a biological system is usually rather limited, and data may be very heterogeneous and complex. Exploratory data analysis approaches a biological problem from a different angle and tries to describe patterns, relationships, trends, outlying data, etc. Two-dimensional gel electrophoresis (2DE) simultaneously quantifies hundreds of individual protein abundances in a cell population or sample tissue. High-throughput 2DE proteomics yield complex and large datasets with a huge disproportion between the hundreds of variables and the restricted number of replicates. Most commonly used statistical tests are for confirmatory data analysis and have been designed to cope with a high number of replicates and a restricted number of variables. Both a good experimental setup and a valid statistical approach are extremely important. There is some inconsistency in the proteomics community. Proteomic data are currently analyzed by a variety of approaches. The objective of this chapter is to give a concise overview of statistical methods used in functional genomics and to find a good compromise between statistics and proteome analysis in practice. This chapter deals with the experimental design and data analysis and, at the end, provides two practical examples (classical staining approach and DIGE approach). Section 2 discusses the issues of replicates and the pooling of samples, and briefly discusses the calibration, normalization, and quantification of data. Section 3 discusses confirmatory univariate and exploratory multivariate analysis and the related assumptions and associated problems. 2. Experimental Design The design of an experiment is crucial for the robustness of the results obtained. Careful planning is essential to maximize the information output of an experiment. The experimental conditions must be well designed in order to keep variation within an experimental group as small as possible, and the experimental setup should be kept as simple as possible in order to keep the data manageable. When the impact of a particular treatment is to be examined, proper controls should be included (positive and negative control), and irrelevant external influences should be eliminated or anticipated (e.g., by randomized design).
Statistical Analysis of Proteomic Data
329
The conventional approach of analyzing a biological problem is to collect data in order to test a particular hypothesis. The collected data should enable the researcher to make an objective and reliable decision concerning the hypothesis. The experimental setup usually includes a procedure that involves several steps: (1) state a null hypothesis (H0 ) (e.g., there is no difference in protein abundance(s) between the treatments) and its alternative (H1 ) (e.g., there is a difference between the treatments), (2) to choose the most appropriate test statistic to check the hypothesis, (3) specify a significance level (i.e., the accepted level of having false positive results and to reject unjustly the null hypothesis), (4) specify the sample size (number of replicates) to have sufficient power, and (5) collect the data. The power of a statistical test is the ability to detect possible differences between the experimental groups. The power of a statistical test or the reduction of false negative results depends on the variance, the change in abundance, the number of replicates, the statistical test chosen, and the predetermined significance level. Lilley and Karp have illustrated the relationship between power, replicate number, and relative expression change in a proteomics experiment (1). Urfer et al. consider the effect of testing all the proteins simultaneously by means of family-wise error rate and false discovery rate (2). The number of replicates is the best way to control the power of a statistical test. Given the labor and cost involved in the 2DE analysis, the number of replicates is often restricted, and thus the variance (technical and biological) should be kept in control. 2.1. Replicates A well-discussed subject is the nature of replicates. Two types of replicates are reported in 2DE studies: (1) technical replicates (repeated measurements of the same sample (e.g., the same protein extract) and (2) biological replicates (different measurements within the same experimental group). Ideally, only biological replicate samples should be used, and one should try to limit the technical variability to the strict minimum so that a repeated measurement of the same sample is not necessary (Fig. 1A). Therefore, both a reliable sample preparation method (3) and an extended experience in electrophoresis and proteomic techniques are indispensable (4,5,6,7). Technical variability can be introduced at the level of (1) sample collection, (2) sample preparation and protein extraction, (3) sample loading and electrophoresis, and (4) staining and image analysis. Some staining methods, like silver staining, implicate a lot of steps, and each sample is run in an individual gel, which makes the approach susceptible to technical variation. Technical replicates might be considered in experiments with a low sample yield, with cost restrictions, or when all the technical variability is still too high (high inter-gel variability) (Fig. 1B).
330
Carpentier et al.
In any case, one should take care to analyze technical replicates next to biological replicates. Statistically speaking, we are dealing with mixed models and nested designs (8,9). Karp et al. discuss the impact of mixing biological and technical replicates in a proteomics experiment (10). Treating technical replicates as biological replicates can increase the rate of false positives. Analyzing biological and technical replicates in one test would seem reasonable only in a nested ANOVA test. If another statistical test should be used, only the biological replicates are used (Fig. 1A), and the technical repetition of the same biological samples (proteins extracts) should be considered as a distinct and confirmatory analysis. With low technical variance observed with the difference gel electrophoresis (DIGE) approach (see below), the value of the analyzing technical replicates can be questioned and hence skipped (Fig. 1A). 2.2. Pooling Another well-debated subject is the pooling of biological samples. Pooling of individual biological tissues or cells averages the sample. On one hand, pooling reduces the variability increasing the power, but on the other hand, there is incontestable loss of relevant information of individuals. The pooling of samples reduces biological variance in detecting changes in protein abundance between the averages of the experimental groups. Pooling of samples is usually done when the biological variation with in an experimental group is too big (Fig. 1C and 1D), or when an individual starting material is not sufficient to extract proteins from. Pooling of samples might be useful, but must be evaluated for each individual experimental setup. 2.3. Data Processing Common strategies for quantitative determination of gel-separated proteins include organic dyes (e.g., colloidal coomassie blue), silver staining, radio labeling, and fluorescent stains (e.g., Deep Purple, Flamingo, SYPRO Orange/Red/Ruby, and other ruthenium complexes and succinimidyl ester derivatives of cyanine dyes). The use of a particular staining method should carefully be considered taking into account the lab equipment available, budget, and power of a particular method. The dynamic range of staining methods and the technical variability both have a great impact on the power of a statistical test and are decisive for the experimental setup (the number of replicates) and the choice of the statistical test. Data from 2DE analysis are generated through image analysis software that detects and quantifies protein abundances and matches the same proteins across different gels. An important challenge in 2DE is to estimate the protein concentration in order to ensure that all gels are loaded with an equal amount of
Statistical Analysis of Proteomic Data
331
Fig. 1. Experimental set-up. Theoretical examples of experimental setup control vs. treatment. (A) Small intra-group variation and small technical variation: four biological replicates for control and four biological replicates for treatment. (B) Small intra-group variation and big technical variation—mixed model: four biological and three technical replicates for control and the same for treatment. (C) Big intra-group variation and small technical: four replicates of biological pool for control and the same for treatment. (D) Big intra-group variation and big technical variation—mixed model: four replicates of biological pool and three technical replicates for control and the same for treatment.
332
Carpentier et al.
proteins, and hence to minimize the technical variation. Most current software packages take this into account and introduce a calibration or normalization in order to compensate for image differences caused by protein loading, staining, and scanning. 2.3.1. Classical Approach Calibration in a classical approach (like silver or coomassie staining) is developed to take into account the differences in scanning properties (such as image depth). Scanner grey values are converted to optical densities so that intensities are no longer dependent on the original pixel depth. The most logical normalization procedure to anticipate possible loading differences for a classical staining is % volume, where the individual spot volumes are normalized by the total volume of all spots. Normalized data, whether or not transformed, can be subsequently analyzed statistically by a relevant statistical test (see below). The most commonly used organic staining is coomassie brilliant blue (CBB) staining. CBB staining has a relative good dynamic range (approximately 103 ) and is perfectly compatible with MS. However, its sensitivity is relatively low. The limit of protein detection for colloidal CBB stain is approximately 8–10 ng (11). Therefore, several modifications have been proposed to improve its sensitivity. For an overview, see (12). The introduction of the first sensitive silver-staining (13) method was a major breakthrough in the field of protein detection, which led to extensive research and various alternative silver-staining protocols (14). Silver-staining is still one of the most sensitive non-radioactive detection techniques with a detection limit in the lower nanogram range. However, the linearity and dynamic range are relatively poor (approximately 102 or less), the staining is protein-dependent, and gel-to-gel variation is not negligible due to numerous solution changes and other carefully timed steps. 2.3.2. Difference Gel Electrophoresis Approach Fluorescent-based methods are surpassing the conventional technologies in use. A standard UV-transilluminator can be used for visualization of most fluorescent stains, but more sophisticated and expensive CCD cameras or laser scanners are appropriate for quantitative determination. The development of succinimidyl ester derivatives of different cyanine fluorescent dyes that modify free amino groups of proteins prior to separation (15) was a major achievement in terms of reproducibility and throughput. The DIGE approach uses fluorophores that have different absorption optimum, making it possible to run multiple samples simultaneously in the same gel. Several dyes were designed to ensure that a protein acquires the same relative mobility irrespective of the dye used to tag it.
Statistical Analysis of Proteomic Data
333
The difference in MW introduced by different length linkers is compensated by different alkyl moieties opposite the linker moiety. Originally, only two different cyanine dyes were included (Cy3 and Cy5), but the concept was extended with a third dye (Cy2) that opened the way for a total new experimental design that further exploits the sample multiplexing capabilities of the dyes, by including an internal standard (16,17). The internal standard is a mixture of equal amounts of each sample and guarantees a powerful normalization procedure for high accuracy of protein quantification. This normalization reduces the variability considerably and brings on reasonable arguments to justify the use of powerful parametric statistics after transformation of the standardized volume. If multiple conditions have to be tested spread over different electrophoresis runs, one common internal standard should be created and included in all the gels of each run. However, if an experimental setup is too complex, the internal standard will contain too many samples possibly resulting in an overlap of spots of different samples. The minimal labeling approach has a dynamic range of four to five orders, and its sensitivity is currently marginally less sensitive than silver-staining (18). Although the dyes have been carefully designed, care should be taken in the experimental design to take into account possible dye-specific effects. Therefore, a supervised randomization of the Cy3/Cy5 labeling is highly recommended. Not only the labeling should be randomized, but also the samples representing an experimental group should be mixed across gels in order to avoid systematic gel artefacts.
3. Data Analysis 3.1. Confirmatory Univariate Data Analysis Univariate statistical methods examine the individual protein spots one by one, considering the different proteins as independent measurements. Table 1 gives an overview of some commonly used parametric and non-parametric univariate tests. Univariate methods start from the null hypothesis that there is no difference between the two experimental populations. Parametric models
Table 1 Overview of Some Commonly Used Univariate Tests Classes of data
Univariate statistics
Comparing 2 treatments
Parametric T-test
Comparing k treatments
ANOVA
Non-parametric Mann–Whitney/Wilcoxon Kolmogorov–Smirnov test Kruskal–Wallis test
334
Carpentier et al.
like the Student’s T-test start from the observed sampling and assume that the observed sample mean and variance approximate the real population mean and variance, and that the variances of the two experimental populations are equal. Based on the observed mean and variance, the two populations are considered normally distributed and a model is made (Fig. 2). If the test statistic (or Tvalue) is large enough, the null hypothesis is rejected (Eq. 1). The numerator measures the distance between the experimental means and is thus an estimation of the inter-group variability; the denominator approximates the real variability and estimates the intra-group variability. T2 = 0002y2 − y1 00032 /S2P 000400021/n1 0003 + 00021/n2 00030005
(1)
where yi : experimental mean (estimate of the population mean, μi ); SP : pooled sample variance (estimate of the variance; it is a weighted average of the group variances accounting for the number of replicates or samples in each group); ni : number of replicates per experimental group. Parametric univariate statistical tests are very powerful, but the data must respect the restrictive assumptions (continuous and normally distributed data, homogeneity of variance, and independent samples) and the assumptions must be tested. A commonly used test for the estimation of homogeneity of variances is the Levene’s test, and for the estimation of normality, it is the Shapiro-Wilk test (19). If one assumption is not met, the significance levels and the power of the test might be invalidated. Transformation of data (e.g., log function, arcsine, square root) is frequently used to improve the distribution characteristics (normality and homogeneity of variance) (20). The problem of proteomic data is the low number of replicates. It is impossible to test these assumptions starting from the low sample sizes commonly used in 2DE experiments. Tests like the Levene’s test and the Shapiro-Wilk test are designed for higher sample sizes and have very limited power at the commonly used sample size in proteomics experiments. Given the labor and cost involved in the 2DE analysis, the number of replicates is often restricted and ranges usually between 3 and 6.
Fig. 2. Distribution of two normal populations with a homogeneous variance. μi : real population average estimated by the sample average.
Statistical Analysis of Proteomic Data
335
Although some empirical evidence illustrates that slight deviations in meeting the assumptions underlying parametric tests may not have radical effects on the obtained probability levels, there is no general agreement as to what is a “slight” deviation (21). An alternative for the parametric tests is the use of non-parametric tests, which do not assume any distribution for the data but usually have a relatively low power (21). The assumptions are independent and continuous ordinal data. A useful non-parametric test is the Kolmogorov–Smirnov test. The Kolmogorov–Smirnov test determines whether or not the experimental groups come from the same distribution. Therefore, the data points in each experimental group are sorted in ascending order, and an empirical distribution function is calculated without any assumption of distribution or variance. The Kolmogorov–Smirnov test statistic D is defined as the maximum distance between the cumulative distributions of two experimental groups (for an example, see Fig. 5). Dn10006n2 = max 0002Sn1 0002X0003 − Sn2 0002X00030002
(2)
where Sni (X) = Ki /ni Ki = number of data equal or less than X; ni : number of replicates per experimental group. 3.2. Exploratory Multivariate Data Analysis Univariate statistical tests, such as the T-test, the Kolmogorov–Smirnov test, ANOVA, or the Kruskal–Wallis test, have not been designed to analyze complex datasets containing multiple correlated variables. Proteomic datasets generally contain hundreds of different proteins that are correlated. Proteins fit within the larger entity of networks and interact with each other. Univariate statistics test the individual variables one by one and are absolutely not able to detect correlations to other variables (proteins). Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives (0007) enhances the chance of reporting false positive cases (multiple testing issue), and assumes that the different variables (proteins) are uncorrelated. Proteins are not uncorrelated; they fit within multiple biological pathways and might have close correlations. The field of multivariate analysis consists of those statistical techniques that consider two or more related random variables as a single entity and attempts to produce an overall result taking the relationship among the variables into account (22). In contrast to a univariate approach, it displays the inter-relationships between a large number of variables and is able to correlate multiple proteins to a specific experimental group. The data from different image analysis software packages can be exported, introduced, and analyzed using several software packages to
336
Carpentier et al.
perform multivariate analysis. Some commonly used packages are Unscrambler, Matlab, SAS, and Statistica. GE Healthcare developed a statistical software package (EDA, extended data analysis) for DIGE approach, which is linked to the image analysis software Decyder. The package offers both univariate and multivariate tools. Here, we will discuss mainly the use of Principal Component Analysis (PCA) (for an overview of other possibilities of EDA package and more DIGE related statistical examples, see Chapter 6). 3.2.1. Principal Component Analysis Principal Component Analysis is one of the multivariate possibilities to perform explorative data analysis. A comprehensive overview of the use of PCA in statistics is given by Sharma (23). The basics of PCA date back to Karl Pearson in 1901 (24), and the final procedure as we know it today was developed by Harold Hotelling in 1933 (25). The use of multivariate methods in the analysis of 2DE was already established in the early days of 2DE (26) and is an emerging application in transcriptomics and proteomics (27,28,29,30,31). PCA condenses the information contained in a huge dataset into a smaller number of artificial factors, which explain most of the variance observed. The most logical modus operandi is to consider the different biological replicate samples of the experimental groups as observations (score plot). The score plot allows the detection of trends in the samples and the loading plot allows to identify the relevant proteins that explain the trends. A principal axis transformation transforms the correlated variables (proteins) into new uncorrelated variables. A principal component (PC) is a linear combination calculated from the existing variables (proteins) [PC1 = a1 (protein1) + a2 (protein2) + … + an (protein n); PC2 = b1 (protein1) + b2 (protein2) + … + bn (protein n)]. The relation between the original variables (proteins) and the PCs is displayed in the loading plot. This means that if a protein has a high loading score for a specific PC, that protein explains an important part of the sample variance. The starting point for PCA is the sample covariance matrix. It has been proven that the sum of the original variances is equal to the sum of the eigenvalues of the sample covariance matrix. The eigenvalues are the variances of the PCs. The ratio of each eigenvalue to the total variance indicates the portion of the total variability accounted for each PC. For the fundamentals of data manipulation and a more detailed description of the properties and mechanisms of multivariate analysis and PCA, the reader is referred to the books of Jackson and Sharma (22,23). It is very important to have an insight into what is calculated and what the assumptions are of different models. The EDA software offers the user the choice to play with observations and loadings. Hence, the user also has the possibility to use the transposed data matrix, and to consider the gel images as
Statistical Analysis of Proteomic Data
337
variables (loading plot) and the proteins as observations (score plot). This might be helpful to improve the image analysis and to detect protein mismatches, but should not be used to explore the inter- and intra-group variability of the biological samples. Explorative PCA does not put strict requirements to the data. The majority of PCA applications are descriptive in nature. In these instances, distributional assumptions are of secondary importance (22). The only requirement that must be met is that the dataset has to be complete, meaning that there must be no missing spot values among the different samples. Finding techniques for performing PCA in the absence of complete data and/or techniques for estimating missing data can solve the problem. Several methods for estimating missing data have been reported from the microarray community (32,33,34). A missing value in 2DE proteomics occurs when a spot is detected in the reference or master gel but not detected in one of the other sample gel images, or it is detected but not matched to the reference or master gel. The causes of missing values might be (1) faint spots, flirting with the detection limit and detected in one gel but not detected in another; (2) mismatches probably caused by distortions in the protein pattern, or (3) absence of spots due to bad transfer from the first to the second dimension. Grove et al. show that the staining procedure was an important source of missing values (27). The concept of DIGE with its common internal standard anticipates the missing value problem to some extent by matching the different internal standard images. A good sample preparation (3) and a good experience in electrophoresis and proteomic techniques also reduce this problem, but missing values are inherent to 2DE and must be faced. Some software packages replace the missing values with the value zero, and others remove all the variables with missing values. Introducing zeros leaves the results open to serious bias when a protein is mismatched in a particular sample or when the spot is missing due to a technical error. This particular protein will get an important loading value for the sample in question, influencing incorrectly the score for this particular sample. In the case a protein is really absent or below the detection limit of the staining method, those missing values can be filled either with zeros or with a threshold value (35). A better alternative might be to average the samples within an experimental group and to explore the data based on the group mean. A missing value will still be considered as a zero and will lower the group mean, but the impact of loading on the sample score plot is buffered by the average. The EDA package offers this possibility (see example below). Taking into account only the proteins that are detected and matched to the master or reference gel solves the problem of missing values, but a lot of useful information is lost (see example below). The EDA package offers the possibility to filter the base dataset and to select only those proteins that are 100% matched. Troyanskaya et al. show that averaging is an improvement upon replacing missing values
338
Carpentier et al.
with zeros, but it yields drastically lower accuracy than the estimation methods such as singular value decomposition and weighted K-nearest neighbors (32). We recommend performing the initial PCA based on the complete dataset and not based on the proteins that appear to be significantly different from the individual univariate analyses. Multivariate statistics have an additional value by being capable of differentiating the different experimental groups in terms of correlated expression rather than absolute expression (28,36). Both approaches are complementary. Performing the analysis only on significant proteins from univariate analysis might disregard useful information. We recommend to start the analysis with explorative multivariate analysis and to compare the data subsequently with the confirmatory univariate analysis of the individual proteins.
3.2.2. Marker Selection Principal Component Analysis is outstanding in detecting outlying data and correlations among the different variables (proteins), but it is not able to determine a threshold level for identifying which proteins are significant in classifying the experimental groups, allowing an objective removal of variables (proteins) that do not contribute to the class distinction. Several algorithms exist to select a subset of features from the whole dataset and to perform a classification. In proteome analysis, this corresponds to selecting the proteins that can best discriminate the experimental groups. The use of partial least squares (PLS) as a regression technique has been promoted primarily within the area of chemometrics (37). In contrast to PCA, PLS is a supervised technique mainly applied to link (or regress) a continuous response variable (or dependent variable) to a set of independent variables (e.g., proteins in a gel). However, in proteomic data, the response variable is often a discrete variable (e.g., treatment A, B, C,…) and only takes a fixed number of values. PLS-DA offers an algorithm to deal with this typical data structure. An analysis of the score and (correlation) loading plot allows defining the proteins that are important in discriminating the different experimental treatments. The variable importance plot (VIP) is an interesting tool for this purpose. According to the user manual, the PLS algorithm of EDA creates a supervised model of the data (predefined experimental groups) and then uses the variable influence on the projection (VIP) scores from the model to create a ranked list of how good a protein is for discrimination between the experimental groups. Discriminant analysis (DA) methods, in general, and PLS-DA, in particular, are used to calculate the probability or accuracy of the marker selection. The purpose of DA is to permit to assign individual observations (samples) to one of the experimental
Statistical Analysis of Proteomic Data
339
groups [e.g., the classification of patient samples as healthy and tumor based on protein extractions (38)]. 4. Examples 4.1. Classical Dyes, 2 Conditions In this example, we examine two different conditions, analyse six biological samples per condition, and perform the analysis with classical CBB staining. The data have been analyzed with the Image Master Platinum software version 5 (GE Healthcare). Image Master version 5 offers the possibility to compensate for technical variance and offers intensity calibration and spot normalization. The relative volume (%vol) spot normalization is the best spot normalization procedure because this takes into account the intensity of a spot as well as the area (Eq. 3). %vol = vol/nS=1 volS
(3)
where volS is the volume of spot S in a gel containing n detected spots. Although this spot normalization procedure reduces the possible technical variance, it has consequences for the data. Normalizing all the spots transforms the data and creates an asymmetric population (Fig. 3). A logarithmic transformation of the data improves the distribution characteristics (Fig. 4). However, univariate statistical methods are not developed to analyze all the spots simultaneously like in Figs. 3 and 4. They examine the individual protein spots (variables) one by one, considering the different proteins as independent measurements. Therefore, one should consider each spot individually, and the real population for the experimental groups of this particular protein spot should be estimated based on the six replicates. Performing distribution tests like the Levene’s test and the Shapiro-Wilk test on six replicates is a possibility, but is unlikely that the null hypotheses (normally distributed and homogeneous variance, respectively) will be rejected. The sample sizes need to be large enough in order to minimize the amount of false results (i.e., the populations will appear to be normally distributed and of equal variance although this is not necessarily the case). Taking into account the typical heterogeneity of variance associated with classical dyes, the %vol spot normalization of Image Master, and the limited sample size, a non-parametric statistical test seems to be the best choice in this case. We opted here for the non-parametric univariate Kolmogorov– Smirnov test. The test is one among the options offered by Image Master. It is a two-sample test with high power efficiency for small sample sizes. The reduced power of a non-parametric test was anticipated by including a
340
Carpentier et al. Histogram: Var1 Shapiro-Wilk W = .35883. p = 0.0000 Expected Normal 2000 1800 1600
No. of obs.
1400 1200 1000 800 600 400 200 0
–0.3 –0.1
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
Fig. 3. Distribution of protein spots analyzed by image master and normalized using the %vol criterion. There is an asymmetrical distribution, with the majority of the spots lying between 0 and 0.1%. Histogram: Var2 Shapiro-Wilk W = .98283. p = .00000 Expected Normal 1000 900 800
No. of obs.
700 600 500 400 300 200 100 0
–7
–6
–5
–4
–3
–2
–1
0
Fig. 4. A logarithmic transformation of the %vol data of Fig. 3.
1
Statistical Analysis of Proteomic Data
341
higher number (6) of biological replicates. Figure 5 shows an example of an individual Kolmogorov–Smirnov test. For the complete experimental setup and biological background, see Carpentier et al. (39). The options of the Image Master Platinum software are rather limited and are focused on two experimental groups. The multivariate analysis offered by Image Master Platinum is factor analysis. Factor analysis is a technique similar in nature to PCA. The results of both techniques are quite similar except that factor analysis explains rather correlations between variables, while PCA explains variability (22). In Image Master Platinum, the gels (images) are used as loading and proteins for the score plot. Factor 1 (explaining the majority of the variability) is in our case associated to protein abundance, and the second factor is associated with inter-group variability. As stated above, this might be useful to improve the image analysis and to detect protein mismatches, but to explore the interand intra-variability of the biological samples, it might be better to export the
% vol
A
B
C
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 a b
0 a bc d e f gh i j k l
k j l g ih fe b a d c
1373
1373
1373
D 0.9
frequence
0.8 0.7 0.6 0.5
A B
0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
%vol
Fig. 5. Example of Kolmogorov–Smirnov test. (A) Descriptive statistics displaying the experimental mean and standard deviation of the two experimental groups (A and B). (B) Descriptive statistics of the individual biological samples of the two experimental groups. (C) The data sorted in ascending order. (D) Empirical cumulative distribution functions of the two experimental groups.
342
Carpentier et al.
data to a statistical program. For an example of classical staining and uni- and multivariate analysis, see Pedreschi et al. (40). 4.2. DIGE Approach, 4 Conditions In this example, we are interested in the effects of a specific treatment over time. Using the DIGE approach, we consider here four time points. At each time point, three biological samples were analyzed, quantifying several hundreds of protein spots (i.e., variables) per sample per time point. To process and analyze the gels, the Decyder software version 6.5 was used in combination with the EDA module (GE Healthcare). The standardized normalization procedure in Decyder 2D BVA is based on the concept of having for each gel the Cy2 labeled internal standard image as reference. This standard image is used to normalize the abundance ratios between the different gels. Decyder offers the possibility to perform transformation and normalization of the data: log standardized abundance (Eq. 4). Log standardized abundance =
10
log 0002vol Cy5 or Cy3/vol Cy20003
(4)
Using the DIGE approach, Karp and Lilley gathered reasonable arguments to assume that the restrictive assumptions of parametric statistics are not violated too strong after the logarithmic transformation of standardized abundance (1). The use of parametric statistics seems, therefore, acceptable. However, univariate statistics test the individual variables one by one and are absolutely not able to correlate multiple proteins. Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives (0007) enhances the chance of reporting false positive cases (multiple testing issue). It is, therefore, advisable to get first an insight in the complex dataset and to explore the data first via multivariate analysis and validate the individual differences via univariate statistics. Not all proteins are relevant to understand the differences between the time points. Therefore, it would be interesting to distinguish relevant proteins from irrelevant proteins that do not have a changing abundance over time. To facilitate the discovery of the differences, we used the PCA of the extended data analysis module of Decyder. PCA reduces more than 1000 variables into PCs that explain most of the variance between the treatment times. PCA analysis is not supervised, meaning that the samples are analyzed without the knowledge of sampling time. In Fig. 6, the score and loading plot are displayed, taking into account the two most important PCs. The different repetitions of the same time point cluster together, and the most important PC (i.e., PC1) is able to separate the clustered treatment times. In practice, this means that proteins with a high positive PC1 value will be abundantly present in the 2-day gels and less
Statistical Analysis of Proteomic Data
343
abundant in 14-day gels and vice versa for proteins with a highly negative PC1 value. Proteins that cluster together have a similar impact on the PCs and have a similar expression pattern (Fig. 6). This rough approach explains only a small part of the variability. The first PC explains 34.2% of the variability and explains a great part of the inter-group biological variability (time effect). A high positive PC1 value is correlated to 2 days, and a high negative value is correlated to 14 days. Most proteins cluster around the origin, indicating a poor contribution to the variance and probably do not change in abundance during the examined time period. The second PC explains 15.1% of the variability and seems to explain mainly (technical) intra-group variability. By default EDA ignores the missing values. By anticipating the missing value issue and taking the average of each experimental group and reducing some technical variability, the first component explains 60.9% of the variability and the second PC 23.4%. Taking into account only the proteins that have been matched and detected in all the gels reduces the number of examined proteins by more than 50% and discards very useful proteins that have, for instance, a very low
A
B
Fig. 6. PCA analysis. (A) Score plot. The big circle is based on the Hotellings T2 -test statistic and is used to detect outlying observables (0007 0.95). The three biological replicates of the same experimental group cluster together, indicating an acceptable intragroup variability (grey ellipse). The different experimental groups are also separated, indicating a certain inter-group variability. There is a clear difference between 2 and 14 days of treatment. (B) The loading plot indicates the correlation between the original variables. A protein with a high loading score for a specific PC explains an important part of the sample variance.
344
Carpentier et al.
abundance in the early days of treatment and higher abundances at the end and vice versa. As an example, we focus on five proteins that seem highly correlated from the loading plot (highlighted in Fig. 6B). Confirmatory differential expression analysis via ANOVA confirms that all five proteins have a very similar expression pattern over time (Fig. 7). This might suggest a common regulatory mechanism or an interaction between the proteins. The individual confirmatory univariate statistics (ANOVA and multiple comparison test) confirm for four out of the five proteins that 2 days is significantly different from 4 days, 8 days, and 14 days; and that 14 days is significantly different from 4 days and 8 days (0007 ≤0.01). We could identify four proteins as lectin isoforms (39), confirming, indeed at a first level, the correlation between the proteins. One protein could not be identified and is under further investigation. This protein is likely to have a common regulatory mechanism (being also a lectin-like protein), might form a complex, or develop an interaction with lectin proteins. This particular protein shows exactly the same expression pattern as the four identified lectins, but the overall ANOVA has a value of 0.0122. This is a nice illustration of
Fig. 7. Confirmatory differential expression analysis—expression pattern of the individual proteins selected from Fig. 6. The different normalized relative abundances are displayed for the different time points (14 days, 8 days, 4 days, and 2 days). The mean of each individual isoform is displayed as a cross.
Statistical Analysis of Proteomic Data
345
how exploratory data analysis is performing, indicating correlation but also bringing up candidate markers that would have been missed when using only confirmatory data analysis (0007 ≤ 0.01).
5. Conclusions The experimental conditions are important and must be well designed. Ideally, only biological replicate samples should be used, and one should try to limit the technical variability to the strict minimum. A reliable sample preparation and an extended experience in electrophoresis and proteomic techniques are indispensable. With the low technical variance observed with the DIGE approach, the need for analyzing technical replicates can be questioned. The pooling of samples reduces the biological variance to detect changes in protein abundance between the averages of the experimental groups. Pooling of samples might be useful but must be reconsidered for each individual experimental setup. The use of a particular staining method should carefully be considered taking into account the available lab equipment, budget, and power of a particular method. The dynamic range of the staining methods and the technical variability have a great impact on the power of a statistical test and are decisive for the experimental setup (the number of replicates) and the choice of the statistical test. Univariate statistics test the individual variables one by one and are absolutely not able to correlate multiple proteins. Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives (0007) enhances the chance of reporting false positive cases (multiple testing issue). Therefore, it is advisable to first get an insight in the complex dataset and to explore the data via multivariate analysis and validate the individual differences via univariate statistics. Using a classical approach with the typical heterogeneity of variance associated with classical dyes and the limited sample sizes, a non-parametric test seems to be the best choice. Using the DIGE approach, the restrictive assumptions of parametric statistics are not violated too strong after the logarithmic transformation of the standardized abundance. The use of parametric statistics seems, therefore, acceptable.
Acknowledgments The authors would like to thank Romina Pedreschi for critical reading and suggestions and Prof. Verbeke for the sharing of his files. Financial support from the Belgian National Fund for Scientific Research (FWO-Flanders) is gratefully acknowledged.
346
Carpentier et al.
References 1. Karp, N. A. & Lilley, K. S. (2005) Proteomics 5, 3105–3115. 2. Urfer, W., Grzegorczyk, M., & Jung, K. (2006) Proteomics S2, 48–55. 3. Carpentier, S. C., Witters, E., Laukens, K., Deckers, P., Swennen, R., & Panis, B. (2005) Proteomics 5, 2497–2507. 4. Bjellqvist, B., Ek, K., Righetti, P. G., Gianazza, E., Gorg, A., Westermeier, R., & Postel, W. (1982) J. Biochem. Biophys. Methods 6, 317–339. 5. Westermeier, R. (2001) Electrophoresis in Practice. Wiley-VCH, Weinheim. 6. Westermeier, R. & Naven, T. (2002) Proteomics in Practice. Wiley-VCH, Weinheim. 7. Rabilloud, T. (2000) Proteome research: two dimensional gel electrophoresis and identification methods. Springer, Heidelberg. 8. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W., eds.). Irwin, Chicago, pp. 958–1010. 9. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W., eds.). Irwin, Chicago, pp. 1121–1164. 10. Karp, N. A., Spencer, M., Lindsay, H., O’dell, K., & Lilley, K. S. (2005) J. Proteome Res. 4, 1867–1871. 11. Patton, W. F. (2000) Electrophoresis 21, 1123–1144. 12. Westermeier, R. (2006) Proteomics S2 61–64. 13. Switzer, R. C., Merril, C. R., & Shifrin, S. (1979) Anal. Biochem. 98, 231–237. 14. Rabilloud, T., Vuillard, L., Gilly, C., & Lawrence, J. (1994) Cellular and Molecular Biology 40, 57–75. 15. Unlu, M., Morgan, M. E., & Minden, J. S. (1997) Electrophoresis 18, 2071–2077. 16. Alban, A., Currie, I., Lewis, S., Stone, T., & Sweet, A. C. (2002) Mol. Biol. Cell 13, 407A–408A. 17. Alban, A., David, S. O., Bjorkesten, L., Andersson, C., Sloge, E., Lewis, S., & Currie, I. (2003) Proteomics 3, 36–44. 18. Tonge, R., Shaw, J., Middleton, B., Rowlinson, R., Rayner, S., Young, J., Pognan, F., Hawkins, E., Currie, I. et al. (2001) Proteomics 1, 377–396. 19. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. eds.). Irwin, Chicago, pp. 95–152. 20. Gustafsson, J. S., Ceasar, R., Glasbey, C. A., Blomberg, A., & Rudemo, M. (2004) Proteomics 4, 3791–3799. 21. Siegel, S. C. N. J. (1988) Non Parametric Statistics for Behavioral Sciences. McGraw-Hill Book Company, Singapore. 22. Jackson, J. E. (2003) A User’s Guide to Principal Components. Wiley, New York. 23. Sharma, S. Applied Multivariate Techniques. Wiley, Hoboken, NJ. 24. Pearson, K. (1901) Phil. Mag. Ser. B. 2, 559–572. 25. Hotelling, H. (1933) J. Educ. Psychol. 24, 417–441. 26. Tarroux, P. (1983) Electrophoresis 4, 63–70.
Statistical Analysis of Proteomic Data
347
27. Grove, H., Hollung, K., Uhlen, A. K., Martens, H., & Faergestad, E. M. (2006) J. Proteome Res. 5, 3399–3410. 28. Marengo, E., Robotti, E., Bobba, M., Liparota, M. C., Rustichelli, C., Zamoo, A., Chilosi, M., & Righetti, P. G. (2006) Electrophoresis 27, 484–494. 29. Schultz, J., Gottlieb, D. M., Petersen, M., Nesic, L., Jacobsen, S., & Sondergaard, I. (2004) Electrophoresis 25, 502–511. 30. Verhoeckx, K. C. M., Gaspari, M., Bijlsma, S., Van Der Greef, J., Witkamp, R. F., Doornbos, R. P., & Rodenburg, R. J. T. (2005) J. Proteome Res. 4, 2015–2023. 31. Gottlieb, D. M., Schultz, J., Bruun, S. W., Jacobsen, S., & Sondergaard, I. (2004) Phytochemistry 65, 1531–1548. 32. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001) Bioinformatics 17, 520–525. 33. Scheel, I., Aldrin, M., Glad, I. K., Sorum, R., Lyng, H., & Frigessi, A. (2005) Bioinformatics 21, 4272–4279. 34. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., & Ishii, S. (2003) Bioinformatics 19, 2088–2096. 35. Wood, J., White, I. R., & Cutler, P. (2004) Signal Process. 84, 1777–1788. 36. Karp, N. A., Griffin, J. L., & Lilley, K. S. (2005) Proteomics 5, 81–90. 37. Wold, S. (1985) Encyc. Stat. Sci. 6, 581–591. 38. Nguyen, D. V. & Rocke, D. M. (2002) Bioinformatics 18, 39–50. 39. Carpentier, S. C., Witters, E., Laukens, K., Van Onckelen, H., Swennen, R., & Panis, B. (2007) Proteomics 7, 92–105. 40. Pedreschi, R., Vanstreels, E., Carpentier, S., Robben, J., Noben, J. P., Swennen, R., Lammertyn, J., Vanderleyden, J., & Nicolaï,B.M. Proteomics 7, 2083–2099.
18 Web-Based Tools for Protein Classification Costas D. Paliakasis, Ioannis Michalopoulos, and Sophia Kossida
Summary Current proteomics technologies generate large number of data among which the investigator has to identify the promising diagnostic/prognostic biomarkers as well as potential therapeutic targets. For the latter, classification of proteins into meaningful families is needed. Current databases, featuring a high level of interconnectivity (cross referencing), provide the tools necessary to bring various data together, facilitating protein classification and elucidation of protein function and interoperativity. This chapter provides guidelines to explore the informationally rich peptide sequences generated by the application of the proteomics methodologies by the use of web-based tools, with the objective to predict potential protein function. After proper preprocessing (e.g., for internal repeats) of a query protein sequence, known domains can be identified, which aid in dividing the query into smaller meaningful parts. Any unclassified remainder of the protein provides the material for low-level comparative analysis for the discovery of distant homologues or candidate novel domain types to be verified experimentally.
Key Words: protein classification; domain families; recurrent tertiary structural motifs; sequence–structure relationships; (protein) structural evolution; protein database; homology searches; domain inference; protein structure redundancy.
1. Introduction From the times of the “one man-one gene” approach, when individuals were working on single protein sequences, which were decoded from the corresponding DNA sequences, to the era of high-throughput techniques, when massive automated procedures produce large numbers of peptide sequences, one task remains virtually the same: individual protein sequences need classification. We, humans, have an amazing instinctive capability to categorize From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
349
350
Paliakasis et al.
objects, even the most complex ones, which in particular can be categorized along various kinds of natural or arbitrary schemes. Proteins feature multiple attributes, such as sequence, structure, function, organelle specificity, evolutionary origin, affinity, isoelectric point, and size (not to mention tissue specificity and antigenicity in higher organisms), all of which offer means for classification. For instance, 2D gel spots corresponding to proteins, which have been separated in terms of their size and isoelectric point, reflect a primary attempt for classification; affinity (e.g., nucleoprotein, lipoprotein, metalloprotein, etc.) and function (e.g., enzyme, carrier) offer another basis for classification, both relating to the chemistry of a protein, and basic spectroscopic data, like those of circular dichroism (which suggest an estimate of the relative amounts of 0002-stranded vs. 0003-helical structure), permit classification to the all-0003, all-0002 or mixed 0003/0002 classes. However, classification schemes based on general attributes (e.g., the physicochemical properties of proteins) suffer from heterogeneity within their classes. For instance, a number of otherwise unrelated proteins can be classified as “metalloproteins.” In general, two requirements with opposing effects should be satisfied by any classification scheme: specificity, which leads to particularization (i.e., a higher number of narrower classes) and abstraction, which leads to generalization (i.e., a smaller number of wider classes). In the end, a comprehensive and useful hierarchy is a trade-off between specificity and abstraction (i.e., the most general classes possible that are still useful in some desired way). Proteins, the structures of which represent successful solutions to the problem of thermodynamic stability and at the same time can accommodate a biologically useful function, provide the basis of all kinds of radiant variation at the level of protein sequence (and consequently function). Each protein variant, that survives the evolutionary pressure of competition against other potential variants, has emerged after a series of modifications of various extents; an explanation is presented later on why this is the preferred mode of action. Common ancestry classification schemes provide the specificity necessary to define sensible protein classes, in contrast to those classification schemes, which follow general features. In the former, all members of each class share a common tertiary structure across very wide evolutionary spans, while similarities at the level of amino acid sequence remain exploitable, even in cases where they are hard to detect. Therefore, evolution-based classification schemes are not driven by our natural impulse to categorize objects drawing arbitrary borderlines, but reflect basic principles of the protein nature. In fact, classification with respect to evolutionary history and structure comes so naturally, that when function is not preserved, we tend to refer to a “-like” form within the same family of proteins, rather than to a different family.
Web-Based Tools for Protein Classification
351
Protein sequences derived from a common ancestor by divergent evolution, share a high degree of similarity (both with each other and naturally with their ancestor, although the latter may be unknown). This similarity persists over quite a wide evolutionary span, before it is worn out by divergence and rendered undetectable by direct pair-wise sequence alignments. Conveniently, it is highly unlikely that proteins without common evolutionary origin share a high degree of similarity; in fact, the higher the similarity the more recent the speciation. It will be shown how these nearest relatives provide the guidelines to identify the features that are crucial for the definition of a family of proteins, before the detection of the most remote relationships is attempted. In conclusion, the amino acid sequence offers a highly specific key to classification, albeit intermediary members, and structure may need to be consulted, before any remote members of a class can be detected. The evolution-based classification schemes, as well as the tools available over the web to explore them, constitute the subject of the following notes. Many researchers in the relevant fields tend to take simple homology searches and domain assignment tools for granted, until an unexpected outcome sheds doubt and confusion; it is the authors’ intention that by the end of this chapter, the reader will be capable to conduct those (otherwise routine) tasks with a higher degree of both awareness and confidence.
2. Materials The procedure of protein classification comprises several more or less independent steps. Although these steps have been arranged (in the present notes) in the order they are usually employed, this order can change, depending on the nature of information available at each point. Steps can also be omitted, if they are unnecessary or their target has already been accomplished (although performing them will provide further reassurance). Each of the steps described is a small protocol in each own right; a number of web tools – some of them in a number of variations – implement each of those steps. However, improvement of user friendliness on one hand and users’ skills on the other has rendered the procedure to look like a single protocol; in fact, sometimes automation hides a number of steps of which only the results can be viewed, in the form of a compiled web page. Instead of listing the websites of all relevant tools, a small and comprehensive selection of entry points is suggested in Table 1, via which a wealth of tools is then accessible. All of those websites provide user friendly interfaces. It is suggested that the reader browses (and gets familiar with) at least those main websites, before attempting to delve deeper into the realm of web-based analysis tools.
352
Paliakasis et al.
Table 1 Main Entry Points to the World Wide Web for Protein Classification ExPASy www.expasy.org A wide range of software tools for the analysis of protein sequences and structures as well as 2D PAGE, can be found here. It also offers an entry point to a rich collection of other web sites, mainly the SwissProt/UniProt databases BLAST www.ncbi.nlm.nih.gov/BLAST A convenient starting point for on-line search of sequence databases (both protein and DNA ones). Many other sites feature some version of BLAST as well EnsEMBL www.ensembl.org A collection of complete genomes, which offers an entry point from a different view – that of a genome rather than that of a sequence Pfam www.sanger.ac.uk/software/pfam A collection of profiles of protein families against which a sequence can be matched, for initial domain recognition Protein data bank www.pdb.org and www.rcsb.org The archive of experimentally determined 3D-structures (by crystallography, NMR, and other techniques) of biological macromolecules (proteins, nucleic acids, sugars, etc.) InterPro www.ebi.ac.uk/interpro An effort to integrate information from several diverse sources to a unified comprehensible form
3. Methods 3.1. Theoretical Issues: Classification Based on Sequence or Structure? The specifics that define a set of sequences as a protein family (i.e., molecular function and involved amino acid residues, other kinds of sequence fingerprints, post-translational modification, etc.) have to be accommodated within a structural framework Fig. 1. However, 3D structure is not reserved for one protein family. In fact, there seems to be a countable set of spatially local packing arrangements between 0003-helices and 0002-sheets, which, when combined,
Web-Based Tools for Protein Classification
353
Fig. 1. Complex shapes can be misclassified by a general property like size, because of small (or larger) parts missing in relation to the simplest forms from which they derive. More specific (“shape-related”) attributes can bring all stars (and parts thereof) together, as they can do with triangles, squares, and circles. Once a proper overall scheme is in place, general attributes (like color) can then detail the distribution within each class.
lead to 3D structural assemblages, stable in terms of thermodynamics and useful in terms of function (1). The participant elements may be distant along the sequence or they may even belong to different chains. The small number of packing options leads to the occurrence of common 3D structural themes, termed the recurrent tertiary motifs, e.g., “up-and-down” helical bundles, 0002barrels, etc. Descriptions at this level of abstraction take into account neither the sequential order of the helices and strands nor their length. Tertiary structural domains in proteins of unrelated evolutionary origin (or function) with apparently unrelated sequences, may adopt the same tertiary motif (usually including further 3D structural elements [(2) see also Note 1]. It can be claimed that the abstract idea of a recurrent tertiary motif leans toward the basic packing arrangements, whereas the implemented domains are closer to the protein families. The 3D environment of certain positions on the structure (a different set of positions for each recurrent tertiary structural motif) poses physicochemical
354
Paliakasis et al. 5-vdef sNIR[enpvtpwnpeps] R1:
: * : * + : *+ A PVID PT AYID PE ASVI G
R2:
E VTIG AN VMVS PM ASIR S[degm]
R3:
P IFVG DR SNVQ DG VVLH A[letineegepiednivevdgkey]
R4:
A VYIG NN VSLA HQ SQVH G
R5:
P AAVG DD TFIG MQ AFVF -
R6:
K SKVG NN CVLE PR SAAI -
R7:
G VTIP DG RYIP AG MVVT - <sqaea..> ------------------------CNS: a VfIG DN vyIa pQ AvVh(g s) (Consensus) BS#1 T1 BS#2 T2 BS#3
Fig. 2. The seven repeats that form the 0002-helix in MT-0004CA demonstrate the level of the impact that structure can have on sequence. The 0002-strands (groups of four residues) are shown, separated from the intervening “turns” (groups of two). The turns that connect successive repeats are split–one residue at the left end and a second one, which is missing in some cases, at the right end. Parts of the sequence in square brackets [] are intervening connecting loops; the part in angle brackets <> follows this core motif and is not part of the repeat sequence. The His residues that coordinate the Zn atom are underlined, and stem from positions (within the repeat) marked by a plus sign (+). A partial repeat (every six positions) has been proposed on the basis of other sequences that adopt this structure; the positions marked by stars (*) correspond to main positions in this (partial) repeat, and the ones marked by colon (:) correspond to the secondary ones. No repetition of this kind (i.e., every six positions) is apparent for any other positions, leaving the 17–18 residues long repeat unit as the only complete one. Positions Asn10–Arg12 (top row) form a small extension the 0002-sheet #3; preceding residues are shown for completeness and only to emphasize that the repeat does not extend in them. In the consensus, drawn at the bottom row, the main ingredients of the repeat unit are shown in capital letters.
requirements, which can be best met usually by one or a few amino acid types), thus defining a scale of preferences (3). These preferences are reflected onto patterns that may arise at the level of the primary sequences (that adopt the relevant recurrent tertiary motifs), whenever these spatially defined positions are close along the sequence Fig. 2. It should be noted, that these patterns are reflections along the sequence of the abstract tertiary theme and that they are much more general than the detailed protein family-specific sequence fingerprints. Simplified lattice models suggest that a small number of 3D structural motifs set loose requirements that can be met by a large number of sequences, along their evolutionary pathway (4). In this case, nature appears to reuse a
Web-Based Tools for Protein Classification
355
successful structural solution in evolutionarily unrelated sequences (see Note 2). On the other end, a large number of 3D structural motifs pose requirements so manifold and exact that only a few sequences can be compatible with them. The resultant patterns of preferences along the sequence appear occasionally strong enough to permit structural motif prediction from the sequence alone (5). It can be claimed that no more than 200 recurrent tertiary structural motifs (the exact number depending on the stringency of their definition) provide the structural basis of perhaps 95% of the nonredundant set of protein structures (2). The average residue coverage is a much smaller figure due to the need of additional structural elements to complete a domain. Vice versa, a large number of tertiary structural motifs are so rare, that they provide the basis of the small remaining proportion of protein structures (see Note 3). Detailed specialization into families takes place within this structural framework: Chothia (6) has long ago estimated that 95% of the protein information to be discovered will derive from no more than 1000 protein families. In fact, for a substantial (and growing) proportion of any newly identified protein sequences, enough information already exists in the databases to build a 3D model (7). The reason for this lies on a simple fact: during the creation of new protein families, the relatively small number of structural alternatives directs nature to a strong preference for the reuse of already successful solutions at the level of sequence (not structure), especially when similar problems are to be solved rather than discovering new ones, on the basis of the same or different structure. The traits being inherited along reuse of sequences are usually the ones to be exploited in protein classification. On the other hand, this small set of structural motifs, the ones easily accessible to protein families of irrelevant origin and/or function, occasionally leads otherwise unrelated proteins to elevated sequence similarity scores (which sometimes appear too high to be explained by chance), just because they fold in the same manner (see Note 4). The traits being developed (as opposed to being inherited) reflect convergent evolution. Protein structure has also served as the basis of classification in some schemes. However, the theoretical considerations, which have been discussed herein (in particular, the fact that unrelated proteins may fold in the same way), hint that classification on the basis of 3D structure alone, will tend to be on a coarser scale. On the other hand, the availability of detailed structural data for a (preferably representative) member of a protein family, experimentally derived by means of X-ray crystallography or NMR spectroscopy, besides all kinds of facilitation reserved for other procedures (e.g., structure-based protein design), offers a valuable aid in sequence-based classification. It provides a very solid ground to assess any sequence-based classification, and a great tool to detect the most remote members. However, unless classifying protein structure per se
356
Paliakasis et al.
(rather than proteins in their entirety), it appears that a common structural architecture alone is not sufficient evidence to classify proteins in the same class. Evolutionarily refined variants of tertiary structural domains, “similar-yetdifferent” within a given repertoire, appear in different combinations with those of other repertoires: a domain for a different cofactor or regulatory factor (e.g., GDP vs. ADP) may be combined with a catalytic domain for a slightly different substrate (fructose vs. glucose). Thus, the most complicated and best tuned series of (simpler) functions, necessary for life, can be accomplished in a spatially ordered and life efficient manner. On the other hand, this fact makes essentially imperative that any classification proceeds up to terms of domains: it suffices to describe any sequence in question, as comprising of “an N-terminal domain of type X and a C-terminal domain of type Y, joined by a loop region of type Z,” otherwise, extensive subtyping and the “Russian doll” effect (see Note 5) will soon be confronted. In practice, the classification procedure starts in the form of the detection of some similarity between a protein (or part thereof) and a prototype (e.g., a profile extracted from a multiple alignment or a structure through which it is threaded), which is too high to explain by chance alone. The tools to demonstrate this similarity are presented under the Subheading 3.2, in any case, it will be the network of similarities within a set of data (sequences, structures, etc), which will clarify the underlying reason for the observed similarity. 3.2. The Practical Side It cannot be stressed enough that most protein sequences are nowadays translations of relevant nucleic acid sequences. It is important to identify cDNA originals if possible, to ensure that the employed nucleic acid sequence corresponds to protein in a reliable way. When the original data are supplied in the form of genomic DNA fragments, introns could still be included and alternative splicing remains a possibility. Current gene recognition programs like GeneScan (8), normally expected in genome-oriented databases like EnsEmbl (9) (see Note 6), can efficiently detect and remove introns, but errors may still infiltrate. If this is the origin of the protein data, certain precautions should be taken: • Search for relevant proteins with reliable sequences, e.g., by means of a preliminary Basic Local Alignment and Search Tool (BLAST) (10) search against SwissProt (11). • Align the sequence of interest to any trustworthy matches and observe the pattern of conservation. Sudden insertions to the sequence in question (especially ones with highly biased composition, short tandem repeats or repetitions of other parts of the protein, especially partial ones, etc) do not necessarily represent extra features or minidomains; deleted parts may have been mistakenly considered to be introns.
Web-Based Tools for Protein Classification
357
• Isolate “candidate” insertions and try to find similar sequences in the databases; see if any trustworthy match makes sense in terms of biology. • Alternatively, try finding a protein in the Protein Data Bank (PDB) (12), which is similar (even remotely) to the one in question (excluding the insert), and has its 3D structure experimentally known (see Note 7). The location of the candidate insertion/deletion on the structure may verify or reject it. • Parts of the query protein matching expressed sequence tags (ESTs) (13) provide an extra source of verification (see Note 8): a part matching an EST is an expressed part.
Other criteria may apply to verify the integrity of a processed putative gene. For example, if the protein has been biochemically characterized, then any experimentally observed property must match the ones of the sequence that is predicted by the gene (or have a good reason why it does not). Another very serious issue is the fact that many annotations are automatically transferred between similar sequences of the same or different databases. Even SwissProt entries are crowded with annotations assigned “by similarity.” The number of proteins with primary annotations is many orders of magnitude smaller than the number of annotated sequences in the current databases. These annotations should be considered as hints that can direct experiments to promising routes rather than secure data. 3.2.1. Preprocessing the Query A preliminary check up of the protein sequence itself is recommended. Repeats and parts of low complexity are of particular interest. 3.2.1.1. REPEATS
Regularities in biological macromolecular structure (like the helical nature of DNA or the super-coiled structure of some protein assemblies) and multimerization create room for repetitions along the protein sequences. Repeats can range in length from a few amino acid residues to complete domains (e.g., as a result of domain duplication). In the latter case, the repetition count is usually small, just two to three copies (14) although much higher counts do occur. When catalytic domains are repeated, the situation may have no ground on structural regularities; it may for instance reflect a need for efficiency (e.g., cooperativity between different copies of a domain). In database searches for multidomain protein queries, it is anyway recommended to treat different domains separately, for reasons explained later on; the difference here lies in the fact that the separate copies can be aligned, and their consensus (or profile) can be extracted and serve as the query.
358
Paliakasis et al.
On the other hand, short tandem repeats (e.g., about 10 amino acid residues long or shorter) normally reflect some structural regularity. In a dot-plot style alignment of a protein sequence to itself they manifest themselves as a (moderate-to-high) number of tracks, which run parallel to the main diagonal (and to each other) in a regular manner (Fig. 3). Since combinations of parts coming from different tracks produce significant alternative alignments, procedures, which attempt to report all possible alternative alignments between two proteins will be severely confounded (see Note 9 on BLAST in particular). A consensus or a profile may be extracted again by a proper alignment of the repeats. However, statistically significant matches cannot be expected for a resultant query of (say) 6 or 12 amino acid residues long. One possible cure is to concatenate a small number of repeats, to produce a query no longer than 50 amino acid residues (see Note 10 on why 50). The small number of repeats (e.g., four repeats of length 11) helps avoiding the explosion of alternatives, although a few of them will not be completely avoided. If this step is taken, it is suggested that the output of a dot-plot utility (such as DOTLET, a Java-based hosted in ExPASy server; Table 1) is consulted, at all times. 3.2.1.2. Parts of Low Complexity
Low complexity occurs when some part of the sequence comprises only a few types of amino acid residues, leading database queries to nonspecific results (see Note 11); the situation can be even worse if some of these types are similar to each other. In general, it is important to know beforehand any significant deviations of the composition in types of amino acid residues, as well as the presence of special features such as signal peptides or groupings of biologically relevant charged side chains (see Note 12). Relevant search procedures, like BLAST (10), detect stretches of low complexity and offer to ignore them during the search; however, what appears to be a part of low complexity may be e.g., a transmembrane stretch. The action to take depends on both the importance and the position of the stretch: • If a single transmembrane part makes sense (or is known to exist), the extra- and intracellular moieties can be separate queries. • A signal peptide (especially when located at the extreme of the N-terminus) usually can be excluded from the procedure, profitably or at least without problem. • A stretch of low complexity, which appears to be of no special significance in terms of structure/function/evolution, can be best left to the search procedure to mask it.
Relevant tools are available from the Web (e.g., the ExPASy site). Alternatively, a simple dot-plot style alignment of the protein sequence can be run vs. itself. Besides repeats, this will reveal areas of low complexity as square blocks of elevated average score, symmetrical around the main diagonal (Fig. 3). If low
Web-Based Tools for Protein Classification (A)
(B)
Fig. 3. Continued
359
360
Paliakasis et al.
complexity occurs within the boundaries of a repeat, similar square blocks will appear around relevant parallel off-diagonal tracks. 3.2.2. Inference of Domains In the spirit of the theoretical analysis earlier in this chapter, classification can take the form of assigning parts of the sequence to domains. Hence, using a domain inferring tool like the ones offered by Pfam (15) and SMART (16) should be among the first steps for classification of a protein, based on its sequence (see Note 13). This information serves to divide the sequence of interest into pieces and handle them separately (see Note 14). Given the high coverage achieved by those collections (more than 75% of the proteins have at least one domain recognized by them, and in average about two-thirds of the length of a protein can be described this way) (15), some protein sequence classification efforts end here (see Note 15). In fact, database search procedures should be soon expected to exploit high-level features, which will be extracted from the query and relevant sequences, resorting to amino acids alone, only for parts where the attempts will fail. 3.2.3. Querying Other Databases Despite the current high coverage of protein sequences in terms of known domains, parts of these sequences still elude. These parts may simply be too distant members of the families they belong to, and they have failed the thresholds of automatic procedures. Those parts should be isolated, properly preprocessed (mainly for compositional biases), and queried against SwissProt and PDB. • Entries (records) in SwissProt (11) offer rich annotation and crossreferences to a number of resources, all in a mainly human readable form and via a nice user friendly interface on top. The high level of curation (including annotation derived by similarity) will save duplicate efforts and may provide valuable hints on how to move on.
0002 Fig. 3. (Continued) (A) Schematic representation of a dot-plot style alignment of a protein against itself; to depict the special cases presented in the text, the protein is supposed to feature two copies of some domain, a low complexity N-terminus and a C-terminal part dominated by some short internal repeat, except for a tail, which appears unique. (B) Alignment of a small part (from a real protein) of low complexity against itself. The situation here is worse than suspected, because the few types of amino acid residues are related to each other (alanine to valine and glycine; to proline and serine in lesser extent).
Web-Based Tools for Protein Classification
361
• Search for similar sequences in PDB (12) will reveal experimentally determined 3D structures of protein instances, possibly related (e.g., through evolution) to the protein of interest. A 3D structure offers a model (even before a model of the query sequence is built, following this information) to think on, a toy on which to visualize and handle data in far more efficient ways (see Note 16).
If domains are inferred by the relevant procedures (or supplied by SwissProt annotation) and/or long stretches (say 30–40 amino acid residues or longer) of special behavior are observed, it is a good idea to handle each sequence part separately, or in small meaningful combinations, for instance, there may be no reason to treat, say, a propeptide separately from the main body of the domain it belongs to (see Note 17 and 18). If a few top hits of a database search can be aligned to the query with confidence, and the next ones are marginal (see Note 18), the output of a multiple alignment of the best hits (including the query) should be converted to some kind of profile [e.g., a position-specific scoring matrix (PSSM)] and the database should be scanned for the resulting profile (see Note 19). The marginal hits of the initial query (i.e., the protein of interest) that match positions conserved throughout the profile will have their statistical significance increased and they will surface. If domain inferring programs can detect some kind of domain on those (initially marginal) hits, this information can then be transferred to the initial query with confidence (recall: the query is part on which no domain was detected). The few top hits will be sometimes marginal (see Note 18). Each of the “best” marginal hits should be used as a query and a number of homologues (about 10; see Note 20) should be collected and aligned without the initial query (i.e., protein of interest). Some kind of profiles (e.g., a PSSM) should be produced by those alignments and the relevant part of the initial query (i.e., protein of interest) should be aligned against them. If the initial query matches the profile at conserved positions (see Note 21), the hit was not fortuitous. Again, if domain inferring programs can detect some kind of domain along the sequences that formed the profile, this information can then be transferred to the initial query with confidence. Other databases provide annotation at high level on specific tasks. InterPro (17) offers a convenient entry point to a number of them, especially for manual sequence classification (as opposed to some massive automated procedure). SuperFamily (18) builds information based on classification of 3D structures (a hit here implies structural similarity regardless of common function or evolutionary origin), PRINTS (19) and PROSITE (20) and one may continue with a long list, where each member targets a specified problem (e.g., if the protein of interest is found to be a peptidase, MEROPS (21) may be consulted for further relevant classification).
362
Paliakasis et al.
4. Notes 1. It is just often a simple operation (e.g., a function) that is built by (part of) the sequences as 3D domains. For instance, there are tertiary structural domains, which simply bind a cofactor and feature an allosteric position, where some regulatory factor (e.g., ADP) will dock to exert its role. The active site may reside on a separate domain, or may be shared between two of them, within the range of the cofactor. 2. Unpublished work (C.D.P., Ph.D. thesis) in continuation of (3) suggests that the requirements set – albeit too vaguely – by an 0003-helical “up-and-down” bundle, which is an abundant tertiary structural motif, raise the relevant parts of the sequence to the extreme 0.1–1% of a suitable distribution, when proteins in a databank are scored for compatibility. This shift is not enough for structure prediction from the sequence alone (too many false positives), but it still reflects a possibly minimal set of requirements posed by the structure for compatible sequences. 3. There is a tendency to treat the observed structural solutions, i.e., the recurrent tertiary structural motifs and domains, as the end evolutionary product of our days. In fact, all the preceding evolutionary steps (as well as the future ones, probably) had to employ one of the solutions provided in this relatively narrow set. If we depict this set, so that similar architectures are close to each other, then “evolution” is a “walk” through this set. Whether this set is continuous or partitioned in a discontinuous manner, is the subject of ongoing research. 4. A continuum is thus established in the scale of similarities between protein sequences, on one end, the small biases due to simple facts (e.g., two transmembrane pieces are coincidentally matched); remote similarities due to common structural architecture, in the middle of the scale; and on the other end, 30% (or more) identity observed due to common origin of a protein from a mammal to a bacterial homologue (and, usually, more than 80%, e.g., between mammals, etc.). 5. This effect characterizes the situation in which a particular domain includes a smaller one, plus some extra structural elements (“decorations”); then, the new total constitutes part of a larger domain, which includes some further structural elements, and so on. Orengo and coworkers (2) have presented a number of examples in their series of papers on classification of protein structure. 6. The version of BLAST featured in EnsEmbl can run against the results of GeneScan; this does not simply translate genomic DNA into Opening Reading Frame (ORFs) before comparison, but it also attempts to “splice” it, after predicting and removing potential introns. Other task-specific databases feature relevant tools. 7. The version of BLAST at the National Center for Biotechnology Information (NCBI) has access to all protein sequences of known structure. Alternatively, the PDB resource (Table 1) can be directly accessed for this purpose, losing however the interconnection to other databases offered by NCBI. 8. Like in the previous Note 7, access by means provided by NCBI is recommended.
Web-Based Tools for Protein Classification
363
9. For example, BLAST seeks all the instances where a small part from the query matches the protein of interest. Then to form longer alignments, BLAST, depending on its version, either expands these “seed-alignments” to contiguous subalignments, uninterrupted by gaps, which are then joined in all valid combinations, or expands the seeds in a gapped alignment fashion. The presence of short repeats may make the output particularly hard to follow, due to the numerous alternatives. 10. Sander and Schneider (22) suggest that the minimum percentage of identity between two proteins, which is required to imply structural similarity converges to about 27% for common alignment length of about 80 amino acids. However, the change in the range of 50–80 is small to justify inclusion of further repeats, which would increase the number of alternative alignments. See also Note 18. 11. For instance, assume that a stretch, about 20 amino acid long or longer, is dominated by leucine, isoleucine, and perhaps a couple of phenylalanines. Not only will this part be nonspecifically matched to any sequence that features a similar deviation in composition, but the resulting alignment will also appear unstable in this part, because of the numerous and almost equivalent alternative ways in which two stretches of the kind can be aligned. 12. For example, a large deviation toward lysine and alanine will make the sequence look like a histone. Scanning a databank for similar peptide sequences, the results will tend to include nonspecific stretches rich in positive (and negative to a lesser extent) charges, in general. 13. The NCBI/BLAST Server (Table 1) offers CDD (conserved domain databank), which is based on both Pfam and SMART, further including collections internal to NCBI. Other servers may offer similar compilations. However, for detailed inquiries one may need to resort to the original resources. The information presented by the original collection can be much richer. Furthermore, each specialized collection offers tools for flexible searches in terms of combinations of various domains, to help detect proteins of similar architecture, reference similarities to other related domain, and so on. 14. The fact that tertiary structural domains tend to behave independently should be exploited. Bench work can usually be facilitated by studying isolated domains, e.g., if some part of a protein makes the molecule hard to crystallize, the relevant information (if available) could indicate which part to remove. Information derived using domain inferring tools can serve to divide a sequence of interest into meaningful pieces. Bioinformatics work may as well get similar profits, e.g., during databases search: assume for example that a protein includes a general hydrolase domain (e.g., an esterase), which is found in many combinations with other domains, which particularize its use; and it also contains a domain, which is specific for the family this sequence belongs to. It will be the latter that will boost the most relevant sequences to the top of the sorted list of BLAST results; accordingly, it will be the one to drive the query protein to the correct subfamily within the framework of a larger family.
364
Paliakasis et al.
15. In the case of multidomain proteins, each hit to a constituent domain (or a significant part of it), signifies the existence of a related part in the databank. Occasionally, some domains will seem apparently missing: either the relevant part of the sequence appears deleted or an expected domain is not recognized along it. Given the statistical nature of the recognition procedure and the nucleotide nature of underlying primary data, the tempting conclusion that this domain/part is not present, is by no means secure. • If the relevant part of the sequence is present, you may check whether domains, which were recognized by domain inference programs along remote homologues of this part, can be transferred by means of alignment involving preformed multiple alignments, as described in Subheading 3.2.3 for the case of remote hits. • If the relevant part of the sequence seems absent, then despite the efficiency of genetic data manipulation procedures, parts of the sequence may have been accidentally considered as introns. Once some major part of a multidomain protein has been located on the complete genome, the hits should serve as pointers to the location to search more carefully at. Perhaps the next generation of data-mining will perform this retro-search of missing parts automatically (like the iterative BLAST is performed today). Until then, and in spite of the times of high-level annotation (which will retrieve the major part of the information being hunted) one should be ready for straightforward TBLASTN of minor parts of the sequence in hand to rule out their existence conclusively and beyond reasonable doubt. 16. When an experimentally determined 3D structure for a similar sequence exists in PDB, then the sequence of interest and the matching structure can be input to some automated model building server (like the SwissModel Server; some servers may also need a ready made alignment between the two) and get a 3D structural approximation of the query protein. If nothing else, inspection of this model will explain any mutational data available and will reveal key locations for experimentation by means of site-directed mutagenesis and other kinds of modification and querying (instead of blind trials along the sequence), in order to infer the mechanism of function or other valuable information. If the quality of the alignment is poor, but both the sequence and the structure can be aligned to e.g., a profile, this intermediary link can mediate alignment between the protein of interest and the distantly related sequence of known structure. Alternatively the remote match may serve as the query to retrieve further sequences homologous to the hit, in order to align the original query to their preformed multiple alignments, as it is described under Subheading 3.2.3. 17. The expectancy value (E-value) provided with the sorted hit list by BLAST depends on the product of the length of the database by the length of the query. Assuming that matching counterparts exist for just one of the domains and that this domain comprises a small part of the total protein, BLAST may
Web-Based Tools for Protein Classification
18.
19.
20.
21.
365
miss matching hits of marginal similarity, just because the length product was unnecessarily (thanks to domain independence) too large. The expectancy value should be regarded as only a rough measure. It would be a more accurate measure of the expected number of hits, if databases were nonredundant (i.e., they contained absolutely nonhomologous sequences) and there were no biases toward specific types of amino acid residues or toward sequence patterns (e.g., the amphipathic ones met in 0003-helices, which account for about one quarter of protein structure in general). Besides, Sander and Schneider (22) have long shown that as soon as a subalignment of a given size exceeds a relevant level of identity, 3D structural similarity can be assumed, independently of the length of the proteins which participate in the comparison or the number of sequences which the query is compared to. They suggest a threshold t(L) = 290.15 × L?00005562 for L < 80 and about 27% for L > 80; cases with identity level higher than t(L) assume related structure, allowing only a small acceptable number of false positives. Alignments lying at the lower side of the line as this derives from the equation mentioned above, do not necessarily signify proteins of unrelated structure. For them, structural similarity, if existant, cannot be simply asserted with confidence. Similarity is rendered more and more improbable as the relevant figures decrease. Details on how to make or use a PSSM may change with implementation. It is worth spending some time on the on-line help offered on PSSM under their implementation at the NCBI. In any case, Clustal (23) may be used to align a sequence to a block of prealigned sequences, or even to two preformed multiple alignments. In both cases, if conserved positions in the “reference” block are conserved along the query sequence (or the query block) the match is reliable. Pfam (15) offers the tools for another approach involving hidden Markov model, the explanation of which is beyond the scope of the present notes. Following the results of Henikoff and Henikoff (24,25), it seems that about 10 homologues are usually already enough, with the reservation that they should cover, if possible, all the range of similarities from 90% down to 40–30%. If all of them are too similar to each other, it will be as if the same sequence was included 10 times. If all of them are too dissimilar to each other, then the risk of mistakes in their multiple alignment will be too high. As a reassurance, in case that a hit is correct, some of the sequences that are homologous to the hit should have appeared in the hit list of the initial search (i.e., the one in which the protein of interest was the query sequence). If just one protein from a large family was reported, chances are that the hit was coincidential.
References 1. Richardson J.S. and Richardson D.C. (1989) “Principles and patterns of protein conformation.” In: Fasman G. (ed) “Prediction of Protein Structure and the Principles of Protein Conformation.” Plenum Press, NY, pp 1–98.
366
Paliakasis et al.
2. Orengo C.A. and Thornton J.M. (2005) “Protein families and their evolution – a structural perspective.” Annu. Rev. Biochem. 74, 867–900. 3. Paliakasis C.D. and Kokkinidis M. (1992) “Relationships between sequence and structure for the four-0003-helix bundle tertiary motif in proteins.” Protein Eng. 5, 739–748. 4. Lattman E.E., Fiebig K.M. and Dill K.A. (1994) “Modeling compact denatured states in proteins.” Biochemistry 33, 6158–6166. 5. Lupas A., vanDyke M. and Stock J. (1991) “Predicting coiled-coils from protein sequences.” Science 252, 1162–1164. 6. Chothia C. (1992) “One thousand families for the molecular biologist.” Nature 357, 543–544. 7. Schwede T., Kopp J., Guex N. and Peitsch M.C. (2003) “SWISS MODEL: an automated protein homology modeling server.” Nucleic Acids Res. 31, 3381–3385. 8. Burge C. and Karlin S. (1997) “Prediction of complete gene structures in human genomic DNA.” J. Mol. Biol. 268, 78–94. 9. Hubbard T., Andrews D., Caccamo M., et al. (2005) “Ensembl 2005.” Nucleic Acids Res. 33, D447–D453. 10. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W. and Lipman D.J. (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res. 25, 3389–3402. 11. Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O’Donovan C., Redaschi N. and Yeh L-S.L. (2005) “The universal protein resource (UniProt).” Nucleic Acids Res. 33, D154–D159. 12. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. and Bourne P.E. (2000) “The protein data bank.” Nucleic Acids Res. 28, 235–242. 13. Boguski M.S., Lowe T.M.J. and Tolstoshev C.M. (1993) “dbEST – database for expressed sequence tags.” Nature Genet. 4, 332–333. 14. Apic G., Gough J. and Teichman S.A. (2001) “Domain combinations in archaeal, eubacterial and eukaryotic proteomes.” J. Mol. Biol. 310, 311–325. 15. Bateman A., Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L.L., Studholme D.J., Yates C. and Eddy S.R. (2004) “The Pfam protein families database.” Nucleic Acids Res. 32, D138–D141. 16. Letunic I., Copley R.R., Pils B., Pinkert S., Schultz J. and Bork P. (2006) “SMART 5: domains in the context of genomes and networks.” Nucleic Acids Res. 34, D257–D260. 17. The InterPro Consortium; Mulder N.J., Apweiler R., Atwood T.K., et al. (2005) “InterPro, Progress and Status in 2005.” Nucleic Acids Res. 33, D201-D205. 18. Madera M., Vogel C., Kummerfeld S.K., Chothia C. and Gough J. (2004) “The SUPERFAMILY database in 2004: additions and improvements.” Nucleic Acids Res. 32, D235-D239.
Web-Based Tools for Protein Classification
367
19. Attwood T.K., Bradley P., Flower D.R., Gaulton A., Maudling N., Mitchell A.L., Moulton G., Nordle A., Paine K., Taylor P., Uddin A. and Zygouri C. (2003) “PRINTS and its automatic supplement, preprints.” Nucleic Acids Res. 31, 400-402. 20. Hulo N., Bairoch A., Bulliard B., Cerutti L., de Castro E., Langendijk-Genevaux P.S., Pagni M. and Sigrist C.J.A. (2006) “The PROSITE database.” Nucleic Acids Res. 34, D227-D230. 21. Rawlings N.D., Morton F.R. and Barrett A.J. (2006) “MEROPS: the peptidase database.” Nucleic Acids Res. 34, D270–D272. 22. Sander C. and Schneider R. (1991) “Database of homology-derived protein structures and the structural meaning of sequence alignment.” Proteins: Struct. Fun. Gen. 9, 56–68. 23. Thompson J.D., Higgins D.G. and Gibson T.J. (1994) “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.” Nucleic Acids Res. 22, 4673–4680. 24. Henikoff S. and Henikoff J.G. (1992) “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci. USA 89, 10915–10919. 25. Henikoff S. and Henikoff J.G. (1993) “Performance evaluation of amino acid substitution matrices.” Proteins Struct. Fun. Gen. 17, 49–61.
19 Open-Source Platform for the Analysis of Liquid Chromatography-Mass Spectrometry (LC-MS) Data Matthew Fitzgibbon, Wendy Law, Damon May, Andrea Detter, and Martin McIntosh
Summary The analysis of protein mixtures by liquid chromatography-mass spectrometry (LCMS) requires tools for viewing and navigating LC-MS data, locating peptides in LC-MS data, and eliminating low-quality peptides. msInspect, an open source platform, can carry out these steps for single experiments and can align and normalize peptide features in comparative studies with multiple LC-MS runs. In addition, msInspect can analyze quantitative studies with and without isotopic labels to generate peptide arrays.
Key Words: liquid chromatography-mass spectrometry; peptide identification; filtering; alignment; quantitation.
1. Introduction msInspect is an open-source platform comprising algorithms and visualization tools that process liquid chromatography-mass spectrometry (LCMS) data files to locate peptides in two dimensions [time and mass over charge (m/z)] and perform various analyses on them (1). msInspect can be used for: • • • •
Visually inspecting LC-MS spectra and peptide features Automatically locating peptide features in high mass accuracy MS spectra Filtering peptide features by various quality measures Quantitating label-free peptide features between experiments via alignment and normalization of the data to create a peptide array From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
369
370
Fitzgibbon et al.
• Identifying isotopically labeled pairs [e.g., isotope coded affinity tagging (ICAT), sable labeling with amino acids in cell culture (SILAC)] for quantitative peptide analysis within a single experiment • Comparing and developing MS feature-finding algorithms
msInspect implements multiple algorithms specifically designed for LC-MS data. The signal processing component exploits the two-dimensional nature of the data to identify coeluting isotopes and then groups them based on the similarity of the observed isotopic distributions to those of naturally occurring peptides. The alignment method estimates the underlying nonlinear mapping of retention times between experiments. The normalization approach (2) adapts methods developed for genomic arrays to accommodate natural variation of LC-MS signal intensities across runs. Ultimately, the goal of msInspect is to mine LC-MS data and to produce peptide arrays that can then be analyzed using tools traditionally applied to genomic arrays msInspect also contains a complete Accurate Mass and Time (AMT) analysis workflow (3). These analytical techniques combine LC-MS and LC-MS/MS data in order to expand peptide coverage and enhance the confidence of peptide identifications. 2. Materials To run msInspect the Java Runtime Environment must be installed. To perform alignment of multiple runs, the R environment must also be installed. Both of these programs must be properly configured and on the computer’s PATH. Information on acquiring these software packages is provided in Subheading 2.1 below. Please contact your local IT systems support group for details on installing these software properly. msInspect reads mass spectra from files in the open mzXML format (4). For background on mzXML and information about converting data from particular instruments to mzXML see Note 1. 2.1. Software 1. msInspect is written in platform-independent Java and requires that the Java Runtime Environment, version 1.5 or later, be installed and on the computer’s PATH. Installation of Java Runtime Environment will also install the latest version of Java Web Start, which will allow msInspect to be run without needing to explicitly install it or update it as new versions are released (see Note 2). a. Windows, Linux, and Solaris users can download “J2SE 5.0” from http://java.sun.com/j2se/1.5.0/download.jsp. b. MacIntosh users running Mac OS X v10.4 or later can download Java from http://www.apple.com/support/downloads.
Open LC-MS Analysis Platform
371
2. To align multiple runs into a peptide array, the R environment for statistical computing, version 2.1.0 or later, must be installed and on the computer’s PATH. R executables for various operating systems are available from http://www.rproject.org.
2.2. Hardware msInspect will run on any computer that supports the software listed in Subheading 2.1. For large input files, typical of high mass accuracy measurements, feature extraction can require several hundred megabytes of memory (see Note 3). msInspect has been tested on computers running Windows XP, GNU Linux, and Mac OS X with at least 1 GB of main memory. 2.3. Data Files msInspect will open any version 2.0 mzXML file containing MS1 data. However, msInspect was designed using high-resolution liquid chromatography-electrospray ionization-time of flight mass spectrometer data so it may not perform as well with an mzXML file from another type of mass spectrometer (e.g., a matrix-assisted laser desorption-time of flight mass spectrometer). Sample mzXML files that may be used to follow all of steps in Section 3 are available on the Web (see Note 4). 3. Methods 3.1. Viewing and Navigating LC-MS Data 1. Launch msInspect from http://proteomics.fhcrc.org/download/tools/msInspect/ viewer.jnlp by clicking on “Launch msInspect with Java Web Start.” “Fred Hutchinson Cancer Research Center” must be accepted as a trusted software publisher for the download to be completed. 2. Upon launching msInspect, the Open File dialog box will automatically open. Browse for the mzXML file to be viewed, select the file, and left click the Open button (see Note 5). You may load a different mzXML file by selecting File > Open from the main msInspect menu bar. 3. The msInspect window (Fig. 1) contains several panes for viewing and navigating the MS run: a. An image of the MS run will be displayed in the Image Pane (the largest pane in the center of the msInspect window). b The Properties Pane (left side of the window) will display detailed information from the mzXML file loaded. This pane will later be used to display details of individual peptide features. It can be hidden with Windows > Show/hide properties.
372
Fitzgibbon et al. c. The Detail Pane is on the right side of the window and the Chart Pane is at the bottom part of the window. Each provides a more detailed view of a region of the spectrum. The Detail Pane provides a zoomed view of the area selected in the full Image Pane. The Chart Pane plots intensity versus m/z (to show the isotopes in a single scan) or intensity versus scan (to show the elution profile of a single isotope).
4. Hold the mouse cursor over a location in the Image Pane. A floating tag will appear displaying the scan number and m/z coordinates of that position. 5. Areas containing peptide features in the Image Pane will appear dark. Left click in a dark area of the image where there appear to be many peptide features as shown in Fig. 1. a. The Detail Pane (right) shows a detailed view of the area selected. Feature finding is automatically launched in this area, and after a few seconds of computation, detected peptide features are circled. Xs indicate the monoisotopic peaks in each feature (see Note 6). b. To see detailed information about a detected peptide, position the mouse cursor over the monoisotopic peak. A floating tag will display scan number, m/z (followed by mass in parentheses), inferred charge state,
Fig. 1. msInspect window showing the Properties Pane (top left), Image Pane (top center), Detail Pane (top right), and Chart Pane (bottom).
Open LC-MS Analysis Platform
373
intensity/background intensity/median intensity, and the first and last scan for the feature. c. The Chart Pane (bottom) displays the m/z spectrum for the scan corresponding to the vertical red line in the Detail Pane. 6. Zoom in on features in the Chart Pane by highlighting a desired area. To do this, anchor the mouse cursor by left clicking at the top left corner of the desired area and continue to hold down the left mouse button while dragging the mouse cursor down and to the right. When the mouse button is released, the chart will be redrawn to produce a magnified view of the selected area (see Note 7). To restore the original chart, left click on the mouse cursor anywhere in the Chart Pane and drag the cursor up or to the left. 7. Select “elution” from the drop-down menu at the top of the Chart Pane to display an elution profile plot. This display shows peaks along the scan axis rather than the m/z axis. Note that the Detail Pane now displays a horizontal line corresponding to the m/z value for the profile as shown in Fig. 2. 8. Zoom in on the Image Pane by right clicking on the mouse and selecting a magnification value from the list (e.g., 200%).
Fig. 2. msInspect window displaying an elution profile plot in the Chart Pane and corresponding horizontal line in the Detail Pane.
374
Fitzgibbon et al.
3.2. Locating Peptides in LC-MS Data A Feature Set file, which lists all of the peptide features detected in a run, can be generated using one of the algorithms included in the platform (see Note 8). 1. Under the Tools menu, select two dimensional (2D) Peak Alignment. This is the default feature-finding algorithm and is recommended for most purposes. 2. To initiate feature finding, select Tools > Find All Features. This will bring up the Extract Features dialog box as shown in Fig. 3. 3. In the “Save Features to File” field, enter (or browse for) a path and add a name for the new Feature Set file. 4. Specify a scan range in the “Start Scan” and “End Scan” fields to limit feature finding to a subset of scans. By default, msInspect will attempt to find peptides in all scans (see Note 9). 5. Left click the Find Features button to begin the feature finding process. As the file is processed, the status bar at the bottom of the msInspect window will display progress. For a large input file, processing may take upwards of 20–30 min. 6. When processing is complete, features will be written to the specified output file and highlighted as colored crosses in the Image and Detail Panes. The status bar will display “Finding features complete. See file yourfilepathyourfile.peptides.tsv.” Place the mouse cursor over one of the detected features to display a summary of its properties. Left click on the feature to view details in the Properties Pane (display by Windows > Show/hide Properties). 7. Select Tools > Display Peptides… to open the Display Features dialog box as shown in Fig. 4A for customization:
Fig. 3. Extract Features dialog box.
Open LC-MS Analysis Platform (A)
(B)
Fig. 4. Continued
375
376
Fitzgibbon et al. a. Display or hide the colored crosses by checking or unchecking the box under the “Display” field. b. Change the color of the crosses by left clicking on the colored box under the “Color” field. A new color can be selected from a color palette. c. View the Feature Set browser by left clicking on the “…” button. This browser lists details of all peptides in the Feature Set. This list can be sorted and edited, comments can be added to a feature, features can be deleted, and the modified Feature Set file may be saved (see Note 10).
3.3. Filtering to Eliminate Low-quality Peptides Low-quality peptides can be removed in msInspect by applying userspecified filtering criteria (e.g., a minimum number of isotopic peaks detected). Removing low-quality peptides is particularly helpful when peptide arrays are to be generated (described in Subheading 3.4.1). 1. Select Tools > Display Peptides…. 2. Left click the Filter tab at the bottom of the Display Features dialog box. This tab displays several parameters by which features can be filtered. 3. Set Min Charge = 1, Min Scans = 3, Min Intensity = 5, Max KL = 1.0, and Min Peaks = 2 as shown in Fig. 4A (see Note 11). 4. Left click the Apply button. The Detail Pane now shows only the features that meet these filtering criteria. 5. Save the filtered Feature Set file over the original file by left clicking on the “…” button at the top right of the Display Features dialog box, then left clicking on the Save button.
3.4. Quantitation of Peptide Features 3.4.1. Quantitation Using Label-free Approaches Features from multiple experiments can be compared in msInspect by simultaneously opening Feature Set files from multiple LC-MS runs, displaying them together, and generating a peptide array. Below are directions for multiple LCMS run comparisons after Feature Set files have been produced (as described above in Subheadings 3.1–3.3) for all LC-MS runs to be compared. 1. Select Tools > Display Peptides…. 2. Left click on the Add Files button (Fig. 4A).
0002 Fig. 4. (A) Display Features dialog box with one file loaded and the Filter tab selected. (B) Display Features dialog box with two files loaded and the Peptide Array tab selected.
Open LC-MS Analysis Platform
377
3. Browse to find another Feature Set file (with file extension.peptide.tsv) and open it. A different colored cross is assigned in the Image Pane to the features from each newly opened file. In this way, multiple Feature Set files can be opened and overlaid in the Image Pane (see Note 12). 4. Left click on the Filter tab (Fig. 4A) at the bottom of the Display Features dialog box and make sure the filter criteria are still set to the values entered in Subheading 3.3 (Min Charge = 1, Min Scans = 3, Min Intensity = 5, Max KL = 1.0, and Min Peaks = 2). Left click on the Apply button if any changes are made. 5. Left click on the Peptide Array tab (Fig. 4B) to set criteria for the peptide array to be generated: a. Enter a name for peptide array file that will be generated. By convention, this file name should end with “.pepArray.tsv.” b. Click the Optimize button to have msInspect search for reasonable tolerances for matching features across runs (see Note 13). c. Check the Normalization box if normalization of features is desired (2). d. Click the Calculate button to actually compute the peptide array. 6. The generated peptide array file consists of one column of intensities for each run and one row for each matched feature. The file is stored in a simple tab-delimited format, which can be exported (to Excel and other programs) and analyzed using tools traditionally applied to genomic arrays (see Note 14).
3.4.2. Quantitation Using Isotopic Labeling A common method of relative quantitation of peptides involves applying heavy and light isotopic labels separately to two samples, then mixing them prior to collecting LC-MS data. Typically, tandem MS/MS (or MS2) experiments are used to analyze these labeled samples. Peptide sequencing in MS/MS can detect the number of labeled residues in each peptide and therefore determine the expected mass difference between light and heavy forms of each peptide. msInspect can perform relative quantitation even in the absence of MS/MS information. Provided with the mass of the light and heavy reagents and with a threshold on the number of labeled residues to consider, msInspect will search for pairs of features consistent with isotopic labeling. 1. Open the file to be analyzed as described in Subheading 3.1. 2. Select Tools > Find All Features. 3. This will again bring up the Extract Features dialog box as shown in Fig. 3. Enter a new output file name and select a scan range of interest as described in Subheading 3.2.3–3.2.4. 4. Note the “Quantitate” check box in this dialog. Selecting this box will enable several options for relative quantitation.
378
Fitzgibbon et al.
5. Select one of several common isotopic labeling strategies (e.g., Cleavable ICAT and O16 /O18 ) from the pull-down menu. Details can be entered including masses for light and heavy label reagents, the particular amino acid labeled, and the maximum number of labeled residues to consider. 6. Left click on the “Find Features” button to locate all features in the specified scan range. Display features from the Feature Set file as described in Subheading 3.2.7. An additional matching step is performed to locate isotopically labeled pairs. A pair is indicated by a vertical bar connecting the light and heavy partners in the Detail Pane. Selecting a pair by left clicking in the Detail Pane will display feature properties including the light and heavy intensities, the ratio of light to heavy, and the number of isotopic labels detected. 7. The results of this quantitation process are stored in a tab separated value (TSV) file specified in step 3.4.2.3. One record is written for each isotopically labeled pair and for each unlabeled peptide (see Note 15).
4. Notes 1. More information on the mzXML file format, as well as utilities to convert native acquisition files from many common MS instruments to mzXML, can be found on the Sashimi website at http://sashimi.sourceforge.net. 2. Running msInspect via Java Web Start is highly recommended for casual use, as it greatly simplifies installation and update of the software. msInspect’s major features, such as feature finding and peptide array creation, are available from the command line as well, and command-line use is more appropriate for batch processing of large numbers of mzXML files. To use msInspect from the command line, the stand-alone JAR file can be downloaded from http://proteomics.fhcrc.org/CPL/msinspect.html. This web page also allows download of the msInspect user’s guide, which contains detailed instructions on installation, using msInspect’s features from the command line, and full source code for the released version (5). 3. Feature extraction can require a great deal of memory since it operates on several scans at a time. By default the Java Web Start version of msInspect allows up to 384 MB of memory to be allocated so that a number of scans and intermediate results may be cached. If additional memory is available on the computer, the amount of memory accessible by msInspect may be increased when running msInspect from the command line with the “-Xmx” option when invoking Java. For example “java –Xmx512M –jar viewerApp.jar.” 4. Sample data files are available at https://proteomics.fhcrc.org/CPAS. From that website, follow the “Published Experiments” link on the lower left side and then left click on the “MiMB Clinical Proteomics” link on the left side. Because LC-MS files can be quite large, the samples provided for download are only small subregions of the files used as figures in Section 3. Some browsers, such as Internet Explorer, may add a “.mzXML.xml” suffix when downloading these
Open LC-MS Analysis Platform
5.
6.
7.
8.
9.
10.
379
files. This should not affect msInspect’s ability to read the files and may be safely modified to “.mzXML” if desired. The first time a particular mzXML file is loaded, msInspect will write a “.inspect” file in the same directory where the mzXML file is located. This file contains an index of each scan in the original file, which will speed subsequent file access. Construction of this index file can take some time for larger input files; the status bar at the bottom of the msInspect window will indicate progress. The area shown in the Detail Pane is indicated in the main Image pane by a blue rectangle. Several aspects of Detail Pane behavior can be adjusted by selecting Detail Pane Settings from the Tools menu. There, feature detection can be turned on or off, background noise that falls below a threshold can be hidden, and the color scheme of the Detail Pane can be modified. Note that in Fig. 1 the Chart Pane clearly shows individual isotopic peaks because the data is from a high-resolution instrument (in this case a Waters LCT Premier). msInspect depends on resolving individual isotopes to infer the charge state of the peptide and therefore its mass. The charge is derived from the reciprocal of the distance between adjacent peaks. In Fig. 1 the peaks of the peptide on the left side of the Chart Pane are 0.5 m/z units apart, therefore msInspect infers that this peptide has a charge of 2. It is not possible to infer a charge for a single peak, so “stray peaks” that cannot be grouped into an isotopic cluster are assigned a charge of zero. msInspect includes a number of feature extraction algorithms, which can be selected in the Tools menu. The default, two dimensional (2D) peak alignment, is recommended for most purposes. The single scan algorithm may be useful if there is little or no scan-to-scan coherence. The feature extraction algorithms in msInspect have been designed to work on high-resolution profile mode data. The algorithms have been successfully applied to centroided data, but performance will depend on the particular centroiding algorithm used and on the noise characteristics of the run under consideration. For such data, the centroided scan algorithm may be appropriate. Once peptides have been located, some amount of visual curation is recommended. The Heat Map view (accessed from the Tools menu) can provide a global view of features grouped by charge state and sorted by various metrics such as mass or intensity. Each column in the Heat Map view consists of a small intensity window around each feature, colored from low intensity (red) to high intensity (yellow). Clicking on a feature in the Heat Map will highlight it in the other windows. By sorting on KL score or intensity and inspecting a few features, one can gain a sense of what filtering criteria might be appropriate for a given data set. When new filter settings are applied, as described in Subheading 3.3, the Heat Map view is automatically updated. A typical example of editing a Feature Set file: a. Sort by ascending KL score (Left click on the “KL” column header). b. Find a feature with KL < 1 that was misidentified by examining its spectrum in msInspect window’s Chart Pane.
380
Fitzgibbon et al. c. Double click in the Description field for the feature to add a comment to the Feature Set List noting that this feature is “questionable.” d. Click “Save” to save changes by overwriting the old Feature Set file.
11. Filtering peptide features can improve the performance of subsequent steps such as construction of peptide arrays. Specific filtering criteria will depend on instrumentation and the experiment goals. The most frequently used filtering criteria include: a. Minimum charge – msInspect locates features by first finding peaks and then grouping them into isotopic distributions consistent with individual peptides. Some peaks will not group with any others and are referred to as “stray peaks.” As described in Note 7, it is not possible to infer the charge state of these stray peaks, so they are assigned a charge of zero. Setting the minimum charge to 1 when filtering will remove these stray peaks, which are often due to noise or chemical contaminants. b. Minimum number of peaks – confidence in the location and charge state assignment of a peptide feature may be greater if it is supported by more isotopic peaks. Setting the minimum number of peaks to 2 will also eliminate the stray peaks described above. c. Minimum number of scans – set the minimum number of scans that a peptide must span in order to be considered. This has the effect of eliminating peptide features that persist for only a brief time. d. Minimum intensity – setting a minimum intensity threshold is often appropriate, although the specific value used will depend on the instrument. e. Maximum KL score – peaks are grouped by how well they match a model of the isotopic distribution of a peptide with a given mass. The KL score described in Bellew, et al. (1) measures how much an extracted group of peaks deviates from this model; in general, a lower KL score indicates a better match. 12. When multiple feature sets are loaded, it is often useful to hide particular sets or to change the colors of the crosses that mark features in a given set. Both of these can be accomplished in the Display Features dialog box as shown in Fig. 4A (select Tools > Display Peptides). For each feature set, this dialog box provides a checkbox to control visibility and a color palette to select colors for the crosses. 13. After optimization, the mass and scan window values that give the best alignment results automatically populate the Peptide Array tab. 14. A number of high-quality open source tools are available for microarray analysis. To analyze peptide arrays produced by msInspect, tools from the Bioconductor project (http://www.bioconductor.org) and from the TM4 microarray software suite (http://www.tm4.org) have been used. 15. Results from isotopic labeling should be treated as suggestive rather than authoritative. Without peptide sequence information, the mass difference between heavy and light partners cannot be definitively ascertained. The quality of the
Open LC-MS Analysis Platform
381
matching is therefore dependent on the quality of feature filtering and the density of features in each run.
Acknowledgments The authors would like to thank Matthew Bellew, Marc Coram, Jimmy Eng, Ruihua Fang, Mark Igra, and Tim Randolph for their intellectual contributions to the development of msInspect. This work was supported by contract # 23XS144A from the National Cancer Institute. References 1. Bellew, M., Coram, M., Fitzgibbon, M., Igra, M., Randolph, T., Wang, P., May, D., Eng, J., Fang, R., Lin, C.W., Chen, J., Goodlet, D., Whiteaker, J., Paulovich, A., and McIntosh, M. (2006) A suite of algorithms for the comprehensive analysis of complex protein mixtures using highresolution LC-MS. Bioinformatics Advance Access published on June 9, 2006 http://bioinformatics.oxfordjournals.org/cgi/reprint/btl276v1. 2. Wang, P., Tang, H., Zhang, H., Whiteaker, J., Paulovich, A.G., and McIntosh, M. (2006) Normalization regarding non-random missing values in high-throughput mass spectrometry data. Proceedings of the Pacific Symposium on Biocomputing 11, 315–326. 3. May, D. Fitzgibbon, M., Liu, Y., Holzman, T., Eng, J., Kemp, C.J., Whiteaker, J., Paulovich, A., and McIntosh, M. (2007) A Platform for Accurate Mass and Time Analyses of Mass Spectrometry Data. Journal of Proteome Research 6(7), 2685–2694. 4. Pedrioli, P.G., Eng, J.K., Hubley, R., Vogelzang, M., Deutsch, E.W., Raught, B., Pratt, B., Nilsson, E., Angeletti, R.H., Apweiler, R., Cheung, K., Costello, C.E., Hermjakob, H., Huang, S., Julian, R.K., Kapp, E., McComb, M.E., Oliver, S.G., Omenn, G., Paton, N.W., Simpson, R., Smith, R., Taylor, C.F., Zhu, W., and Aebersold, R. (2004) A common open representation of mass spectrometry data and its application to proteomics research. Nature Biotechnology 22(11), 1459–1466. 5. Computational Proteomics Laboratory. msInspect website. Accessed on June 28, 2006 at http://proteomics.fhcrc.org/CPL/msinspect.html.
20 Pattern Recognition Approaches for Classifying Proteomic Mass Spectra of Biofluids Ray L. Somorjai
Summary The statistical classification strategy we have developed for magnetic resonance, infrared, and Raman spectra for the analysis of biomedical data is discussed, particularly as it applies to proteomic mass spectra. A general discussion of the current use of pattern recognition methods is given, with caveats and suggestions relevant for clinical applicability.
Key Words: visualization; preprocessing; feature selection/extraction; robust classifier; classifier aggregation; proteomics; mass spectroscopy; magnetic resonance spectroscopy; biodiagnostics.
1. Introduction Unlike magnetic resonance spectroscopy (MRS), infrared spectroscopy (IRS), and Raman spectroscopy (RS) (1,2,3), proteomic mass spectroscopy (PMS) is a relative newcomer to the field of biodiagnostics. However, with the goal of discriminating various disease and disease states, it is a welcome complementary technique that provides yet another means of analyzing biofluids. In particular, this complementarity extends the range of characterizing biofluids, from vibrational states of specific chemical groups (IRS, RS), through the identification of small molecules (MRS), to proteins and protein fragments (PMS). Being an emerging field, PMS suffers from growing-up pains. In particular, there are experimental difficulties specific to PMS that have yet to be addressed From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
383
384
Somorjai
(see Note 1) (in the following, the author assumes that the spectra, for which classifiers are to be developed, have been properly “processed”). Typically, biomedical data consist of a relatively few (of the order 10–100) samples (patterns) that are initially presented in a very high-dimensional feature space (feature ≡ m/z intensity), with dimensionality L (dimension ≡ features) of order 1000–10,000. Unfortunately, these two characteristics lead to two curses that impede the development of robust classifiers: the curse of dimensionality and the curse of dataset sparsity (3). The consequence of the two curses is that the sample to feature ratio (SFR) is 1/10–1/1000, instead of the minimal 5–10, required for robust classification, as is generally accepted by the machine learning community. In this chapter, the author presents the specific strategy [dubbed statistical classification strategy (SCS)] they have developed over the last dozen years to deal with such problems, particularly as they apply to MR, IR, and Raman spectra. We have been adapting this strategy and applying it with success to biomedical data derived from both proteomics mass spectra and microarrays (see Note 2). The author compares the differences and similarities of the SCS with the proteomics data analysts’ current tools and wherever possible, makes recommendations. 2. The Statistical Classification Strategy Lifting the twin curses of high dimensionality and dataset sparsity requires special approaches. The “strategy” part of the SCS reflects the fact that no single approach is, or can be optimal [“there are no panaceas in data analysis” (4)], and that a data-driven, multistage strategy is necessary or even essential. Using a divide-and-conquer philosophy, the SCS consists of five stages: 1. 2. 3. 4. 5.
Data visualization Preprocessing Feature selection/extraction Robust classifier development Classifier aggregation (ensembles)
The five stages are, of course, intimately interrelated; in particular, we use the visualization stage to constantly monitor how well the other stages of the strategy are working. Figure 1 provides a flowchart of the SCS. A more detailed description of the SCS can be found in (5) (see Note 3). 2.1. Visualization of High-Dimensional Data Proper data visualization is an essential first step that requires dimensionalityreducing mapping/projection from typically a very large, L-dimensional feature
Pattern Recognition for Proteomic Spectra
385
DATA VISUALIZATION PREPROCESSING FEATURE SELECTION / EXTRACTION CLASSIFIER DEVELOPMENT CLASSIFIER AGGREGATION
Fig. 1. Flowchart for the five stages of the SCS.
space to one to three dimensions. Of course, mapping from high dimensions to lower ones cannot preserve all distances exactly, because most of the original degrees of freedom are lost. However, if only class separability is required, exact visualization, our primary goal, is both achievable and sufficient. In fact, we recently proposed such an approach (6). It involves mapping highdimensional patterns to a special plane, the relative distance plane (RDP). The mapping procedure starts with the selection of a distance measure. This can range from Euclidean, city block, maximum norm to Mahalanobis, and its generalization (Anderson – Bahadur, AB) (7). Next, two reference patterns are chosen, one from each class. The critical observation, on which the RDP mapping relies, is that the distance of any other pattern to these two reference points is preserved exactly even after the mapping. This is because a triangle remains a triangle in any dimension and for any distance metric. Hence, the three distances of any such a triangle can be displayed in two dimensions, without distortion. By cycling through all possible reference pairs, we can display and visualize the data with respect to these sets, i.e., from a large number of possible “perspectives” (as an analogy, consider looking at a sculpture from every angle to assess its shape and form), a very powerful approach for detecting outliers (e.g., poor quality spectra), discovering additional subgroups within a class (clustering), assessing whether training and test sets derive from the same distributions, etc., in short, for establishing and ensuring quality control. 2.2. Preprocessing Preprocessing enables the user to adapt, “tune” the data, so that the subsequent stages of the SCS are optimized. For spectra, whether MS or MR, we found that the most useful preprocessing approaches, alone or in combination, are normalization (“whitening,” or scaling to unit area), smoothing (filtering), and/or peak alignment (with respect to some internal or external
386
Somorjai
reference). Various transformations of the spectra lead frequently to better classification. Examples of such transformations include replacing the spectra by their (numerical) derivatives or by rank-ordered variants (the nonlinear rank-ordering replaces the original features by their ranks, thus minimizing the influence of accidentally large or small feature values) and combinations of these. Furthermore, creating differently preprocessed versions of the same dataset, selecting different sets of features from these (stage 3), and developing different classifiers using these feature sets (stage 4) facilitates the aggregation of these multiple classifiers for possibly increased accuracy (stage 5). The achieved classifier’s accuracy and reliability are also assessed by visualization of the results (stage 1). This demonstrates how the strategy uses the stages in an interactive, feedback fashion.
2.3. Feature Selection/Extraction In general, this stage is one of the two most important components of the SCS. It is essential not only for dimensionality reduction (which helps lifting the curse of dimensionality), but, when done properly, also helping to arrive at biologically relevant and transparent interpretations of the data (“biomarker” identification). The driving force behind feature selection/extraction (FSE) is the goal of satisfying one of the two critical requirements for any reliable classifier development, lifting the curse of dimensionality. Spectra, whether mass or MR, are peculiar: their “intrinsic dimensionality,” the number of independent, relevant features they possess, is generally much smaller than their original dimensionality. This is because spectra have many irrelevant features (“noise”), and adjacent features are strongly correlated. Some of these correlated features correspond to spectral peaks, representing small molecules (MRS), or small proteins, protein fragments, or peptides (PMS). Thus, it is clearly beneficial to eliminate irrelevant features and identify discriminatory peaks (potential “biomarkers”). For spectra, principal component analysis, a frequently used dimension reduction method (often the principal tool of many PMS data analysts), is doubly dangerous. First, it “scrambles” the original features, making discriminatory feature identification and selection problematic; second, since the principal components (PCs) are ordered according to the maximum variance explained in the data, there is no guarantee that the first few PCs are discriminatory for classification. Even if one were to choose the first M 0003 L PCs from the original, total L-term set, these are rarely the best discriminators. One could try selecting m < M PCs as optimal for classification (e.g., by exhaustive search); our early experience indicates that some of the good discriminators are among the remaining k = M + 1,…,L
Pattern Recognition for Proteomic Spectra
387
subset of PCs. All these difficulties point to the need for a feature selection method specific to spectral data, one that preserves spectral interpretability. There are two generic approaches to feature selection (8). The filter method selects features without consideration of the classifiers to be used with these features. The wrapper (embedding) method finds optimal features, while using the eventual classifier to guide the selection method. We have developed a genetic algorithm-based optimal region selection (GA-ORS) method that finds discriminatory features without loosing spectral interpretability (9). The GA-ORS is based on the wrapper approach and is an example of feature extraction. It has the advantage that the spectral ranges found are averaged over adjacent data points (thus equivalent to peak area determination). Such averaging increases the signal to noise ratio, a bonus. Within the GA-ORS suite of programs, one can also control the widths of the selected spectral subregions (discriminatory peaks); this helps to eliminate those regions that appear to be discriminatory simply because of accidental differences in the “noise” regions due to the limited sample size (9,10). The GA-ORS has been very successful in identifying discriminatory subregions of MR, IR, and Raman spectra of biofluids and tissues, obtained for distinguishing between various diseases and disease states (1). In the context of feature selection, many proteomic mass spectroscopists first identify “relevant” peaks, sometimes in an ad hoc fashion, as possible contributors to discrimination. Although using all available “domain knowledge” is very important and should always be considered when available, it can also introduce bias, because of possible preconceived notions of what is relevant for discrimination. Our feature selection approach, sketched above, removes most of such bias, by identifying hitherto unsuspected, novel discriminatory “peaks,” or more accurately, discriminatory spectral subregions. Furthermore, by its explicit multivariate nature, GA-ORS tends to identify a “fingerprint,” a “panel” of peaks whose simultaneous interaction is necessary for discrimination. When the multidimensional feature space does not arise from spectra, e.g., microarray data or preselected discrete peaks in PMS, for which averaging adjacent features is not meaningful, direct application of the GA-ORS methodology may not be appropriate [although we have used it as a preliminary, clustering-type feature selection “trick” (5)]. However, when possible, exhaustive, or when not, a dynamic programming-based search for optimal or near-optimal discriminatory feature subsets is still feasible and is one of the options available in GA-ORS. Figure 2 demonstrates the importance of feature selection, and the relevance of an interactive, feedback-mode visualization of data. For the two-class, prostrate cancer vs. healthy proteomic (mass spectral) dataset (11), we display a Euclidean distance-based mapping, either directly from the original 15,154
388
Somorjai Prostate Cancer – L2 Mapping from 15,154 Dimensions
5 Dimensions
Fig. 2. Mapping from the original 15,154 dimensions (left panel) misclassified eight samples from the training set (TS; class 1, black disks, class 2, black crosses) and nine from the independent validation (test) set (VS; class 1, grey triangles, class 2, grey squares). The mapping from five dimensions (right panel), classified correctly all TS and the VS samples. The dashed lines shown are the optimal LDA separators.
dimensions (left panel) or from five dimensions, reduced via GA-ORS (right panel). Clearly, the success of class separation depends on the dimensionality of the feature space. When mapping from the original 15,154 dimensions, the optimal two-dimensional separation of training sets (TS; black disks for class 1, black crosses for class 2) and test sets (VS; grey triangles for class 1, grey squares for class 2) misclassify eight samples from the training set and nine from the independent test set. For the mapping from five dimensions, all samples are classified correctly (see Note 4). 2.4. Robust Classifier Development There are two, generally interrelated goals for supervised classifiers. First, we want robust classifiers, i.e., with high generalization power. This is realized when the classifier classifies new, unknown “patterns” correctly and reliably. Second, we want to identify the smallest subset of maximally discriminatory features. Eventual disease management/treatment would benefit from having only a few, biologically relevant and interpretable features. Ideally, both classification goals should be achieved, especially in clinically relevant studies. Unfortunately, achieving the first goal is frequently at the expense of the second. A good example is the recent use of support vector machines (SVMs) for classification. These have become particularly popular because of their
Pattern Recognition for Proteomic Spectra
389
persuasive theoretical foundations (12,13) (see Note 5). However, because the SVMs project the data into even higher dimensional feature spaces to achieve linear separability of the classes, relevant, discriminatory feature identification becomes more difficult. The technical complexity and sophistication of the classifiers used range from the simplest correlation techniques, through k nearest neighbors, linear and quadratic discriminant analysis, decision trees, neural nets, etc., to (nonlinear) SVMs. However, the choice of classifier seems not to be dictated by the data to be classified, but rather by “expert” recommendation (usually based on other types of data), personal experience or preference, or simply software availability. The maxim “simpler is better” has mostly been ignored [see however (14)]. In general, no specific effort has been expended on choosing the most appropriate, optimal type of classifier for a given dataset. With a few exceptions, the proteomics (mass spectroscopy) community tends to use the “best” (i.e., the most sophisticated) classifier, whether appropriate or not! If the dataset size is sufficiently large, then the optimum approach for developing a robust classifier is to partition the data into training set, monitoring set and a completely independent test (validation) set. Such partitioning is required to prevent overfitting. This occurs when the classifier adapts itself too closely to the peculiarities of a training set that comprises a limited number of samples. Using a monitoring set helps decide when to stop training. The ultimate assessment of the classifier’s generalization capability is how well it does on the independent test set that was in no way involved in creating the classifier. Unfortunately, a sufficiently large sample size is a luxury rarely available to the data analysts of biomedical data. The only recourse is to use some version of crossvalidation (CV) (15). CV comes in different flavors, each with its advantages and disadvantages. All of them are designed to deal with the bias introduced by using the entire dataset both to develop the “optimal” classifier and to estimate the classification error (see Note 6). It is important to re-emphasize that because of the typical small sample size of biomedical data, the best approach to robust classifier development is to select the simplest classifier possible. This suggests linear classifiers. Complex classifiers have too many parameters that need optimization, inevitably raising the scepter of overfitting (see Note 7). Dimensionality reduction (FSE) is, of course, essential for obtaining an appropriate SFR. Realizing the role of the SFR is important when developing classifiers. However, an essential caveat is that data sparsity can render any classification result statistically suspect, even if the SFR is satisfied (3). The importance of guaranteeing the appropriate SFR is being recognized. However, the consequences of data set sparsity are still not appreciated (16).
390
Somorjai
The control of disparate sensitivities and specificities produced by classifiers when the dataset is imbalanced has particular clinical relevance (typically, there are many more samples from normal subjects than from patients with particular diseases) and tuning methods are needed for the classifiers developed. The standard method in the pattern recognition literature is either oversampling (taking multiple samples from the sparser class), or undersampling (taking a subset of the samples from the larger class), such that the sample sizes in the two classes become balanced (sensitivity, SE ≈ specificity, SP). However, this approach fails quite frequently. Our approach is based on penalizing misclassification of members of the smaller class until SE ≈ SP (note that the penalty weight is generally not equal to the ratio of the class sizes). 2.5. Classifier Aggregation Clinically relevant classifiers require statistically significant class assignments for the samples. Thus, when a classifier’s assignment probability for a sample is “fuzzy” (e.g., less than 75% for a second class problem) that assignment is not really useful from a clinical point of view. If the overall accuracy of a classifier is low and the assignments are fuzzy, a multiple classifier strategy (classifier aggregation) can frequently be beneficial. The idea is to combine the outputs of several classifiers, with the expectation that the new classifier thus formed will be more accurate and less fuzzy than the best of the individual constituents. One of the requirements for accurate ensemble-based classifiers is diversity. It is believed that the component classifiers should be as different as possible. This can be achieved in several ways. One of these approaches used conceptually and methodologically very different classifiers (Linear Discriminant Analysis (LDA), neural nets, and dynamic programming) on the same, unmodified data (17). However, our more recent experiments and experiences suggest that classifier diversity is not necessarily required. Comparable accuracy can be achieved in a simpler way, by employing a single, simple classifier (e.g., LDA) and producing diversity using different transformations of the data (we have already discussed some of these in the context of feature selection). How are we to combine the outcomes of the various classifiers? Some of the combinations range from the simple majority rule to more complex, trainable rules, e.g., stacked generalization (SG) (18). SG uses the output probabilities of the constituent classifiers as input features for a new classifier. Boosting (19) is a very powerful version a learnable classifier combination rule (see Note 8). It was used for identifying proteomic biomarkers for cancer detection (20). There are many classifier combination rules. When choosing such a rule, it is important to take into account both sample size and classifier complexity.
Pattern Recognition for Proteomic Spectra
391
3. Discussion Of course, experimental quality control is essential for good classifiers, i.e., those that have useful generalization properties. Much has been made of the “surprising” observation that different (or even the same) experimental groups, using different classifiers end up with totally different sets of discriminatory features (21). These are ascribed to various possible experimental differences in the spectral acquisition, etc. (22,23,24). Although these are indeed significant contributing factors, and must be considered and corrected, sight is lost of the important fact that when nonunique discriminatory sets are found, they are as likely caused by dataset sparsity (3) as by differences in experimental protocols. The initial euphoria is over: one cannot (or should not be able to) publish in prestigious journals (e.g., Science, Nature, Lancet, PNAS, etc.) proteomic results based on very limited sample sizes. Furthermore, even when there are enough data to produce a respectable classifier, high-impact journals are unlikely to accept a manuscript unless the results are independently validated. In particular, the chemical/biological identification of the discriminatory proteins, protein fragments, or peptides must accompany the classification results. This increased focus on establishing the clinical relevance of putative biomarkers is definitely a good sign. However, at this stage of the game, it is possibly premature, and one would prefer first to have a quick, noninvasive, reliable diagnostic/prognostic tool. To be clinically relevant, many more samples are required to develop such a tool (i.e., a sufficiently robust classifier; this requirement will likely rule out the reliable detection of rare diseases). Unfortunately, currently available sample sizes preclude the discovery of unique biomarker “fingerprints” of a disease. This nonuniqueness due to data sparsity leads inevitably to expensive, onerous, and unnecessary laboratory investigations to sift out medically relevant, unique subsets from the plethora of putative biomarkers found and suggested for various diseases. Understanding the biochemical causes is, of course, essential for, say, finding a possible cure, but should succeed the diagnostic/prognostic stage. Despites such caveats, the proteomics field is maturing and once the technical problems are successfully resolved, will undoubtedly provide important medical/clinical insights. The author further suggests that the power of proteomic spectroscopy can be enhanced by the simultaneous consideration of other experimental modalities that complement PMS, especially MRS, which could identify smaller discriminatory compounds also present in biofluids. 4. Notes 1. Amongst these are correcting the nonflat baselines arising from the matrix material, peak alignment of the spectra, reconciling data acquisition at different times, in different laboratories, with mass spectrometers of different sensitivity,
392
2.
3.
4.
5.
6.
Somorjai correcting high frequency noise, etc. Proper experimental design, including rigorous quality assessment and control is essential before any classifier development is attempted. Good discussions and summaries are given in (21,22,23,24). The realization that some classification strategy is essential for the analysis of proteomic data is recent. That these strategies are different emphasizes that not only there is no best classifier, but also that no unique, best strategy exits either; different groups discovered different strategies that worked well for the data they analyzed (20,25). What common is that all strategies are multistage. The data-driven nature of the SCS emphasizes the fact that there is no simple, universal prescription for creating an optimal classifier (4), i.e., no simple, ready “recipe” is or likely to be available. This much-improved result strengthens the importance of feature selection. Note that both mappings were done using the Euclidean distance, necessary, because one cannot use any other distance measure (e.g., Mahalanobis) that involves matrix inversion. After feature selection, when the number of features is fewer than the number of samples, much more powerful and relevant distance measures can be used. For a fair comparison, the Euclidean distance is used for both cases presented in Fig. 2 [for further possible improvements obtainable using other distance measures see (6)] In practice, SVMs are not nearly as effective as suggested by theory. In fact, we have found (26) that a simple LDA classifier, with wrapper-driven feature selection, when applied to several publicly available proteomic mass spectra, and to six microarray datasets, generally outperformed a linear SVM, even when the latter was used with feature selection. Furthermore, SVM-based classifiers frequently produce classification results that are distinctly out of balance. The accuracy obtained for one of the classes is most of the time considerably better. This imbalance between sensitivity and specificity is of clinical relevance when trying to minimize false negatives and/or false positives. Different variants of CV deal differently with the so-called bias-variance dilemma, particularly acute for datasets with limited sample size. The simplest version, the leave-one-out (LOO) method, removes one of the N samples, develops a classifier with the remaining N – 1 samples, and tests its prediction accuracy on the left-out sample. By cycling through all N samples, N accuracy assessments are found. For small N (for which the data partition, as described in the main text, is not possible), LOO suffers from large variance, even though it minimized the bias. K-fold CV is frequently used to balance bias and variance. The samples are partitioned into K roughly equal subsets. K – 1 subsets are used for training the classifier, while the leftout subset is the current test set. Cycling through the K partitions and then calculating the mean and standard deviation of the accuracies over the K test sets assess how well and how reliably one is expected to classify new, unknown samples. K is typically chosen to be 5 or 10, whether or not the sample size warrants this choice. A more reasonable approach is to determine the best K via CV. Particularly, powerful is Efron’s bootstrapping approach (15). This involves the entire dataset, but uses a random resampling with replacement strategy. A large number of artificial datasets
Pattern Recognition for Proteomic Spectra
393
of the same size as the original are thus produced. A classifier is created for each of these, and the outcomes are averaged. Bootstrapping is supposed to reduce both large bias and variance. Inspired by the bootstrapping concept, we have been using, with some success, its generalization (27). 7. Instead of the direct use of nonlinear classifiers, with the attendant optimization problems, a simple trick is to use nonlinear terms but retain the simplicity of a linear classifier. One approach we found useful is to first develop a linear classifier (with feature selection) and then augment the linear features by constructing from them nonlinear functions, say, quadratic terms. This, of course, increases the number of parameters to be determined. However, the problem remains linear in the augmented feature space and linear classifiers can be developed. Furthermore, our explicit approach produces new features that remain interpretable as interaction terms. This is unlike the SVM classifiers that map implicitly into a much higher dimensional linear feature space, without interpretability. In addition, we can reduce the dimensionality of our augmented feature space by additional feature selection via exhaustive search, optimized by CV. 8. Boosting requires “weak” base classifiers, Cj , j = 1,2,…,j that are combined into a more accurate composite classifier, Dj = C1 + C2 + … = Cj . At stage m, the boosting algorithm carries out a weighed selection of a base classifier, given all previously chosen base classifiers. For the new base classifier Cm , larger weights are given to samples that are incorrectly classified by the current composite classifier Dm−10002 so that Cm will be chosen with a tendency to correctly classify previously incorrectly classified samples.
Acknowledgments The author thanks the entire Biomedical Informatics Group for their decadelong, essential contributions to the development of the algorithms and softwares described. References 1. Lean, C. L., Somorjai, R. L., Smith, I. C. P., Russell, P., Mountford, C. E. (2002) Accurate diagnosis and prognosis of human cancers by proton MRS and a three stage classification strategy. Annual Reports on NMR Spectroscopy 48, 71–111. 2. Somorjai, R. L., Dolenko, B., Nikulin, A., Nickerson, P., Rush, D., Shaw, A. et al. (2002) Distinguishing normal from rejecting renal allografts: application of a threestage classification strategy MR and IR spectra of urine. Vibrational Spectroscopy 28, 97–102. 3. Somorjai, R. L., Dolenko, B., Baumgartner, R. (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19, 1484–1491. 4. Huber, P. J. (1985) Projection pursuit. Ann. Statistics 13, 435–475.
394
Somorjai
5. Somorjai, R. L., Alexander, M., Baumgartner, R., Booth, S., Bowman, C., Demko, A., Dolenko, B., Mandelzweig, M., Nikulin, A. E., Pizzi, N., Pranckeviciene, E., Summers, R., Zhilkin, P. (2004) A data-driven, flexible machine learning strategy for the classification of biomedical data. In: Dubitzky, W. and Azuaje, F. (eds.) Artificial Intelligence Methods and Tools for Systems Biology, Chapter 5. Computational Biology Series, Vol. 5. Springer, pp. 67–85. 6. Somorjai, R. L., Demko, A., Mandelzweig, M., Dolenko, B., Nikulin, A. E., Baumgartner, R. et al. (2004) Mapping high-dimensional data onto a relative distance plane – a novel, exact method for visualizing and characterizing highdimensional patterns. Journal of Biomedical Informatics 37, 366–379. 7. Anderson, T. W., Bahadur, R. R. (1962) Classification into two multivariate normal distributions with different covariance matrices. Annals of Mathematical Statistics 33, 420–431. 8. Kohavi, R., John, G. H. (1997) Wrappers for feature subset selection. Artificial Intelligence 273–324. 9. Nikulin, A. E., Dolenko, B., Bezabeh, T., Somorjai, R. L. (1998) Near-optimal region selection for feature space reduction: novel preprocessing methods for classifying MR spectra. NMR in Biomedicine 11, 209–217. 10. Li, J., Zhang, Zh., Rosenzweig, J., Wang, Y. Y., Chan, D. W. (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clinical Chemistry 48, 1296–1304. 11. Dataset “JNCI-7-3-02,” downloaded from the NIH/FDA Clinical Proteomics Program Databank (http://clinicalproteomics.steem.com). 12. Vapnik, V. N. (2000) The nature of statistical learning theory, 2nd edition, Statistics for Engineering and Information Science. Springer, New York. 13. Schölkopf, B., Smola, A. J. (2002) Learning with Kernels. Support Vector Machines, Regularization, and Beyond. The MIT Press, Cambridge, Mass. 14. Lee, K. R., Lin, X., Park, D. C., Eslava, S. (2003) Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics 3, 1680–1686. 15. Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia. 16. Diamandis, E. P. (2003) Proteomic patterns in biological fluids: do they represent the future of cancer diagnostics? Clinical Chemistry 49(8), 1272–1278. 17. Somorjai, R. L., Nikulin, A. E., Pizzi, N., Jackson, D., Scarth, G., Dolenko, B., Gordon, H., Russel, P., Lean, C. L., Delbridge, L., Mountford, C. E., Smith, I. C. P. (1995) Computerized consensus diagnosis: a classification strategy for the robust analysis of MR spectra. I. Application to 1 H spectra of thyroid neoplasms. Magnetic Resonance in Medicine 33, 257–263. 18. Wolpert, D. H. (1992) Stacked generalization. Neural Networks 5, 241–259. 19. Schapire, R. R. (1990) The strength of weak learnability. Machine Learning 5, 197–227. 20. Yasui, Y., Pepe, M., Thomson, M. L., Adam, B.-L., Wright Jr., G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., Feng, Z. (2003) A data-analytic strategy
Pattern Recognition for Proteomic Spectra
21. 22.
23.
24.
25.
26. 27.
395
for protein biomarker discovery: profiling of high-dimensional data for cancer detection. Biostatistics 3, 449–463. Diamandis, E. P. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool. Molecular and Cellular Proteomics 3(4), 367–378. Baggerly, K. A., Morris, J. S., Coombes, K. (2004) Cautions about reproducibility in mass spectrometry patterns: joint analysis of several proteomic data sets. Bioinformatics 20, 777–785. Hu, J., Coombes, K. R., Morris, J. S., Baggerly, K. A. (2005) The importance of experimental design in mass spectrometry experiments: some cautionary tales. Briefings in Functional Genomics and Proteomics 3(4), 322–331. Shin, H. and Markey, M. K. (2006) A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples. Journal of Biomedical Informatics 39, 2237–2248. Zhu, W., Wang, X., Ma, Y., Rao, M., Glimm, J., Kovach, J. S. (2003) Detection of cancer-specific markers amid massive mass spectral data. Proceedings of National Academic Science USA 100(25), 14666–14671. Somorjai, R. L. and Pranckeviciene, E. (2006) (Unpublished). Somorjai, R. L., Dolenko, B., Nikulin, A., Nickerson, P., Rush, D., Shaw, A., De Glogowski, M., Rendell, J., Deslauriers, R. (2002) Distinguishing normal from rejecting renal allografts: application of a three-stage classification strategy to MR and IR spectra of urine. Vibrational Spectroscopy 28, 97–102.
Index
Affi-gel Protein A MAPS II kit, 277 Aflatoxin B1 (AFB1), 194 Alkaline phosphatase (ALP) assay, 233, 237 Alpha-fetoprotein, 194 Alzheimer’s disease, 310 Annexin V, 172 ANOVA, analysis of variance, 100, 112, 114, 259, 330, 335, 344 Antibody arrays construction, 270–272 direct labeling methods, for cancer diagnostics, 268–269 formats for, 264–266 labeling and hybridization, of serum samples, 269–270, 272–274 and other proteomic strategies, 263–264 planar, labeling-hybridization methods and, 266–268 printing, 269 scanning and data analysis, 274 Anti-SAPE antibody, 267 ArrayQuant scanners, 281 AutoPixTM , 48. See also Laser-capture microdissection Axon scanners, 281
Bayesian classification methods. See Linear Discriminant Analysis Bayes’s rule, 300 BCA 200 Protein Assay Kit, 277 Bead-based multiplex assays. See also Suspension antibody microarrays detection antibody, 254 diluents, 254 general protocol for, 254–255 sample preparation, 252–254 screening protocol, 255–256 Biological variation analysis (BVA) module, of DeCyder, 112–113 “Biomarker panel,” 11 Bio-Rad Micro Bio-Spin P30 column, 277 Biotinyl-tyramide, 275
BLAST, 352, 358 Blood samples, preanalytical phase collection of, 36 processing of, 37–38 protease inhibitors, 38 serum and plasma specimens, characteristics of, 36–37 Bradford assay, 225
Carboxylated beads, 249. See also Suspension antibody microarrays activation, 251 antibodies coupling to activated, 251 cell-counting chamber and, 252 washing and storage of coupled, 251 1-(5-Carboxypentyl)-1-methylindodi-carbocyanine halide (Cy5) N-hydroxy-succinimidyl ester, 163 1-(5-Carboxypentyl)-1-propylindocarbocyanine halide (Cy3) N-hydroxy-succinimidyl ester, 163 CAST. See Clustering Affinity Search Technique Celecoxib, and cyclooxygenase-2 (COX-2), 183 Charge-couple device (CCD) camera-based imaging system, 268, 293, 332 CIMminer (Clustered Image Maps), 259 Cleavable isotope-coded affinity tag (cICAT) labeling technology, 195, 197, 200–201 Clinical proteomics, 1 biological specimens, 6–7 biomarker discovery and, 9–14 overview and scope of, 2–3 sample specimens and processing techniques, 4–9 Cluster analysis techniques, 297–299, 306 gene expression-based, 307 Clustering Affinity Search Technique, 259 Coomassie brilliant blue (CBB) staining, 68, 332, 339 Creatinine assay, 142 Cyanines (Cy3/Cy5), 264, 333 Cyclooxygenase-2 (COX-2) and celecoxib, 183
397
398 CyDye labeling, 95, 105–106, 109–110. See also Difference gel electrophoresis (DIGE) technology Cy2-labeled internal standard, 98–99 minimal labeling method, 96 pooled-sample internal standard for, 107 saturation labeling, 96 Cy3-labeled streptavidin, 267 Cytokeratin 19 (CK19), 163 DA-PLS method. See Discriminant analysis–partial least squares method DeCyder software, 101, 112–113, 342. See also Difference gel electrophoresis (DIGE) technology Delayed extraction-matrix assisted laser desorption/ionization time-of-flight mass spectrometry (DE-MALDI-TOF-MS), 194 Dendrogram, 297, 299 Dialysis, 150. See also Urine protein profiling, by 2DE and MALDI-TOF-MS Difference gel electrophoresis (DIGE) technology, 78, 93, 330, 332–333, 342–345 ANOVA, 100, 112, 114 in clinical setting, 103 CyDye labeling, 95, 105–106, 109–110 Cy2-labeled internal standard, 98–99 minimal labeling method, 96 pooled-sample internal standard for, 107 saturation labeling, 96 DeCyder suite of software tools, 101, 112–113 2D gel electrophoresis and poststaining, 94, 110–111 experimental design, 108–109 and statistical confidence, 112–114 extended data analysis (EDA) software module, 101, 113 false discovery rate (FDR), 100 hierarchical clustering (HC), 102 labeling materials, 104–105 LCM and, 163–170 MeOH/CHCl3 protocol, 106 MuDPIT, 97 multivariate statistical analysis, 114–115 principle component analysis, 101 SDS-polyacrylamide gel electrophoresis, 104 software algorithms, 111–112 Student’s t-test, 100, 112, 114 DIGE/MS analysis, 103, 115 Direct labeling, 264, 268 protocol for, 272–274 Discriminant analysis–partial least squares method, 306, 309–311
Index Discrimination power (DP), 303–305 Dithiothreitol (DTT), 68 Dot-plot style alignment, of protein sequence, 358–359 DTT/IAA equilibration procedure, 73 ECM. See Extracellular matrix EDA software. See Extended data analysis software EDC/Sulfo-NHS, 249. See also Suspension antibody microarrays 2DE-MALDI-TOF-MS assay, 194 EnsEmbl, 352, 356 Escherichia coli, 307 Ethylene vinyl acetate (EVA) polymer, 161 Ettan 2D electrophoresis system, 110 Exosomes, 142 ExPASy proteomics tools, 202, 352 Expressed sequence tags (ESTs), 357 Extended data analysis software, 101, 113 Extracellular matrix, 8 and matrix vesicles (MVs) proteomes, MS and, 231–232 alkaline phosphatase assay, 234, 237 immunofluorescence staining and, 235, 239 MC3T3-E1, osteoblast cell line, 233, 236–237, 239 nanoRPLC-MS/MS, 235, 238–239 strong cation exchange liquid chromatography, of peptides, 234–235, 238 Extracted ion chromatogram, 219, 221–222, 224 Fetal bovine serum (FBS), 254 Fisher’s F-test, 302 Flow cytometric analysis, 160 Fluorophores, 264, 267 photobleaching and quenching of, 274–275 Fourier transformer mass spectrometry (FTMS), 172–174 Free flow electrophoresis (FFE), plasma samples fractionation and, 60–61, 67 Frontotemporal dementia, 310 GAORS method. See Genetic algorithm-based optimal region selection method 2D Gaussian function, 312 Gaussian multivariate probability distribution, 300 2-D Gel-electrophoresis (2-D GE), 292. See also 2D-PAGE maps analysis LCM cells analysis by, 77 HER-2/neu positive and -negative breast tumors, 87–88
Index isoelectric focusing (IEF), 79–80, 83–84 MASCOT search engine, 87 paraffin-embedded sections staining, 81–82 preparation and analysis, 61, 67–69 protein sample preparation, 79, 82–83 SDS-PAGE, 79–80, 84–85 silver staining and image analysis, 80, 85–86 tissue block and tissue section preparation, 78–79, 81 trypsin digestion and MS analysis, 80, 86–87 Gel-free mass spectrometry and LCM, 171–172 Gene expression microarrays, 45 GenePix Pro 3.0 software program, 280–281 GeneScan program, 356 Genetic algorithm-based optimal region selection method, 387–388. See also Proteomic mass spectroscopy gp96, tumor rejection antigen, 169 GRANTA-519, 308
HCC. See Hepatocellular carcinoma HCL. See Hierarchical clustering Hematoxylin and eosin (H&E) staining, tissue sample collection, 44, 47–48 Hepatitis B/C virus (HBV/HCV), 194 Hepatocellular carcinoma, 8, 11, 59, 67, 163, 170, 193 qualitative and quantitative proteomic analysis of cICAT labeling technology, 195, 197, 200–201 2DE-MALDI-TOF-MS assay, 194 2D-LC-MS/MS for, 195–197, 201–202 ExPASy proteomics tools, 202 LCM for, 194–196, 199 nonenzymatic method (NESP), 196, 198–199 toludine blue removal and protein mixture digestion, 197, 199–200 HERMeS software package, PCA and, 306 HER-2/neu oncogene, 85–86, 163 Hierarchical clustering, 259, 299. See also Cluster analysis techniques High performance liquid chromatography, 169, 171, 183, 212–214 Horseradish peroxidase (HRP), 267 HPLC. See High performance liquid chromatography HSP27 protein, 103 HT-29, COX-2 expressing colon cancer cell line, 183 Human Proteome Organization, 143 Hydrogels, 271. See also Antibody arrays
399 ICAT labeling. See Isotope-coded affinity tag labeling IMAC-Cu2+ ProteinChips, 134, 136 Image analysis. See also 2D-PAGE maps analysis by fuzzy logic principles image defuzzyfication, 312 image digitalization, 311–312 multi-dimensional scaling (MDS), 315–317 PCA and classification methods, 315 refuzzyfication, 312–313 moment functions, 317 Legendre moments, 318–319 Image Master Platinum software, 339, 341 Immobilized pH gradient strip. See also Two-dimensional electrophoresis (2DE) isoelectric focusing (IEF) with, 60, 65 rehydration of, 64–65 Immunofluorescence staining, 235 InterPro, 352, 361 Iodoacetamide (IAA), 68 IPG strip. See Immobilized pH gradient strip Isotope-coded affinity tag labeling, 78, 195 mass spectrometry (MS) and, 181 celecoxib, cyclooxygenase-2 (COX-2) and, 183 cell culture and harvest, 183, 186 cell lysis, desalting, and protein quantitation, 184–187 cleavable reagents, 182, 185, 187–188 cleaving biotin, 186, 189 labeled peptides purification, 185–186, 188–189 proteins, denaturation and reduction of, 185, 187 quantitative proteomic analysis and, 184 Java Runtime Environment, 370. See also msInspect, for LC-MS data analysis KMC (K-Means/K-Medians Clustering), 259 Kolmogorov–Smirnov test, 335, 339, 341 Kruskal–Wallis test, 335 Laser-capture microdissection, 8, 44–45, 160. See also Tissue sample collection, for proteomics analysis AutoPixTM , 48 cells analysis, by 2-D GE, 77 HER-2/neu positive and -negative breast tumors, 87–88 isoelectric focusing (IEF), 79–80, 83–84
400 MASCOT search engine, 87 paraffin-embedded sections staining, 81–82 protein sample preparation, 79, 82–83 SDS-PAGE, 79–80, 84–85 silver staining and image analysis, 80, 85–86 tissue block and tissue section preparation, 78–79, 81 trypsin digestion and MS analysis, 80, 86–87 development, 161 different labeling techniques and, 170 DIGE and, 163–170 and 2-D GE, 162–163 gel-free mass spectrometry and, 171–172 for HCC and non-HCC hepatocytes isolation, 194–195, 199 LCM lysate, 49–50 and mass spectrometry analysis, 172–174 PixCell II instrument, 48–49, 161 and protein chip technology, 172 separation methods and, 171 for tissue sample collection, 44–45 VeritasTM , 48 Laser microdissection and pressure catapulting, 8 LC-ESI-MS/MS. See Liquid chromatography-electrospray ionization tandem mass spectrometry LCM. See Laser-capture microdissection LC-MS data. See Liquid chromatography-mass spectrometry data LC-MS/MS. See Liquid chromatography-tandem mass spectrometry LDA. See Linear Discriminant Analysis Legendre moments, 317–319 Levene’s test, 334 Linear Discriminant Analysis, 300–301, 315–316 Liquid chromatography-mass spectrometry data, 370, 374–376, 377 Liquid chromatography-mass spectrometry data analysis, msInspect for, 369 data viewing and navigation, 371–373 locating peptides in, 373–376 low-quality peptides, elimination of, 376 peptide quantitation, 376–378 software installation for, 370 Liquid chromatography-tandem mass spectrometry, 170, 171 label-free, for biomarker identification, 209–210 albumin/IgG depletion, 211–213 chromatographic alignment, 218–221 data transformation and normalization, 222 HPLC, 212–214 mass spectrometer, 212, 214
Index MS/MS spectral filtering, 216–217 peptide identification, 217–218 peptide quantification, 221–222 statistical analysis, 223 zoom scan data processing, 214–216 LMPC. See Laser microdissection and pressure catapulting two-dimensional (2D-LC/MS/MS), 78 Lysine labeling, 169 MALDI/SELDI protein profiling, of serum, 125–126 on MALDI-TOF–TOF data collection, 131–132 MB fractionation, of human serum, 131 protein identification by, 132–133 MB-based fractionation, 127, 128, 131 SELDI and MALDI spectra acquisition, 129 SELDI ProteinChip, 130 (Magnetic bead based) on SELDI-TOF, 133 ProteinChip arrays, 134–135 SPA matrix addition, 135 spectra collection on, 135–138 MALDI-TOF-MS. See Matrix-assisted laser desorption time of flight mass spectrometry MALDI-TOF, peptide mass fingerprinting (PMF) and, 62, 71 MALDI-TOF–TOF, serum protein profiling on data collection, 131–132 MB fractionation, of human serum, 131 protein identification by, 132–133 Maleimide labeling, of cysteine sulfhydryls, 96 MARS. See Multiple affinity removal system MASCOT software, 81, 87–88 Mass spectrometry, 58–59, 214 ICAT labeling and, 181 celecoxib, cyclooxygenase-2 (COX-2) and, 183 cell culture and harvest, 183, 186 cell lysis, desalting, and protein quantitation, 184–187 cleavable reagents, 182, 185, 187–188 cleaving biotin, 186, 189 labeled peptides purification, 185–186, 188–189 proteins, denaturation and reduction of, 185, 187 quantitative proteomic analysis and, 184 LCM and, 172–174
Index Matrix-assisted laser desorption time of flight mass spectrometry, 125–126, 142, 163, 194 LCM and, 171 for urine protein profiling. See Urine protein profiling, by 2DE and MALDI-TOF-MS MAVER-1 cell lines, 308 MC3T3-E1, osteoblast cell line, 233, 236–237, 239 MDS technique. See Multi-dimensional scaling techniques MeOH/CHCl3 protocol, 106 Metalloproteins, 350 MicroSol-IEF, ZOOM® , 60, 65–66 Miniaturized parallelized sandwich immunoassays. See Suspension antibody microarrays MS. See Mass spectrometry MS-Fit software, 81 msInspect, for LC-MS data analysis, 369 data viewing and navigation, 371–373 locating peptides in, 373–376 low-quality peptides, elimination, 376 peptide quantitation, 376–378 software installation for, 370 MS/MS spectral filtering, 216–217 Multi-dimensional scaling techniques, 313, 315–317 MultiExperiment Viewer (MeV), 259 Multiple affinity removal system, 59, 63–64 Multiplexed bead-based flow-cytometry assays, 266 Nanoflow reversed-phase LC-tandem mass spectrometry (nanoRPLC-MS/MS), 233, 235, 238–239 Non-enzymatic sample preparation (NESP), 194, 196, 198–199 One-antibody label-based assays, 264–266 One-dimensional liquid chromatography coupled with tandem mass spectrometry (1D-LC-MS/MS), 201–202. See also Hepatocellular carcinoma 16 O/18 O isotopic labeling, 78 Osteoblasts, 232. See also Extracellular matrix MC3T3-E1, 233, 236–237, 239 2D-PAGE maps analysis, 291 dedicated software packages and, 292–294 image analysis fuzzy logic, 311–317 moment functions, 317–319 spot volume datasets, analysis of, 294 cluster analysis, 297–299 DA-PLS method, 309–311
401 linear discriminant analysis, 300–301 pattern recognition methods, 306–309 PLS regression and DA-PLS regression, 306 principal component analysis, 294–297 SIMCA method, 301–305 PALM microlaser dissector, 161 Parkinson’s disease, 310 Partial least squares regression, 306, 308, 338 Pattern recognition methods cluster analysis. See Cluster analysis techniques PCA. See Principle component analysis proteomic mass spectroscopy and. See Proteomic mass spectroscopy SIMCA classification. See Soft-independent model of class analogy method PCA. See Principle component analysis PCa-24 protein, in epithelial cells, 172 PDB. See Protein data bank PDQuest system, 293, 308 Peptide mass fingerprinting, MALDI-TOF and, 62, 71 Peptide/protein separation system, 171 PerkinElmer scanners, 281 Pfam, 352, 360 PIN. See Prostatic intraepithelial neoplasia PIVKA-II, 194 PixCell II system, 48–49, 77, 82–83, 161. See also Laser-capture microdissection Planar antibody arrays, 248, 264. See also Antibody arrays main formats of, 265 types of, labeling-hybridization methods and, 266–268 10plex soluble receptor assay, 255–256, 258. See also Bead-based multiplex assays PLS regression. See Partial least squares regression PMF. See Peptide mass fingerprinting PMS. See Proteomic mass spectroscopy Position-specific scoring matrix, 361 Post-translational modification (PTM) profiling, on selected spots, 71–72 Principle component analysis, 101, 259, 294–297, 308, 315–316, 343. See also 2D-PAGE maps analysis Escherichia coli, 307 for explorative data analysis, 336–338 in HERMeS software package, 306 U937 human lymphoma cell line and, 307 Prostatic intraepithelial neoplasia, 44 Protein chip technology and LCM, 172 Protein data bank, 352, 360–361 Protein precipitation, 143–144
402 Protein profiling of human plasma samples , by two-dimensional electrophoresis, 57 coomassie brilliant blue G-250 staining, 68 destaining, in-gel deglycosylation and in-gel tryptic digestion, 61–62, 69 2D gels preparation and analysis, 61, 67–69 difference in gel electrophoresis (DIGE) system, 59 free flow electrophoresis (FFE), samples fractionation by, 60–61, 67 high-abundance proteins depletion, by immunoaffinity column, 59, 63–64 HPPP, 58 IPG gel strip rehydration, 64–65 isoelectric focusing (IEF), with IPG strip, 60, 65 MALDI plating and peptides desalting, 62, 69–71 mass spectrometry (MS), 58–59 microscale solution isoelectric focusing, ZOOM® , 60, 65–66 peptide mass fingerprinting, MALDI-TOF and, 62, 71 PTMs profiling, on selected spots, 71–72 samples preparation, 59, 62 TCA/acetone precipitation, 64 Proteomic data, statistical analysis, 327 classical dyes, 339–342 confirmatory univariate data analysis, 333–335 DIGE approach, 342–345 experimental design for, 328 data processing, 330–333 pooling, 330 replicates, 329–330 exploratory multivariate data analysis, 335 marker selection, 338–339 principal component analysis, 336–338 Proteomic mass spectroscopy, 383 statistical classification strategy (SCS) for classifier aggregation, 390 data visualization, 384–385 feature selection/extraction (FSE), 386–388 preprocessing, 385–386 robust classifier development, 388–390 Proteomics analysis, for tissue sample collection formalin fixation, 43–44 hematoxylin staining, 47–48 immunocapture procedure, 46 immunofluorescence staining, 48 laser-capture microdissection (LCM), 44–45 AutoPixTM , 48 PixCell II instrument, 48–49
Index VeritasTM , 48 LCM lysate, 49–50 SELDI-TOF-MS, 46 PSSM. See Position-specific scoring matrix QTC (QT CLUST), 260 Resonance light scattering (RLS), 268 Reverse protein arrays, 268 Rolling-circle amplification (RCA), 268 SCX-LC. See Strong cation exchange liquid chromatography SDS-PAGE. See Sodium dodecyl sulfate-polyacrylamide gel electrophoresis SELDI. See Surface-enhanced laser desorption/ionization SELDI-TOF. See Surface-enhanced laser desorption/ionization time-of-flight Self Organizing Maps (SOM), 259 Self Organizing Tree Algorithm (SOTA), 259 Shapiro-Wilk test, 334, 339 Significance Analysis of Microarrays (SAM), 259 Silver staining, 80, 332–333. See also Laser-capture microdissection and image analysis, 85–86 SIMCA method. See Soft-independent model of class analogy method SKBR-3, breast cancer cell line, 171 Sodium dodecyl sulfate-polyacrylamide gel electrophoresis, 84–85, 94, 96, 104, 110–111 isoelectric focusing (IEF) and, 79–80 PROTEAN II xi Cell system (Bio-Rad) for, 84 Soft-independent model of class analogy method, 301–305, 307–308 Streptavidin-R-Phycoerythrin (SAPE), 267 Strong cation exchange liquid chromatography, 234–235, 238 Strong cation exchange liquid chromatography, of peptides, 233, 234–235, 238 Student’s T-test, 334 2-(4-Sulfophenylazo)-1,8-dihydroxy-3,6naphthalenedisulfonic acid (SPADNS), 60, 67 Support vector machines, 388–389. See also Proteomic mass spectroscopy Surface-enhanced laser desorption/ionization, 9, 13, 125–126, 142, 172, 194 serum protein profiling on, 133 ProteinChip arrays, 134–135 SPA matrix addition, 135 spectra collection on, 135–138
Index Suspension antibody microarrays, 247–248 bead-based multiplex assays processing, 252–256 limit of detection (LOD), 257 miniaturized multiplexed protein assays, analytical performance, 256–259 pattern generation, 259–260 principle of, 249 production, coupling to carboxylated microspheres, 249–252 SVMs. See Support vector machines
TAAs arrays. See Tumor-associated antigen arrays TCA/acetone precipitation, 2DE and, 64 Tissue sample collection, for proteomics analysis formalin fixation, 43–44 hematoxylin staining, 47–48 immunocapture procedure, 46 immunofluorescence staining, 48 laser-capture microdissection (LCM), 44–45 AutoPixTM , 48 PixCell II instrument, 48–49 VeritasTM , 48 LCM lysate, 49–50 SELDI-TOF-MS, 46 Tributylphosphine (TBP), 68 Trichloroacetic acid (TCA) precipitation, 143–144, 146–147, 151 Trifluoroacetic acid (TFA), 182 Tris buffer, 277 TTEST (T-tests), 259 Tumor-associated antigen arrays, 266, 269 Two-dimensional electrophoresis (2DE), 11, 194, 328 biological replicates, 329–330 LCM and, 162–163 for protein profiling of human plasma samples, 57 coomassie brilliant blue G-250 staining, 68 destaining, in-gel deglycosylation and in-gel tryptic digestion, 61–62, 69 2D gels preparation and analysis, 61, 67–69 difference in gel electrophoresis (DIGE) system, 59 free flow electrophoresis (FFE), samples fractionation by, 60–61, 67 high-abundance proteins depletion, by immunoaffinity column, 59, 63–64 HPPP, 58 IPG gel strip rehydration, 64–65 isoelectric focusing (IEF), with IPG strip, 60, 65
403 MALDI plating and peptides desalting, 62, 69–71 mass spectrometry (MS), 58–59 microscale solution isoelectric focusing, ZOOM® , 60, 65–66 peptide mass fingerprinting, MALDI-TOF and, 62, 71 PTMs profiling, on selected spots, 71–72 samples preparation, 59, 62 TCA/acetone precipitation, 64 technical replicates, 329–330 for urine protein profiling. See Urine protein profiling, by 2DE and MALDI-TOF-MS Two-dimensional fluorescence difference gel electrophoresis (2-D DIGE), 78 see also Difference Gel electrophoresis (DIGE) technology Two-dimensional liquid chromatography tandem mass spectrometry (2D-LC-MS/MS), 78, 170 see also liquid chromatography tandem mass spectrometry for HCC and non-HCC hepatocytes isolation, 195–197, 201–202 Two-dimensional polyacrylamide gel electrophoresis (2D PAGE), 162–163, 174 see also 2D gel electrophoresis, 2D gels Two-factor ANOVA (TFA), 259 Ultrafiltration technique, 144 Urine protein profiling, by 2DE and MALDI-TOF-MS, 141–142 analytical/profiling techniques, 145–146 organic solvent precipitation protocol, 145, 147–148 protein precipitation, 143–144 TCA/acetone precipitation protocol, 145–147 ultrafiltration-SPE, 144–145, 148–149 urine SPE, 149 VeritasTM , 48. See also Laser-capture microdissection Web-based tools, for protein classification, 349 BLAST, 352, 358 dot-plot style alignment, of protein sequence, 358–359 EnsEmbl, 352, 356 evolution-based classification schemes, 351 ExPASy, 352 expressed sequence tags (ESTs), 357 GeneScan program, 356
404 InterPro, 352, 361 MEROPS, 361 metalloproteins, 350 PDB, 352, 360–361 Pfam, 352, 360 PRINTS, 361 PROSITE, 361 sequence and structure of proteins and, 352–356
Index SMART, 360 Western blotting protocols, 275 XIC. See Extracted ion chromatogram ZOOM® , MicroSol-IEF, 60, 65–66 Zoom scan triple-play experiment, 214
Update java to latest version

I Want To Update Java

Update java to 64 bit

Update Java To Latest Version

Recent Posts How to set up PVS-Studio in Travis CI using the example of PSP game console emulator. Travis CI is a distributed web service for building and testing software that uses GitHub as a source code hosting.