Thursday, June 4, 2020

Machine learning uncovers cell identity regulator by histone code

Curating CIGs

We first performed an intensive query of PubMed using the search expression “cell identity”[Title/abstract] OR “cell marker”[Title/abstract], which returned 7581 PubMed abstracts. We then searched the abstract for names of 297 cell types listed in the SHOGoiN database and ranked cell types by number of associated abstracts. To retrieve CIGs, we then conducted a manual literature review for the ten top-ranked cell types that also have RNA-sequencing (RNA-seq) data and ChIP-Seq data for the histone modifications of H3K4me3, H3K4me1, H3K27ac, and H3K27me3. We also defined control genes by requiring that their names did not appear together with the name of the given cell type in literature or any of five major annotation database, i.e., the Entrez Gene, Gene Cards, Ensembl, Gene Ontology, and KEGG. We then further selected a random subset of the control genes, so that number of control genes in the subset is the same as number of our curated CIGs.

Data collection

The RNA-seq, H3K4me3, H3K4me1, H3K27ac, and H3K27me3 ChIP-seq data for the ten well-defined cell types (H1-hESC, CD34 + HPC, B cell, HUVECs, human mammary epithelial cells, neural cells, MRG cell, normal human lung fibroblast, mesenchymal stem cell, human skeletal muscle myoblast) and landscape analysis are downloaded from GEO database and ENCODE project website (

ChIP-seq and RNA-seq data analysis

Human reference genome sequence version hg19 and UCSC Known Genes were downloaded from the UCSC Genome Browser website39. RNA-seq raw reads were mapped to the human genome version hg19 using TopHat version 2.1.1 with default parameter values. Expression value for each gene was determined by the function Cuffdiff in Cufflinks version 2.2.1 with default parameter values. Afterwards, quantile normalization of gene expression values was performed across cell and tissue types.

For ChIP-seq data, reads were first mapped to hg19 human genome by Bowtie version 1.1.0:

bowtie -p 8 -m 1 --chunkmbs 512 –best hg19_reference_genome fastq_file

Wig file is generated using DANPOS 2.2.3:

python dpeak sample –b input --smooth_width 0 -c 25000000 --frsz 200 --extend 200 –o output_dir

Quantile normalization is performed using DANPOS 2.2.3:

python wiq --buffer_size 50 hg19.chrom.sizes.xls wig –reference reference.qnor.sort.wiq --rformat wiq --rsorted 1

By this method, ChIP-seq data from different cell and tissue types were all normalized to have the same quantiles.

Bigwig is generated using the tool WigToBigWig with the following command line:

wigToBigWig -clip sample.bgsub.Fnor.wig hg19.sizes.xls

The tool WigToBigWig was downloaded from the ENCODE project website ( The “hg19.sizes.xls” in the command line is a file containing the length of each chromosome in the human genome. We then submitted the bigWig file to the UCSC Genome Browser ( to visualize ChIP-seq signal at each base pair39,40.

Peak calling and feature value calculation are performed using the GridGO function in the CEFCIG framework (detailed algorithm is described below).

In feature value calculation, skewness and kurtosis values are centered on zero. If no signal is being detected in the peak calling region, skewness and kurtosis are set to zero.

GridGO algorithm

We developed GridGO, a grid-based genetic method to optimize bioinformatics parameters for detecting epigenetic signature of CIGs. We use GridGO to optimize three important parameters, including the height cutoff to define ChIP-seq enrichment peak, the upstream distance cutoff to assign a peak to a nearby gene, and the downstream distance cutoff to assign a peak to a nearby gene. However, GridGO is designed to also allow optimizing different number of parameters. For simplicity, we will describe details of the algorithm by an example in which the upstream and downstream distance cutoffs to assign a ChIP-seq peak to nearby genes are set to be the same, so that GridGO will optimize only two parameters including the height cutoff to define ChIP-Seq enrichment peak and the distance cutoff to assign a peak to a nearby gene (Supplementary Fig. 2a). In the first iteration of optimization, the entire two-dimensional parameter space will be divided into m equal-size grids. Then the parameter values in the center of each grid will be used to define ChIP-Seq enrichment peaks and to assign the peaks to nearby genes. Afterwards, P-value of difference in a peak feature (epigenetic signature) between CIGs and control genes will be determined by Wilcoxon’s test. The grid with the lowest P-value will be the optimal grid saved for the second iteration. In the second iteration, the grid saved in the first iteration will be divided into a new set of m small grids, which will be tested as the previous iteration to select an optimal grid saved for the third iteration. Such genetic evolution of parameter grid keeps going until the number of iteration become larger than a given value n or the new optimal grid is not better than the previous optimal grid. To estimate a potential overfitting effect, we used only 80% of training genes in the GridGo optimization and build the CIGdiscover model based on parameters optimized by these genes. Then the performance of CIGdiscover on these 80% genes and the rest 20% genes were compared, and little overfitting effect was observed.

Backward and forward feature selections

In backward feature elimination, all features are included in the model at the beginning. In each round of iteration, after trying to remove individual features from the model and test the influence on the model, one feature with least impairment to the performance of the model is removed. In contrast, in forward feature construction, there is no feature in the model in the beginning. In each round of iteration, after trying to add individual features from the feature pool and test the influence on the model, the feature that led to the best improvement to the model was added into the model. The performance is measured by the closest distance between ROC curve and the top left corner of the panel.

Specifically, in an iteration I of the forward feature construction process (Supplementary Fig. 3a right section), let Si−1 = [s1, s2, …, si−1] be the combination of features selected by the previous i−1 iterations, and let Ci−1 = [ci, ci+1, …, cn] be the remaining candidate features. Our algorithm will combine ci with Si−1 to form a new candidate combination and evaluate the performance of this combination by 100 times cross-validations. Similarly, the algorithm will combine ci+1, ci+2, …, or cn with Si−1 to form n − i additional candidate combinations, and evaluate the performance of each candidate combination by 100 times cross-validations. Among these n − i + 1 candidate combinations, the combination that shows the best performance will be the combination Si selected by iteration i.

Training CIGdiscover

Logistic regression model is built on the base of the Sklearn logistic regression library. Data are centralized and normalized using the Preprocessing library in Sklearn. Cross-validation is repeated 100 times by splitting the data into 80% training and 20% test data. Penalty is set as L1 to prevent overfitting in all experiments.

We can denote the response for case i as yi, the jth predictor for case i as xij, the regression coefficient and the intercept corresponding to the jth predictor as βj and μ. Let θ = (μ,β1, … ,βp)t and xi = (xi = (xi1, … ,xip)t, we estimate the parameter vector θ by maximizing the log-likelihood

$$Lleft( theta right) = mathop {sum}limits_{i = 1}^n {left[ {y_ilog left( {p_i} right) + left( {1 - y_i} right)log left( {1 - p_i} right)} right]}$$


The Lasso method is implemented by fixing an upper bound on the sum of the absolute value of the model parameters, which can be denoted by penalizing the negative log-likelihood with L1-norm. In the Logistic regression model, the negative log-likelihood is denoted by

$$- mathop{sum}limits_{i = 1}^{n} log left( p_{beta} left( {y_{i}{mathrm{|}}x_{i}} right) right) = mathop {sum}limits_{i = 1}^{n} left{-y_{i} left(mathop {sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)}right) + {mathrm{log}}left(1 + {mathrm{exp}}left(mathop{sum}limits_{j = 0}^{p} beta_{j} x^{(j)}right)right)right}$$


The loss function ρ can be written as

$$rho _{left( beta right)}(x,y) = - yleft(mathop{sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)}right) + log left( 1 + exp left( mathop{sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)} right) right)$$


The Lasso estimator of a Logistic regression model is defined as

$$hat beta left( lambda right) = {mathrm{{argmin}}}_beta left( {n^{ - 1}mathop {sum}limits_{i = 1}^n {rho _{left( beta right)}left( {x_i,y_i} right) + lambda parallel beta parallel _1} } right)$$


For each gene, the signed distance to the hyperplane was used as CIG score. The decision threshold (CIG score cutoff) for CIGs were determined by distance to the top left corner of the ROC curve41.

P-value between a pair of ROC curves was calculated by Hanley’s method42. First, a critical ratio z will be defined as:

$$z = frac{{A_1 - A_2}}{{sqrt {{mathrm{SE}}_1^2 + {mathrm{SE}}_2^2 - 2r{mathrm{SE}}_1{mathrm{SE}}_2} }}$$


where A1 and SE1 are the observed area under curve and estimated SE of area under curve for ROC curve 1; where A2 and SE2 are the associated values for ROC curve 2. r represents the correlation between A1 and A2 via querying the table provided in Hanley’s method42. Two intermediate correlation coefficients are required to calculate r. First, rcig, is the Pearson correlation between the CIG scores given to CIGs by the two models; rnoncig, is the Pearson correlation between the CIG scores given to non-CIG genes by the two models. Furthermore, r is acquired by querying the table42 using (rcig + rnoncig)/2 and (A1 + A2)/2. SE of the ROC areas are calculated based on the following equation43.

$${mathrm{{SE}}} = sqrt {frac{{Aleft( {1 - A} right) + left( {{mathrm{na}} - 1} right)left( {Q_1 - A * A} right) + ({mathrm{nn}} - 1)(Q_2 - A * A)}}{{{mathrm{na}} * {mathrm{nn}}}}}$$


where A is the area under the curve, na and nn are the number of control genes and CIG genes, respectively, and Q1 and Q2 are estimated by: Q = A/(2 − A), Q = 2 A ∗ A/(1 + A). Then this quantity z is referred to tables of normal distributions and used to estimate p-value between the two ROC curves.

Correlation and collinearity test

Spearman correlation coefficient between features was calculated using the Python Pandas library. For collinearity analysis, variance inflation factor is calculated using the SciPy library.

Analyze the influence of cell types

For cross test, identity genes from only five randomly selected cell types were used to train the model and identity genes from the other cell types were used to test the model. Overlapped genes are removed from the training and test datasets. Similarly, to analyze how the number of cell types influence the model, identity genes that were used for training and for testing were from different cell types. For parallel test, genes that were used for training and testing were from the same cell types.

Analyze the influence of noises

To test the robustness of the model, labels of genes were swapped in four different ways: only swap identity genes to negative control genes (false negative), only swap negative control genes to identity genes (false positive), swap equal number of identity genes to control genes and control genes to identity genes (bidirectional false), and randomly change the labels of genes. After swapping, the genes were subject to CIGdiscover to test its performance.

Training CIGnet

Data related to network nodes (genes), edges, and closeness between nodes were downloaded from CellNet website ( For each cell-type-specific cell identity subnetwork, only the CIGs were used. Known master transcription factors were defined as in Fig. 1. Control transcription factors are randomly selected from all known transcription factors, except the master transcription factors. Due to the small number of master transcription factors in the training datasets, SMOTE is used to expand the positive genes in training datasets following the distribution of feature values of the known master transcription factors. All performance tests for the logistic regression model are conducted in the same way as described for CIGdiscover. Cell-type specificity is calculated using Tau index44 and scaled to be between −1 and 1.

Pathway analysis

Top 500 identity genes ranked by the CIG score are selected to perform the DAVID pathway analysis (DAVID 6.8 The pathways with q-values (adjusted P-values using the Benjamini method) smaller than 0.05 were defined as significantly enriched.

Cell lineage hierarchy analysis

For heatmap and hierarchy cluster analysis, we retrieved top 500 identity genes for each cell or tissue type to create a heatmap of cell identity score using the tool Morpheus ( with default parameters.

Data visualization

Decision boundary analysis is performed using Sklearn and visualized by Matplotlib. ROC curves are created using Matplotlib. Bar plots is created using the Prism statistical software package (Graph Pad Software, Inc., La Jolla, CA, USA). Scatter plots are created using Microsoft Excel software.

Average density of epigenetic marks in promoter region around TSS were plotted using the Profile function in DANPOS version 2.2.3:

python profile wig --genefile_paths putative_identity_genes.txt,putative_negative_identity_genes.txt --genefile_aliases positive,negative --heatmap 1 --name outdir --genomic_sites TSS --flank_up 3000 --flank_dn 10000

Heatmap for density of epigenetic marks around TSS is plotted using the software MeV46 version 4.8.1.

We have added figures for visualization of CIG networks for individual cell or tissue types in the “network visualization section” of our CIGDB at

Maintenance of human PSCs

Human PSCs were maintained on Matrigel in mTesR1 medium. Cells were passaged approximately every 6 days. To passage PSCs, cells were washed with Dulbecco’s modified Eagle’s medium (DMEM)/F12 medium (no serum) and incubated in 1 mg/ml dispase until colony edges started to detach from the dish. The dish was then washed three times with DMEM/ F12 medium. After the final wash, colonies were scraped off of the dish with a cell scraper and gently triturated into small clumps and passaged onto fresh Matrigel-coated plates.

Human PSCs differentiation to ECs

Differentiation is induced 4 days after PSCs passaging (day 0). Mesoderm specification is induced by the addition of bone morphogenetic protein 4, activin A, small-molecule inhibitor of glycogen synthase kinase-3β (CHIR99021), and vascular endothelial growth factor (VEGF). Mesoderm inductive factors are removed on day 3 of differentiation and are replaced with vascular specification medium supplemented with VEGF and the transforming growth factor-β pathway small-molecule inhibitor SB431542. Vascular specification medium is additionally refreshed on days 7 and 9 of differentiation. Flow cytometric analysis of differentiated ECs is performed on day 10.

CRISPR gRNA and lentiviral vector design

Two open-access software, Cas-Designer ( and CRISPR design (, were used to design guide RNAs (gRNA) targeted to candidate gene. Two guides were designed per gene shown as in Supplementary Data 4.

Target DNA oligos were purchased from IDT (Integrated DNA Technologies) and cloned into the lentiCRISPR v2 plasmid2 (Addgene plasmid# 52961) via BsmBI restriction enzyme sites upstream of the scaffold sequence of the U6-driven gRNA cassette. All plasmids were sequenced to confirm successful ligation.

Lentiviral constructs

Lentivirus was packaged by co-transfection of constructs with second-generation packaging plasmids pMD2.G, psPAX2 into a six-well plate with HEK293T cells. After the first 24 h of transfection (250 ng of pMD2.G, 750 ng of psPAX2, 1 µg of target plasmid), the medium was changed to DMEM and the supernatants 48 and 72 h after transfection were pooled, filtered through a 0.45 µm filter, and used for infection.

Cell culture and lentiviral transduction

HUVECs were purchased from Lonza (C2517A) and human pluripotent stem cell (hPSC) was a kind gift from Dr John Cooke’s group. All cells used in this study were within 15 passages after receipt. HUVECs were cultured in 5% CO2 and maintained in vitro in Endothelial Growth Basal Medium with EGM-2 SingleQuot Kit. hPSC was cultured in 5% CO2 and maintained in vitro in mTeSR1 basal medium with mTeSR1 supplement. Those cell lines were mycoplasma negative during routine tests. HUVECs were grown to 70% confluence and infected with lentiviral vectors. The media was changed 8 h after viral transduction and incubated for 48 h before selection with 1 µg/mL puromycin for 3 days. HUVECs were collected for cell proliferation assay, nitric oxide production, and genomic DNA extraction. On the second day of PSC passaging, PSCs were infected with lentiviral vectors. The media was changed 6 h after viral transduction. Differentiation was induced 3 days after virus infection.

T7 endonuclease I assay

Genomic DNA from lentiviral transduced cells were extracted with a Quick-DNA Miniprep Kit (Zymo Research, Irvine, CA) following manufacturer’s protocol and were quantified using a Synergy 2 Multi-Mode Reader (BioTek, Winooski, VT, USA). The targeted regions were PCR-amplified with amfiSure PCR Master Mix (GenDEPOT, Barker, TX, USA) using primers flanking the target sites. Primers sequence are shown in Supplementary Data 4.

We denatured 200 ng of the PCR products and then slowly hybridized to form heteroduplexes using the following program settings: 95 °C for 5 min, 95°–85 °C at −2 °C/s, 85°–25 °C at −0.1 °C/s. Heteroduplexes were digested with T7 endonuclease I (New England Biolabs, Ipswich, MA, USA) at 37 °C for 30 min. In addition, the digested products were separated on a 2% TAE agarose gel for analysis. Images were captured using the ChemiDoc XRS+ Molecular Imager system (Bio-Rad, Hercules, CA, USA).

Cell proliferation assay

After viral transduction and puromycin selection, HUVECs were plated in 96-well plates at a density of 1000 cells per well and allowed to attach for 24 h. Viability was measured utilizing the CellTiter-Glo® Luminescent Cell Viability Assay (Promega, Madison, WI, USA). Results were read at 24, 48, and 72 h on the on Synergy 2 Multi-Mode Reader.

Nitric oxide production assay

HUVECs (8 × 103) were added to triplicate wells of a 96-well plate with 200 ml media. 4-amino-5methylamino-2979-difluorofluorescein diacetate (10 mM) in anhydrous dimethylsulfoxide was added to each well and the plate was incubated for 30 min at 37 °C and 5% CO2. The cells were washed with PBS, 200 ml fresh media was added, and the plate was incubated for an additional 30 min. Fluorescence was measured using a Synergy 2 Multi-Mode Reader.

Endothelial cell tube formation assay

Ninety-six-well plates were coated with 50 μl Matrigel (R&D Systems, catalog number 3432-005-01) and incubated at 37 °C for 30 min. Control (1 × 104) and CRISPR/Cas9-edited HUVECs in 100 μl EGM medium were seeded in each well, respectively. After 4 h, images were captured using the Leica epi-fluorescence microscope. Branches number and length were quantified using ImageJ software.

Fluorometric LDL uptake assay

Control and CRISPR/Cas9-edited HUVECs were seeded on a 48-well plate. Alexa Fluor 594 AcLDL (Thermo Fisher Scientific, catalog number L35353) was added to the culture medium for the final 4 h of the incubation time. HUVECs were washed, trypsinized, and centrifuged at 200 × g for 5 min, then resuspended in FACSB-10. Fluorescence was determined using a flow cytometer (LSR II, Becton-Dickinson, San Jose, CA, USA) and the data were analyzed using FlowJo software.

Flow cytometric analysis

Ten days after differentiation, PSCs were trypsinized, centrifuged at 200 × g for 5 min, resuspended in FACSB-10 (FACS buffer–10% fetal bovine serum) and incubated with anti-human-VE-cadherin (Invitrogen, catalog number 17-0319-41, 1:150) and anti-human CD31 (Invitrogen, catalog number 53-1449-41, 1:150) for 30 min on ice. Fluorescence was determined using a flow cytometer (LSR II, Becton-Dickinson, San Jose, CA, USA) and the data were analyzed using FlowJo software.

Statistical analysis

For bar plots and box plots, P-values are calculated with Wilcoxon’s test (two-tail). For ROC curves, P-values are calculated by Hanley’s method42 (two-tail). Q-values (Benjamini) for pathways were directly determined using DAVID ( For CRISPR9 experiments, data were presented as mean ± SD of six individual experiments. Statistical analysis was performed with Student’s T-test by means of the Prism statistical software package (Graph Pad Software, Inc., La Jolla, CA, USA).

No comments:

Post a Comment

If you have any doubts. Please let me know.