Showing posts with label Machine learning. Show all posts
Showing posts with label Machine learning. Show all posts

Thursday, June 4, 2020

Machine learning uncovers cell identity regulator by histone code

Curating CIGs

We first performed an intensive query of PubMed using the search expression “cell identity”[Title/abstract] OR “cell marker”[Title/abstract], which returned 7581 PubMed abstracts. We then searched the abstract for names of 297 cell types listed in the SHOGoiN database and ranked cell types by number of associated abstracts. To retrieve CIGs, we then conducted a manual literature review for the ten top-ranked cell types that also have RNA-sequencing (RNA-seq) data and ChIP-Seq data for the histone modifications of H3K4me3, H3K4me1, H3K27ac, and H3K27me3. We also defined control genes by requiring that their names did not appear together with the name of the given cell type in literature or any of five major annotation database, i.e., the Entrez Gene, Gene Cards, Ensembl, Gene Ontology, and KEGG. We then further selected a random subset of the control genes, so that number of control genes in the subset is the same as number of our curated CIGs.

Data collection

The RNA-seq, H3K4me3, H3K4me1, H3K27ac, and H3K27me3 ChIP-seq data for the ten well-defined cell types (H1-hESC, CD34 + HPC, B cell, HUVECs, human mammary epithelial cells, neural cells, MRG cell, normal human lung fibroblast, mesenchymal stem cell, human skeletal muscle myoblast) and landscape analysis are downloaded from GEO database and ENCODE project website (

ChIP-seq and RNA-seq data analysis

Human reference genome sequence version hg19 and UCSC Known Genes were downloaded from the UCSC Genome Browser website39. RNA-seq raw reads were mapped to the human genome version hg19 using TopHat version 2.1.1 with default parameter values. Expression value for each gene was determined by the function Cuffdiff in Cufflinks version 2.2.1 with default parameter values. Afterwards, quantile normalization of gene expression values was performed across cell and tissue types.

For ChIP-seq data, reads were first mapped to hg19 human genome by Bowtie version 1.1.0:

bowtie -p 8 -m 1 --chunkmbs 512 –best hg19_reference_genome fastq_file

Wig file is generated using DANPOS 2.2.3:

python dpeak sample –b input --smooth_width 0 -c 25000000 --frsz 200 --extend 200 –o output_dir

Quantile normalization is performed using DANPOS 2.2.3:

python wiq --buffer_size 50 hg19.chrom.sizes.xls wig –reference reference.qnor.sort.wiq --rformat wiq --rsorted 1

By this method, ChIP-seq data from different cell and tissue types were all normalized to have the same quantiles.

Bigwig is generated using the tool WigToBigWig with the following command line:

wigToBigWig -clip sample.bgsub.Fnor.wig hg19.sizes.xls

The tool WigToBigWig was downloaded from the ENCODE project website ( The “hg19.sizes.xls” in the command line is a file containing the length of each chromosome in the human genome. We then submitted the bigWig file to the UCSC Genome Browser ( to visualize ChIP-seq signal at each base pair39,40.

Peak calling and feature value calculation are performed using the GridGO function in the CEFCIG framework (detailed algorithm is described below).

In feature value calculation, skewness and kurtosis values are centered on zero. If no signal is being detected in the peak calling region, skewness and kurtosis are set to zero.

GridGO algorithm

We developed GridGO, a grid-based genetic method to optimize bioinformatics parameters for detecting epigenetic signature of CIGs. We use GridGO to optimize three important parameters, including the height cutoff to define ChIP-seq enrichment peak, the upstream distance cutoff to assign a peak to a nearby gene, and the downstream distance cutoff to assign a peak to a nearby gene. However, GridGO is designed to also allow optimizing different number of parameters. For simplicity, we will describe details of the algorithm by an example in which the upstream and downstream distance cutoffs to assign a ChIP-seq peak to nearby genes are set to be the same, so that GridGO will optimize only two parameters including the height cutoff to define ChIP-Seq enrichment peak and the distance cutoff to assign a peak to a nearby gene (Supplementary Fig. 2a). In the first iteration of optimization, the entire two-dimensional parameter space will be divided into m equal-size grids. Then the parameter values in the center of each grid will be used to define ChIP-Seq enrichment peaks and to assign the peaks to nearby genes. Afterwards, P-value of difference in a peak feature (epigenetic signature) between CIGs and control genes will be determined by Wilcoxon’s test. The grid with the lowest P-value will be the optimal grid saved for the second iteration. In the second iteration, the grid saved in the first iteration will be divided into a new set of m small grids, which will be tested as the previous iteration to select an optimal grid saved for the third iteration. Such genetic evolution of parameter grid keeps going until the number of iteration become larger than a given value n or the new optimal grid is not better than the previous optimal grid. To estimate a potential overfitting effect, we used only 80% of training genes in the GridGo optimization and build the CIGdiscover model based on parameters optimized by these genes. Then the performance of CIGdiscover on these 80% genes and the rest 20% genes were compared, and little overfitting effect was observed.

Backward and forward feature selections

In backward feature elimination, all features are included in the model at the beginning. In each round of iteration, after trying to remove individual features from the model and test the influence on the model, one feature with least impairment to the performance of the model is removed. In contrast, in forward feature construction, there is no feature in the model in the beginning. In each round of iteration, after trying to add individual features from the feature pool and test the influence on the model, the feature that led to the best improvement to the model was added into the model. The performance is measured by the closest distance between ROC curve and the top left corner of the panel.

Specifically, in an iteration I of the forward feature construction process (Supplementary Fig. 3a right section), let Si−1 = [s1, s2, …, si−1] be the combination of features selected by the previous i−1 iterations, and let Ci−1 = [ci, ci+1, …, cn] be the remaining candidate features. Our algorithm will combine ci with Si−1 to form a new candidate combination and evaluate the performance of this combination by 100 times cross-validations. Similarly, the algorithm will combine ci+1, ci+2, …, or cn with Si−1 to form n − i additional candidate combinations, and evaluate the performance of each candidate combination by 100 times cross-validations. Among these n − i + 1 candidate combinations, the combination that shows the best performance will be the combination Si selected by iteration i.

Training CIGdiscover

Logistic regression model is built on the base of the Sklearn logistic regression library. Data are centralized and normalized using the Preprocessing library in Sklearn. Cross-validation is repeated 100 times by splitting the data into 80% training and 20% test data. Penalty is set as L1 to prevent overfitting in all experiments.

We can denote the response for case i as yi, the jth predictor for case i as xij, the regression coefficient and the intercept corresponding to the jth predictor as βj and μ. Let θ = (μ,β1, … ,βp)t and xi = (xi = (xi1, … ,xip)t, we estimate the parameter vector θ by maximizing the log-likelihood

$$Lleft( theta right) = mathop {sum}limits_{i = 1}^n {left[ {y_ilog left( {p_i} right) + left( {1 - y_i} right)log left( {1 - p_i} right)} right]}$$


The Lasso method is implemented by fixing an upper bound on the sum of the absolute value of the model parameters, which can be denoted by penalizing the negative log-likelihood with L1-norm. In the Logistic regression model, the negative log-likelihood is denoted by

$$- mathop{sum}limits_{i = 1}^{n} log left( p_{beta} left( {y_{i}{mathrm{|}}x_{i}} right) right) = mathop {sum}limits_{i = 1}^{n} left{-y_{i} left(mathop {sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)}right) + {mathrm{log}}left(1 + {mathrm{exp}}left(mathop{sum}limits_{j = 0}^{p} beta_{j} x^{(j)}right)right)right}$$


The loss function ρ can be written as

$$rho _{left( beta right)}(x,y) = - yleft(mathop{sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)}right) + log left( 1 + exp left( mathop{sum}limits_{j = 0}^{p} beta_{j} x^{left( j right)} right) right)$$


The Lasso estimator of a Logistic regression model is defined as

$$hat beta left( lambda right) = {mathrm{{argmin}}}_beta left( {n^{ - 1}mathop {sum}limits_{i = 1}^n {rho _{left( beta right)}left( {x_i,y_i} right) + lambda parallel beta parallel _1} } right)$$


For each gene, the signed distance to the hyperplane was used as CIG score. The decision threshold (CIG score cutoff) for CIGs were determined by distance to the top left corner of the ROC curve41.

P-value between a pair of ROC curves was calculated by Hanley’s method42. First, a critical ratio z will be defined as:

$$z = frac{{A_1 - A_2}}{{sqrt {{mathrm{SE}}_1^2 + {mathrm{SE}}_2^2 - 2r{mathrm{SE}}_1{mathrm{SE}}_2} }}$$


where A1 and SE1 are the observed area under curve and estimated SE of area under curve for ROC curve 1; where A2 and SE2 are the associated values for ROC curve 2. r represents the correlation between A1 and A2 via querying the table provided in Hanley’s method42. Two intermediate correlation coefficients are required to calculate r. First, rcig, is the Pearson correlation between the CIG scores given to CIGs by the two models; rnoncig, is the Pearson correlation between the CIG scores given to non-CIG genes by the two models. Furthermore, r is acquired by querying the table42 using (rcig + rnoncig)/2 and (A1 + A2)/2. SE of the ROC areas are calculated based on the following equation43.

$${mathrm{{SE}}} = sqrt {frac{{Aleft( {1 - A} right) + left( {{mathrm{na}} - 1} right)left( {Q_1 - A * A} right) + ({mathrm{nn}} - 1)(Q_2 - A * A)}}{{{mathrm{na}} * {mathrm{nn}}}}}$$


where A is the area under the curve, na and nn are the number of control genes and CIG genes, respectively, and Q1 and Q2 are estimated by: Q = A/(2 − A), Q = 2 A ∗ A/(1 + A). Then this quantity z is referred to tables of normal distributions and used to estimate p-value between the two ROC curves.

Correlation and collinearity test

Spearman correlation coefficient between features was calculated using the Python Pandas library. For collinearity analysis, variance inflation factor is calculated using the SciPy library.

Analyze the influence of cell types

For cross test, identity genes from only five randomly selected cell types were used to train the model and identity genes from the other cell types were used to test the model. Overlapped genes are removed from the training and test datasets. Similarly, to analyze how the number of cell types influence the model, identity genes that were used for training and for testing were from different cell types. For parallel test, genes that were used for training and testing were from the same cell types.

Analyze the influence of noises

To test the robustness of the model, labels of genes were swapped in four different ways: only swap identity genes to negative control genes (false negative), only swap negative control genes to identity genes (false positive), swap equal number of identity genes to control genes and control genes to identity genes (bidirectional false), and randomly change the labels of genes. After swapping, the genes were subject to CIGdiscover to test its performance.

Training CIGnet

Data related to network nodes (genes), edges, and closeness between nodes were downloaded from CellNet website ( For each cell-type-specific cell identity subnetwork, only the CIGs were used. Known master transcription factors were defined as in Fig. 1. Control transcription factors are randomly selected from all known transcription factors, except the master transcription factors. Due to the small number of master transcription factors in the training datasets, SMOTE is used to expand the positive genes in training datasets following the distribution of feature values of the known master transcription factors. All performance tests for the logistic regression model are conducted in the same way as described for CIGdiscover. Cell-type specificity is calculated using Tau index44 and scaled to be between −1 and 1.

Pathway analysis

Top 500 identity genes ranked by the CIG score are selected to perform the DAVID pathway analysis (DAVID 6.8 The pathways with q-values (adjusted P-values using the Benjamini method) smaller than 0.05 were defined as significantly enriched.

Cell lineage hierarchy analysis

For heatmap and hierarchy cluster analysis, we retrieved top 500 identity genes for each cell or tissue type to create a heatmap of cell identity score using the tool Morpheus ( with default parameters.

Data visualization

Decision boundary analysis is performed using Sklearn and visualized by Matplotlib. ROC curves are created using Matplotlib. Bar plots is created using the Prism statistical software package (Graph Pad Software, Inc., La Jolla, CA, USA). Scatter plots are created using Microsoft Excel software.

Average density of epigenetic marks in promoter region around TSS were plotted using the Profile function in DANPOS version 2.2.3:

python profile wig --genefile_paths putative_identity_genes.txt,putative_negative_identity_genes.txt --genefile_aliases positive,negative --heatmap 1 --name outdir --genomic_sites TSS --flank_up 3000 --flank_dn 10000

Heatmap for density of epigenetic marks around TSS is plotted using the software MeV46 version 4.8.1.

We have added figures for visualization of CIG networks for individual cell or tissue types in the “network visualization section” of our CIGDB at

Maintenance of human PSCs

Human PSCs were maintained on Matrigel in mTesR1 medium. Cells were passaged approximately every 6 days. To passage PSCs, cells were washed with Dulbecco’s modified Eagle’s medium (DMEM)/F12 medium (no serum) and incubated in 1 mg/ml dispase until colony edges started to detach from the dish. The dish was then washed three times with DMEM/ F12 medium. After the final wash, colonies were scraped off of the dish with a cell scraper and gently triturated into small clumps and passaged onto fresh Matrigel-coated plates.

Human PSCs differentiation to ECs

Differentiation is induced 4 days after PSCs passaging (day 0). Mesoderm specification is induced by the addition of bone morphogenetic protein 4, activin A, small-molecule inhibitor of glycogen synthase kinase-3β (CHIR99021), and vascular endothelial growth factor (VEGF). Mesoderm inductive factors are removed on day 3 of differentiation and are replaced with vascular specification medium supplemented with VEGF and the transforming growth factor-β pathway small-molecule inhibitor SB431542. Vascular specification medium is additionally refreshed on days 7 and 9 of differentiation. Flow cytometric analysis of differentiated ECs is performed on day 10.

CRISPR gRNA and lentiviral vector design

Two open-access software, Cas-Designer ( and CRISPR design (, were used to design guide RNAs (gRNA) targeted to candidate gene. Two guides were designed per gene shown as in Supplementary Data 4.

Target DNA oligos were purchased from IDT (Integrated DNA Technologies) and cloned into the lentiCRISPR v2 plasmid2 (Addgene plasmid# 52961) via BsmBI restriction enzyme sites upstream of the scaffold sequence of the U6-driven gRNA cassette. All plasmids were sequenced to confirm successful ligation.

Lentiviral constructs

Lentivirus was packaged by co-transfection of constructs with second-generation packaging plasmids pMD2.G, psPAX2 into a six-well plate with HEK293T cells. After the first 24 h of transfection (250 ng of pMD2.G, 750 ng of psPAX2, 1 µg of target plasmid), the medium was changed to DMEM and the supernatants 48 and 72 h after transfection were pooled, filtered through a 0.45 µm filter, and used for infection.

Cell culture and lentiviral transduction

HUVECs were purchased from Lonza (C2517A) and human pluripotent stem cell (hPSC) was a kind gift from Dr John Cooke’s group. All cells used in this study were within 15 passages after receipt. HUVECs were cultured in 5% CO2 and maintained in vitro in Endothelial Growth Basal Medium with EGM-2 SingleQuot Kit. hPSC was cultured in 5% CO2 and maintained in vitro in mTeSR1 basal medium with mTeSR1 supplement. Those cell lines were mycoplasma negative during routine tests. HUVECs were grown to 70% confluence and infected with lentiviral vectors. The media was changed 8 h after viral transduction and incubated for 48 h before selection with 1 µg/mL puromycin for 3 days. HUVECs were collected for cell proliferation assay, nitric oxide production, and genomic DNA extraction. On the second day of PSC passaging, PSCs were infected with lentiviral vectors. The media was changed 6 h after viral transduction. Differentiation was induced 3 days after virus infection.

T7 endonuclease I assay

Genomic DNA from lentiviral transduced cells were extracted with a Quick-DNA Miniprep Kit (Zymo Research, Irvine, CA) following manufacturer’s protocol and were quantified using a Synergy 2 Multi-Mode Reader (BioTek, Winooski, VT, USA). The targeted regions were PCR-amplified with amfiSure PCR Master Mix (GenDEPOT, Barker, TX, USA) using primers flanking the target sites. Primers sequence are shown in Supplementary Data 4.

We denatured 200 ng of the PCR products and then slowly hybridized to form heteroduplexes using the following program settings: 95 °C for 5 min, 95°–85 °C at −2 °C/s, 85°–25 °C at −0.1 °C/s. Heteroduplexes were digested with T7 endonuclease I (New England Biolabs, Ipswich, MA, USA) at 37 °C for 30 min. In addition, the digested products were separated on a 2% TAE agarose gel for analysis. Images were captured using the ChemiDoc XRS+ Molecular Imager system (Bio-Rad, Hercules, CA, USA).

Cell proliferation assay

After viral transduction and puromycin selection, HUVECs were plated in 96-well plates at a density of 1000 cells per well and allowed to attach for 24 h. Viability was measured utilizing the CellTiter-Glo® Luminescent Cell Viability Assay (Promega, Madison, WI, USA). Results were read at 24, 48, and 72 h on the on Synergy 2 Multi-Mode Reader.

Nitric oxide production assay

HUVECs (8 × 103) were added to triplicate wells of a 96-well plate with 200 ml media. 4-amino-5methylamino-2979-difluorofluorescein diacetate (10 mM) in anhydrous dimethylsulfoxide was added to each well and the plate was incubated for 30 min at 37 °C and 5% CO2. The cells were washed with PBS, 200 ml fresh media was added, and the plate was incubated for an additional 30 min. Fluorescence was measured using a Synergy 2 Multi-Mode Reader.

Endothelial cell tube formation assay

Ninety-six-well plates were coated with 50 μl Matrigel (R&D Systems, catalog number 3432-005-01) and incubated at 37 °C for 30 min. Control (1 × 104) and CRISPR/Cas9-edited HUVECs in 100 μl EGM medium were seeded in each well, respectively. After 4 h, images were captured using the Leica epi-fluorescence microscope. Branches number and length were quantified using ImageJ software.

Fluorometric LDL uptake assay

Control and CRISPR/Cas9-edited HUVECs were seeded on a 48-well plate. Alexa Fluor 594 AcLDL (Thermo Fisher Scientific, catalog number L35353) was added to the culture medium for the final 4 h of the incubation time. HUVECs were washed, trypsinized, and centrifuged at 200 × g for 5 min, then resuspended in FACSB-10. Fluorescence was determined using a flow cytometer (LSR II, Becton-Dickinson, San Jose, CA, USA) and the data were analyzed using FlowJo software.

Flow cytometric analysis

Ten days after differentiation, PSCs were trypsinized, centrifuged at 200 × g for 5 min, resuspended in FACSB-10 (FACS buffer–10% fetal bovine serum) and incubated with anti-human-VE-cadherin (Invitrogen, catalog number 17-0319-41, 1:150) and anti-human CD31 (Invitrogen, catalog number 53-1449-41, 1:150) for 30 min on ice. Fluorescence was determined using a flow cytometer (LSR II, Becton-Dickinson, San Jose, CA, USA) and the data were analyzed using FlowJo software.

Statistical analysis

For bar plots and box plots, P-values are calculated with Wilcoxon’s test (two-tail). For ROC curves, P-values are calculated by Hanley’s method42 (two-tail). Q-values (Benjamini) for pathways were directly determined using DAVID ( For CRISPR9 experiments, data were presented as mean ± SD of six individual experiments. Statistical analysis was performed with Student’s T-test by means of the Prism statistical software package (Graph Pad Software, Inc., La Jolla, CA, USA).

Machine Learning Takes The Embarrassment Out Of Video-conference Wardrobe Malfunctions

Telecommuters: tired of the constant embarrassment of showing up to video conferences wearing nothing but your underwear? Machine Learning Takes The Embarrassment Out Of Video-conference Wardrobe Malfunctions.

 Save the humiliation and all those pesky trips down to HR with Safe Meeting, the new system that uses the power of artificial intelligence to turn off your camera if you forget that casual Friday isn’t supposed to be that casual.

The following infomercial is brought to you by [Nick Bild], who says the whole thing is tongue-in-cheek but we sense a certain degree of “necessity is the mother of invention” here. It’s true that the sudden throng of remote-work newbies certainly increases the chance of videoconference mishaps and the resulting mortification, so whatever the impetus, Safe Meeting seems like a great idea. It uses a Pi cam connected to a Jetson Nano to capture images of you during videoconferences, which are conducted over another camera. The stream is classified by a convolutional neural net (CNN) that determines whether it can see your underwear. If it can, it makes a REST API call to the conferencing app to turn off the camera. The video below shows it in action, and that it douses the camera quickly enough to spare your modesty.

We shudder to think about how [Nick] developed an underwear-specific training set, but we applaud him for doing so and coming up with a neat application for machine learning. He’s been doing some fun work in this space lately, from monitoring where surfaces have been touched to a 6502-based gesture recognition system.

Thanks To Renewable And Machine Learning

Google currently Forecasts The Wind

Wind farms have historically created less cash for the electricity they turn out as a result of they need been unable to predict however windy it'll be tomorrow.

“The approach plenty of power markets work is you have got to schedule your assets every day ahead,” aforementioned archangel Terrell, the pinnacle of energy market strategy at Google. “And you tend to urge stipendiary higher after you try this than if you sell into the market time period.

“Well, however do variable assets like wind schedule every day ahead after you do not know the wind goes to blow?” Terrell asked, “and however are you able to really reserve your home in line?”

“We're not obtaining the complete profit and therefore the full worth of that power.”

Here’s how: Google and therefore the Google-owned AI firm DeepMind combined weather information with power information from 700 megawatts of wind energy that Google sources within the Central us. exploitation machine learning, they need been ready to higher predict wind production, higher predict electricity provide and demand, and as a result, cut back operative prices.

“What we've been doing is functioning in partnership with the DeepMind team to use machine learning to require the weather information that is out there in public, really forecast what we predict the wind production are succeeding day, and bid that wind into the day-ahead markets,” Terrell aforementioned in a very recent seminar hosted by the Stanford Precourt Institute of Energy. Stanford denote video of the seminar last week.

The result has been a twenty p.c increase in revenue for wind farms, Terrell aforementioned.

The Department of Energy listed improved wind prognostication as a primary priority in its 2015 Wind Vision report, mostly to enhance reliability: “Improve Wind Resource Characterization,” the report aforementioned at the highest of its list of goals. “Collect information and develop models to enhance wind prognostication at multiple temporal scales—e.g., minutes, hours, days, months, years.”

Google’s goal has been additional sweeping: to clean carbon entirely from its energy portfolio, that consumes the maximum amount power as 2 San Franciscos.

Google achieved associate degree initial milestone by matching its annual energy use with its annual renewable-energy procural, Terrell aforementioned. however the corporate has not been carbon-free in each location at each hour, that is currently its new goal—what Terrell calls its “24x7 carbon-free” goal.

“We're extremely getting down to flip our efforts during this direction, and we're finding that it is not one thing that is simple to try to to. It's arguably a rocket launching, particularly in places wherever the renewable resources of these days don't seem to be as value effective as they're in different places.”

The scientists at London-based DeepMind have incontestible that AI will facilitate by increasing the market viability of renewables at Google and on the far side.

“Our hope is that this sort of machine learning approach will strengthen the business case for alternative energy and drive more adoption of carbon-free energy on electrical grids worldwide,” aforementioned DeepMind program manager Sims American Revolutionary leader and Google engineer Carl Elkin. in a very Deepmind diary post, they define however they boosted profits for Google’s wind farms within the Southwest Power Pool, associate degree energy market that stretches across the plains from the Canadian border to north Texas:

“Using a neural network trained on wide out there weather forecasts and historical rotary engine information, we tend to designed the DeepMind system to predict wind-power output thirty six hours prior to actual generation. supported these predictions, our model recommends the way to create optimum hourly delivery commitments to the ability grid a full day prior to.”

The DeepMind system predicts wind-power output thirty six hours prior to, permitting power producers to form ... [+] additional remunerative advance bids to produce power to the grid.


Wednesday, June 3, 2020

Artificial Intelligence - The Future Of Communication

The future of work is no longer merely a concept, but a reality — Covid-19 has made sure of that.
What role, then, does artificial intelligence (AI) have to play in this drastic shift?
For some time now, I’ve firmly maintained the belief that AI would take over the vast majority of process-driven work within 15 years. However, with years of key developments in the world of work having recently been crammed into a matter of months, the future has unfolded very differently than we imagined.
Rather than coming about through careful planning, companies have been thrust into this new way of working.
Without doubt, many were unprepared for it and have had to move quickly to put in place remote working solutions to keep business going.
They simply didn’t have the time to manage change and implement AI-driven solutions. However, many predict that we’ll see remote working becoming part of the ‘new normal’ even after lockdown measures are eased.

Companies like Twitter, for example, have already announced that their employees can work from home indefinitely.
Assuming that the growing trend towards permanent remote working continues, organisations will need to carefully consider the AI solutions they turn to for automating process-driven work.
Yet, how does this affect the security considerations required to make remote working effective for companies?
Every time we send a message to a colleague or share a company file, we share bits of data electronically.
Data, of course, is the lifeblood of every modern organisation, so when it is shared, it must be done so securely.
If we add AI into the mix, we run into a potential data security issue.
This is because, fundamentally, AI needs data to work properly.
Its purpose is to access data, analyse it and generate better outcomes for organisations through automation.
In doing so, it replaces certain tasks, but also enables employees to perform existing jobs more effectively.
Yet, despite the productivity gains, this new era of remote working doesn’t necessarily lend itself to AI being used appropriately when it comes to the secure transmission of data.
AI-driven communications are just as vulnerable to security flaws as those performed by humans.
This is especially evident when we look closely at the kind of technology enterprises use to communicate internally – a crucial component of any business model outside an office’s four walls.

According to Morten Brøgger, CEO of messaging and collaboration platform Wire, AI may not be as intertwined with the future of work as we think: “AI and the future definitely not in the collaboration and communication market or in the future of work 2.0.”
Brøgger continues: “The reason is that if you start building a lot of AI into communication tools, it means surveillance for your users, which is a clear breach of their privacy. If you do need an AI, you need to have the data to examine behavioural patterns. This means breaking end-to-end encryption, because there will be a machine that receives a copy of everything a user is doing.”
Wire is one of a crop of collaboration tools that have enabled many organisations to continue operating during COVID-19.
It found its niche by specifically targeting large enterprises back in 2017, who Brøgger believes have an advanced understanding of the importance of security and privacy.
With its leadership team comprising ex-Skype employees, the company now positions itself as being on a mission to change the way employees communicate in the workplace.
Though organisations can safely integrate AI technology, this abrupt new iteration of the workplace that we’ve suddenly found ourselves in has perhaps gone some way to exposing why AI isn’t the panacea for everything - at least not without careful security planning at the outset.

As Brøgger notes: “There were a lot of companies who were basically caught in this situation that weren’t ready for it. So, how do they put infrastructure in place that is sufficiently secure to allow people to work from home and work on things that are absolutely confidential? There are no longer any global rules – no one size fits all. That’s not how the world is, even with collaboration.”
It’s clear that the world of work is changing, though it took a pandemic to accelerate that change.

Companies need to take stock of how they can make the most of the tools available to them to reduce inefficiencies, and I still maintain that AI is, in
But it must be done in a way that doesn’t come at the expense of breaching a company’s security and putting its most valuable data assets at risk.

Alternative data and artificial intelligence - the future fuel for investors

Tomas Franczyk, Managing Director, Head of Global Information Services, Asia Pacific, Nasdaq. This is the undiscovered data from non-traditional data sources that give investors information and unique insights to help them evaluate investment opportunities.

The alternative data market, expected to become a $1.7 billion industry in the next few years, can encompass a range of sources.
These include natively digital information like web traffic, online buying habits, and social media activity, as well as more granular indicators of financial performance, such as ocean cargo and in the capital markets space, alternative data is viewed as increasingly important.
According to a 2019 survey by Greenwich Associates, 95% of trading professionals believe alternative data will become more valuable to the trading process, and 85% of banks, investors, and capital markets service providers plan to increase spending on data management.
As this new data becomes a powerful differentiator in the search for alpha, a rapidly growing community of buy-side firms have started using it to add power to quantitative and fundamental investment models with the aim of outperforming the market.
For example, Nasdaq's platform Quandl, which identifies datasets from local firms to build investment models, has partnered with large insurance companies in the United States to access insurance policies on this enables users to accurately measure car sales before automotive manufacturers report them.
This data would be extremely important, say, to hedge fund investors who need investment insights into the automotive sector.

Meanwhile, in Asia, Nasdaq is building regional-specific data products and has partnered with local fintechs and other innovative local vendors to migrate their core data and alternative data to the more sophisticated technologies mean organisations can create datasets that support managers with short-term trading strategies as well as those with a long-term approach, such as institutional investors.
Another report by Greenwich Associates last year found that 74% of firms surveyed agreed that alternative data has started to have a big impact on institutional investing, while nearly 30% of quantitative funds attribute at least 20% of their alpha to alternative data.
It can only work if it is properly interpreted and analysed.
By its nature, alternative data is harder to consume than financial data; it is often unstructured, does not follow patterns, and is created at a very fast rate.
Hence, investors now have a growing need for talent and technology, including analytics platforms, testing tools, fluid data architecture, and data science teams, to help them with their data management.
Advanced analytics and artificial intelligence (AI), such as machine learning and natural language processing, can be crucial to analysing data.

Machines can process events at roughly 2,000 times the speed of humans, digest vast datasets, and work around the clock.
During the investment process, AI-enabled data processing can increase the volume and quality of idea generation; this increase in data, including alternative data, when combined with computing power can help investment managers develop a long-lasting competitive advantage.
While some organisations are well on their way to introducing AI-based models, the industry is still understanding and identifying the operational, regulatory, and technological risks that come with the race effective risk management practices will be key for the successful adoption of AI.

Data providers have an opportunity to assist the asset management industry by making alternative data and AI the drivers of future investment research.
In fact, we may see active portfolio managers look less to the sell-side for their research needs and instead develop their own research, invest in data experts and technology, and partner with vendors to supply the information and analytical tools they need. Nasdaq offers comprehensive, bespoke, and timely data and insights to help clients build and protect assets.