Home
Videos uploaded by user “KDD2016 video”
KDD2016 paper 573
 
02:55
Title: "Why Should I Trust You?": Explaining the Predictions of Any Classifier Authors: Marco Túlio Ribeiro*, University of Washington Sameer Singh, University of Washington Carlos Guestrin, University of Washington Abstract: Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We further propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects. Our explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and detecting why a classifier should not be trusted. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 56661 KDD2016 video
KDD2016 paper 683
 
03:29
Title: MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories Authors: Prithu Banerjee*, UBC Pranali Yawalkar, IIT Madras Sayan Ranu, IIT Madras Abstract: In this paper, we study the problem of mining temporally anomalous sub-trajectory patterns from trajectory streams. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we design a technique called MANTRA. We study the properties unique to anomalous sub-trajectories and utilize them in MANTRA to iteratively refine the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we demonstrate that the proposed model conforms to human intuition. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 3846 KDD2016 video
KDD2016 paper 392
 
02:48
Title: Large-Scale Item Categorization in e-Commerce Using Multiple Recurrent Neural Networks Authors: Jung-Woo Ha*, NAVER LABS Hyuna Pyo, NAVER LABS Jeonghee Kim, NAVER LABS Abstract: Precise item categorization is a key issue in e-commerce domains. However, it still remains a challenging problem due to data size, category skewness, and noisy metadata. Here, we demonstrate a successful report on a deep learning-based item categorization method, i.e., deep categorization network (DeepCN), in an e-commerce website. DeepCN is an end-to-end model using multiple recurrent neural networks (RNNs) dedicated to metadata attributes for generating features from text metadata and fully connected layers for classifying item categories from the generated features. The categorization errors are propagated back through the fully connected layers to the RNNs for weight update in the learning process. This deep learning-based approach allows diverse attributes to be integrated into a common representation, thus overcoming sparsity and scalability problems. We evaluate DeepCN on large-scale real-world data including more than 94 million items with approximately 4,100 leaf categories from a Korean e-commerce website. Experiment results show our method improves the categorization accuracy compared to the model using single RNN as well as a standard classification model using unigram-based bag-of-words. Furthermore, we investigate how much the model parameters and the used attributes influence categorization performances. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 1578 KDD2016 video
"Why Should I Trust you?" Explaining the Predictions of Any Classifier
 
24:26
Author: Marco Tulio Ribeiro, Department of Computer Science and Engineering, University of Washington Abstract: Despite widespread adoption, machine learning models re- main mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 11537 KDD2016 video
KDD2016 paper 819
 
02:20
Title: DopeLearning: A Computational Approach to Rap Lyrics Generation Authors: Eric Malmi*, Aalto University Pyry Takala, Aalto University Hannu Toivonen, University of Helsinki Tapani Raiko, Aalto University Aristides Gionis, Aalto University Abstract: Writing rap lyrics requires both creativity to construct a meaningful, interesting story and lyrical skills to produce complex rhyme patterns, which form the cornerstone of good flow. We present a rap lyrics generation method that captures both of these aspects. First, we develop a prediction model to identify the next line of existing lyrics from a set of candidate next lines. This model is based on two machine-learning techniques: the RankSVM algorithm and a deep neural network model with a novel structure. Results show that the prediction model can identify the true next line among 299 randomly selected lines with an accuracy of 17%, i.e., over 50 times more likely than by random. Second, we employ the prediction model to combine lines from existing songs, producing lyrics with rhyme and a meaning. An evaluation of the produced lyrics shows that in terms of quantitative rhyme density, the method outperforms the best human rappers by 21%. The rap lyrics generator has been deployed as an online tool called DeepBeat, and the performance of the tool has been assessed by analyzing its usage logs. This analysis shows that machine-learned rankings correlate with user preferences. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 20144 KDD2016 video
KDD2016 paper 1036
 
02:10
Title: Gemello: Creating a Detailed Energy Breakdown from just the Monthly Electricity Bill Authors: Nipun Batra*, Indraprastha Institute of Information Technology, Delhi Amarjeet Singh, Indraprastha Institute of Information Technology, Delhi Kamin Whitehouse, University of Virginia Abstract: The first step to saving energy in the home is often to create an energy breakdown: the amount of energy used by each individual appliance in the home. Unfortunately, current techniques that produce an energy breakdown are not scalable: they require hardware to be installed in each and every home. In this paper, we propose a more scalable solution called Gemello that estimates the energy breakdown for one home by matching it with similar homes for which the breakdown is already known. This matching requires only the monthly energy bill and household characteristics such as square footage of the home and the size of the household. We evaluate this approach using 57 homes and results indicate that the accuracy of Gemello is comparable to or better than existing techniques that use sensing infrastructure in each home. The information required by Gemello is often publicly available and, as such, it can be immediately applied to many homes around the world. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 8442 KDD2016 video
KDD2016 paper 303
 
03:15
Title: Multi-layer Representation Learning for Medical Concepts Authors: Edward Choi*, Georgia Institute of Technology Mohammad Taha Bahador, Georgia Institute of Technology Elizabeth Searles, Children Healthcare of Atlanta Catherine Coffey, Children Healthcare of Atlanta Jimeng Sun, Georgia Institute of Technology Abstract: Learning efficient representations for concepts has been proven to be an important basis for many applications such as machine translation or document classification. Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits will have broad applications in healthcare analytics. However, in Electronic Health Records (EHR) the visit sequences of patients include multiple concepts (diagnosis, procedure, and medication codes) per visit. This structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within each visit. In this work, we propose Med2Vec, which not only learns distributed representations for both medical codes and visits from a large EHR dataset with over 3 million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec displays significant improvement in key medical applications compared to popular baselines such as Skip-gram, GloVe and stacked autoencoder, while providing clinically meaningful interpretation. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 2053 KDD2016 video
KDD2016 paper 461
 
03:01
Title: Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering Authors: Steven H. H. Ding, McGill University Benjamin C. M. Fung*, McGill University Philippe Charland, Defence Research and Development Canada Abstract: Assembly code analysis is one of the critical processes for detecting and justifying software plagiarism and software patent infringements when the source code is unavailable. It is also a common practice to discover exploits and vulnerabilities in existing software. However, it is a manually intensive and time-consuming process even for experienced reverse engineers. An effective and efficient assembly code clone search engine can greatly reduce the effort of this process, since it can identify the cloned parts that have been previously analyzed. The assembly code clone search problem belongs to the field of software engineering. However, it strongly depends on practical nearest neighbor search techniques in data mining and database. By closely collaborating with reverse engineers and Defence Research and Development Canada (DRDC), we study the concerns and challenges that make existing assembly code clone approaches not practically applicable from the perspective of data mining. We propose a new variant of LSH scheme and incorporate it with graph matching to address these challenges. We implement an integrated assembly clone search engine called Kam1n0. It is the first clone search engine that can efficiently identify the given query assembly function’s subgraph clones from a large assembly code repository. Kam1n0 is built upon the Apache Spark computation framework and Cassandra-like key-value distributed storage. The deployed system is publicly available and readers can try out its beta version on Google Cloud. Extensive experimental results suggest that Kam1n0 is accurate, efficient, and scalable for handling large volume of assembly code. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 3344 KDD2016 video
KDD2016 paper 330
 
01:34
Title: Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns Authors: Roel Bertens*, Utrecht University Jilles Vreeken, Max Planck Institute for Informatics and Saarland University Arno Siebes, Utrecht University Abstract: We study how to obtain concise descriptions of discrete multivariate sequential data. In particular, how to do so in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the domains (alphabets) of all sequences, allow patterns to overlap temporally, as well as allow for gaps in their occurrences. We formalise our goal by the Minimum Description Length principle, by which our objective is to discover the set of patterns that provides the most succinct description of the data. To discover high-quality pattern sets directly from data, we introduce DITTO, a highly efficient algorithm that approximates the ideal result very well. Experiments show that DITTO correctly discovers the patterns planted in synthetic data. Moreover, it scales favourably with the length of the data, the number of attributes, the alphabet sizes. On real data, ranging from sensor networks to annotated text, DITTO discovers easily interpretable summaries that provide clear insight in both the univariate and multivariate structure. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 977 KDD2016 video
Attribute Extraction from Product Titles in eCommerce
 
05:22
Author: Ajinkya More, Wal-Mart Stores, Inc. Abstract: This paper presents a named entity extraction system for detecting attributes in product titles of eCommerce retailers like Walmart. The absence of syntactic structure in such short pieces of text makes extracting attribute values a challenging problem. We find that combining sequence labeling algorithms such as Conditional Random Fields and Structured Perceptron with a curated normalization scheme produces an effective system for the task of extracting product attribute values from titles. To keep the discussion concrete, we will illustrate the mechanics of the system from the point of view of a particular attribute - brand. We also discuss the importance of an attribute extraction system in the context of retail websites with large product catalogs, compare our approach to other potential approaches to this problem and end the paper with a discussion of the performance of our system for extracting attributes. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1124 KDD2016 video
Learning to learn and compositionality with deep recurrent neural networks
 
01:23:45
Author: Nando de Freitas, Department of Computer Science, University of Oxford Abstract: Deep neural network representations play an important role in computer vision, speech, computational linguistics, robotics, reinforcement learning and many other data-rich domains. In this talk I will show that learning-to-learn and compositionality are key ingredients for dealing with knowledge transfer so as to solve a wide range of tasks, for dealing with small-data regimes, and for continual learning. I will demonstrate this with three examples: learning learning algorithms, neural programmers and interpreters, and learning communication. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 15161 KDD2016 video
Serving a Billion Personalized News Feeds
 
39:50
Author: Lars Backstrom, Facebook Abstract: Feed ranking's goal is to provide people with over a billion personalized experiences. We strive to provide the most compelling content to each person, personalized to them so that they are most likely to see the content that is most interesting to them. Similar to a newspaper, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a person visits, we need to find the best piece of content out of all the available stories and put it at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each person, figure out which friends, pages and topics they care about and pick the stories each particular person is interested in. In addition to the large-scale machine learning problems we work on, another primary area of research is understanding the value we are creating for people and making sure that our objective function is in alignment with what people want. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 3528 KDD2016 video
Plenary Panel: Is Deep Learning the New 42?
 
01:46:20
Authors: moderator: Andrei Broder, Yahoo! Research panelist: Pedro Domingos, Dept. of Computer Science & Engineering, University of Washington panelist: Nando de Freitas, Department of Computer Science, University of Oxford panelist: Isabelle Guyon, Clopinet panelist: Jitendra Malik, UC Berkeley panelist: Jennifer Neville, Computer Science Department, Purdue University Abstract: The history of deep learning goes back more than five decades but in the marketplace of ideas its perceived value went through booms and busts. We are no doubt at an all time high: in the last couple of years we witnessed extraordinary advances in vision, speech recognition, game playing, translation, and so on, all powered by deep networks. At the same time companies such as Amazon, Apple, Facebook, Google, and Microsoft are making huge bets on deep learning research and infrastructure, ML competitions are dominated by deep learning approaches, open source deep learning software is proliferating, and the popular press both cheerleads the progress and raises the dark specter of unintended consequences. So is deep learning the answer to everything? According to Douglas Adams’s famous “Hitchhiker’s Guide to the Galaxy” after 7.5 millions years of work the “Deep Thought” computer categorically found out that 42 is the “Answer to the Ultimate Question of Life, the Universe, and Everything” (although unfortunately, no one knows exactly what that question was). Rather than wait another 7.5 million years for “Deep Thought” to answer our quest we have assembled a distinguished panel of experts to give us their opinion on deep learning and its present and future impact. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 18140 KDD2016 video
KDD2016 paper 511
 
02:05
Title: Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta Authors: Michael Madaio*, Carnegie Mellon University Shang-Tse Chen, Georgia Tech Oliver L. Haimson, University of California, Irvine Wenwen Zhang, Georgia Tech Xiang Cheng, Emory University Matthew Hinds-Aldrich, Atlanta Fire Rescue Dept. Duen Horng Chau, Georgia Tech Bistra Dilkina, Georgia Tech Abstract: The Atlanta Fire Rescue Department (AFRD), like many municipal fire departments, actively works to reduce fire risk by inspecting commercial properties for potential hazards and fire code violations. However, AFRD’s fire inspection practices relied on tradition and intuition, with no existing data-driven process for prioritizing fire inspections or identifying new properties requiring inspection. In collaboration with AFRD, we developed the Firebird framework to help municipal fire departments identify and prioritize commercial property fire inspections, using machine learning, geocoding, and information visualization. Firebird computes fire risk scores for over 5,000 buildings in the city, with true positive rates of up to 71% in predicting fires. It has identified 6,096 new potential commercial properties to inspect, based on AFRD’s criteria for inspection. Furthermore, through an interactive map, Firebird integrates and visualizes fire incidents, property information and risk scores to help AFRD make informed decisions about fire inspections. Firebird has already begun to make positive impact at both local and national levels. It is improving AFRD’s inspection processes and Atlanta residents’ safety, and was highlighted by National Fire Protection Association (NFPA) as a best practice for using data to inform fire inspections. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 1553 KDD2016 video
KDD2016 paper 798
 
03:11
Title: Question Independent Grading using Machine Learning: The Case of Computer Program Grading Authors: Gursimran Singh*, Aspiring Minds Shashank Srikant, Aspiring Minds Varun Aggarwal, Aspiring Minds Abstract: Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no labeled samples for grading responses to a new, unseen question. We extend our previous work wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model across questions, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system’s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 4729 KDD2016 video
KDD2016 paper 403
 
02:39
Title: Accelerating Online CP Decompositions for Higher Order Tensors Authors: Shuo Zhou*, University of Melbourne Nguyen Vinh, University of Melbourne James Bailey, University of Melbourne Yunzhe Jia, University of Melbourne Ian Davidson, University of California-Davis Abstract: Tensors are a natural representation for multidimensional data. In recent years, CANDECOMP/PARAFAC (CP) decomposition, one of the most popular tools for analyzing multi-way data, has been extensively studied and widely applied. However, today’s datasets are often dynamically changing over time. Tracking the CP decomposition for such dynamic tensors is a crucial but challenging task, due to the large scale of the tensor and the velocity of new data arriving. Traditional techniques, such as Alternating Least Squares (ALS), cannot be directly applied to this problem because of their poor scalability in terms of time and memory. Additionally, existing online approaches have only partially addressed this problem and can only be deployed on third-order tensors. To fill this gap, we propose an efficient online algorithm that can incrementally track the CP decompositions of dynamic tensors with an arbitrary number of dimensions. In terms of effectiveness, our algorithm demonstrates comparable results with the most accurate algorithm, ALS, whilst being computationally much more efficient. Specifically, on small and moderate datasets, our approach is tens to hundreds of times faster than ALS, while for large-scale datasets, the speedup can be more than 3,000 times. Compared to other state-of-the-art online approaches, our method shows not only significantly better decomposition quality, but also better performance in terms of stability, efficiency and scalability. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 391 KDD2016 video
Dealing with Class Imbalance using Thresholding
 
18:40
Author: Rumi Ghosh, Robert Bosch LLC. Abstract: We propose thresholding as an approach to deal with class imbalance. We define the concept of thresholding as a process of determining a decision boundary in the presence of a tunable parameter. The threshold is the maximum value of this tunable parameter where the conditions of a certain decision are satisfied. We show that thresholding is applicable not only for linear classifiers but also for non-linear classifiers. We show that this is the implicit assumption for many approaches to deal with class imbalance in linear classifiers. We then extend this paradigm beyond linear classification and show how non-linear classification can be dealt with under this umbrella framework of thresholding. The proposed method can be used for outlier detection in many real-life scenarios like in manufacturing. In advanced manufacturing units, where the manufacturing process has matured over time, the number of instances (or parts) of the product that need to be rejected (based on a strict regime of quality tests) becomes relatively rare and are defined as outliers. How to detect these rare parts or outliers beforehand? How to detect combination of conditions leading to these outliers? These are the questions motivating our research. This paper focuses on prediction of outliers and conditions leading to outliers using classification. We address the problem of outlier detection using classification. The classes are good parts (those passing the quality tests) and bad parts (those failing the quality tests and can be considered as outliers). The rarity of outliers transforms this problem into a class-imbalanced classification problem. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 2351 KDD2016 video
KDD2016 paper 920
 
02:36
Title: Reconstructing an Epidemic over Time Authors: Polina Rozenshtein, Aalto University Aristides Gionis*, Aalto University B. Aditya Prakash, Virginia Tech Jilles Vreeken, Max-Planck Institute for Informatics and Saarland University Abstract: We consider the problem of reconstructing an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. Our input consists of a \emph{temporal network}, which contains information about when two nodes interacted, and a small sample of nodes that have been reported as infected. The goal is to recover the flow of the spread, including discovering the starting nodes, and identifying other likely-infected nodes that were not reported. This has multiple applications, from public health to social media and viral marketing purposes. Previous work explicitly factor-in many unrealistic assumptions: (a) the underlying network does not change or that we see all interactions; (b) that we have access to perfect noise-free data; or (c) that we know the exact propagation model. In contrast, we avoid these simplifications, and take into account the temporal network, require only a small sample of reported infections, and do not make any restrictive assumptions on the propagation model. We develop CulT, a scalable and effective algorithm to reconstruct epidemics that is also suited for an online setting. It works by formulating the problem as that of a temporal Steiner-tree computation, for which we design a fast algorithm leveraging the specific structure of our problem. We demonstrate the efficacy of CulT through extensive experiments on diverse datasets. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 759 KDD2016 video
Algorithmic Bias: From Discrimination Discovery to Fairness-Aware Data Mining (Part 1)
 
35:12
Authors: Carlos Castillo, EURECAT, Technology Centre of Catalonia Francesco Bonchi, ISI Foundation Abstract: Algorithms and decision making based on Big Data have become pervasive in all aspects of our daily lives lives (offline and online), as they have become essential tools in personal finance, health care, hiring, housing, education, and policies. It is therefore of societal and ethical importance to ask whether these algorithms can be discriminative on grounds such as gender, ethnicity, or health status. It turns out that the answer is positive: for instance, recent studies in the context of online advertising show that ads for high-income jobs are presented to men much more often than to women [Datta et al., 2015]; and ads for arrest records are significantly more likely to show up on searches for distinctively black names [Sweeney, 2013]. This algorithmic bias exists even when there is no discrimination intention in the developer of the algorithm. Sometimes it may be inherent to the data sources used (software making decisions based on data can reflect, or even amplify, the results of historical discrimination), but even when the sensitive attributes have been suppressed from the input, a well trained machine learning algorithm may still discriminate on the basis of such sensitive attributes because of correlations existing in the data. These considerations call for the development of data mining systems which are discrimination-conscious by-design. This is a novel and challenging research area for the data mining community. The aim of this tutorial is to survey algorithmic bias, presenting its most common variants, with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions. The tutorial covers two main complementary approaches: algorithms for discrimination discovery and discrimination prevention by means of fairness-aware data mining. We conclude by summarizing promising paths for future research. More on http://www.kdd.org/kdd2016/ KDD2016 conference is published on http://videolectures.net/
Views: 1839 KDD2016 video
Large Scale Machine Learning at Verizon: Theory and Applications
 
34:56
Author: Jeff Stribling, Verizon Communications Abstract: This talk will cover recent innovations in large-scale machine learning and their applications on massive, real-world data sets at Verizon. These applications power new revenue generating products and services for the company and are hosted on a massive computing and storage platform known as Orion. We will discuss the architecture of Orion and the underlying algorithmic framework. We will also cover some of the real-world aspects of building a new organization dedicated to creating new product lines based on data science. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1135 KDD2016 video
Gameplay First: Data Science at Blizzard Entertainment
 
39:07
Author: Chaitanya Chemudugunta, Blizzard Entertainment Inc. Abstract: With a focus on gameplay first, Blizzard Entertainment is known for developing premium games like World of Warcraft, Starcraft, Diablo, Hearthstone, Heroes of the Storm and Overwatch. Tens of millions of players login daily and interact with a variety of game features generating massive amounts of rich and diverse data streams. In this talk, I will provide a general overview of data science challenges at Blizzard and discuss two challenges in the area of game design. Specifically, I will discuss challenges and solutions for matchmaking in competitive games; and discuss how gameplay and player segmentation can be used to inform game balance. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 690 KDD2016 video
KDD2016 paper 1081
 
03:10
Title: Recurrent Marked Temporal Point Process Authors: Nan Du*, Georgia Tech Hanjun Dai, Max Plank Institute Rakshit Trivedi, Max Plank Institute Utkarsh Upadhyay, Max Plank Institute Manuel Gomez-Rodriguez, MPI-SWS Le Song, MPI-SWS Abstract: Large volumes of event data are becoming increasingly available in a wide variety of applications, such as healthcare analytics, smart city and social network analysis. The precise time interval or the exact distance between two events carries a great deal of information about the dynamics of the underlying systems. These characteristics make such data fundamentally different from independently and identically distributed data and time-series data where time and space are treated as indices rather than random variables. Marked temporal point processes and intensity functions are the mathematical framework for modeling such event data. However, typical point process models, such as Hawkes processes, continuous Markov chains, autoregressive conditional duration processes, are making strong assumptions about the generative processes of event data which may or may not reflect reality, and the parametric assumptions have also restricted the expressive power of temporal point process models. Can we obtain a more expressive model of marked temporal point processes? How can we learn such a model from massive data? In this paper, we propose a novel point process, referred to as the Recurrent Temporal Point Process (RTPP), to simultaneously model the event timings and markers. The key idea of our approach is to view the intensity function of a temporal point process as a nonlinear function of the history, and parameterize the function using a recurrent neural network. We develop an efficient stochastic gradient algorithm for learning RTPP which can readily scale up to millions of events. Using both synthetic and real world datasets, we show that, in the case that the true models are parametric models, RTPP can learn the dynamics of such models without knowing the actual parametric forms; and in the case that the true models are unknown, RTPP can also learn the dynamics, and achieve better predictive performance than other parametric alternatives based on prior knowledge. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 1119 KDD2016 video
Decoding Fashion Contexts Using Word Embeddings
 
24:23
Author: Deepak Warrier, Myntra Designs Private Ltd. Abstract: Personalisation in e-commerce hinges on dynamically uncovering the user’s context via his/her interactions on the portal. The harder the context identification, lesser is the effectiveness of personalisation. Our work attempts to uncover and understand the user’s context to effectively render personalisation for fashion ecommerce. We highlight fashion-domain specific gaps with typical implementations of personalised recommendation systems and present an alternate approach. Our approach hinges on user sessions (clickstream) as a proxy to the context and explores “session vector” as an atomic unit for personalization. The approach to learn context vector incorporates both the fashion product (style) attributes and the users’ browsing signals. We establish various possible user contexts (product clusters) and a style can have a fuzzy membership into multiple contexts. We predict the user’s context using the skip-gram model with negative sampling introduced by Mikolov et al [1]. We are able to decode the context with a high accuracy even for non-coherent sessions. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 559 KDD2016 video
KDD2016 paper 567
 
03:00
Title: Graph Wavelets via Sparse Cuts Authors: Arlei Lopes da Silva*, University of California, Santa Barbara Xuan-Hong Dang, University of California, Santa Barbara Prithwish Basu, Raytheon BBN Technologies Ambuj Singh, University of California, Santa Barbara Ananthram Swami, Army Lab Abstract: Modeling information that resides on vertices of large graphs is a key problem in several real-life applications, ranging from social networks to the Internet-of-things. Signal Processing on Graphs and, in particular, graph wavelets can exploit the intrinsic smoothness of these datasets in order to represent them in a both compact and accurate manner. However, how to discover wavelet bases that capture the geometry of the data with respect to the signal as well as the graph structure remains an open question. In this paper, we study the problem of computing graph wavelet bases via sparse cuts in order to produce low-dimensional encodings of data-driven bases. This problem is connected to known hard problems in graph theory (e.g. multiway cuts) and thus requires an efficient heuristic. We formulate the basis discovery task as a relaxation of a vector optimization problem, which leads to an elegant solution as a regularized eigenvalue computation. Moreover, we propose several strategies in order to scale our algorithm to large graphs. Experimental results show that the proposed algorithm can effectively encode both the graph structure and signal, producing compressed and accurate representations for vertex values in a wide range of datasets (e.g. sensor and gene networks) and outperforming the best baseline by up to 8 times. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 767 KDD2016 video
Making Strides in Quantifying and Understanding Soccer
 
35:49
Author: Sarah Rudd, StatDNA, LLC Abstract: Soccer has a rich history of people using data in an attempt to gain a better understanding of what happened in a game. However, due to its fluid nature, the sport is often assumed to be difficult to quantify and analyze. This talk will highlight some of the progress that has been made in soccer analytics in recent years, including some of the advances being made thanks to rich, full-tracking datasets. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 2890 KDD2016 video
KDD2016 paper 984
 
00:56
Title: The Limits of Popularity-Based Recommendations, and the Role of Social Ties Authors: Marco Bressan*, Sapienza University of Rome Stefano Leucci, Sapienza University of Rome Alessandro Panconesi, Sapienza University of Rome Prabhakar Raghavan, Google Erisa Terolli, Sapienza University of Rome Abstract: In this paper we introduce a mathematical model that captures some of the salient features of recommender systems that are based on popularity and that try to exploit social ties among the users. We show that, under very general conditions, the market always converges to a steady state, for which we are able to give an explicit form. Thanks to this we can tell rather precisely how much a market is altered by a recommendation system, and determine the power of users to influence others. Our theoretical results are complemented by experiments with real world social networks showing that social graphs prevent large market distortions in spite of the presence of highly influential users. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 3331 KDD2016 video
KDD2016 paper 767
 
02:43
Title: A Subsequence Interleaving Model for Sequential Pattern Mining Authors: Jaroslav Fowkes, University of Edinburgh Charles Sutton, University of Edinburgh Abstract: Recent sequential pattern mining methods have used the minimum description length (MDL) principle to define an encoding scheme which describes an algorithm for mining the most compressing patterns in a database. We present a novel subsequence interleaving model based on a probabilistic model of the sequence database, which allows us to search for the most compressing set of patterns without designing a specific encoding scheme. Our proposed algorithm is able to efficiently mine the most relevant sequential patterns and rank them using an associated measure of interestingness. The efficient inference in our model is a direct result of our use of a structural expectation-maximization framework, in which the expectation-step takes the form of a submodular optimization problem subject to a coverage constraint. We show on both synthetic and real world datasets that our model mines a set of sequential patterns with low spuriousness and redundancy, high interpretability and usefulness in real-world applications. Furthermore, we demonstrate that the quality of the patterns from our approach is comparable to, if not better than, existing state of the art sequential pattern mining algorithms. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 296 KDD2016 video
Node Representation in Mining Heterogeneous Information Networks
 
34:26
Author: Yizhou Sun, Computer Science Department, University of California, Los Angeles, UCLA Abstract: One of the challenges in mining information networks is the lack of intrinsic metric in representing nodes into a low dimensional space, which is essential in many mining tasks, such as recommendation and anomaly detection. Moreover, when coming to heterogeneous information networks, where nodes belong to different types and links represent different semantic meanings, it is even more challenging to represent nodes properly. In this talk, we will focus on two mining tasks, i.e., (1) content-based recommendation and (2) anomaly detection in heterogeneous categorical events, and introduce (1) how to represent nodes when different types of nodes and links are involved; and (2) how heterogeneous links play different roles in these tasks. Our results have demonstrated the superiority as well as the interpretability of these new methodologies. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 2100 KDD2016 video
Fast and Accurate Kmeans Clustering with Outliers
 
18:45
Author: Shalmoli Gupta, Department of Computer Science, University of Illinois at Urbana-Champaign More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1768 KDD2016 video
KDD2016 paper 1156
 
02:38
Title: EMBERS AutoGSR: Automated Coding of Civil Unrest Events Authors: Parang Saraf*, Virginia Tech Naren Ramakrishnan, Virginia Tech Abstract: We describe the EMBERS AutoGSR system that conducts automated coding of civil unrest events from news articles published in multiple languages. The nuts and bolts of the AutoGSR system constitute an ecosystem of filtering, ranking, and recommendation models to determine if an article reports a civil unrest event and, if so, proceed to identify and encode specific characteristics of the civil unrest event such as the when, where, who, and why of the protest. AutoGSR is a deployed system for the past 6 months continually processing data 24x7 in languages such as Spanish, Portugese, English and encoding civil unrest events in 10 countries of Latin America: Argentina, Brazil, Chile, Colombia, Ecuador, El Salvador, Mexico, Paraguay, Uruguay, and Venezuela. We demonstrate the superiority of AutoGSR over both manual approaches and other state-of-the-art encoding systems for civil unrest. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 5698 KDD2016 video
Disaggregating Appliance-Level Energy Consumption: A Probabilistic Framework
 
09:15
Author: Sabina Tomkins, Jack Baskin School of Engineering, University of California Santa Cruz Abstract: In this work we propose a probabilistic disaggregation framework which can determine the energy consumption of individual electrical appliances from aggregate power readings. Our proposed framework uses probabilistic soft logic (PSL), to define a hinge-loss Markov random field (HL-MRF). Our method is novel in that it can integrate a diverse range of features, is highly scalable to any number of appliances, and makes less assumptions than existing methods. As the residential sector is responsible for over a third of all electricity demand, and delivering appliance level energy consumption information to consumers has been demonstrated to reduce electricity consumption, our framework has the potential to make a significant impact on energy savings. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 381 KDD2016 video
Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks
 
01:12:17
Author: Jennifer Chayes, Microsoft Research Abstract: There are numerous examples of sparse massive networks, in particular the Internet, WWW and online social networks. How do we model and learn these networks? In contrast to conventional learning problems, where we have many independent samples, it is often the case for these networks that we can get only one independent sample. How do we use a single snapshot today to learn a model for the network, and therefore be able to predict a similar, but larger network in the future? In the case of relatively small or moderately sized networks, it’s appropriate to model the network parametrically, and attempt to learn these parameters. For massive networks, a non-parametric representation is more appropriate. In this talk, we first review the theory of graphons, developed over the last decade to describe limits of dense graphs, and the more the recent theory describing sparse graphs of unbounded average degree, including power-law graphs. We then show how to use these graphons as non-parametric models for sparse networks. Finally, we show how to get consistent estimators of these non-parametric models, and moreover how to do this in a way that protects the privacy of individuals on the network. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 2899 KDD2016 video
KDD2016 paper 943
 
02:37
Title: Evaluating Mobile App Release Authors: Ya Xu*, LinkedIn Corporation Nanyu Chen, LinkedIn Corporation Abstract: We have seen an explosive growth of mobile usage, particularly on mobile apps. It is more important than ever to be able to properly evaluate mobile app release. A/B testing is a standard framework to evaluate new ideas. We have seen much of its applications in the online world across the industry [9,10,12]. Running A/B tests on mobile apps turns out to be quite different, and much of it is attributed to the fact that we cannot ship code easily to mobile apps other than going through a lengthy build, review and release process. Mobile infrastructure and user behavior differences also contribute to how A/B tests are conducted differently on mobile apps, which will be discussed in details in this paper. In addition to measuring features individually in the new app version through randomized A/B tests, we have a unique opportunity to evaluate the mobile app as a whole using the quasi-experimental framework [21]. Not all features can be A/B tested due to infrastructure changes and wholistic product redesign. We propose and establish quasi-experiment techniques for measuring impact from mobile app release, with results shared from a recent major app launch at LinkedIn. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 1007 KDD2016 video
A Framework of Combining Deep Learning and Survival Analysis for Asset Health Management
 
14:32
Author: Linxia Liao, General Electric Company Abstract: We propose a method to integrate feature extraction and prediction as a single optimization task by stacking a threelayer model as a deep learning structure. The first layer of the deep structure is a Long Short Term Memory (LSTM) model which deals with the sequential input data from a group of assets. The output of the LSTM model is followed by mean-pooling, and the result is fed to the second layer. The second layer is a neural network layer, which further learns the feature representation. The output of the second layer is connected to a survival model as the third layer for predicting asset health condition. The parameters of the three-layer model are optimized together via stochastic gradient decent. The proposed method was tested on a small dataset collected from a fleet of mining haul trucks. The model resulted in the “individualized” failure probability representation for assessing the health condition of each individual asset, which well separates the in-service and failed trucks. The proposed method was also tested on a large open source hard drive dataset, and it showed promising result. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 854 KDD2016 video
Computational Social Science: Exciting Progress and Future Challenges
 
39:37
Author: Duncan Watts, Microsoft Research Abstract: The past 15 years have witnessed a remarkable increase in both the scale and scope of social and behavioral data available to researchers, leading some to herald the emergence of a new field: “computational social science.” Against these exciting developments stands a stubborn fact: that in spite of many thousands of published papers, there has been surprisingly little progress on the “big” questions that motivated the field in the first place—questions concerning systemic risk in financial systems, problem solving in complex organizations, and the dynamics of epidemics or social movements, among others. In this talk I highlight some examples of research that would not have been possible just a handful of years ago and that illustrate the promise of CSS. At the same time, they illustrate its limitations. I then conclude with some thoughts on how CSS can bridge the gap between its current state and its potential. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1474 KDD2016 video
KDD2016 paper 448
 
02:59
Title: Point-of-Interest Recommendations: Learning Potential Check-ins from Friends Authors: Huayu Li, University of North Carolina at Charlotte Yong Ge, University of North Carolina at Charlotte Hengshu Zhu, Baidu Inc. Abstract: The emergence of Location-based Social Network (LBSN) services provides a wonderful opportunity to build personalized Point-of-Interest (POI) recommender systems. Although a personalized POI recommender system can significantly facilitate users’ outdoor activities, it faces many challenging problems, such as the hardness to model user’s POI decision making process and the difficulty to address data sparsity and user/location cold-start problem. To cope with these challenges, we define three types of friends (i.e., social friends, location friends, and neighboring friends) in LBSN, and develop a two-step framework to leverage the information of friends to improve POI recommendation accuracy and address cold-start problem. Specifically, we first propose to learn a set of potential locations that each individual’s friends have checked-in before and this individual is most interested in. Then we incorporate three types of check-ins (i.e., observed check-ins, potential check-ins and other unobserved check-ins) into matrix factorization model using two different loss functions (i.e., the square error based loss and the ranking error based loss). To evaluate the proposed model, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on two real-world data sets. The experimental results demonstrate the effectiveness of our methods. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 924 KDD2016 video
Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization
 
18:56
Author: Junming Liu, Rutgers, The State University of New Jersey Abstract: Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. A key to success for a bike sharing systems is the effectiveness of rebalancing operations, that is, the effort-s of restoring the number of bikes in each station to its target value by routing vehicles through pick-up and drop-off operations. There are two major issues for this bike rebalancing problem: the determination of station inventory target level and the large scale multiple capacitated vehicle routing optimization with outlier stations. The key challenges include demand prediction accuracy for inventory target level determination, and an effective optimizer for vehicle routing with hundreds of stations. To this end, in this paper, we develop a Meteorology Similarity Weighted K-Nearest-Neighbor (M-SWK) regressor to predict the station pick-up demand based on large-scale historic trip records. Based on further analysis on the station network constructed by station-station connections and the trip duration, we propose an inter station bike transition (ISBT) model to predict the station drop-off demand. Then, we provide a mixed integer nonlinear programming (MINLP) formulation of multiple capacitated bike routing problem with the objective of minimizing total travel distance. To solve it, we propose an Adaptive Capacity Constrained K-centers Clustering (AdaCCKC) algorithm to separate outlier stations (the demands of these stations are very large and make the optimization infeasible) and group the rest stations into clusters within which one vehicle is scheduled to redistribute bikes between stations. In this way, the large scale multiple vehicle routing problem is reduced to inner cluster one vehicle routing problem with guaranteed feasible solutions. Finally, the extensive experimental results on the NYC Citi Bike system show the advantages of our approach for bike demand prediction and large-scale bike rebalancing optimization. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 439 KDD2016 video
Opportunities and Challenges for Remote Sensing in Agricultural Applications of Data Science
 
45:38
Author: Melba Crawford, College of Engineering, Purdue University Abstract: Increases in global population, coupled with challenges of climate change require development of technologies to support increased food production throughout the entire supply chain – from plant breeding to delivery of agricultural products. Developments in remote sensing from space-based, airborne, and proximal sensing platforms, coupled with advanced capabilities in computational platforms and data analytics, are providing new opportunities for contributing solutions to address grand challenges related to food, energy, and water. Spaceborne platforms carrying new active and passive sensors are moving from complex, multi-purpose missions to lower cost, measurement specific constellations of small satellites. Advances in materials are leading to miniaturization and mass production of sensors and supporting instrumentation, resulting in advanced sensing from affordable autonomous vehicles. New algorithms to exploit the massive, multi-modality data sets and provide actionable information for agricultural applications from phenotyping to crop mapping and monitoring are being developed. An overview of recent contributions, as well opportunities and challenges for data science in analysis of multi-temporal, multi-scale multi-sensor remotely sensed data will be presented. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1355 KDD2016 video
KDD2016 paper 679
 
04:29
Title: Structural Neighborhood Based Classification of Nodes in a Network Authors: Sharad Nandanwar*, Indian Institute of Science Musti Narasimha Murty, Indian Institute of Science Abstract: Classification of entities based on the underlying network structure is an important problem. Networks encountered in practice are sparse and have many missing and noisy links. Even though statistical learning techniques have been used for intra-network classification based on local neighborhood, they perform poorly as they exploit only local information. In this paper, we propose a novel structural neighborhood based learning using a random walk. For classifying a node we take a random walk from the corresponding node, and make a decision based on how nodes in the respective k th - level neighborhood are getting classified. We observe that random walks of short length are helpful in classification. Emphasizing role of longer random walk may cause under- lying markov-chain to converge towards stationary distribu- tion. Considering this, we take a lazy random walk based ap- proach with variable termination probability for each node, based on its structural properties including degree. Our ex- perimental study on real world datasets demonstrates the superiority of the proposed approach over the existing state- of-the-art approaches. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 2524 KDD2016 video
KDD2016 paper 185
 
02:32
Title: Mining Subgroups with Exceptional Transition Behavior Authors: Florian Lemmerich*, Gesis Martin Becker, University of Würzburg Philipp Singer, Gesis Denis Helic, TU Graz Andreas Hotho, University of Wuerzburg Markus Strohmaier, Gesis Abstract: We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 270 KDD2016 video
Matrix Computations and Optimization in Apache Spark
 
22:52
Authors: Reza Bosagh Zadeh, Institute for Computational and Mathematical Engineering, Stanford University Abstract: We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. Another example is our Spark port of the popular TFOCS optimization package, originally built for MATLAB, which allows for solving Linear programs as well as a variety of other convex programs. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM. The contributions described in this paper are already merged into Apache Spark and available on Spark installations by default, and commercially supported by a slew of companies which provide further services. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1024 KDD2016 video
The Wisdom of Crowds: Best Practices for Data Prep & Machine Learning
 
38:21
Author: Ingo Mierswa, Rapid-I GmbH Abstract: With hundreds of thousands of users, RapidMiner is the most frequently used visual workflow platform for machine learning. It covers the full spectrum of analytics from data preparation to machine learning and model validation. In this presentation, I will take you on a tour of machine learning which spans the last 15 years of research and industry applications and share key insights with you about how data scientists perform their daily analysis tasks. These patterns are extracted from mining millions of analytical workflows that have been created with RapidMiner over the past years. This talk will address important questions around the data mining process such as: What are the most frequently used solutions for typical data quality problems? How often are analysts using decision trees or neural networks? And does this behavior change over time or depend on the users experience level? More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 809 KDD2016 video
Ranking Relevance in Yahoo Search
 
23:47
Author: Dawei Yin, Yahoo! Inc. Abstract: Search engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search engine has gone far beyond text matching, and now involves tremendous challenges. The semantic gap between queries and URLs is the main barrier for improving base relevance. Clicks help provide hints to improve relevance, but unfortunately for most tail queries, the click information is too sparse, noisy, or missing entirely. For comprehensive relevance, the recency and location sensitivity of results is also critical. In this paper, we give an overview of the solutions for relevance in the Yahoo search engine. We introduce three key techniques for base relevance – ranking functions, semantic matching features and query rewriting. We also describe solutions for recency sensitive relevance and location sensitive relevance. This work builds upon 20 years of existing efforts on Yahoo search, summarizes the most recent advances and provides a series of practical relevance solutions. The reported performance is based on Yahoo’s commercial search engine, where tens of billions of URLs are indexed and served by the ranking system. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1128 KDD2016 video
KDD2016 paper 510
 
02:35
Title: Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platforms Authors: Alexander Nikolaev*, University at Buffalo Shounak Gore, University at Buffalo Venu Govindaraju, University at Buffalo Abstract: The challenges of assessing the ``health’’ of online social media platforms and strategically growing them are recognized by many practitioners and researchers. For those platforms that primarily rely on user-generated content, the reach—the degree of participation referring to the percentage and involvement of users—is a key indicator of success. This paper lays a theoretical foundation for measuring engagement as a driver of reach that achieves growth via positive externality effects. The paper takes a game theoretic approach to quantifying engagement, viewing a platform’s social capital as a cooperatively created value and finding a fair distribution of this value among the contributors. It introduces engagement capacity, a measure of the ability of users and user groups to engage peers, and formulates the Engaging Team Formation Problem (EngTFP) to identify the sets of users that ``make a platform go’‘. We show how engagement capacity can be useful in characterizing forum user behavior and in the reach maximization efforts. We also stress how engagement analysis differs from influence measurement. Computational investigations with Twitter and Health Forum data reveal the properties of engagement capacity and the utility of EngTFP. More on http://www.kdd.org/kdd2016/ KDD2016 Conference will be recorded and published on http://videolectures.net/
Views: 193 KDD2016 video
Applying Deep Learning for Prognostic Health Monitoring of Aerospace and Building Systems
 
12:26
Author: Kishore K. Reddy, United Technologies Research Center Abstract: Data-driven prognostics are instrumental in enabling anomaly detection, sensor estimation and prediction in prognostics and health management (PHM) systems. Recent advances in machine learning techniques such as deep learning (DL) has rejuvenated data-driven analysis in PHM. DL algorithms have been successful due to the presence of large volumes of data and its ability to learn the features during the learning process. The performance improvement is significant from the features learnt from DL techniques as compared to the hand crafted features. This paper proposes using deep belief networks (DBN) and deep auto encoders (DAE) in three different aerospace and building systems applications: (i) estimation of fuel flow rate in jet engines, (ii) fault detection in elevator cab doors using smart phone, and (iii) prediction of chiller power consumption in heating, ventilation, and air conditioning (HVAC) systems. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1482 KDD2016 video
Multi-Sensor Prognostics using an Unsupervised Health Index based on an LSTM Encoder-Decoder
 
14:18
Author: Pankaj Malhotra, Tata Consultancy Services Ltd Abstract: Many approaches for estimation of Remaining Useful Life (RUL) of a machine, using its operational sensor data, make assumptions about how a system degrades or a fault evolves, e.g., exponential degradation. However, in many domains degradation may not follow a pattern. We propose a Long Short Term Memory based Encoder-Decoder (LSTM-ED) scheme to obtain an unsupervised health index (HI) for a system using multi-sensor time-series data. LSTM-ED is trained to reconstruct the time-series corresponding to healthy state of a system. The reconstruction error is used to compute HI which is then used for RUL estimation. We evaluate our approach on publicly available Turbofan Engine and Milling Machine datasets. We also present results on a real-world industry dataset from a pulverizer mill where we find significant correlation between LSTM-ED based HI and maintenance costs. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 523 KDD2016 video
Smart Reply: Automated Response Suggestion for Email
 
24:14
Author: Anjuli Kannan, Google Research New York, Google, Inc. Abstract: In this paper we propose and investigate a novel end-to-end method for automatically generating short email responses, called Smart Reply. It generates semantically diverse suggestions that can be used as complete email responses with just one tap on mobile. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. It is designed to work at very high throughput and process hundreds of millions of messages daily. The system exploits state-of-the-art, large-scale deep learning. We describe the architecture of the system as well as the challenges that we faced while building it, like response diversity and scalability. We also introduce a new method for semantic clustering of user-generated content that requires only a modest amount of explicitly labeled data. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1069 KDD2016 video
Deep Learning for Financial Sentiment Analysis
 
14:44
Author: Sahar Sohangir, Florida Atlantic University More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 1124 KDD2016 video
A Bayesian Network approach to County-Level Corn Yield Prediction
 
14:58
Author: Vikas Chawla, Department of Computer Science, Iowa State University More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 348 KDD2016 video
Asymmetric Transitivity Preserving Graph Embedding
 
16:22
Author: Ziwei Zhang, Tsinghua University Abstract: Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of directed graphs. Asymmetric transitivity depicts the correlation among directed edges, that is, if there is a directed path from u to v, then there is likely a directed edge from u to v. Asymmetric transitivity can help in capturing structures of graphs and recovering from partially observed graphs. To tackle this challenge, we propose the idea of preserving asymmetric transitivity by approximating high-order proximity which are based on asymmetric transitivity. In particular, we develop a novel graph embed-ding algorithm, High-Order Proximity preserved Embedding (HOPE for short), which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. More specifically, we first derive a general formulation that cover multiple popular high-order proximity measurements, then propose a scalable embedding algorithm to approximate the high-order proximity measurements based on their general formulation. Moreover, we provide a theoretical upper bound on the RMSE (Root Mean Squared Error) of the approximation. Our empirical experiments on a synthetic dataset and three real-world datasets demonstrate that HOPE can approximate the high-order proximities significantly better than the state-of-art algorithms and outperform the state-of-art algorithms in tasks of reconstruction, link prediction and vertex recommendation. More on http://www.kdd.org/kdd2016/ KDD2016 Conference is published on http://videolectures.net/
Views: 327 KDD2016 video