Research & Services

Services | Book | Publications | Datasets

Services

Served as Program Committee Member/Invited Reviewer at some of the leading conferences in Machine Learning:

Book

Sculpting Data for ML: The first act of Machine Learning

The book is endorsed by Julian McAuley, Professor at UC San Diego, Laurence Moroney, AI Lead Advocate at Google, and Mengting Wan, Senior Applied Scientist at Microsoft.

Abstract: In the contemporary world of Artificial Intelligence and Machine Learning, _data is the new oil_. For Machine Learning algorithms to work their magic, it is imperative to lay a firm foundation with relevant data. Sculpting Data for ML introduces the readers to the first act of Machine Learning, Dataset Curation. This book puts forward practical tips to identify valuable information from the extensive amount of crude data available at our fingertips. The step-by-step guide accompanies code examples in Python from the extraction of real-world datasets and illustrates ways to hone the skills of extracting meaningful datasets. In addition, the book also dives deep into how data fits into the Machine Learning ecosystem and tries to highlight the impact good quality data can have on the Machine Learning system's performance.

Publications

  • Springer Nature's Deep Learning for Social Media Data Analytics
    • Do not fake it till you make it! ‑ Synopsis of trending fake news detection methodologies
      Book Chapter by Rishabh Misra and Jigyasa Grover, accepted for publication in Springer Nature Book "Deep Learning for Social Media Data Analytics", September 2022, ISBN: 978-3-031-10868-6.
    • Book Chapter (coming soon)
  • WSDM'20
    • Addressing Marketing Bias in Product Recommendations
      Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley, in Proceedings of 2020 ACM Conference on Web Search and Data Mining (WSDM'20), Houston, TX, USA, Feb. 2020. (15% acceptance rate)
    • Paper | Data and Code
  • ACL'19
    • Fine-Grained Spoiler Detection from Large-Scale Review Corpora
      Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, in Proceedings of 57th Annual Meeting of the Association for Computational Linguistics 2019 (ACL'19), Florence, Italy, Jul. 2019. (18% acceptance rate)
    • Paper | Dataset | Media: NBC, Gizmodo, Geek.com, UCSD News/ UC News, TechXplore
  • RecSys'18
    • Decomposing Fit Semantics for Product Size Recommendation in Metric Spaces
      Rishabh Misra, Mengting Wan, Julian McAuley, in Proceedings of 2018 ACM Conference on Recommender Systems (RecSys'18), Vancouver, Canada, Oct. 2018. (25% acceptance rate)
    • Paper | Code | Datasets
  • MUSE'15
    • Scalable Bayesian Matrix Factorization
      Avijit Saha*, Rishabh Misra*, Balaraman Ravindran, In Proceedings of the 6th International Workshop on Mining Ubiquitous and Social Environments (MUSE) @ PKDD/ECML, 2015 Sep 7 (pp. 43-54), Porto, Portugal. (* equal contribution)
    • Paper | Code
  • Preprints
    • Sarcasm Detection using Hybrid Neural Network
      Rishabh Misra and Prahal Arora, arXiv preprint arXiv:1908.07414 (2019). Paper | Code
    • Hotel Recommendation System
      Aditi A Mavalankar*, Ajitesh Gupta*, Chetan Gandotra*, Rishabh Misra*, arXiv preprint arXiv:1908.07498 (2018) *equal contribution. Paper
    • Scalable Variational Bayesian Factorization Machine
      Avijit Saha, Rishabh Misra, Ayan Acharya, and Balaraman Ravindran, preprint 2017. Paper | Code

Datasets

  • News Headlines Dataset For Sarcasm Detection
    • Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets. To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost.
    • Link to Kaggle page (31k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset (click to reveal the bibtex)
      @article{misra2019sarcasm,
      title={Sarcasm Detection using Hybrid Neural Network},
      author={Misra, Rishabh and Arora, Prahal},
      journal={arXiv preprint arXiv:1908.07414},
      year={2019}
      }

      @book{book,
      author = {Misra, Rishabh and Grover, Jigyasa},
      year = {2021},
      month = {01},
      pages = {},
      title = {Sculpting Data for ML: The first act of Machine Learning},
      isbn = {978-0-578-83125-1}
      }
  • News Category Dataset
    • People rely on daily news to know what is happening around the world. In today’s world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information would be valuable to learning authentic news’ Natural Language syntax and semantics. This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. To make it more useful, I have included the source links of the news articles so that more data can be extracted as needed. Utility of this dataset is multifold: it could be used to produce interesting liguistic insights about the language used in different news articles or to simply identify untracked news articles. (33k+ downloads on Kaggle)
    • Link to Kaggle page (33k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset (click to reveal the bibtex)
      @dataset{dataset,
      author = {Misra, Rishabh},
      year = {2018},
      month = {06},
      pages = {},
      title = {News Category Dataset},
      doi = {10.13140/RG.2.2.20331.18729}
      }

      @book{book,
      author = {Misra, Rishabh and Grover, Jigyasa},
      year = {2021},
      month = {01},
      pages = {},
      title = {Sculpting Data for ML: The first act of Machine Learning},
      isbn = {978-0-578-83125-1}
      }
  • Clothing Fit Dataset for Size Recommendation
    • Product size recommendation and fit prediction are critical in order to improve customers’ shopping experiences and to reduce product return rates. However, modeling customers’ fit feedback is challenging due to its subtle semantics, arising from the subjective evaluation of products and imbalanced label distribution (most of the feedbacks are "Fit"). These datasets, which are the only fit related datasets available publically at this time, collected from ModCloth and RentTheRunWay could be used to address these challenges to improve the recommendation process.
    • Link to Kaggle page (6k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset (click to reveal the bibtex)
      @inproceedings{misra2018decomposing,
      title={Decomposing fit semantics for product size recommendation in metric spaces},
      author={Misra, Rishabh and Wan, Mengting and McAuley, Julian},
      booktitle={Proceedings of the 12th ACM Conference on Recommender Systems},
      pages={422--426},
      year={2018},
      organization={ACM}
      }

      @book{book,
      author = {Misra, Rishabh and Grover, Jigyasa},
      year = {2021},
      month = {01},
      pages = {},
      title = {Sculpting Data for ML: The first act of Machine Learning},
      isbn = {978-0-578-83125-1}
      }
  • IMDB Spoiler Dataset
    • User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects about the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. 'spoilers') such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms. This dataset is collected from IMDB and contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not.
    • Link to Kaggle page (2k+ downloads on Kaggle)
    • Please cite these articles if you use the dataset (click to reveal the bibtex)
      @dataset{dataset,
      author = {Misra, Rishabh},
      year = {2019},
      month = {05},
      pages = {},
      title = {IMDB Spoiler Dataset},
      doi = {10.13140/RG.2.2.11584.15362}
      }