Using ML for holiday planning: Summarising Airbnb reviews

July 3, 2019 10-minute read
When it comes to holiday accommodation, Airbnb is the first avenue that immediately comes to mind.

With the popularity of Airbnb, there are thousands of Airbnb places available at most major cities in the world. Even after filtering by the number of people, Airbnb property types, and date of check-in and out, we’re still left with plenty to choose from.

When choosing an Airbnb, apart from the obvious requirements — price, location, and amenities, I tend to spend time reading through guest reviews to understand more about the host and the experience I can expect while staying there. The only problem is that this manual effort can be very time-consuming!

There has to be a better way!
=============================

> “How can I get a concise understanding of prior guests experience without having to read through pages of reviews?”

I wasn’t only interested in knowing whether most reviews were positive. I was also interested in knowing what most guests have said about their experience.

With my problem framed, I decided to approach the problem in 3 different ways — Topic Modelling, relevant keyword extraction using TF-IDF (**T**erm **F**requency — **I**nverse **D**ocument **F**requency) and Text Summarisation.

Table of Contents
=================

- [**Data**](#data)  
- [**Let’s discuss approaches**](#lets-discuss-approaches)  
    - [1\. Topic Modelling](#1.-topic-modelling)  
    - [2\. TF-IDF](#2.-tf-idf)  
    - [3\. Text Summarisation](#3.-text-summarisation)  
- [**Extraction In Action (How have I used it for my holiday?)**](#extraction-in-action-(how-have-i-used-it-for-my-holiday?))  
- [**The End**](#the-end)

Data
====

I have used Amsterdam Airbnb reviews sourced from [InsideAirbnb](http://insideairbnb.com/get-the-data.html). Over the last 10 years, 450,000 reviews were submitted by guests on their stay in Amsterdam for 15,000 listings.

**_Preprocessing_**

[An average of 4.4mil foreign tourists visits Amsterdam every year.](https://amsterdam.org/en/facts-and-figures.php) So we can expect a portion of reviews to be written in a language other than English. This was quickly validated when exploring the dataset.

As the dataset did not contain a field for language, I had to use [FastText’s Language Identification model](https://fasttext.cc/docs/en/language-identification.html) to predict the language of a document and use that information to filter out non-English reviews. The model was great for my use case as it was fast and accurate, having [achieved over 98% accuracy in EuroGov dataset which consists of many different European languages.](https://fasttext.cc/blog/2017/10/02/blog-post.html)



Apart from the language, the same `reviews.csv` dataset was treated slightly differently for each approach. This will be briefly covered below.

For each approach, I have used 211 reviews from the same Listing#2818 for comparison.

Let’s discuss approaches
========================

1\. Topic Modelling
-------------------

Topic modelling is an unsupervised machine learning method commonly used for discovering abstract topics within a collection of documents. It considers each document to be represented by a blend of topics and each topic to be represented by a set of words that frequently occur together.

Topic modelling clusters words found within documents into`n` number of topics. Each cluster of words represents an abstract topic. For example, with a cluster of words `badminton, taekwondo, football` , we can determine that the associated abstract topic relates to `sports`.

I will be using [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA), one of the most popular examples of topic modelling.

**_Implementation_**

*   _Data preparation_

1.  Symbols and stop words were removed
2.  Tokens were stemmed using Snowball Algorithm (improved from Porter)
3.  TF-IDF vectors created for each review using unigrams

*   _Code_



*   _Example output_

``` python
$ Topic #1:  
['<host>', 'provid', 'map', 'came', 'stay', 'amsterdam', 'travel', 'took', 'late', 'would'\

Topic #2:  
['<host>', 'everyth', 'good', 'kind', 'well', 'part', 'clean', 'comfort', 'also', 'guest']  

Topic #3:  
['<host>', 'host', 'stay', 'room', 'clean', 'place', 'great', 'help', 'amsterdam', 'map']  

Topic #4:  
['<host>', 'jouri', 'worthi', 'chang', 'session', 'vacat', 'overal', 'weed', 'scare', 'classi']  

Topic #5:  
['<host>', 'stay', 'get', 'room', 'host', 'amsterdam', 'provid', 'apart', 'also', 'come']
$
```

Let’s break down the topics discovered:

1.  Topic #1 is fairly vague. It was difficult to interpret what it means. Was “late” mentioned in a negative light about the host? Or were guests talking about how the host waited for them because they were late? Or about how the host would respond to their late replies?
2.  Topic #2 suggests that the host is good and kind and the place was comfortable and clean.
3.  Topic #3 suggests that the host is a great help and the place again was great and clean.
4.  Topic #4 suggests that the guests enjoyed their journey.. maybe including the weed?
5.  Topic #5 again is fairly vague.

*   _Good and bad_

**_The good_**_:_ topics are automatically discovered from the data itself without any labelled data required.

**_The bad:_** there is no right way in deciding the number of topics upfront unless you have prior knowledge. It takes a lot of trial and error. At its best, LDA can only provide a rough idea of topics that exist within the data. Interpreting a set of words to an abstract topic is a subjective guessing game.

**2\. TF-IDF**
--------------

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is short for **T**erm **F**requency-**I**nverse **D**ocument **F**requency. This scoring mechanism is commonly used in information retrieval and text mining for reflecting the relevance of words in a document.

There are 2 parts to this score:

*   **Term Frequency** — Number of times word is found within a document
*   **Inverse Document Frequency** —The inverse number of times word is found in a collection of documents. The term **_inverse_** is important to note, as we are not interested in words that appear frequently across all documents.

A word which appears frequently within a document but is not frequently found in the collection will have high TF-IDF score as this word is relevant to the document. For example, we can expect `sleep` in an article that discusses “Benefits of 8 hours of sleep” to have a high TF-IDF score as the word would be frequently mentioned in the article but might not be used as frequently in other articles.

In contrast, words like `the, good, how` are common words which can be used in various articles. These words would have a low TF-IDF score.

It is also worth mentioning that the same word in different documents will have a different TF-IDF score.

**_Implementation_**

*   _Data preparation_

1.  Symbols and stop words were removed
2.  Tokens were stemmed using Snowball Algorithm (improved from Porter)
3.  TF-IDF vectors created for each review using bigrams

*   _Code_



*   _Example output_

``` python
['great host', 'perfect host', 'public transport', 'high recommend', 'place clean', 'get around', 'make sure', 'stay amsterdam', 'recommend stay', 'host place']
```

The top 10 relevant keywords from all guest reviews indicate that the host is great, the place is clean and they highly recommend this place. There were also frequent mentions of public transportation.

Compared to LDA, the keywords extracted from TF-IDF are less ambiguous. But there are still keywords like `make sure` and `get around` which are a little too vague to interpret.

*   _Good and bad_

**_The good:_** relevant keywords were extracted using statistical methods. Simple to implement.

**_The bad:_** _s_emantic meanings of different words are not taken into consideration. Terms like `clean apartment` and `clean flat` semantically share the same meaning but in TF-IDF, these are treated as two different strings.

3\. Text Summarisation
----------------------

Text Summarisation is used to find the most informative sentences in a document or collection of documents. Extractive Summarisation is the most popular approach which involves selecting sentences that most represent information in the document or collection of documents.

A commonly used technique for Extractive Summarisation is a graph-based technique called TextRank Algorithm. This algorithm was built from [PageRank](https://en.wikipedia.org/wiki/PageRank) (think Google!). Sentences are ranked by their importance based on the similarity of one sentence to another.

**_Implementation_**

*   _Data preparation_

1.  Symbols and stop words were removed.
2.  GloVe embeddings pre-trained on [Wikipedia+Gigaword 5](https://nlp.stanford.edu/projects/glove/) for 100 dimensions were downloaded and extracted
3.  Similarity matrix built by applying GloVe embeddings to each sentence in review and calculating similarity between each sentence using cosine distance
4.  Applied TextRank Algorithm to get sentence rankings

*   _Code_



*   _Example output_

``` python
[
    "HOST: <HOST> was very accomodating, has prepared everything you will need for your stay in the city, you get to have great and fun conversations with him, you will be for sure well taken care of!",
    "Not only was the room comfortable, colourful, light, quiet, and equipped with everything we could possibly need - and <HOST>'s flat spotless and beautifully furnished and in a great location - but <HOST> himself is the perfect host, spending the first hour of our arrival talking to us about Amsterdam, answering our many questions, showing us how to get around.",
    "He was friendly, extremely helpful & went the extra mile to make sure my friend and I were at home at his place.",
    "His attention to details and kindness make his place an excellent alternative for those considering a bed and breakfast in Amsterdam\\r\\nI strongly advise to consider his place: Great location, an affordable price, a clean and organized room and a great host.",
    "I traveled first time to Amsterdam with a friend and we stayed at <HOST>´s.He was an excelent host with helping to find out routes and gave lots of tips how to handle things in Amsterdam. The place was very clean and quiet.We recomment <HOST>´s room."
]
```

Immediately I see some resemblance between the top 5 most informative sentences and the top 10 relevant keywords from TF-IDF:

1.  Great host/ Perfect host

```
- <HOST> was very accomodating, has prepared everything you will need for your stay in the city, you get to have great and fun conversations with him, you will be for sure well taken care of
- He was an excelent host with helping to find out routes and gave lots of tips how to handle things in Amsterdam
- His attention to details and kindness make his place an excellent alternative for those considering a bed and breakfast in Amsterdam
- He was friendly, extremely helpful & went the extra mile to make sure my friend and I were at home at his place.
```

1. Place clean

```
- <HOST>'s flat spotless and beautifully furnished and in a great location
- a clean and organized room
- The place was very clean and quiet
```

1. High recommend

```
- We recomment <HOST>´s room.
- I strongly advise to consider his place
```

*   _Good and bad_

**_The good:_** Approach being unsupervised means no labelled training data is required.

Extraction In Action (How have I used it for my holiday?)
=========================================================

Having tried all 3 approaches above, I have found the Text Summarisation approach to be most insightful, humanly readable and interpretable with the least amount of ambiguity.

The below section will showcase how I have applied it on my recent trip planning to the beautiful Lisbon.

After having short-listed 5 properties that I wanted to stay in, I’ve copied the URLs into my Jupyter Notebook for extraction.

The workflow involved:

1.  extracting reviews submitted from the last 12 months for each listing
2.  performing the same text cleaning process as discussed above
3.  applying Text Summarisation using TextRank Algorithm as discussed above
4.  visualising the top 5 most informative sentences from each listing

![](https://miro.medium.com/max/1828/1\*P71jGafBTSmC\_tR16dMkLQ.jpeg)

Top 5 most informative review sentences from Airbnb listings

![](https://miro.medium.com/max/1864/1\*qbNy6JCuGmtTQfi4SNqTQA.jpeg)

14 reviews written in the last 12 months for Listing#888141 with summarised text highlighted

Voila! Without summarising the reviews, I would have had to read through 64 reviews for these 5 listings.

**_Hits:_** All 5 summaries covered main points of concern about the host, location, the cleanliness and comfort of the place. The summary for Listing#21042405, in particular, was insightful as it pointed out that keys had to be collected from a different location.

**_Misses:_** A guest from Listing#888141 complained about the place having no A/C and it was really hot during their visit. This comment was not picked up in the summarisation and the main reason could be because the guest was the only one who had made such a complaint hence it was not as significant when compared to other comments.

The End
=======

Thanks for reading! I’ve enjoyed this wee project that combines 2 of my favourite things — travel and Data Science. Hopefully, you have enjoyed this read and found this application interesting and practical as well.

Jupyter notebooks used can be found [GitHub](https://github.com/fifionachow/airbnb-reviews-analysis)

Originally published on [Medium](https://towardsdatascience.com/using-ml-for-holiday-planning-summarising-airbnb-reviews-193abb002232)

References
==========

[https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/](https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/)