This blog post will delve into the challenges, approaches, and algorithms involved in hotel price prediction. We’ll also share what we learned from our experience building a price prediction module for Rakuten Travel.
What is hotel price prediction?Hotel price prediction is the process of using machine learning algorithms to forecast the rates of hotel rooms based on various factors such as date, location, room type, demand, and historical prices. The goal is to provide hotel operators with valuable insights into pricing trends, enabling them to make informed decisions about revenue management strategies and to stay competitive in the market.
There are quite a few KPIs used by hotels to track their performance and support their business analysis. In terms of price prediction tasks, a logical metric to use as a target variable for machine learning algorithms is the average daily rate (ADR). It shows how much revenue occupied rooms earn per given day on average. At AltexSoft, we developed an algorithm for price prediction and revenue optimization for one of our clients, Rakuten Travel, Japan's largest online booking platform that also owns some hotels. The algorithm was specifically designed to assist a small family hotel that belongs to Rakuten properties in forecasting ADRs and making informed decisions regarding starting room prices. The objective was to establish these prices 20 to 40 and even more days before the actual check-in to maximize the hotel's revenue.
Hotel price prediction challengesTo run a successful revenue management operation, having the capability to accurately price your rooms to maximize revenue and occupancy is of utmost importance. Of course, all these prediction activities don’t happen by magic.
Check out our video on how revenue management works in hospitality.Before the advent of machine learning, prediction activities were largely manual and time-consuming, relying heavily on the experience of hotel managers. Today, revenue management has evolved with the help of predictive analytics. This technology enables anticipating future trends and estimating ADRs for hotel or vacation rental properties. With this information, you can adjust your pricing strategy and make sure you offer competitive prices for the inventory.
But there are specific challenges associated with hotel price prediction using machine learning.
The complexity of the hospitality market. The hotel industry is complex, with a wide range of internal and external factors that impact prices. For machine learning algorithms to predict prices accurately, people who do the data preparation must consider these factors and gather all this information to train the model.
Among the internal factors, there are
- historical prices,
- the location of the hotel,
- the number and size of rooms,
- amenities and services provided,
- demand fluctuations based on seasonality and holidays, and more.
- pricing strategies of competitors,
- economic changes,
- events such as sports competitions or festivals,
- exchange rates, and
- political situations, etc.
Data availability and quality. The accuracy of hotel price prediction is dependent on the availability and quality of data, e.g., historical prices and demand data. If some important information is missing or unreliable, the predictions may not be as expected. For example, if a hotel is new, there may not be enough historical data to train accurate machine learning models.
Our team faced the challenge of data scarcity when working on the Rakuten project. The hotel had only one year of booking data, whereas for accurate price predictions at least two or three years of data are required. To fill this gap, we used information on direct competitors who sold rooms through Rakuten Travel. We selected similar accommodations in the area and used their booking histories to forecast average market prices.
Data relevance. Including irrelevant data in the training dataset can make the model overly complex, as it tries to learn patterns that don’t actually fit the task. Just as bad data quality and insufficient data, irrelevant information can cause the model to make incorrect predictions when presented with new, unseen data.
While we were engaged in our project, we had to filter out noninformative bookings. Since our goal was to suggest prices for early bookings, we excluded reservations made within 20 (or sometimes 40) days of check-in. Short booking lead times could result in significantly different prices from the original. Longer lead times between reservation and stay allowed us to obtain a more accurate starting rate.
With sufficient and quality travel data in place, ML becomes a valuable tool for forecasting hotel deals.
So how exactly are hotel price prediction tools built?
Data collection and preprocessingAs with any machine learning task, it all starts with high-quality data that is sufficient for training a model.
To learn more on this and other fundamentals, check out our specialized article on data preparation for machine learning, or alternatively, watch a brief video on the subject.
Data preparation for ML projects in 14 minutesFor now, our focus will be on identifying the appropriate data and constructing a dataset to develop an ML-based model for predicting hotel prices.
Data sourcesIn developing hotel price prediction models, gathering extensive data from different sources is crucial. Here, the rule is the more, the better, although the quality is equally important. Alexander Konduforov, who served as a Data Science Competence Lead on the Rakuten project, explains, "To achieve high accuracy of predictions, the team used up to 4 years of historical reservations collected from the target hotel and selected competitors."
But where can you obtain sufficient valuable data to construct a predictive model? Several options are available.
Hotel software. The most trustworthy sources of booking data for hotels are property management systems (PMSs), channel managers, and websites with a direct booking module. These systems store all the information regarding reservations and pricing, including booking lead times, occupancy data, and rates at which a particular room or accommodation was booked during a specific period.
For our project, we used reservations data from the client’s PMS system and also reservations for more than 80 hotels located in the same area made on the Rakuten Travel booking platform.
Hospitality data providers. There are data providers that bring all the necessary hotel data to one place, saving you time and effort. For instance, PHP TRAVELS is a data provider that offers various types of travel data, such as travel intent, flight data, hotel rates and pricing, and consumer transaction data, to name a few. Another data provider, Key Data, specializes in short-term rentals. AltexSoft worked with Key Data to enhance their tool with several AI-driven features aimed at more precisely analyzing key hospitality KPIs like Occupancy Rate, WAPE, Average Daily Rate, etc.
OTAs and metasearch engines. Of course, OTAs like Airbnb and Booking.com as well as metasearch engines like Tripadvisor can be used as data sources for ML models to predict hotel prices. These platforms provide a wealth of data about prices, bookings, and availability for a wide range of accommodations
It's worth noting, though, that scraping data from websites using web tools or APIs may raise legal concerns. A legitimate approach is to either purchase available datasets or negotiate an agreement with the company so that it provides official access to the information.
Public datasets. You can utilize publicly accessible datasets from Kaggle or other platforms that offer booking information for hotels. However, keep in mind that such datasets are usually limited in size and may not have sufficient features to develop an effective model. For more information on the top public datasets for machine learning, refer to our article.
Dataset structureAfter selecting the data source, the next step is to determine the variables that will form the parameters in your hotel price prediction model.
In developing an AI tool for Rakuten Travel, we identified and categorized the following important features from historical data:
- reservation data for a family hotel from its PMS (ADRs, booking lead times, dates);
- reservations data for 80+ competitors located in the same region as the target hotel;
- hotel attributes such as location, property type, max capacity, etc.;
- hotel room amenities;
- seasonality (whether it was high or low season); and
- holidays, vacations, and other external factors.
When the dataset is ready, it has to be split into training and testing sets. Typically, you use 80 percent of the data to train the model and the remaining 20 percent left unseen to test its accuracy. If you use all of the data to train the model, you may end up with an overfitting issue — a situation in which the model memorizes sample data and performs well on it but shows poor results on new datasets. So you always need to run the pre-trained algorithm on a subset of new, unseen validation data to detect overfitting and make sure your model makes accurate predictions on it too.
Most popular machine learning models to predict hotel pricesPrice prediction can be formulated as a regression task. Regression analysis is a statistical technique used to predict continuous numerical values (prices, occupancy rate, or sales volumes in demand forecasting) based on a set of input features (historical prices, location, room size, amenities, etc.). In other words, it helps identify the relationship between a dependent or target variable (ADR in our case) and single or multiple independent (interdependent) variables, A/K/A predictors that impact the target variable.
Different ML algorithms can perform a regression analysis to forecast prices: from simpler decision trees to more complex deep-learning neural networks. The most commonly used ones are linear regression, XGBoost, and Recurrent neural networks (RNN). So we are going to discuss them in more detail.
Linear regressionLinear regression models are based on the assumption that the connection between a dependent continuous variable Y and one or more explanatory (independent) variables X is linear, meaning it can be represented by a straight line.
Let's take a visualization of a simple linear regression as an example. Suppose we have a group of cottage hotels of different sizes. The x-axis is their size and the y-axis is their market price.
A regression model plot that represents the correlation between the sizes of hotels and their prices. Source: MediumIn the plot above, data points are our observed hotels and they don't lie in a perfectly straight line. To track dependencies between two parameters — the impact of the hotel size on its price — we build a line so that the distance between it and any of the data points is minimal. Now, we can use it as an instrument to predict the optimal price from the size.
“While Linear regression still works well with structured tabular data, the data we had, it wasn’t enough to make accurate predictions in our case,” explains Alexander Konduforov, “So we tried a few other models, including neural networks, evaluated the results, and decided to go with the best-performing one — XGBoost.”
In prediction tasks that use unstructured data (such as images or text), artificial neural networks outperform all other algorithms. But when it comes to small structured or tabular data, decision tree-based algorithms take precedence.
XGBoostXGBoost (Extreme Gradient Boosting) is a supervised machine learning algorithm based on a set of decision trees to predict a target variable.
The basic idea behind decision trees is to repeatedly split the dataset into smaller subsets based on the features that are most informative for predicting the target variable. Each split results in a node in the tree, with the root node representing the entire dataset and the leaves representing the final predicted outcomes.
Here’s what the decision tree looks like for predictions of booking cancelations and guests coming back. Source: GitHubThe XGBoost algorithm uses a process called gradient boosting, hence the name, to improve the accuracy of the model by iteratively adding new decision trees that focus on the examples the previous models got wrong. Models are added sequentially until further improvements stop.
Besides decision trees, XGBoost allows you to use linear models as base models. The key advantage of XGBoost is its speed and efficiency, making it ideal for large datasets. It also includes several regularization techniques that help prevent overfitting and improve generalization.
Using the XGBoost algorithm, we were able to analyze the historical data of our direct competitors and forecast average market prices for each day of the year. Then we compared these predicted prices with the rates of the target family hotel to calculate price differentials.
The ADR differences were used as inputs to a separate price elasticity model that would forecast occupancy rates. It also considered factors such as seasonality, holidays, day of the week, general market trends, as well as hotel prices. This second model was designed to help revenue managers understand how price differences impact occupancy and set the optimal rates that would result in maximum sales.
Recurrent neural networksRecurrent Neural Networks (RNNs) are a specialized type of neural network that has the unique ability to remember and incorporate past outputs as inputs for the current time step. This differs from traditional neural networks, which treat each input and output as independent entities.
Recurrent neural network architectureRNNs have internal memory that can be represented as many copies of the same neural network with looped connections. It works like this: Each copy of the network passes a message to the next copy, and the decision is made with consideration of both the current input and the output learned from the previous input.
RNNs are particularly suited for predicting the changes in hotel rates over time — time-series forecasting. This is because RNNs can remember previous outputs and use them for predictions. Compared to linear regression or decision trees, RNNs are more effective in capturing complex sequential patterns and long-term dependencies in the data, so it makes them a good choice for price prediction.
As with any other neural network, an RNN requires a lot of data to ensure that it provides accurate results. Basically, if your dataset is huge and you want a model to generalize effectively on all that data, it’s better to opt for neural networks like RNNs.
This brings us to the last but not least point.
Model deployment and evaluationIn the hotel price prediction project, the last step is to deploy the machine learning model or models that have been identified as the best-performing. It's also important to note that the techniques discussed here are just a small subset of the many approaches available for predicting hotel prices.
Depending on the specific use case, simpler models such as LightGBM or ARIMA may be suitable and provide satisfactory results. To determine the best method for your specific use case, it is beneficial to experiment with various approaches and utilize machine learning metrics to assess their performance.
Another noteworthy point here is that despite the effectiveness of predictive models they may not account for unforeseen events such as pandemics or natural disasters. COVID-19, for example, was impossible to predict, let alone all the ways it would impact the travel industry. Therefore, all the prediction models developed before the pandemic must be revised to incorporate new data.
Also, it’s worth noting that while a price forecasting model trained on historical data is a valuable tool, it is not a replacement for human judgment. Revenue managers must use their critical thinking to make decisions based on the output of the models and develop pricing strategies. Therefore, it is beneficial to combine ML modeling with human expertise to gain more comprehensive insights.