Being a smart e-commerce consumer using DS
ebay as a case study for product price analysis and prediction
E-commerce continues to grow and expand, with an extra boost given during covid time. Online B2C sales are predicted to generate $4.9 trillion in 2021, which is a major increase after generating $1.3 trillion in 2014. Increasing number of people today use the internet to purchase different products, from groceries to laptops, and even cars. The main challenge online consumers face is the high abundance and variety of products and sellers. Everyone wants to get the best deal. A good quality product in the lowest price available. That being said, searching through hundreds of pages and scanning thousands of products can be a tedious and time-consuming task. Wouldn't it be nice if we could have some kind of a way to know if a product we found has a good price or not? Data science to the rescue!
As a case study I used ebay. Large e-commerce websites like ebay, ali express, amazon, and others offer hundreds of results for each product searched. I scraped products and their details including their price, performed some analyses, and built a model aimed at predicting the price of a product. I did it on a small scale as a part of my data science studies, but for a commercial use, knowing the average price of the product you want to buy might help in understanding if a great deal was found or not, which could potentially save money.
All of the data and the tools I used, including the data sets, scraper, and Jupyter notebook can be found in the following GitHub repository: https://github.com/royyanovski/Data_Mining_Project
Data scraping
In order to scrape the data I built a data scraper that uses 'beautiful soup' and 'requests' libraries. The scraper scraped product data from the ebay search pages by entering search words. The data scraped were: product title, product price, shipping fee, seller name, seller country, seller feedback score, product category, product condition, and the search page number. The data was stored in a database I designed for it with the following structure:
I chose to scrape electronic products, namely smartphones, tablets, and computers. Overall 8,340 products were scraped, with 6,194 left after filtration. These are composed of 35 product types, from 10 different manufacturers.
Dataset
The final dataset, after some preprocessing (no scaling, binning or encoding) looks like this:
Data analyses
Let's see some insights found in the data. First of all, the different products, which had a wide range of prices, are shown below.
We can see that the dataset has product prices ranging between 500–6,500 ILS (155–2,015 USD), with 'tablets' being a cheaper category , and 'computers' being the more expensive category. When examining the plot we must keep in mind that the products are of different conditions.
As we can see the condition of the product has an effect on its price in a pattern that we might have expected, excluding the 'tablets' category that shows a rather stable price, non-dependent of the condition. In addition, 'certified refurbished' was a condition that was found only for computers and its average price is very close to 'new', which makes me wonder about its relevance, especially today, where people throw away usable products just for the sake of buying and owning new ones.
Another thing I was looking for, is to trying to understand if ebay sellers will use the assumption that must people don't look past the first search page, in order to promote more expensive products. The graph below shows that this is not the case, as there is no decreasing price trend with pages, and arguably even an increasing trend in the 'tablets' category.
Another interesting thing to examine is the shipping fee. Sellers offer different shipping fees, so I tried to see if it is influenced by the shipping country, because it seems to make sense. If the cargo has a longer way to travel, it might be more expensive to ship it.
As shown above, this assumption has been disproven. We can see that the distance has nothing to do with the shipping fee, which is probably effected from other factors. The shipping country may still be a factor as well, but it is not the distance that matters. Maybe things like the cargo traffic between the countries affects price. China for example as a very large exporter so maybe shipping out items will be cheaper from there. In addition, it might also depend on a certain country's shipping taxes and relative prices.
What about sellers? Can we link their feedback scores to their pricing patterns?
In regards to the shipping fee, there seems to be some a kind of an optimal range of pricing, that links with higher feedback scores. The range between 70–300 ILS (22–93 USD) might represent the tradeoff between overpricing and charging a fee that will allow better and faster shipping conditions/methods and/or better costumer service.
In regards to the product price, the feedback scores seems to be highly affected by overpricing. Each category has its own threshold, that shows a decrease in feedback scores when the price is above it.
Modeling and predicting prices
Finally, I built a Random Forest regression model to predict product prices.
The preprocessing steps included ( in chronological order): removing redundant columns (like the ones used for db design proposes), splitting to the three categories, giving product names (ebay product titles are varied), filtering unfitting products, adding a 'manufacturer' feature, removing outliers by choosing a price range, one-hot encoding categorial features, and binning the 'shipping fee' feature.
The data were divided to train (4,649 samples) and test (1,545 samples) sets. The final amount of features was 65. My baseline R-square was 0.0 (computed by the mean of the prices).
After some hyper parameter tuning I reached an R-square score of 0.755.
It might be worth noting that the model was a bit overfitted, returning R-square of 0.825 for the training set.
Feature importance
Using the feature importance rank provided by scikit-learn's Random Forest algorithm, I plotted the ten most important features, as shown below.
The shipping fee was by far the most important feature in predicting item prices, which seems very interesting for further inspection. Does shipping prices increase with product prices? and if so, what is the reason for that? are the sellers just allowing themselves to charge more or are they spending more on the shipping for it to be safer? (e.g. insurance, different shipping methods)
In addition we can see the expected high importances of conditions and product categories, especially those containing more extreme values (new vs. used, computers vs. tablets). The page number is also important, and it is possible to see in the previous analysis that there are different patterns for the tablets and for the computers categories regarding page numbers. The other important features are product types (models), manufacturers, and countries that had a relatively big impact, but it should be mentioned that after the seven most important features the importance ranks are getting lower and are becoming pretty similar between features.
Conclusion
The importance of an hypothetical, ML based tool as the one described above could be significant and help private consumers save a lot of money. When browsing through e-commerce sites, we can see different prices for the same product. We can compare them, and even sort them, but how can we know when we found a bargain? or when we are overpaying? What is the average cost of this product, in this condition? and how much more should I pay to order from a trusted, high rated seller? what is a reasonable shipping fee? Many questions such as these can be answered by creating the suggested tool which might make us all into smarter consumers.