Being a smart e-commerce consumer using DS

ebay as a case study for product price analysis and prediction

Roy Yanovski
7 min readMay 25, 2021

E-commerce continues to grow and expand, with an extra boost given during covid time. Online B2C sales are predicted to generate $4.9 trillion in 2021, which is a major increase after generating $1.3 trillion in 2014. Increasing number of people today use the internet to purchase different products, from groceries to laptops, and even cars. The main challenge online consumers face is the high abundance and variety of products and sellers. Everyone wants to get the best deal. A good quality product in the lowest price available. That being said, searching through hundreds of pages and scanning thousands of products can be a tedious and time-consuming task. Wouldn't it be nice if we could have some kind of a way to know if a product we found has a good price or not? Data science to the rescue!

As a case study I used ebay. Large e-commerce websites like ebay, ali express, amazon, and others offer hundreds of results for each product searched. I scraped products and their details including their price, performed some analyses, and built a model aimed at predicting the price of a product. I did it on a small scale as a part of my data science studies, but for a commercial use, knowing the average price of the product you want to buy might help in understanding if a great deal was found or not, which could potentially save money.

All of the data and the tools I used, including the data sets, scraper, and Jupyter notebook can be found in the following GitHub repository: https://github.com/royyanovski/Data_Mining_Project

Data scraping

In order to scrape the data I built a data scraper that uses 'beautiful soup' and 'requests' libraries. The scraper scraped product data from the ebay search pages by entering search words. The data scraped were: product title, product price, shipping fee, seller name, seller country, seller feedback score, product category, product condition, and the search page number. The data was stored in a database I designed for it with the following structure:

DB design.

I chose to scrape electronic products, namely smartphones, tablets, and computers. Overall 8,340 products were scraped, with 6,194 left after filtration. These are composed of 35 product types, from 10 different manufacturers.

Dataset

The final dataset, after some preprocessing (no scaling, binning or encoding) looks like this:

Data analyses

Let's see some insights found in the data. First of all, the different products, which had a wide range of prices, are shown below.

Product price (in ILS) as a factor of product type (name), colored by its category.

We can see that the dataset has product prices ranging between 500–6,500 ILS (155–2,015 USD), with 'tablets' being a cheaper category , and 'computers' being the more expensive category. When examining the plot we must keep in mind that the products are of different conditions.

Product price (in ILS) as a factor of its condition, on different categories.

As we can see the condition of the product has an effect on its price in a pattern that we might have expected, excluding the 'tablets' category that shows a rather stable price, non-dependent of the condition. In addition, 'certified refurbished' was a condition that was found only for computers and its average price is very close to 'new', which makes me wonder about its relevance, especially today, where people throw away usable products just for the sake of buying and owning new ones.

Another thing I was looking for, is to trying to understand if ebay sellers will use the assumption that must people don't look past the first search page, in order to promote more expensive products. The graph below shows that this is not the case, as there is no decreasing price trend with pages, and arguably even an increasing trend in the 'tablets' category.

Product price (in ILS) as a factor of search page number, on different categories.

Another interesting thing to examine is the shipping fee. Sellers offer different shipping fees, so I tried to see if it is influenced by the shipping country, because it seems to make sense. If the cargo has a longer way to travel, it might be more expensive to ship it.

Shipping fee as a factor of the shipping country. The black line shows the distance from the ordering country (in this case — Israel) in Kilometers (right side y-axis).

As shown above, this assumption has been disproven. We can see that the distance has nothing to do with the shipping fee, which is probably effected from other factors. The shipping country may still be a factor as well, but it is not the distance that matters. Maybe things like the cargo traffic between the countries affects price. China for example as a very large exporter so maybe shipping out items will be cheaper from there. In addition, it might also depend on a certain country's shipping taxes and relative prices.

What about sellers? Can we link their feedback scores to their pricing patterns?

Seller feedback score as a factor of the shipping price (in ILS), on different categories.

In regards to the shipping fee, there seems to be some a kind of an optimal range of pricing, that links with higher feedback scores. The range between 70–300 ILS (22–93 USD) might represent the tradeoff between overpricing and charging a fee that will allow better and faster shipping conditions/methods and/or better costumer service.

Seller feedback score as a factor of the product price (in ILS), on different categories.

In regards to the product price, the feedback scores seems to be highly affected by overpricing. Each category has its own threshold, that shows a decrease in feedback scores when the price is above it.

Modeling and predicting prices

Finally, I built a Random Forest regression model to predict product prices.

The preprocessing steps included ( in chronological order): removing redundant columns (like the ones used for db design proposes), splitting to the three categories, giving product names (ebay product titles are varied), filtering unfitting products, adding a 'manufacturer' feature, removing outliers by choosing a price range, one-hot encoding categorial features, and binning the 'shipping fee' feature.
The data were divided to train (4,649 samples) and test (1,545 samples) sets. The final amount of features was 65. My baseline R-square was 0.0 (computed by the mean of the prices).
After some hyper parameter tuning I reached an R-square score of 0.755.

True product prices vs. predicted product prices from the trained model.

It might be worth noting that the model was a bit overfitted, returning R-square of 0.825 for the training set.

Python code from the project's notebook, showing the model.

Feature importance

Using the feature importance rank provided by scikit-learn's Random Forest algorithm, I plotted the ten most important features, as shown below.

Feature importance of the top ten most important features

The shipping fee was by far the most important feature in predicting item prices, which seems very interesting for further inspection. Does shipping prices increase with product prices? and if so, what is the reason for that? are the sellers just allowing themselves to charge more or are they spending more on the shipping for it to be safer? (e.g. insurance, different shipping methods)
In addition we can see the expected high importances of conditions and product categories, especially those containing more extreme values (new vs. used, computers vs. tablets). The page number is also important, and it is possible to see in the previous analysis that there are different patterns for the tablets and for the computers categories regarding page numbers. The other important features are product types (models), manufacturers, and countries that had a relatively big impact, but it should be mentioned that after the seven most important features the importance ranks are getting lower and are becoming pretty similar between features.

Conclusion

The importance of an hypothetical, ML based tool as the one described above could be significant and help private consumers save a lot of money. When browsing through e-commerce sites, we can see different prices for the same product. We can compare them, and even sort them, but how can we know when we found a bargain? or when we are overpaying? What is the average cost of this product, in this condition? and how much more should I pay to order from a trusted, high rated seller? what is a reasonable shipping fee? Many questions such as these can be answered by creating the suggested tool which might make us all into smarter consumers.

--

--

Roy Yanovski

PhD, Marine biologist, Data scientist, Sports lover, and nature enthusiast. Interested in using data science to make the world a better place.