Struggles and victories of the everyday work of a data scientist

A survey conducted by CrowdFlower showed that data preparation accounts for about 80% of the work of data scientists. I would say that figure should be even higher.

What can go wrong with data preparation in a data scientist work? Whatever you can imagine, but the worst would be the things you cannot imagine. Firstly, the quality of the data available can be poor. In my case, data from Adobe Analytics. You have to be very cautious at every step and don’t believe blindly in the data you see. In my everyday experience, I find errors like the number of leads on certain web pages stopped being reported, or the number of certain events like opening the form suddenly became twice as high as in previous months. Just because. You have to patiently check everything and report errors to Adobe or the programmers.

However, how great it feels when after half a year since making a request for a timestamp to be available in seconds, not minutes, you can finally analyze users’ behavior using a graph, considering the order of the users’ steps. Such a graph can show, for example, what percent of users on the landing page went on to open the form, what percent downloaded the price list and what percent left the page for good. It shows the users’ behavior for all possible steps and so illustrates their digital customer journey.data scientist duties

The second issue with data preparation in a data scientist work is that you have to account for the time it takes to set up big data infrastructure, especially in the case of data concerning webpage users’ behavior, when everything has to be in accordance with GDPR. It really takes time. Moreover, clients really don’t like to share their data with other parties.

Even if the quality of data is sufficient (or rather: has become sufficient after long days of your struggle), the next step seems to be even more challenging: discover the meaning of data. A human mind tends to think of data as some unshaped mass. And data becomes real, useful data when it is strict and precise. Each variable is defined, unequivocal. Business creates some vision of what data they want to collect and what they want to gain from it, but a data scientist is the first to put it in strict terms. It is a work of reformulating business requirements and efforts to adapt IT systems collecting data so that it all becomes data modeling usable.

After you finally get access to data, you learned the meaning of the data and all the preprocessing is done, maybe even in real time, it’s time to look for patterns and build machine learning models. Among others, a scoring or recommendation system.

Scoring can measure users’ engagement in a process. For example, analysts from American company Target created a scoring model that predicted what shopping behavior coincides with early pregnancy. This enabled the sending of special offers for products for children. Scoring can also measure users’ likelihood of leaving the lead or buying a product. The more interaction with the webpage, especially with the sections concerning the finalization of the deal, like the basket, the higher the probability. Users with a high score without purchase can be encouraged by remarketing or discounts.

The recommendation system is used to recommend the user something that he would likely buy. How can we know his taste? We usually base it on users with similar taste as his or similar products to those that he has already bought. Amazon, using a recommender system, based on the products ordered by you and your taste, finds users with similar profiles and proposes products which you haven’t bought, but similar users to you did.

Definitely, building the machine learning models is the most rewarding part of the data scientist’s job, his victories. Although the number of models built is way higher compared to the number of models really used by the client, each of them usually brings insights to the problem under consideration and, of course, there is always a lot of fun with training the algorithms.

To sum up, data preparation in a data scientist work could be compared to preparation for an extreme expedition into the mountains. It takes a lot of time, you have to gather all the necessary equipment and train your fitness. It takes months, while the expedition itself might take just two weeks. You can get discouraged easily or neglect the preparations and so ruin the success. On the other hand, you need to know when you are ready and not extend the preparations forever. In the end, the moment of setting off on the expedition is very exciting and satisfying. However, whether you are going to reach the top or not depends largely on how well you are prepared. This matter is often decided during the long and dull preparations, not in the exciting final.

Let's make a great project together

Estimate project