by Soňa Pochybová
On 20th June 2019, at the ERNI Development Day taking place in Bern, Switzerland, a new Service was introduced by the name AI & Data Science. This indicates that ERNI would like to start enhancing its portfolio in these fields. But what do these terms stand for? In this blog I would like to give a short overview of what does it mean to be a Data Scientist and what areas you need to study, should you choose to become one.
As someone who studied and practiced experimental High-Energy Physics, it is a subject close to my heart, and I am very glad ERNI is stepping into this direction. There is nothing more thrilling than seeing information emerging from an apparent chaos, when data start telling one of the many underlying stories hidden inside.
So let’s start …
Who are Data Scientists and what do they do
The term Data Scientist was coined in 2008 by D.J. Patil and Jeff Hammerbacher, the respective leads of data analytics at LinkedIn and Facebook. The term stands for people who:
Bring structure to unstructured data
To give an example, imagine you have a problem, which requires collecting date information from various text. There is a multitude of formats which a date can take. 21-03-2019, 03/21/2019, 21st March 2019, … all describe the same date. Considering all of these formats in your analysis is highly impractical. A job of a data scientist in this case would be:
- Identify all possible formats of a date in the data set
- Define optimal, unified structure, e.g. a mapping: {Day: 21, Month: 3, Year: 2019}
- Store all date values from the data set in this unified structure
Define which questions need answering
Having data and having a problem you want to solve is one thing. But knowing how the data may help you solve your problem is another. Skilled Data Scientists know how to approach the data in order to solve the problem at hand and prepare the data for the upcoming analysis in a way best suited for the question.
For example, you have a problem, where you want to offer a suitable discount to a customer buying at your store. In order to do so you might want to know the answer to (not only) the following questions:
- Is the customer male or female?
- How old is he?
- Is he married?
- Does he have kids?
- How often does he buy the sorts of goods available at your store?
- What are his hobbies?
- How do similar customers react on similar incentives?
- How high needs the discount be?
Analyze data
Once the questions are defined, you can design the process of getting the answers from the data using various analysis techniques. Going to the problem above, you can explore the customer’s purchase history (e.g. coming from his loyalty card) to get answers to many of the questions posed and then decide on an offer based on the portfolio of products at your store. You may want to predict the success of your decision based on the reactions of similar customers and how this in turn increases your sales.
Drive strategic decision making
Once the first results are ready, it is good to visualize them in a way, that the message you want to give is clear and understandable. In our example, what you may want to present is how the sales rise as a result of offering a targeted discount, as compared to offers not tailored to customer behaviour. This can influence the strategy on discount offers in a store with the focus to increase sales volumes.
These activities can be shortly summarized in the Data Science lifecycle:
Trades of a Data Scientist
Looking at the description above, it is obvious Data Scientists have to be curious and strongly data and result oriented. On top of that, they need to be highly technically skilled and have a strong background in statistics and linear algebra. They need to have great communication and presentation skills. It’s a lot, but no worries! If you’re driven and you see yourself as one day having a career in Data Science, it’s a knowledge you can build up. To give you some guidance, focus on the following:
- Statistics and linear algebra
- R, Python (numpy, scikit-learn, pandas, tensorflow)
- Apache Spark, Dask, Clouds
- SQL, NoSQL
- Data visualisation tools (bokeh, matplotlib)
- Data analytic tools: Tableau, QlikSense
- Collaborative development tools (GitHub, Jupyter Notebook)
So, start learning and enjoy the journey ….
References